225 lines
7.2 KiB
ReStructuredText
225 lines
7.2 KiB
ReStructuredText
PEP: 223
|
|
Title: Change the Meaning of ``\x`` Escapes
|
|
Author: Tim Peters <tim.peters@gmail.com>
|
|
Status: Final
|
|
Type: Standards Track
|
|
Content-Type: text/x-rst
|
|
Created: 20-Aug-2000
|
|
Python-Version: 2.0
|
|
Post-History: 23-Aug-2000
|
|
|
|
|
|
Abstract
|
|
========
|
|
|
|
Change ``\x`` escapes, in both 8-bit and Unicode strings, to consume
|
|
exactly the two hex digits following. The proposal views this as
|
|
correcting an original design flaw, leading to clearer expression
|
|
in all flavors of string, a cleaner Unicode story, better
|
|
compatibility with Perl regular expressions, and with minimal risk
|
|
to existing code.
|
|
|
|
|
|
Syntax
|
|
======
|
|
|
|
The syntax of ``\x`` escapes, in all flavors of non-raw strings, becomes ::
|
|
|
|
\xhh
|
|
|
|
where h is a hex digit (0-9, a-f, A-F). The exact syntax in 1.5.2 is
|
|
not clearly specified in the Reference Manual; it says ::
|
|
|
|
\xhh...
|
|
|
|
implying "two or more" hex digits, but one-digit forms are also
|
|
accepted by the 1.5.2 compiler, and a plain ``\x`` is "expanded" to
|
|
itself (i.e., a backslash followed by the letter x). It's unclear
|
|
whether the Reference Manual intended either of the 1-digit or
|
|
0-digit behaviors.
|
|
|
|
|
|
Semantics
|
|
=========
|
|
|
|
In an 8-bit non-raw string, ::
|
|
|
|
\xij
|
|
|
|
expands to the character ::
|
|
|
|
chr(int(ij, 16))
|
|
|
|
Note that this is the same as in 1.6 and before.
|
|
|
|
In a Unicode string,
|
|
::
|
|
|
|
\xij
|
|
|
|
acts the same as ::
|
|
|
|
\u00ij
|
|
|
|
i.e. it expands to the obvious Latin-1 character from the initial
|
|
segment of the Unicode space.
|
|
|
|
An ``\x`` not followed by at least two hex digits is a compile-time error,
|
|
specifically ``ValueError`` in 8-bit strings, and ``UnicodeError`` (a subclass
|
|
of ``ValueError``) in Unicode strings. Note that if an ``\x`` is followed by
|
|
more than two hex digits, only the first two are "consumed". In 1.6
|
|
and before all but the *last* two were silently ignored.
|
|
|
|
|
|
Example
|
|
=======
|
|
|
|
In 1.5.2::
|
|
|
|
>>> "\x123465" # same as "\x65"
|
|
'e'
|
|
>>> "\x65"
|
|
'e'
|
|
>>> "\x1"
|
|
'\001'
|
|
>>> "\x\x"
|
|
'\\x\\x'
|
|
>>>
|
|
|
|
In 2.0::
|
|
|
|
>>> "\x123465" # \x12 -> \022, "3456" left alone
|
|
'\0223456'
|
|
>>> "\x65"
|
|
'e'
|
|
>>> "\x1"
|
|
[ValueError is raised]
|
|
>>> "\x\x"
|
|
[ValueError is raised]
|
|
>>>
|
|
|
|
|
|
History and Rationale
|
|
=====================
|
|
|
|
``\x`` escapes were introduced in C as a way to specify variable-width
|
|
character encodings. Exactly which encodings those were, and how many
|
|
hex digits they required, was left up to each implementation. The
|
|
language simply stated that ``\x`` "consumed" *all* hex digits following,
|
|
and left the meaning up to each implementation. So, in effect, ``\x`` in C
|
|
is a standard hook to supply platform-defined behavior.
|
|
|
|
Because Python explicitly aims at platform independence, the ``\x`` escape
|
|
in Python (up to and including 1.6) has been treated the same way
|
|
across all platforms: all *except* the last two hex digits were
|
|
silently ignored. So the only actual use for ``\x`` escapes in Python was
|
|
to specify a single byte using hex notation.
|
|
|
|
Larry Wall appears to have realized that this was the only real use for
|
|
``\x`` escapes in a platform-independent language, as the proposed rule for
|
|
Python 2.0 is in fact what Perl has done from the start (although you
|
|
need to run in Perl -w mode to get warned about ``\x`` escapes with fewer
|
|
than 2 hex digits following -- it's clearly more Pythonic to insist on
|
|
2 all the time).
|
|
|
|
When Unicode strings were introduced to Python, ``\x`` was generalized so
|
|
as to ignore all but the last *four* hex digits in Unicode strings.
|
|
This caused a technical difficulty for the new regular expression engine:
|
|
SRE tries very hard to allow mixing 8-bit and Unicode patterns and
|
|
strings in intuitive ways, and it no longer had any way to guess what,
|
|
for example, ``r"\x123456"`` should mean as a pattern: is it asking to match
|
|
the 8-bit character ``\x56`` or the Unicode character ``\u3456``?
|
|
|
|
There are hacky ways to guess, but it doesn't end there. The ISO C99
|
|
standard also introduces 8-digit ``\U12345678`` escapes to cover the entire
|
|
ISO 10646 character space, and it's also desired that Python 2 support
|
|
that from the start. But then what are ``\x`` escapes supposed to mean?
|
|
Do they ignore all but the last *eight* hex digits then? And if less
|
|
than 8 following in a Unicode string, all but the last 4? And if less
|
|
than 4, all but the last 2?
|
|
|
|
This was getting messier by the minute, and the proposal cuts the
|
|
Gordian knot by making ``\x`` simpler instead of more complicated. Note
|
|
that the 4-digit generalization to ``\xijkl`` in Unicode strings was also
|
|
redundant, because it meant exactly the same thing as ``\uijkl`` in Unicode
|
|
strings. It's more Pythonic to have just one obvious way to specify a
|
|
Unicode character via hex notation.
|
|
|
|
|
|
Development and Discussion
|
|
==========================
|
|
|
|
The proposal was worked out among Guido van Rossum, Fredrik Lundh and
|
|
Tim Peters in email. It was subsequently explained and discussed on
|
|
Python-Dev under subject "Go \x yourself" [1]_, starting 2000-08-03.
|
|
Response was overwhelmingly positive; no objections were raised.
|
|
|
|
|
|
Backward Compatibility
|
|
======================
|
|
|
|
Changing the meaning of ``\x`` escapes does carry risk of breaking existing
|
|
code, although no instances of incompatibility have yet been discovered.
|
|
The risk is believed to be minimal.
|
|
|
|
Tim Peters verified that, except for pieces of the standard test suite
|
|
deliberately provoking end cases, there are no instances of ``\xabcdef...``
|
|
with fewer or more than 2 hex digits following, in either the Python
|
|
CVS development tree, or in assorted Python packages sitting on his
|
|
machine.
|
|
|
|
It's unlikely there are any with fewer than 2, because the Reference
|
|
Manual implied they weren't legal (although this is debatable!). If
|
|
there are any with more than 2, Guido is ready to argue they were buggy
|
|
anyway <0.9 wink>.
|
|
|
|
Guido reported that the O'Reilly Python books *already* document that
|
|
Python works the proposed way, likely due to their Perl editing
|
|
heritage (as above, Perl worked (very close to) the proposed way from
|
|
its start).
|
|
|
|
Finn Bock reported that what JPython does with ``\x`` escapes is
|
|
unpredictable today. This proposal gives a clear meaning that can be
|
|
consistently and easily implemented across all Python implementations.
|
|
|
|
|
|
Effects on Other Tools
|
|
======================
|
|
|
|
Believed to be none. The candidates for breakage would mostly be
|
|
parsing tools, but the author knows of none that worry about the
|
|
internal structure of Python strings beyond the approximation "when
|
|
there's a backslash, swallow the next character". Tim Peters checked
|
|
``python-mode.el``, the std ``tokenize.py`` and ``pyclbr.py``, and the IDLE syntax
|
|
coloring subsystem, and believes there's no need to change any of
|
|
them. Tools like ``tabnanny.py`` and ``checkappend.py`` inherit their immunity
|
|
from ``tokenize.py``.
|
|
|
|
|
|
Reference Implementation
|
|
========================
|
|
|
|
The code changes are so simple that a separate patch will not be produced.
|
|
Fredrik Lundh is writing the code, is an expert in the area, and will
|
|
simply check the changes in before 2.0b1 is released.
|
|
|
|
|
|
BDFL Pronouncements
|
|
===================
|
|
|
|
Yes, ``ValueError``, not ``SyntaxError``. "Problems with literal interpretations
|
|
traditionally raise 'runtime' exceptions rather than syntax errors."
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
.. [1] Tim Peters, Go \x yourself
|
|
https://mail.python.org/pipermail/python-dev/2000-August/007825.html
|
|
|
|
|
|
Copyright
|
|
=========
|
|
|
|
This document has been placed in the public domain.
|