Completed, about to post.
This commit is contained in:
parent
e7c1e697c0
commit
156e7dfa35
182
pep-0223.txt
182
pep-0223.txt
|
@ -2,11 +2,11 @@ PEP: 223
|
|||
Title: Change the Meaning of \x Escapes
|
||||
Version: $Revision$
|
||||
Author: tpeters@beopen.com (Tim Peters)
|
||||
Status: Draft
|
||||
Status: Active
|
||||
Type: Standards Track
|
||||
Python-Version: 2.0
|
||||
Created: 20-Aug-2000
|
||||
Post-History:
|
||||
Post-History: 23-Aug-2000
|
||||
|
||||
|
||||
Abstract
|
||||
|
@ -19,6 +19,184 @@ Abstract
|
|||
to existing code.
|
||||
|
||||
|
||||
Syntax
|
||||
|
||||
The syntax of \x escapes, in all flavors of non-raw strings, becomes
|
||||
|
||||
\xhh
|
||||
|
||||
where h is a hex digit (0-9, a-f, A-F). The exact syntax in 1.5.2 is
|
||||
not clearly specified in the Reference Manual; it says
|
||||
|
||||
\xhh...
|
||||
|
||||
implying "two or more" hex digits, but one-digit forms are also
|
||||
accepted by the 1.5.2 compiler, and a plain \x is "expanded" to
|
||||
itself (i.e., a backslash followed by the letter x). It's unclear
|
||||
whether the Reference Manual intended either of the 1-digit or
|
||||
0-digit behaviors.
|
||||
|
||||
|
||||
Semantics
|
||||
|
||||
In an 8-bit non-raw string,
|
||||
\xij
|
||||
expands to the character
|
||||
chr(int(ij, 16))
|
||||
Note that this is the same as in 1.6 and before.
|
||||
|
||||
In a Unicode string,
|
||||
\xij
|
||||
acts the same as
|
||||
\u00ij
|
||||
i.e. it expands to the obvious Latin-1 character from the initial
|
||||
segment of the Unicode space.
|
||||
|
||||
An \x not followed by at least two hex digits is a compile-time error,
|
||||
specifically ValueError in 8-bit strings, and UnicodeError (a subclass
|
||||
of ValueError) in Unicode strings. Note that if an \x is followed by
|
||||
more than two hex digits, only the first two are "consumed". In 1.6
|
||||
and before all but the *last* two were silently ignored.
|
||||
|
||||
|
||||
Example
|
||||
|
||||
In 1.5.2:
|
||||
|
||||
>>> "\x123465" # same as "\x65"
|
||||
'e'
|
||||
>>> "\x65"
|
||||
'e'
|
||||
>>> "\x1"
|
||||
'\001'
|
||||
>>> "\x\x"
|
||||
'\\x\\x'
|
||||
>>>
|
||||
|
||||
In 2.0:
|
||||
|
||||
>>> "\x123465" # \x12 -> \022, "3456" left alone
|
||||
'\0223456'
|
||||
>>> "\x65"
|
||||
'e'
|
||||
>>> "\x1"
|
||||
[ValueError is raised]
|
||||
>>> "\x\x"
|
||||
[ValueError is raised]
|
||||
>>>
|
||||
|
||||
|
||||
History and Rationale
|
||||
|
||||
\x escapes were introduced in C as a way to specify variable-width
|
||||
character encodings. Exactly which encodings those were, and how many
|
||||
hex digits they required, was left up to each implementation. The
|
||||
language simply stated that \x "consumed" *all* hex digits following,
|
||||
and left the meaning up to each implementation. So, in effect, \x in C
|
||||
is a standard hook to supply platform-defined behavior.
|
||||
|
||||
Because Python explicitly aims at platform independence, the \x escape
|
||||
in Python (up to and including 1.6) has been treated the same way
|
||||
across all platforms: all *except* the last two hex digits were
|
||||
silently ignored. So the only actual use for \x escapes in Python was
|
||||
to specify a single byte using hex notation.
|
||||
|
||||
Larry Wall appears to have realized that this was the only real use for
|
||||
\x escapes in a platform-independent language, as the proposed rule for
|
||||
Python 2.0 is in fact what Perl has done from the start (although you
|
||||
need to run in Perl -w mode to get warned about \x escapes with fewer
|
||||
than 2 hex digits following -- it's clearly more Pythonic to insist on
|
||||
2 all the time).
|
||||
|
||||
When Unicode strings were introduced to Python, \x was generalized so
|
||||
as to ignore all but the last *four* hex digits in Unicode strings.
|
||||
This caused a technical difficulty for the new regular expression engine:
|
||||
SRE tries very hard to allow mixing 8-bit and Unicode patterns and
|
||||
strings in intuitive ways, and it no longer had any way to guess what,
|
||||
for example, r"\x123456" should mean as a pattern: is it asking to match
|
||||
the 8-bit character \x56 or the Unicode character \u3456?
|
||||
|
||||
There are hacky ways to guess, but it doesn't end there. The ISO C99
|
||||
standard also introduces 8-digit \U12345678 escapes to cover the entire
|
||||
ISO 10646 character space, and it's also desired that Python 2 support
|
||||
that from the start. But then what are \x escapes supposed to mean?
|
||||
Do they ignore all but the last *eight* hex digits then? And if less
|
||||
than 8 following in a Unicode string, all but the last 4? And if less
|
||||
than 4, all but the last 2?
|
||||
|
||||
This was getting messier by the minute, and the proposal cuts the
|
||||
Gordian knot by making \x simpler instead of more complicated. Note
|
||||
that the 4-digit generalization to \xijkl in Unicode strings was also
|
||||
redundant, because it meant exactly the same thing as \uijkl in Unicode
|
||||
strings. It's more Pythonic to have just one obvious way to specify a
|
||||
Unicode character via hex notation.
|
||||
|
||||
|
||||
Development and Discussion
|
||||
|
||||
The proposal was worked out among Guido van Rossum, Fredrik Lundh and
|
||||
Tim Peters in email. It was subsequently explained and disussed on
|
||||
Python-Dev under subject "Go \x yourself", starting 2000-08-03.
|
||||
Response was overwhelmingly positive; no objections were raised.
|
||||
|
||||
|
||||
Backward Compatibility
|
||||
|
||||
Changing the meaning of \x escapes does carry risk of breaking existing
|
||||
code, although no instances of incompabitility have yet been discovered.
|
||||
The risk is believed to be minimal.
|
||||
|
||||
Tim Peters verified that, except for pieces of the standard test suite
|
||||
deliberately provoking end cases, there are no instances of \xabcdef...
|
||||
with fewer or more than 2 hex digits following, in either the Python
|
||||
CVS development tree, or in assorted Python packages sitting on his
|
||||
machine.
|
||||
|
||||
It's unlikely there are any with fewer than 2, because the Reference
|
||||
Manual implied they weren't legal (although this is debatable!). If
|
||||
there are any with more than 2, Guido is ready to argue they were buggy
|
||||
anyway <0.9 wink>.
|
||||
|
||||
Guido reported that the O'Reilly Python books *already* document that
|
||||
Python works the proposed way, likely due to their Perl editing
|
||||
heritage (as above, Perl worked (very close to) the proposed way from
|
||||
its start).
|
||||
|
||||
Finn Bock reported that what JPython does with \x escapes is
|
||||
unpredictable today. This proposal gives a clear meaning that can be
|
||||
consistently and easily implemented across all Python implementations.
|
||||
|
||||
|
||||
Effects on Other Tools
|
||||
|
||||
Believed to be none. The candidates for breakage would mostly be
|
||||
parsing tools, but the author knows of none that worry about the
|
||||
internal structure of Python strings beyond the approximation "when
|
||||
there's a backslash, swallow the next character". Tim Peters checked
|
||||
python-mode.el, the std tokenize.py and pyclbr.py, and the IDLE syntax
|
||||
coloring subsystem, and believes there's no need to change any of
|
||||
them. Tools like tabnanny.py and checkappend.py inherit their immunity
|
||||
from tokenize.py.
|
||||
|
||||
|
||||
Reference Implementation
|
||||
|
||||
The code changes are so simple that a separate patch will not be produced.
|
||||
Fredrik Lundh is writing the code, is an expert in the area, and will
|
||||
simply check the changes in before 2.0b1 is released.
|
||||
|
||||
|
||||
BDFL Pronouncements
|
||||
|
||||
Yes, ValueError, not SyntaxError. "Problems with literal interpretations
|
||||
traditionally raise 'runtime' exceptions rather than syntax errors."
|
||||
|
||||
|
||||
Copyright
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
|
|
Loading…
Reference in New Issue