Completed, about to post.

2000-08-24 03:26:42 +00:00 · 2000-08-24 03:26:42 +00:00 · 156e7dfa35
parent e7c1e697c0
commit 156e7dfa35
1 changed files with 180 additions and 2 deletions
--- a/pep-0223.txt
+++ b/pep-0223.txt
@ -2,11 +2,11 @@ PEP: 223
 Title: Change the Meaning of \x Escapes
 Version: $Revision$
 Author: tpeters@beopen.com (Tim Peters)
-Status: Draft
+Status: Active
 Type: Standards Track
 Python-Version: 2.0
 Created: 20-Aug-2000
-Post-History:
+Post-History: 23-Aug-2000


 Abstract
@ -19,6 +19,184 @@ Abstract
    to existing code.


+Syntax
+
+    The syntax of \x escapes, in all flavors of non-raw strings, becomes
+
+        \xhh
+
+    where h is a hex digit (0-9, a-f, A-F).  The exact syntax in 1.5.2 is
+    not clearly specified in the Reference Manual; it says
+
+        \xhh...
+
+    implying "two or more" hex digits, but one-digit forms are also
+    accepted by the 1.5.2 compiler, and a plain \x is "expanded" to
+    itself (i.e., a backslash followed by the letter x).  It's unclear
+    whether the Reference Manual intended either of the 1-digit or
+    0-digit behaviors.
+
+
+Semantics
+
+    In an 8-bit non-raw string,
+        \xij
+    expands to the character
+        chr(int(ij, 16))
+    Note that this is the same as in 1.6 and before.
+
+    In a Unicode string,
+        \xij
+    acts the same as
+        \u00ij
+    i.e. it expands to the obvious Latin-1 character from the initial
+    segment of the Unicode space.
+
+    An \x not followed by at least two hex digits is a compile-time error,
+    specifically ValueError in 8-bit strings, and UnicodeError (a subclass
+    of ValueError) in Unicode strings.  Note that if an \x is followed by
+    more than two hex digits, only the first two are "consumed".  In 1.6
+    and before all but the *last* two were silently ignored.
+
+
+Example
+
+    In 1.5.2:
+
+        >>> "\x123465"  # same as "\x65"
+        'e'
+        >>> "\x65"
+        'e'
+        >>> "\x1"
+        '\001'
+        >>> "\x\x"
+        '\\x\\x'
+        >>>
+
+    In 2.0:
+
+        >>> "\x123465" # \x12 -> \022, "3456" left alone
+        '\0223456'
+        >>> "\x65"
+        'e'
+        >>> "\x1"
+        [ValueError is raised]
+        >>> "\x\x"
+        [ValueError is raised]
+        >>>
+
+
+History and Rationale
+
+    \x escapes were introduced in C as a way to specify variable-width
+    character encodings.  Exactly which encodings those were, and how many
+    hex digits they required, was left up to each implementation.  The
+    language simply stated that \x "consumed" *all* hex digits following,
+    and left the meaning up to each implementation.  So, in effect, \x in C
+    is a standard hook to supply platform-defined behavior.
+
+    Because Python explicitly aims at platform independence, the \x escape
+    in Python (up to and including 1.6) has been treated the same way
+    across all platforms:  all *except* the last two hex digits were
+    silently ignored.  So the only actual use for \x escapes in Python was
+    to specify a single byte using hex notation.
+
+    Larry Wall appears to have realized that this was the only real use for
+    \x escapes in a platform-independent language, as the proposed rule for
+    Python 2.0 is in fact what Perl has done from the start (although you
+    need to run in Perl -w mode to get warned about \x escapes with fewer
+    than 2 hex digits following -- it's clearly more Pythonic to insist on
+    2 all the time).
+
+    When Unicode strings were introduced to Python, \x was generalized so
+    as to ignore all but the last *four* hex digits in Unicode strings.
+    This caused a technical difficulty for the new regular expression engine:
+    SRE tries very hard to allow mixing 8-bit and Unicode patterns and
+    strings in intuitive ways, and it no longer had any way to guess what,
+    for example, r"\x123456" should mean as a pattern:  is it asking to match
+    the 8-bit character \x56 or the Unicode character \u3456?
+
+    There are hacky ways to guess, but it doesn't end there.  The ISO C99
+    standard also introduces 8-digit \U12345678 escapes to cover the entire
+    ISO 10646 character space, and it's also desired that Python 2 support
+    that from the start.  But then what are \x escapes supposed to mean?
+    Do they ignore all but the last *eight* hex digits then?  And if less
+    than 8 following in a Unicode string, all but the last 4?  And if less
+    than 4, all but the last 2?
+
+    This was getting messier by the minute, and the proposal cuts the
+    Gordian knot by making \x simpler instead of more complicated.  Note
+    that the 4-digit generalization to \xijkl in Unicode strings was also
+    redundant, because it meant exactly the same thing as \uijkl in Unicode
+    strings.  It's more Pythonic to have just one obvious way to specify a
+    Unicode character via hex notation.
+
+
+Development and Discussion
+
+    The proposal was worked out among Guido van Rossum, Fredrik Lundh and
+    Tim Peters in email.  It was subsequently explained and disussed on
+    Python-Dev under subject "Go \x yourself", starting 2000-08-03.
+    Response was overwhelmingly positive; no objections were raised.
+
+
+Backward Compatibility
+
+    Changing the meaning of \x escapes does carry risk of breaking existing
+    code, although no instances of incompabitility have yet been discovered.
+    The risk is believed to be minimal.
+
+    Tim Peters verified that, except for pieces of the standard test suite
+    deliberately provoking end cases, there are no instances of \xabcdef...
+    with fewer or more than 2 hex digits following, in either the Python
+    CVS development tree, or in assorted Python packages sitting on his
+    machine.
+
+    It's unlikely there are any with fewer than 2, because the Reference
+    Manual implied they weren't legal (although this is debatable!).  If
+    there are any with more than 2, Guido is ready to argue they were buggy
+    anyway <0.9 wink>.
+
+    Guido reported that the O'Reilly Python books *already* document that
+    Python works the proposed way, likely due to their Perl editing
+    heritage (as above, Perl worked (very close to) the proposed way from
+    its start).
+
+    Finn Bock reported that what JPython does with \x escapes is
+    unpredictable today.  This proposal gives a clear meaning that can be
+    consistently and easily implemented across all Python implementations.
+
+
+Effects on Other Tools
+
+    Believed to be none.  The candidates for breakage would mostly be
+    parsing tools, but the author knows of none that worry about the
+    internal structure of Python strings beyond the approximation "when
+    there's a backslash, swallow the next character".  Tim Peters checked
+    python-mode.el, the std tokenize.py and pyclbr.py, and the IDLE syntax
+    coloring subsystem, and believes there's no need to change any of
+    them.  Tools like tabnanny.py and checkappend.py inherit their immunity
+    from tokenize.py.
+
+
+Reference Implementation
+
+    The code changes are so simple that a separate patch will not be produced.
+    Fredrik Lundh is writing the code, is an expert in the area, and will
+    simply check the changes in before 2.0b1 is released.
+
+
+BDFL Pronouncements
+
+    Yes, ValueError, not SyntaxError.  "Problems with literal interpretations
+    traditionally raise 'runtime' exceptions rather than syntax errors."
+
+
+Copyright
+
+    This document has been placed in the public domain.
+
+

 Local Variables:
 mode: indented-text