Use Hisao's strategy of converting to UTF-8.

2002-04-19 17:32:14 +00:00 · 2002-04-19 17:32:14 +00:00 · 6c432cfe3f
parent 8a055a911e
commit 6c432cfe3f
1 changed files with 31 additions and 39 deletions
--- a/pep-0263.txt
+++ b/pep-0263.txt
@ -1,7 +1,8 @@
 PEP: 0263
 Title: Defining Python Source Code Encodings
 Version: $Revision$
-Author: mal@lemburg.com (Marc-André Lemburg)
+Author: mal@lemburg.com (Marc-André Lemburg),
+  loewis@informatik.hu-berlin.de (Martin v. Löwis)
 Status: Draft
 Type: Standards Track
 Python-Version: 2.3
@ -25,7 +26,7 @@ Problem
    programming environment rather unfriendly to Python users who live
    and work in non-Latin-1 locales such as many of the Asian 
    countries. Programmers can write their 8-bit strings using the
-    favourite encoding, but are bound to the "unicode-escape" encoding
+    favorite encoding, but are bound to the "unicode-escape" encoding
    for Unicode literals.

 Proposed Solution
@ -35,7 +36,7 @@ Proposed Solution
    at the top of the file to declare the encoding.

    To make Python aware of this encoding declaration a number of
-    concept changes are necessary with repect to the handling of
+    concept changes are necessary with respect to the handling of
    Python source code data.

 Defining the Encoding
@ -95,54 +96,43 @@ Concepts

       2. decode it into Unicode assuming a fixed per-file encoding

-       3. tokenize the Unicode content
+       3. convert it into a UTF-8 byte string

-       4. compile it, creating Unicode objects from the given Unicode data
+       4. tokenize the UTF-8 content
+
+       5. compile it, creating Unicode objects from the given Unicode data
          and creating string objects from the Unicode literal data
-          by first reencoding the Unicode data into 8-bit string data
+          by first reencoding the UTF-8 data into 8-bit string data
          using the given file encoding

-       5. variable names and other identifiers will be reencoded into
-          8-bit strings using the file encoding to assure backward
-          compatibility with the existing implementation
-
       Note that Python identifiers are restricted to the ASCII
-       subset of the encoding.
+       subset of the encoding, and thus need no further conversion
+       after step 4.

 Implementation

-    Since changing the Python tokenizer/parser combination will
-    require major changes in the internals of the interpreter and
-    enforcing the use of magic comments in source code files which
-    place non-ASCII characters in string literals, comments
-    and Unicode literals, the proposed solution should be implemented
-    in two phases:
+    For backwards-compatibility with existing code which currently
+    uses non-ASCII in string literals without declaring an encoding,
+    the implementation will be introduced in two phases:

-    1. Implement the magic comment detection, but only apply the
-       detected encoding to Unicode literals in the source file.
+    1. Allow non-ASCII in string literals and comments, by internally
+       treating a missing encoding declaration as a declaration of
+       "iso-8859-1". This will cause arbitrary byte strings to
+       correctly round-trip between step 2 and step 5 of the
+       processing, and provide compatibility with Python 2.2 for
+       Unicode literals that contain non-ASCII bytes.

-       If no magic comment is used, Python should continue to
-       use the standard [raw-]unicode-escape codecs for Unicode
-       literals.
+       A warning will be issued if non-ASCII bytes are found in the
+       input, once per improperly encoded input file.

-       In addition to this step and to aid in the transition to
-       explicit encoding declaration, the tokenizer must check the
-       complete source file for compliance with the declared
-       encoding. If the source file does not properly decode, a single
-       warning is generated per file.
+    2. Remove the warning, and change the default encoding to "ascii".

-    2. Change the tokenizer/compiler base string type from char* to
-       Py_UNICODE* and apply the encoding to the complete file.
+    The builtin compile() API will be enhanced to accept Unicode as
+    input. 8-bit string input is subject to the standard procedure for
+    encoding detection as described above.

-       Source files which fail to decode cause an error to be raised
-       during compilation.
-
-       The builtin compile() API will be enhanced to accept Unicode as
-       input. 8-bit string input is subject to the standard procedure
-       for encoding detection as decsribed above.
-
-    Martin v. Loewis is working on a patch which implements phase 1.
-    See [1] for details.
+    SUZUKI Hisao is working on a patch; see [2] for details. A patch
+    implementing only phase 1 is available at [1].

 Scope

@ -153,7 +143,9 @@ Scope
 References

    [1] Phase 1 implementation:
-        http://sourceforge.net/tracker/?func=detail&atid=305470&aid=526840&group_id=5470
+        http://python.org/sf/526840
+    [2] Phase 2 implementation:
+        http://python.org/sf/534304

 History