From 6c432cfe3f9bb61477d53560ebfa0f2d60040ce8 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Martin=20v=2E=20L=C3=B6wis?= <martin@v.loewis.de>
Date: Fri, 19 Apr 2002 17:32:14 +0000
Subject: [PATCH] Use Hisao's strategy of converting to UTF-8.

---
 pep-0263.txt | 70 +++++++++++++++++++++++-----------------------------
 1 file changed, 31 insertions(+), 39 deletions(-)

diff --git a/pep-0263.txt b/pep-0263.txt
index 3f69795e5..c7e6a8b90 100644
--- a/pep-0263.txt
+++ b/pep-0263.txt
@@ -1,7 +1,8 @@
 PEP: 0263
 Title: Defining Python Source Code Encodings
 Version: $Revision$
-Author: mal@lemburg.com (Marc-André Lemburg)
+Author: mal@lemburg.com (Marc-André Lemburg),
+  loewis@informatik.hu-berlin.de (Martin v. Löwis)
 Status: Draft
 Type: Standards Track
 Python-Version: 2.3
@@ -25,7 +26,7 @@ Problem
     programming environment rather unfriendly to Python users who live
     and work in non-Latin-1 locales such as many of the Asian 
     countries. Programmers can write their 8-bit strings using the
-    favourite encoding, but are bound to the "unicode-escape" encoding
+    favorite encoding, but are bound to the "unicode-escape" encoding
     for Unicode literals.
 
 Proposed Solution
@@ -35,7 +36,7 @@ Proposed Solution
     at the top of the file to declare the encoding.
 
     To make Python aware of this encoding declaration a number of
-    concept changes are necessary with repect to the handling of
+    concept changes are necessary with respect to the handling of
     Python source code data.
 
 Defining the Encoding
@@ -95,54 +96,43 @@ Concepts
 
        2. decode it into Unicode assuming a fixed per-file encoding
 
-       3. tokenize the Unicode content
+       3. convert it into a UTF-8 byte string
 
-       4. compile it, creating Unicode objects from the given Unicode data
+       4. tokenize the UTF-8 content
+
+       5. compile it, creating Unicode objects from the given Unicode data
           and creating string objects from the Unicode literal data
-          by first reencoding the Unicode data into 8-bit string data
+          by first reencoding the UTF-8 data into 8-bit string data
           using the given file encoding
 
-       5. variable names and other identifiers will be reencoded into
-          8-bit strings using the file encoding to assure backward
-          compatibility with the existing implementation
-
        Note that Python identifiers are restricted to the ASCII
-       subset of the encoding.
+       subset of the encoding, and thus need no further conversion
+       after step 4.
 
 Implementation
 
-    Since changing the Python tokenizer/parser combination will
-    require major changes in the internals of the interpreter and
-    enforcing the use of magic comments in source code files which
-    place non-ASCII characters in string literals, comments
-    and Unicode literals, the proposed solution should be implemented
-    in two phases:
+    For backwards-compatibility with existing code which currently
+    uses non-ASCII in string literals without declaring an encoding,
+    the implementation will be introduced in two phases:
 
-    1. Implement the magic comment detection, but only apply the
-       detected encoding to Unicode literals in the source file.
+    1. Allow non-ASCII in string literals and comments, by internally
+       treating a missing encoding declaration as a declaration of
+       "iso-8859-1". This will cause arbitrary byte strings to
+       correctly round-trip between step 2 and step 5 of the
+       processing, and provide compatibility with Python 2.2 for
+       Unicode literals that contain non-ASCII bytes.
 
-       If no magic comment is used, Python should continue to
-       use the standard [raw-]unicode-escape codecs for Unicode
-       literals.
+       A warning will be issued if non-ASCII bytes are found in the
+       input, once per improperly encoded input file.
 
-       In addition to this step and to aid in the transition to
-       explicit encoding declaration, the tokenizer must check the
-       complete source file for compliance with the declared
-       encoding. If the source file does not properly decode, a single
-       warning is generated per file.
+    2. Remove the warning, and change the default encoding to "ascii".
 
-    2. Change the tokenizer/compiler base string type from char* to
-       Py_UNICODE* and apply the encoding to the complete file.
+    The builtin compile() API will be enhanced to accept Unicode as
+    input. 8-bit string input is subject to the standard procedure for
+    encoding detection as described above.
 
-       Source files which fail to decode cause an error to be raised
-       during compilation.
-
-       The builtin compile() API will be enhanced to accept Unicode as
-       input. 8-bit string input is subject to the standard procedure
-       for encoding detection as decsribed above.
-
-    Martin v. Loewis is working on a patch which implements phase 1.
-    See [1] for details.
+    SUZUKI Hisao is working on a patch; see [2] for details. A patch
+    implementing only phase 1 is available at [1].
 
 Scope
 
@@ -153,7 +143,9 @@ Scope
 References
 
     [1] Phase 1 implementation:
-        http://sourceforge.net/tracker/?func=detail&atid=305470&aid=526840&group_id=5470
+        http://python.org/sf/526840
+    [2] Phase 2 implementation:
+        http://python.org/sf/534304
 
 History