From 6c432cfe3f9bb61477d53560ebfa0f2d60040ce8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Martin=20v=2E=20L=C3=B6wis?= Date: Fri, 19 Apr 2002 17:32:14 +0000 Subject: [PATCH] Use Hisao's strategy of converting to UTF-8. --- pep-0263.txt | 70 +++++++++++++++++++++++----------------------------- 1 file changed, 31 insertions(+), 39 deletions(-) diff --git a/pep-0263.txt b/pep-0263.txt index 3f69795e5..c7e6a8b90 100644 --- a/pep-0263.txt +++ b/pep-0263.txt @@ -1,7 +1,8 @@ PEP: 0263 Title: Defining Python Source Code Encodings Version: $Revision$ -Author: mal@lemburg.com (Marc-André Lemburg) +Author: mal@lemburg.com (Marc-André Lemburg), + loewis@informatik.hu-berlin.de (Martin v. Löwis) Status: Draft Type: Standards Track Python-Version: 2.3 @@ -25,7 +26,7 @@ Problem programming environment rather unfriendly to Python users who live and work in non-Latin-1 locales such as many of the Asian countries. Programmers can write their 8-bit strings using the - favourite encoding, but are bound to the "unicode-escape" encoding + favorite encoding, but are bound to the "unicode-escape" encoding for Unicode literals. Proposed Solution @@ -35,7 +36,7 @@ Proposed Solution at the top of the file to declare the encoding. To make Python aware of this encoding declaration a number of - concept changes are necessary with repect to the handling of + concept changes are necessary with respect to the handling of Python source code data. Defining the Encoding @@ -95,54 +96,43 @@ Concepts 2. decode it into Unicode assuming a fixed per-file encoding - 3. tokenize the Unicode content + 3. convert it into a UTF-8 byte string - 4. compile it, creating Unicode objects from the given Unicode data + 4. tokenize the UTF-8 content + + 5. compile it, creating Unicode objects from the given Unicode data and creating string objects from the Unicode literal data - by first reencoding the Unicode data into 8-bit string data + by first reencoding the UTF-8 data into 8-bit string data using the given file encoding - 5. variable names and other identifiers will be reencoded into - 8-bit strings using the file encoding to assure backward - compatibility with the existing implementation - Note that Python identifiers are restricted to the ASCII - subset of the encoding. + subset of the encoding, and thus need no further conversion + after step 4. Implementation - Since changing the Python tokenizer/parser combination will - require major changes in the internals of the interpreter and - enforcing the use of magic comments in source code files which - place non-ASCII characters in string literals, comments - and Unicode literals, the proposed solution should be implemented - in two phases: + For backwards-compatibility with existing code which currently + uses non-ASCII in string literals without declaring an encoding, + the implementation will be introduced in two phases: - 1. Implement the magic comment detection, but only apply the - detected encoding to Unicode literals in the source file. + 1. Allow non-ASCII in string literals and comments, by internally + treating a missing encoding declaration as a declaration of + "iso-8859-1". This will cause arbitrary byte strings to + correctly round-trip between step 2 and step 5 of the + processing, and provide compatibility with Python 2.2 for + Unicode literals that contain non-ASCII bytes. - If no magic comment is used, Python should continue to - use the standard [raw-]unicode-escape codecs for Unicode - literals. + A warning will be issued if non-ASCII bytes are found in the + input, once per improperly encoded input file. - In addition to this step and to aid in the transition to - explicit encoding declaration, the tokenizer must check the - complete source file for compliance with the declared - encoding. If the source file does not properly decode, a single - warning is generated per file. + 2. Remove the warning, and change the default encoding to "ascii". - 2. Change the tokenizer/compiler base string type from char* to - Py_UNICODE* and apply the encoding to the complete file. + The builtin compile() API will be enhanced to accept Unicode as + input. 8-bit string input is subject to the standard procedure for + encoding detection as described above. - Source files which fail to decode cause an error to be raised - during compilation. - - The builtin compile() API will be enhanced to accept Unicode as - input. 8-bit string input is subject to the standard procedure - for encoding detection as decsribed above. - - Martin v. Loewis is working on a patch which implements phase 1. - See [1] for details. + SUZUKI Hisao is working on a patch; see [2] for details. A patch + implementing only phase 1 is available at [1]. Scope @@ -153,7 +143,9 @@ Scope References [1] Phase 1 implementation: - http://sourceforge.net/tracker/?func=detail&atid=305470&aid=526840&group_id=5470 + http://python.org/sf/526840 + [2] Phase 2 implementation: + http://python.org/sf/534304 History