Use Hisao's strategy of converting to UTF-8.

This commit is contained in:
Martin v. Löwis 2002-04-19 17:32:14 +00:00
parent 8a055a911e
commit 6c432cfe3f
1 changed files with 31 additions and 39 deletions

View File

@ -1,7 +1,8 @@
PEP: 0263
Title: Defining Python Source Code Encodings
Version: $Revision$
Author: mal@lemburg.com (Marc-André Lemburg)
Author: mal@lemburg.com (Marc-André Lemburg),
loewis@informatik.hu-berlin.de (Martin v. Löwis)
Status: Draft
Type: Standards Track
Python-Version: 2.3
@ -25,7 +26,7 @@ Problem
programming environment rather unfriendly to Python users who live
and work in non-Latin-1 locales such as many of the Asian
countries. Programmers can write their 8-bit strings using the
favourite encoding, but are bound to the "unicode-escape" encoding
favorite encoding, but are bound to the "unicode-escape" encoding
for Unicode literals.
Proposed Solution
@ -35,7 +36,7 @@ Proposed Solution
at the top of the file to declare the encoding.
To make Python aware of this encoding declaration a number of
concept changes are necessary with repect to the handling of
concept changes are necessary with respect to the handling of
Python source code data.
Defining the Encoding
@ -95,54 +96,43 @@ Concepts
2. decode it into Unicode assuming a fixed per-file encoding
3. tokenize the Unicode content
3. convert it into a UTF-8 byte string
4. compile it, creating Unicode objects from the given Unicode data
4. tokenize the UTF-8 content
5. compile it, creating Unicode objects from the given Unicode data
and creating string objects from the Unicode literal data
by first reencoding the Unicode data into 8-bit string data
by first reencoding the UTF-8 data into 8-bit string data
using the given file encoding
5. variable names and other identifiers will be reencoded into
8-bit strings using the file encoding to assure backward
compatibility with the existing implementation
Note that Python identifiers are restricted to the ASCII
subset of the encoding.
subset of the encoding, and thus need no further conversion
after step 4.
Implementation
Since changing the Python tokenizer/parser combination will
require major changes in the internals of the interpreter and
enforcing the use of magic comments in source code files which
place non-ASCII characters in string literals, comments
and Unicode literals, the proposed solution should be implemented
in two phases:
For backwards-compatibility with existing code which currently
uses non-ASCII in string literals without declaring an encoding,
the implementation will be introduced in two phases:
1. Implement the magic comment detection, but only apply the
detected encoding to Unicode literals in the source file.
1. Allow non-ASCII in string literals and comments, by internally
treating a missing encoding declaration as a declaration of
"iso-8859-1". This will cause arbitrary byte strings to
correctly round-trip between step 2 and step 5 of the
processing, and provide compatibility with Python 2.2 for
Unicode literals that contain non-ASCII bytes.
If no magic comment is used, Python should continue to
use the standard [raw-]unicode-escape codecs for Unicode
literals.
A warning will be issued if non-ASCII bytes are found in the
input, once per improperly encoded input file.
In addition to this step and to aid in the transition to
explicit encoding declaration, the tokenizer must check the
complete source file for compliance with the declared
encoding. If the source file does not properly decode, a single
warning is generated per file.
2. Remove the warning, and change the default encoding to "ascii".
2. Change the tokenizer/compiler base string type from char* to
Py_UNICODE* and apply the encoding to the complete file.
The builtin compile() API will be enhanced to accept Unicode as
input. 8-bit string input is subject to the standard procedure for
encoding detection as described above.
Source files which fail to decode cause an error to be raised
during compilation.
The builtin compile() API will be enhanced to accept Unicode as
input. 8-bit string input is subject to the standard procedure
for encoding detection as decsribed above.
Martin v. Loewis is working on a patch which implements phase 1.
See [1] for details.
SUZUKI Hisao is working on a patch; see [2] for details. A patch
implementing only phase 1 is available at [1].
Scope
@ -153,7 +143,9 @@ Scope
References
[1] Phase 1 implementation:
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=526840&group_id=5470
http://python.org/sf/526840
[2] Phase 2 implementation:
http://python.org/sf/534304
History