Use Hisao's strategy of converting to UTF-8.
This commit is contained in:
parent
8a055a911e
commit
6c432cfe3f
70
pep-0263.txt
70
pep-0263.txt
|
@ -1,7 +1,8 @@
|
|||
PEP: 0263
|
||||
Title: Defining Python Source Code Encodings
|
||||
Version: $Revision$
|
||||
Author: mal@lemburg.com (Marc-André Lemburg)
|
||||
Author: mal@lemburg.com (Marc-André Lemburg),
|
||||
loewis@informatik.hu-berlin.de (Martin v. Löwis)
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Python-Version: 2.3
|
||||
|
@ -25,7 +26,7 @@ Problem
|
|||
programming environment rather unfriendly to Python users who live
|
||||
and work in non-Latin-1 locales such as many of the Asian
|
||||
countries. Programmers can write their 8-bit strings using the
|
||||
favourite encoding, but are bound to the "unicode-escape" encoding
|
||||
favorite encoding, but are bound to the "unicode-escape" encoding
|
||||
for Unicode literals.
|
||||
|
||||
Proposed Solution
|
||||
|
@ -35,7 +36,7 @@ Proposed Solution
|
|||
at the top of the file to declare the encoding.
|
||||
|
||||
To make Python aware of this encoding declaration a number of
|
||||
concept changes are necessary with repect to the handling of
|
||||
concept changes are necessary with respect to the handling of
|
||||
Python source code data.
|
||||
|
||||
Defining the Encoding
|
||||
|
@ -95,54 +96,43 @@ Concepts
|
|||
|
||||
2. decode it into Unicode assuming a fixed per-file encoding
|
||||
|
||||
3. tokenize the Unicode content
|
||||
3. convert it into a UTF-8 byte string
|
||||
|
||||
4. compile it, creating Unicode objects from the given Unicode data
|
||||
4. tokenize the UTF-8 content
|
||||
|
||||
5. compile it, creating Unicode objects from the given Unicode data
|
||||
and creating string objects from the Unicode literal data
|
||||
by first reencoding the Unicode data into 8-bit string data
|
||||
by first reencoding the UTF-8 data into 8-bit string data
|
||||
using the given file encoding
|
||||
|
||||
5. variable names and other identifiers will be reencoded into
|
||||
8-bit strings using the file encoding to assure backward
|
||||
compatibility with the existing implementation
|
||||
|
||||
Note that Python identifiers are restricted to the ASCII
|
||||
subset of the encoding.
|
||||
subset of the encoding, and thus need no further conversion
|
||||
after step 4.
|
||||
|
||||
Implementation
|
||||
|
||||
Since changing the Python tokenizer/parser combination will
|
||||
require major changes in the internals of the interpreter and
|
||||
enforcing the use of magic comments in source code files which
|
||||
place non-ASCII characters in string literals, comments
|
||||
and Unicode literals, the proposed solution should be implemented
|
||||
in two phases:
|
||||
For backwards-compatibility with existing code which currently
|
||||
uses non-ASCII in string literals without declaring an encoding,
|
||||
the implementation will be introduced in two phases:
|
||||
|
||||
1. Implement the magic comment detection, but only apply the
|
||||
detected encoding to Unicode literals in the source file.
|
||||
1. Allow non-ASCII in string literals and comments, by internally
|
||||
treating a missing encoding declaration as a declaration of
|
||||
"iso-8859-1". This will cause arbitrary byte strings to
|
||||
correctly round-trip between step 2 and step 5 of the
|
||||
processing, and provide compatibility with Python 2.2 for
|
||||
Unicode literals that contain non-ASCII bytes.
|
||||
|
||||
If no magic comment is used, Python should continue to
|
||||
use the standard [raw-]unicode-escape codecs for Unicode
|
||||
literals.
|
||||
A warning will be issued if non-ASCII bytes are found in the
|
||||
input, once per improperly encoded input file.
|
||||
|
||||
In addition to this step and to aid in the transition to
|
||||
explicit encoding declaration, the tokenizer must check the
|
||||
complete source file for compliance with the declared
|
||||
encoding. If the source file does not properly decode, a single
|
||||
warning is generated per file.
|
||||
2. Remove the warning, and change the default encoding to "ascii".
|
||||
|
||||
2. Change the tokenizer/compiler base string type from char* to
|
||||
Py_UNICODE* and apply the encoding to the complete file.
|
||||
The builtin compile() API will be enhanced to accept Unicode as
|
||||
input. 8-bit string input is subject to the standard procedure for
|
||||
encoding detection as described above.
|
||||
|
||||
Source files which fail to decode cause an error to be raised
|
||||
during compilation.
|
||||
|
||||
The builtin compile() API will be enhanced to accept Unicode as
|
||||
input. 8-bit string input is subject to the standard procedure
|
||||
for encoding detection as decsribed above.
|
||||
|
||||
Martin v. Loewis is working on a patch which implements phase 1.
|
||||
See [1] for details.
|
||||
SUZUKI Hisao is working on a patch; see [2] for details. A patch
|
||||
implementing only phase 1 is available at [1].
|
||||
|
||||
Scope
|
||||
|
||||
|
@ -153,7 +143,9 @@ Scope
|
|||
References
|
||||
|
||||
[1] Phase 1 implementation:
|
||||
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=526840&group_id=5470
|
||||
http://python.org/sf/526840
|
||||
[2] Phase 2 implementation:
|
||||
http://python.org/sf/534304
|
||||
|
||||
History
|
||||
|
||||
|
|
Loading…
Reference in New Issue