Changed Python's source code encoding default to ASCII.
Added note about handling of Unicode literals in phase 1.
This commit is contained in:
parent
130b28e2a4
commit
4903b95001
33
pep-0263.txt
33
pep-0263.txt
|
@ -40,10 +40,8 @@ Proposed Solution
|
|||
|
||||
Defining the Encoding
|
||||
|
||||
Just as in coercion of strings to Unicode, Python will default to
|
||||
the interpreter's default encoding (which is ASCII in standard
|
||||
Python installations) as standard encoding if no other encoding
|
||||
hints are given.
|
||||
Python will default to ASCII as standard encoding if no other
|
||||
encoding hints are given.
|
||||
|
||||
To define a source code encoding, a magic comment must
|
||||
be placed into the source files either as first or second
|
||||
|
@ -76,12 +74,12 @@ Concepts
|
|||
result in a decoding error during compilation of the Python
|
||||
source code.
|
||||
|
||||
Any encoding which allows processing the first two lines in
|
||||
the way indicated above is allowed as source code encoding,
|
||||
this includes ASCII compatible encodings as well as certain
|
||||
Any encoding which allows processing the first two lines in the
|
||||
way indicated above is allowed as source code encoding, this
|
||||
includes ASCII compatible encodings as well as certain
|
||||
multi-byte encodings such as Shift_JIS. It does not include
|
||||
encodings which use two or more bytes for all characters
|
||||
like e.g. UTF-16. The reason for this is to keep the encoding
|
||||
encodings which use two or more bytes for all characters like
|
||||
e.g. UTF-16. The reason for this is to keep the encoding
|
||||
detection algorithm in the tokenizer simple.
|
||||
|
||||
2. Handling of escape sequences should continue to work as it does
|
||||
|
@ -116,19 +114,22 @@ Implementation
|
|||
Since changing the Python tokenizer/parser combination will
|
||||
require major changes in the internals of the interpreter and
|
||||
enforcing the use of magic comments in source code files which
|
||||
place non-default encoding characters in string literals, comments
|
||||
place non-ASCII characters in string literals, comments
|
||||
and Unicode literals, the proposed solution should be implemented
|
||||
in two phases:
|
||||
|
||||
1. Implement the magic comment detection and default encoding
|
||||
handling, but only apply the detected encoding to Unicode
|
||||
literals in the source file.
|
||||
1. Implement the magic comment detection, but only apply the
|
||||
detected encoding to Unicode literals in the source file.
|
||||
|
||||
If no magic comment is used, Python should continue to
|
||||
use the standard [raw-]unicode-escape codecs for Unicode
|
||||
literals.
|
||||
|
||||
In addition to this step and to aid in the transition to
|
||||
explicit encoding declaration, the tokenizer must check the
|
||||
complete source file for compliance with the default encoding
|
||||
(which usually is ASCII). If the source file does not properly
|
||||
decode, a single warning is generated per file.
|
||||
complete source file for compliance with the declared
|
||||
encoding. If the source file does not properly decode, a single
|
||||
warning is generated per file.
|
||||
|
||||
2. Change the tokenizer/compiler base string type from char* to
|
||||
Py_UNICODE* and apply the encoding to the complete file.
|
||||
|
|
Loading…
Reference in New Issue