Changed Python's source code encoding default to ASCII.

Added note about handling of Unicode literals in phase 1.
This commit is contained in:
Marc-André Lemburg 2002-03-15 17:07:12 +00:00
parent 130b28e2a4
commit 4903b95001
1 changed files with 17 additions and 16 deletions

View File

@ -40,10 +40,8 @@ Proposed Solution
Defining the Encoding Defining the Encoding
Just as in coercion of strings to Unicode, Python will default to Python will default to ASCII as standard encoding if no other
the interpreter's default encoding (which is ASCII in standard encoding hints are given.
Python installations) as standard encoding if no other encoding
hints are given.
To define a source code encoding, a magic comment must To define a source code encoding, a magic comment must
be placed into the source files either as first or second be placed into the source files either as first or second
@ -76,12 +74,12 @@ Concepts
result in a decoding error during compilation of the Python result in a decoding error during compilation of the Python
source code. source code.
Any encoding which allows processing the first two lines in Any encoding which allows processing the first two lines in the
the way indicated above is allowed as source code encoding, way indicated above is allowed as source code encoding, this
this includes ASCII compatible encodings as well as certain includes ASCII compatible encodings as well as certain
multi-byte encodings such as Shift_JIS. It does not include multi-byte encodings such as Shift_JIS. It does not include
encodings which use two or more bytes for all characters encodings which use two or more bytes for all characters like
like e.g. UTF-16. The reason for this is to keep the encoding e.g. UTF-16. The reason for this is to keep the encoding
detection algorithm in the tokenizer simple. detection algorithm in the tokenizer simple.
2. Handling of escape sequences should continue to work as it does 2. Handling of escape sequences should continue to work as it does
@ -116,19 +114,22 @@ Implementation
Since changing the Python tokenizer/parser combination will Since changing the Python tokenizer/parser combination will
require major changes in the internals of the interpreter and require major changes in the internals of the interpreter and
enforcing the use of magic comments in source code files which enforcing the use of magic comments in source code files which
place non-default encoding characters in string literals, comments place non-ASCII characters in string literals, comments
and Unicode literals, the proposed solution should be implemented and Unicode literals, the proposed solution should be implemented
in two phases: in two phases:
1. Implement the magic comment detection and default encoding 1. Implement the magic comment detection, but only apply the
handling, but only apply the detected encoding to Unicode detected encoding to Unicode literals in the source file.
literals in the source file.
If no magic comment is used, Python should continue to
use the standard [raw-]unicode-escape codecs for Unicode
literals.
In addition to this step and to aid in the transition to In addition to this step and to aid in the transition to
explicit encoding declaration, the tokenizer must check the explicit encoding declaration, the tokenizer must check the
complete source file for compliance with the default encoding complete source file for compliance with the declared
(which usually is ASCII). If the source file does not properly encoding. If the source file does not properly decode, a single
decode, a single warning is generated per file. warning is generated per file.
2. Change the tokenizer/compiler base string type from char* to 2. Change the tokenizer/compiler base string type from char* to
Py_UNICODE* and apply the encoding to the complete file. Py_UNICODE* and apply the encoding to the complete file.