Changes regarding the default encoding and other minor tweaks.
See history for details.
This commit is contained in:
parent
ea3edf8588
commit
2e572852cf
52
pep-0263.txt
52
pep-0263.txt
|
@ -39,8 +39,10 @@ Proposed Solution
|
|||
|
||||
Defining the Encoding
|
||||
|
||||
Python will default to Latin-1 as standard encoding if no other
|
||||
encoding hints are given.
|
||||
Just as in coercion of strings to Unicode, Python will default to
|
||||
the interpreter's default encoding (which is ASCII in standard
|
||||
Python installations) as standard encoding if no other encoding
|
||||
hints are given.
|
||||
|
||||
To define a source code encoding, a magic comment must
|
||||
be placed into the source files either as first or second
|
||||
|
@ -49,6 +51,11 @@ Defining the Encoding
|
|||
#!/usr/bin/python
|
||||
# -*- coding: <encoding name> -*-
|
||||
|
||||
More precise, the first or second line must match the regular
|
||||
expression "coding[:=]\s*([\w-_]+)". The first group of this
|
||||
expression is then interpreted as encoding name. If the encoding
|
||||
is unknown to Python, an error is raised during compilation.
|
||||
|
||||
To aid with platforms such as Windows, which add Unicode BOM marks
|
||||
to the beginning of Unicode files, the UTF-8 signature
|
||||
'\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
|
||||
|
@ -66,7 +73,7 @@ Concepts
|
|||
1. The complete Python source file should use a single encoding.
|
||||
Embedding of differently encoded data is not allowed and will
|
||||
result in a decoding error during compilation of the Python
|
||||
source code.
|
||||
source code.
|
||||
|
||||
Only ASCII compatible encodings are allowed as source code
|
||||
encoding to assure that Python language elements other than
|
||||
|
@ -101,34 +108,47 @@ Concepts
|
|||
Note that Python identifiers are restricted to the ASCII
|
||||
subset of the encoding.
|
||||
|
||||
For backwards compatibility, the implementation must assume
|
||||
Latin-1 as the original file encoding if not given (otherwise,
|
||||
binary data currently stored in 8-bit strings wouldn't make the
|
||||
roundtrip).
|
||||
|
||||
Implementation
|
||||
|
||||
Since changing the Python tokenizer/parser combination will
|
||||
require major changes in the internals of the interpreter, the
|
||||
proposed solution should be implemented in two phases:
|
||||
require major changes in the internals of the interpreter and
|
||||
enforcing the use of magic comments in source code files which
|
||||
place non-default encoding characters in string literals, comments
|
||||
and Unicode literals, the proposed solution should be implemented
|
||||
in two phases:
|
||||
|
||||
1. Implement the magic comment detection and default encoding
|
||||
handling, but only apply the detected encoding to Unicode
|
||||
literals in the source file.
|
||||
|
||||
In addition to this step and to aid in the transition to
|
||||
explicit encoding declaration, the tokenizer must check the
|
||||
complete source file for compliance with the default encoding
|
||||
(which usually is ASCII). If the source file does not properly
|
||||
decode, a single warning is generated per file.
|
||||
|
||||
2. Change the tokenizer/compiler base string type from char* to
|
||||
Py_UNICODE* and apply the encoding to the complete file.
|
||||
|
||||
Source files which fail to decode cause an error to be raised
|
||||
during compilation.
|
||||
|
||||
The builtin compile() API will be enhanced to accept Unicode as
|
||||
input. 8-bit string input is subject to the standard procedure
|
||||
for encoding detection as decsribed above.
|
||||
|
||||
Scope
|
||||
|
||||
This PEP only affects Python source code which makes use of the
|
||||
proposed magic comment. Without the magic comment in the proposed
|
||||
position, Python will treat the source file as it does currently
|
||||
(using the Latin-1 encoding assumption) to maintain backwards
|
||||
compatibility.
|
||||
This PEP intends to provide an upgrade path from th current
|
||||
(more-or-less) undefined source code encoding situation to a more
|
||||
robust and portable definition.
|
||||
|
||||
History
|
||||
|
||||
1.7: Added warnings to phase 1 implementation. Replaced the
|
||||
Latin-1 default encoding with the interpreter's default
|
||||
encoding. Added tweaks to compile().
|
||||
1.4 - 1.6: Minor tweaks
|
||||
1.3: Worked in comments by Martin v. Loewis:
|
||||
UTF-8 BOM mark detection, Emacs style magic comment,
|
||||
two phase approach to the implementation
|
||||
|
@ -137,10 +157,8 @@ Copyright
|
|||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
fill-column: 70
|
||||
End:
|
||||
|
|
Loading…
Reference in New Issue