Adapted to Martin's comments.
This commit is contained in:
parent
b40374d85b
commit
4073747f5d
79
pep-0263.txt
79
pep-0263.txt
|
@ -37,6 +37,27 @@ Proposed Solution
|
||||||
concept changes are necessary with repect to the handling of
|
concept changes are necessary with repect to the handling of
|
||||||
Python source code data.
|
Python source code data.
|
||||||
|
|
||||||
|
Defining the Encoding
|
||||||
|
|
||||||
|
Python will default to Latin-1 as standard encoding if no other
|
||||||
|
encoding hints are given.
|
||||||
|
|
||||||
|
To define a source code encoding, a magic comment must
|
||||||
|
be placed into the source files either as first or second
|
||||||
|
line in the file:
|
||||||
|
|
||||||
|
#!/usr/bin/python
|
||||||
|
# -*- coding: <encoding name> -*-
|
||||||
|
|
||||||
|
To aid with platforms such as Windows, which add Unicode BOM marks
|
||||||
|
to the beginning of Unicode files, the UTF-8 signature
|
||||||
|
'\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
|
||||||
|
(even if no magic encoding comment is given).
|
||||||
|
|
||||||
|
If a source file uses both the UTF-8 BOM mark signature and a
|
||||||
|
magic encoding comment, the only allowed encoding for the comment
|
||||||
|
is 'utf-8'. Any other encoding will cause an error.
|
||||||
|
|
||||||
Concepts
|
Concepts
|
||||||
|
|
||||||
The PEP is based on the following concepts which would have to be
|
The PEP is based on the following concepts which would have to be
|
||||||
|
@ -47,6 +68,12 @@ Concepts
|
||||||
result in a decoding error during compilation of the Python
|
result in a decoding error during compilation of the Python
|
||||||
source code.
|
source code.
|
||||||
|
|
||||||
|
Only ASCII compatible encodings are allowed as source code
|
||||||
|
encoding to assure that Python language elements other than
|
||||||
|
literals and comments remain readable by ASCII processing tools
|
||||||
|
and to avoid problems with wide characters encodings such as
|
||||||
|
UTF-16.
|
||||||
|
|
||||||
2. Handling of escape sequences should continue to work as it does
|
2. Handling of escape sequences should continue to work as it does
|
||||||
now, but with all possible source code encodings, that is
|
now, but with all possible source code encodings, that is
|
||||||
standard string literals (both 8-bit and Unicode) are subject to
|
standard string literals (both 8-bit and Unicode) are subject to
|
||||||
|
@ -71,50 +98,40 @@ Concepts
|
||||||
8-bit strings using the file encoding to assure backward
|
8-bit strings using the file encoding to assure backward
|
||||||
compatibility with the existing implementation
|
compatibility with the existing implementation
|
||||||
|
|
||||||
ISSUE:
|
Note that Python identifiers are restricted to the ASCII
|
||||||
|
subset of the encoding.
|
||||||
|
|
||||||
Should we restrict identifiers to ASCII ?
|
For backwards compatibility, the implementation must assume
|
||||||
|
Latin-1 as the original file encoding if not given (otherwise,
|
||||||
To make this backwards compatible, the implementation would have to
|
|
||||||
assume Latin-1 as the original file encoding if not given (otherwise,
|
|
||||||
binary data currently stored in 8-bit strings wouldn't make the
|
binary data currently stored in 8-bit strings wouldn't make the
|
||||||
roundtrip).
|
roundtrip).
|
||||||
|
|
||||||
Comment Syntax
|
Implementation
|
||||||
|
|
||||||
The magic comment will use the following syntax. It will have to
|
Since changing the Python tokenizer/parser combination will
|
||||||
appear as first or second line in the Python source file.
|
require major changes in the internals of the interpreter, the
|
||||||
|
proposed solution should be implemented in two phases:
|
||||||
|
|
||||||
ISSUE:
|
1. Implement the magic comment detection and default encoding
|
||||||
|
handling, but only apply the detected encoding to Unicode
|
||||||
|
literals in the source file.
|
||||||
|
|
||||||
Possible choices for the format:
|
2. Change the tokenizer/compiler base string type from char* to
|
||||||
|
Py_UNICODE* and apply the encoding to the complete file.
|
||||||
1. Emacs style:
|
|
||||||
|
|
||||||
#!/usr/bin/python
|
|
||||||
# -*- coding: utf-8; -*-
|
|
||||||
|
|
||||||
2. Via a pseudo-option to the interpreter (one which is not used
|
|
||||||
by the interpreter):
|
|
||||||
|
|
||||||
#!/usr/bin/python --encoding=utf-8
|
|
||||||
|
|
||||||
3. Using a special comment format:
|
|
||||||
|
|
||||||
#!/usr/bin/python
|
|
||||||
#!encoding = 'utf-8'
|
|
||||||
|
|
||||||
4. XML-style format:
|
|
||||||
|
|
||||||
#!/usr/bin/python
|
|
||||||
#?python encoding = 'utf-8'
|
|
||||||
|
|
||||||
Scope
|
Scope
|
||||||
|
|
||||||
This PEP only affects Python source code which makes use of the
|
This PEP only affects Python source code which makes use of the
|
||||||
proposed magic comment. Without the magic comment in the proposed
|
proposed magic comment. Without the magic comment in the proposed
|
||||||
position, Python will treat the source file as it does currently
|
position, Python will treat the source file as it does currently
|
||||||
to maintain backwards compatibility.
|
(using the Latin-1 encoding assumption) to maintain backwards
|
||||||
|
compatibility.
|
||||||
|
|
||||||
|
History
|
||||||
|
|
||||||
|
1.3: Worked in comments by Martin v. Loewis:
|
||||||
|
UTF-8 BOM mark detection, Emacs style magic comment,
|
||||||
|
two phase approach to the implementation
|
||||||
|
|
||||||
Copyright
|
Copyright
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue