Adapted to Martin's comments.
This commit is contained in:
parent
b40374d85b
commit
4073747f5d
85
pep-0263.txt
85
pep-0263.txt
|
@ -37,6 +37,27 @@ Proposed Solution
|
|||
concept changes are necessary with repect to the handling of
|
||||
Python source code data.
|
||||
|
||||
Defining the Encoding
|
||||
|
||||
Python will default to Latin-1 as standard encoding if no other
|
||||
encoding hints are given.
|
||||
|
||||
To define a source code encoding, a magic comment must
|
||||
be placed into the source files either as first or second
|
||||
line in the file:
|
||||
|
||||
#!/usr/bin/python
|
||||
# -*- coding: <encoding name> -*-
|
||||
|
||||
To aid with platforms such as Windows, which add Unicode BOM marks
|
||||
to the beginning of Unicode files, the UTF-8 signature
|
||||
'\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
|
||||
(even if no magic encoding comment is given).
|
||||
|
||||
If a source file uses both the UTF-8 BOM mark signature and a
|
||||
magic encoding comment, the only allowed encoding for the comment
|
||||
is 'utf-8'. Any other encoding will cause an error.
|
||||
|
||||
Concepts
|
||||
|
||||
The PEP is based on the following concepts which would have to be
|
||||
|
@ -45,7 +66,13 @@ Concepts
|
|||
1. The complete Python source file should use a single encoding.
|
||||
Embedding of differently encoded data is not allowed and will
|
||||
result in a decoding error during compilation of the Python
|
||||
source code.
|
||||
source code.
|
||||
|
||||
Only ASCII compatible encodings are allowed as source code
|
||||
encoding to assure that Python language elements other than
|
||||
literals and comments remain readable by ASCII processing tools
|
||||
and to avoid problems with wide characters encodings such as
|
||||
UTF-16.
|
||||
|
||||
2. Handling of escape sequences should continue to work as it does
|
||||
now, but with all possible source code encodings, that is
|
||||
|
@ -71,50 +98,40 @@ Concepts
|
|||
8-bit strings using the file encoding to assure backward
|
||||
compatibility with the existing implementation
|
||||
|
||||
ISSUE:
|
||||
Note that Python identifiers are restricted to the ASCII
|
||||
subset of the encoding.
|
||||
|
||||
Should we restrict identifiers to ASCII ?
|
||||
For backwards compatibility, the implementation must assume
|
||||
Latin-1 as the original file encoding if not given (otherwise,
|
||||
binary data currently stored in 8-bit strings wouldn't make the
|
||||
roundtrip).
|
||||
|
||||
To make this backwards compatible, the implementation would have to
|
||||
assume Latin-1 as the original file encoding if not given (otherwise,
|
||||
binary data currently stored in 8-bit strings wouldn't make the
|
||||
roundtrip).
|
||||
Implementation
|
||||
|
||||
Comment Syntax
|
||||
Since changing the Python tokenizer/parser combination will
|
||||
require major changes in the internals of the interpreter, the
|
||||
proposed solution should be implemented in two phases:
|
||||
|
||||
The magic comment will use the following syntax. It will have to
|
||||
appear as first or second line in the Python source file.
|
||||
1. Implement the magic comment detection and default encoding
|
||||
handling, but only apply the detected encoding to Unicode
|
||||
literals in the source file.
|
||||
|
||||
ISSUE:
|
||||
|
||||
Possible choices for the format:
|
||||
|
||||
1. Emacs style:
|
||||
|
||||
#!/usr/bin/python
|
||||
# -*- coding: utf-8; -*-
|
||||
|
||||
2. Via a pseudo-option to the interpreter (one which is not used
|
||||
by the interpreter):
|
||||
|
||||
#!/usr/bin/python --encoding=utf-8
|
||||
|
||||
3. Using a special comment format:
|
||||
|
||||
#!/usr/bin/python
|
||||
#!encoding = 'utf-8'
|
||||
|
||||
4. XML-style format:
|
||||
|
||||
#!/usr/bin/python
|
||||
#?python encoding = 'utf-8'
|
||||
2. Change the tokenizer/compiler base string type from char* to
|
||||
Py_UNICODE* and apply the encoding to the complete file.
|
||||
|
||||
Scope
|
||||
|
||||
This PEP only affects Python source code which makes use of the
|
||||
proposed magic comment. Without the magic comment in the proposed
|
||||
position, Python will treat the source file as it does currently
|
||||
to maintain backwards compatibility.
|
||||
(using the Latin-1 encoding assumption) to maintain backwards
|
||||
compatibility.
|
||||
|
||||
History
|
||||
|
||||
1.3: Worked in comments by Martin v. Loewis:
|
||||
UTF-8 BOM mark detection, Emacs style magic comment,
|
||||
two phase approach to the implementation
|
||||
|
||||
Copyright
|
||||
|
||||
|
|
Loading…
Reference in New Issue