Adapted to Martin's comments.

This commit is contained in:
Marc-André Lemburg 2002-02-26 10:01:25 +00:00
parent b40374d85b
commit 4073747f5d
1 changed files with 51 additions and 34 deletions

View File

@ -37,6 +37,27 @@ Proposed Solution
concept changes are necessary with repect to the handling of
Python source code data.
Defining the Encoding
Python will default to Latin-1 as standard encoding if no other
encoding hints are given.
To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file:
#!/usr/bin/python
# -*- coding: <encoding name> -*-
To aid with platforms such as Windows, which add Unicode BOM marks
to the beginning of Unicode files, the UTF-8 signature
'\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
(even if no magic encoding comment is given).
If a source file uses both the UTF-8 BOM mark signature and a
magic encoding comment, the only allowed encoding for the comment
is 'utf-8'. Any other encoding will cause an error.
Concepts
The PEP is based on the following concepts which would have to be
@ -45,7 +66,13 @@ Concepts
1. The complete Python source file should use a single encoding.
Embedding of differently encoded data is not allowed and will
result in a decoding error during compilation of the Python
source code.
source code.
Only ASCII compatible encodings are allowed as source code
encoding to assure that Python language elements other than
literals and comments remain readable by ASCII processing tools
and to avoid problems with wide characters encodings such as
UTF-16.
2. Handling of escape sequences should continue to work as it does
now, but with all possible source code encodings, that is
@ -71,50 +98,40 @@ Concepts
8-bit strings using the file encoding to assure backward
compatibility with the existing implementation
ISSUE:
Note that Python identifiers are restricted to the ASCII
subset of the encoding.
Should we restrict identifiers to ASCII ?
For backwards compatibility, the implementation must assume
Latin-1 as the original file encoding if not given (otherwise,
binary data currently stored in 8-bit strings wouldn't make the
roundtrip).
To make this backwards compatible, the implementation would have to
assume Latin-1 as the original file encoding if not given (otherwise,
binary data currently stored in 8-bit strings wouldn't make the
roundtrip).
Implementation
Comment Syntax
Since changing the Python tokenizer/parser combination will
require major changes in the internals of the interpreter, the
proposed solution should be implemented in two phases:
The magic comment will use the following syntax. It will have to
appear as first or second line in the Python source file.
1. Implement the magic comment detection and default encoding
handling, but only apply the detected encoding to Unicode
literals in the source file.
ISSUE:
Possible choices for the format:
1. Emacs style:
#!/usr/bin/python
# -*- coding: utf-8; -*-
2. Via a pseudo-option to the interpreter (one which is not used
by the interpreter):
#!/usr/bin/python --encoding=utf-8
3. Using a special comment format:
#!/usr/bin/python
#!encoding = 'utf-8'
4. XML-style format:
#!/usr/bin/python
#?python encoding = 'utf-8'
2. Change the tokenizer/compiler base string type from char* to
Py_UNICODE* and apply the encoding to the complete file.
Scope
This PEP only affects Python source code which makes use of the
proposed magic comment. Without the magic comment in the proposed
position, Python will treat the source file as it does currently
to maintain backwards compatibility.
(using the Latin-1 encoding assumption) to maintain backwards
compatibility.
History
1.3: Worked in comments by Martin v. Loewis:
UTF-8 BOM mark detection, Emacs style magic comment,
two phase approach to the implementation
Copyright