Changes regarding the default encoding and other minor tweaks.
See history for details.
This commit is contained in:
parent
ea3edf8588
commit
2e572852cf
50
pep-0263.txt
50
pep-0263.txt
|
@ -39,8 +39,10 @@ Proposed Solution
|
||||||
|
|
||||||
Defining the Encoding
|
Defining the Encoding
|
||||||
|
|
||||||
Python will default to Latin-1 as standard encoding if no other
|
Just as in coercion of strings to Unicode, Python will default to
|
||||||
encoding hints are given.
|
the interpreter's default encoding (which is ASCII in standard
|
||||||
|
Python installations) as standard encoding if no other encoding
|
||||||
|
hints are given.
|
||||||
|
|
||||||
To define a source code encoding, a magic comment must
|
To define a source code encoding, a magic comment must
|
||||||
be placed into the source files either as first or second
|
be placed into the source files either as first or second
|
||||||
|
@ -49,6 +51,11 @@ Defining the Encoding
|
||||||
#!/usr/bin/python
|
#!/usr/bin/python
|
||||||
# -*- coding: <encoding name> -*-
|
# -*- coding: <encoding name> -*-
|
||||||
|
|
||||||
|
More precise, the first or second line must match the regular
|
||||||
|
expression "coding[:=]\s*([\w-_]+)". The first group of this
|
||||||
|
expression is then interpreted as encoding name. If the encoding
|
||||||
|
is unknown to Python, an error is raised during compilation.
|
||||||
|
|
||||||
To aid with platforms such as Windows, which add Unicode BOM marks
|
To aid with platforms such as Windows, which add Unicode BOM marks
|
||||||
to the beginning of Unicode files, the UTF-8 signature
|
to the beginning of Unicode files, the UTF-8 signature
|
||||||
'\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
|
'\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
|
||||||
|
@ -101,34 +108,47 @@ Concepts
|
||||||
Note that Python identifiers are restricted to the ASCII
|
Note that Python identifiers are restricted to the ASCII
|
||||||
subset of the encoding.
|
subset of the encoding.
|
||||||
|
|
||||||
For backwards compatibility, the implementation must assume
|
|
||||||
Latin-1 as the original file encoding if not given (otherwise,
|
|
||||||
binary data currently stored in 8-bit strings wouldn't make the
|
|
||||||
roundtrip).
|
|
||||||
|
|
||||||
Implementation
|
Implementation
|
||||||
|
|
||||||
Since changing the Python tokenizer/parser combination will
|
Since changing the Python tokenizer/parser combination will
|
||||||
require major changes in the internals of the interpreter, the
|
require major changes in the internals of the interpreter and
|
||||||
proposed solution should be implemented in two phases:
|
enforcing the use of magic comments in source code files which
|
||||||
|
place non-default encoding characters in string literals, comments
|
||||||
|
and Unicode literals, the proposed solution should be implemented
|
||||||
|
in two phases:
|
||||||
|
|
||||||
1. Implement the magic comment detection and default encoding
|
1. Implement the magic comment detection and default encoding
|
||||||
handling, but only apply the detected encoding to Unicode
|
handling, but only apply the detected encoding to Unicode
|
||||||
literals in the source file.
|
literals in the source file.
|
||||||
|
|
||||||
|
In addition to this step and to aid in the transition to
|
||||||
|
explicit encoding declaration, the tokenizer must check the
|
||||||
|
complete source file for compliance with the default encoding
|
||||||
|
(which usually is ASCII). If the source file does not properly
|
||||||
|
decode, a single warning is generated per file.
|
||||||
|
|
||||||
2. Change the tokenizer/compiler base string type from char* to
|
2. Change the tokenizer/compiler base string type from char* to
|
||||||
Py_UNICODE* and apply the encoding to the complete file.
|
Py_UNICODE* and apply the encoding to the complete file.
|
||||||
|
|
||||||
|
Source files which fail to decode cause an error to be raised
|
||||||
|
during compilation.
|
||||||
|
|
||||||
|
The builtin compile() API will be enhanced to accept Unicode as
|
||||||
|
input. 8-bit string input is subject to the standard procedure
|
||||||
|
for encoding detection as decsribed above.
|
||||||
|
|
||||||
Scope
|
Scope
|
||||||
|
|
||||||
This PEP only affects Python source code which makes use of the
|
This PEP intends to provide an upgrade path from th current
|
||||||
proposed magic comment. Without the magic comment in the proposed
|
(more-or-less) undefined source code encoding situation to a more
|
||||||
position, Python will treat the source file as it does currently
|
robust and portable definition.
|
||||||
(using the Latin-1 encoding assumption) to maintain backwards
|
|
||||||
compatibility.
|
|
||||||
|
|
||||||
History
|
History
|
||||||
|
|
||||||
|
1.7: Added warnings to phase 1 implementation. Replaced the
|
||||||
|
Latin-1 default encoding with the interpreter's default
|
||||||
|
encoding. Added tweaks to compile().
|
||||||
|
1.4 - 1.6: Minor tweaks
|
||||||
1.3: Worked in comments by Martin v. Loewis:
|
1.3: Worked in comments by Martin v. Loewis:
|
||||||
UTF-8 BOM mark detection, Emacs style magic comment,
|
UTF-8 BOM mark detection, Emacs style magic comment,
|
||||||
two phase approach to the implementation
|
two phase approach to the implementation
|
||||||
|
@ -137,10 +157,8 @@ Copyright
|
||||||
|
|
||||||
This document has been placed in the public domain.
|
This document has been placed in the public domain.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Local Variables:
|
Local Variables:
|
||||||
mode: indented-text
|
mode: indented-text
|
||||||
indent-tabs-mode: nil
|
indent-tabs-mode: nil
|
||||||
fill-column: 70
|
|
||||||
End:
|
End:
|
||||||
|
|
Loading…
Reference in New Issue