diff --git a/pep-0263.txt b/pep-0263.txt index e03e762d9..a79be3601 100644 --- a/pep-0263.txt +++ b/pep-0263.txt @@ -37,6 +37,27 @@ Proposed Solution concept changes are necessary with repect to the handling of Python source code data. +Defining the Encoding + + Python will default to Latin-1 as standard encoding if no other + encoding hints are given. + + To define a source code encoding, a magic comment must + be placed into the source files either as first or second + line in the file: + + #!/usr/bin/python + # -*- coding: -*- + + To aid with platforms such as Windows, which add Unicode BOM marks + to the beginning of Unicode files, the UTF-8 signature + '\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well + (even if no magic encoding comment is given). + + If a source file uses both the UTF-8 BOM mark signature and a + magic encoding comment, the only allowed encoding for the comment + is 'utf-8'. Any other encoding will cause an error. + Concepts The PEP is based on the following concepts which would have to be @@ -45,7 +66,13 @@ Concepts 1. The complete Python source file should use a single encoding. Embedding of differently encoded data is not allowed and will result in a decoding error during compilation of the Python - source code. + source code. + + Only ASCII compatible encodings are allowed as source code + encoding to assure that Python language elements other than + literals and comments remain readable by ASCII processing tools + and to avoid problems with wide characters encodings such as + UTF-16. 2. Handling of escape sequences should continue to work as it does now, but with all possible source code encodings, that is @@ -71,50 +98,40 @@ Concepts 8-bit strings using the file encoding to assure backward compatibility with the existing implementation - ISSUE: + Note that Python identifiers are restricted to the ASCII + subset of the encoding. - Should we restrict identifiers to ASCII ? + For backwards compatibility, the implementation must assume + Latin-1 as the original file encoding if not given (otherwise, + binary data currently stored in 8-bit strings wouldn't make the + roundtrip). - To make this backwards compatible, the implementation would have to - assume Latin-1 as the original file encoding if not given (otherwise, - binary data currently stored in 8-bit strings wouldn't make the - roundtrip). +Implementation -Comment Syntax + Since changing the Python tokenizer/parser combination will + require major changes in the internals of the interpreter, the + proposed solution should be implemented in two phases: - The magic comment will use the following syntax. It will have to - appear as first or second line in the Python source file. + 1. Implement the magic comment detection and default encoding + handling, but only apply the detected encoding to Unicode + literals in the source file. - ISSUE: - - Possible choices for the format: - - 1. Emacs style: - - #!/usr/bin/python - # -*- coding: utf-8; -*- - - 2. Via a pseudo-option to the interpreter (one which is not used - by the interpreter): - - #!/usr/bin/python --encoding=utf-8 - - 3. Using a special comment format: - - #!/usr/bin/python - #!encoding = 'utf-8' - - 4. XML-style format: - - #!/usr/bin/python - #?python encoding = 'utf-8' + 2. Change the tokenizer/compiler base string type from char* to + Py_UNICODE* and apply the encoding to the complete file. Scope This PEP only affects Python source code which makes use of the proposed magic comment. Without the magic comment in the proposed position, Python will treat the source file as it does currently - to maintain backwards compatibility. + (using the Latin-1 encoding assumption) to maintain backwards + compatibility. + +History + + 1.3: Worked in comments by Martin v. Loewis: + UTF-8 BOM mark detection, Emacs style magic comment, + two phase approach to the implementation Copyright