Adapted to Martin's comments.

2002-02-26 10:01:25 +00:00 · 2002-02-26 10:01:25 +00:00 · 4073747f5d
parent b40374d85b
commit 4073747f5d
1 changed files with 51 additions and 34 deletions
--- a/pep-0263.txt
+++ b/pep-0263.txt
@ -37,6 +37,27 @@ Proposed Solution
    concept changes are necessary with repect to the handling of
    Python source code data.

+Defining the Encoding
+
+    Python will default to Latin-1 as standard encoding if no other
+    encoding hints are given.
+
+    To define a source code encoding, a magic comment must
+    be placed into the source files either as first or second
+    line in the file:    
+
+          #!/usr/bin/python
+          # -*- coding: <encoding name> -*-
+
+    To aid with platforms such as Windows, which add Unicode BOM marks
+    to the beginning of Unicode files, the UTF-8 signature
+    '\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
+    (even if no magic encoding comment is given).
+
+    If a source file uses both the UTF-8 BOM mark signature and a
+    magic encoding comment, the only allowed encoding for the comment
+    is 'utf-8'.  Any other encoding will cause an error.
+
 Concepts

    The PEP is based on the following concepts which would have to be
@ -45,7 +66,13 @@ Concepts
    1. The complete Python source file should use a single encoding.
       Embedding of differently encoded data is not allowed and will
       result in a decoding error during compilation of the Python
-       source code.
+       source code. 
+
+       Only ASCII compatible encodings are allowed as source code
+       encoding to assure that Python language elements other than
+       literals and comments remain readable by ASCII processing tools
+       and to avoid problems with wide characters encodings such as
+       UTF-16.

    2. Handling of escape sequences should continue to work as it does 
       now, but with all possible source code encodings, that is
@ -71,50 +98,40 @@ Concepts
          8-bit strings using the file encoding to assure backward
          compatibility with the existing implementation

-          ISSUE: 
+       Note that Python identifiers are restricted to the ASCII
+       subset of the encoding.

-              Should we restrict identifiers to ASCII ?
+    For backwards compatibility, the implementation must assume
+    Latin-1 as the original file encoding if not given (otherwise,
+    binary data currently stored in 8-bit strings wouldn't make the
+    roundtrip).

-       To make this backwards compatible, the implementation would have to
-       assume Latin-1 as the original file encoding if not given (otherwise,
-       binary data currently stored in 8-bit strings wouldn't make the
-       roundtrip).
+Implementation

-Comment Syntax
+    Since changing the Python tokenizer/parser combination will
+    require major changes in the internals of the interpreter, the
+    proposed solution should be implemented in two phases:

-    The magic comment will use the following syntax. It will have to
-    appear as first or second line in the Python source file.
+    1. Implement the magic comment detection and default encoding
+       handling, but only apply the detected encoding to Unicode
+       literals in the source file.

-    ISSUE:
-
-        Possible choices for the format:
-
-        1. Emacs style:
-
-          #!/usr/bin/python
-          # -*- coding: utf-8; -*-
-
-        2. Via a pseudo-option to the interpreter (one which is not used
-           by the interpreter):
-
-          #!/usr/bin/python --encoding=utf-8
-
-        3. Using a special comment format:
-
-          #!/usr/bin/python
-          #!encoding = 'utf-8'
-
-        4. XML-style format:
-
-          #!/usr/bin/python
-          #?python encoding = 'utf-8'
+    2. Change the tokenizer/compiler base string type from char* to
+       Py_UNICODE* and apply the encoding to the complete file.

 Scope

    This PEP only affects Python source code which makes use of the
    proposed magic comment. Without the magic comment in the proposed
    position, Python will treat the source file as it does currently
-    to maintain backwards compatibility.
+    (using the Latin-1 encoding assumption) to maintain backwards
+    compatibility.
+
+History
+
+    1.3: Worked in comments by Martin v. Loewis: 
+         UTF-8 BOM mark detection, Emacs style magic comment,
+         two phase approach to the implementation

 Copyright