Adapted to Martin's comments.

2002-02-26 10:01:25 +00:00 · 2002-02-26 10:01:25 +00:00 · 4073747f5d
parent b40374d85b
commit 4073747f5d
1 changed files with 51 additions and 34 deletions
--- a/pep-0263.txt
+++ b/pep-0263.txt
@ -37,6 +37,27 @@ Proposed Solution
    concept changes are necessary with repect to the handling of
    Python source code data.
 Defining the Encoding
    Python will default to Latin-1 as standard encoding if no other
    encoding hints are given.
    To define a source code encoding, a magic comment must
    be placed into the source files either as first or second
    line in the file:    
          #!/usr/bin/python
          # -*- coding: <encoding name> -*-
    To aid with platforms such as Windows, which add Unicode BOM marks
    to the beginning of Unicode files, the UTF-8 signature
    '\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
    (even if no magic encoding comment is given).
    If a source file uses both the UTF-8 BOM mark signature and a
    magic encoding comment, the only allowed encoding for the comment
    is 'utf-8'.  Any other encoding will cause an error.
 Concepts
    The PEP is based on the following concepts which would have to be
@ -47,6 +68,12 @@ Concepts
       result in a decoding error during compilation of the Python
       source code. 
       Only ASCII compatible encodings are allowed as source code
       encoding to assure that Python language elements other than
       literals and comments remain readable by ASCII processing tools
       and to avoid problems with wide characters encodings such as
       UTF-16.
    2. Handling of escape sequences should continue to work as it does 
       now, but with all possible source code encodings, that is
       standard string literals (both 8-bit and Unicode) are subject to 
@ -71,50 +98,40 @@ Concepts
          8-bit strings using the file encoding to assure backward
          compatibility with the existing implementation
-          ISSUE: 
+       Note that Python identifiers are restricted to the ASCII
       subset of the encoding.
-              Should we restrict identifiers to ASCII ?
+    For backwards compatibility, the implementation must assume
-
+    Latin-1 as the original file encoding if not given (otherwise,
       To make this backwards compatible, the implementation would have to
       assume Latin-1 as the original file encoding if not given (otherwise,
    binary data currently stored in 8-bit strings wouldn't make the
    roundtrip).
-Comment Syntax
+Implementation
-    The magic comment will use the following syntax. It will have to
+    Since changing the Python tokenizer/parser combination will
-    appear as first or second line in the Python source file.
+    require major changes in the internals of the interpreter, the
    proposed solution should be implemented in two phases:
-    ISSUE:
+    1. Implement the magic comment detection and default encoding
       handling, but only apply the detected encoding to Unicode
       literals in the source file.
-        Possible choices for the format:
+    2. Change the tokenizer/compiler base string type from char* to
-
+       Py_UNICODE* and apply the encoding to the complete file.
        1. Emacs style:
          #!/usr/bin/python
          # -*- coding: utf-8; -*-
        2. Via a pseudo-option to the interpreter (one which is not used
           by the interpreter):
          #!/usr/bin/python --encoding=utf-8
        3. Using a special comment format:
          #!/usr/bin/python
          #!encoding = 'utf-8'
        4. XML-style format:
          #!/usr/bin/python
          #?python encoding = 'utf-8'
 Scope
    This PEP only affects Python source code which makes use of the
    proposed magic comment. Without the magic comment in the proposed
    position, Python will treat the source file as it does currently
-    to maintain backwards compatibility.
+    (using the Latin-1 encoding assumption) to maintain backwards
    compatibility.
 History
    1.3: Worked in comments by Martin v. Loewis: 
         UTF-8 BOM mark detection, Emacs style magic comment,
         two phase approach to the implementation
 Copyright