Changes regarding the default encoding and other minor tweaks.

See history for details.
2002-02-27 11:07:16 +00:00 · 2002-02-27 11:07:16 +00:00 · 2e572852cf
parent ea3edf8588
commit 2e572852cf
1 changed files with 35 additions and 17 deletions
--- a/pep-0263.txt
+++ b/pep-0263.txt
@ -39,8 +39,10 @@ Proposed Solution

 Defining the Encoding

-    Python will default to Latin-1 as standard encoding if no other
-    encoding hints are given.
+    Just as in coercion of strings to Unicode, Python will default to
+    the interpreter's default encoding (which is ASCII in standard
+    Python installations) as standard encoding if no other encoding
+    hints are given.

    To define a source code encoding, a magic comment must
    be placed into the source files either as first or second
@ -49,6 +51,11 @@ Defining the Encoding
          #!/usr/bin/python
          # -*- coding: <encoding name> -*-

+    More precise, the first or second line must match the regular
+    expression "coding[:=]\s*([\w-_]+)". The first group of this
+    expression is then interpreted as encoding name. If the encoding
+    is unknown to Python, an error is raised during compilation.
+
    To aid with platforms such as Windows, which add Unicode BOM marks
    to the beginning of Unicode files, the UTF-8 signature
    '\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
@ -66,7 +73,7 @@ Concepts
    1. The complete Python source file should use a single encoding.
       Embedding of differently encoded data is not allowed and will
       result in a decoding error during compilation of the Python
-       source code. 
+       source code.

       Only ASCII compatible encodings are allowed as source code
       encoding to assure that Python language elements other than
@ -101,34 +108,47 @@ Concepts
       Note that Python identifiers are restricted to the ASCII
       subset of the encoding.

-    For backwards compatibility, the implementation must assume
-    Latin-1 as the original file encoding if not given (otherwise,
-    binary data currently stored in 8-bit strings wouldn't make the
-    roundtrip).
-
 Implementation

    Since changing the Python tokenizer/parser combination will
-    require major changes in the internals of the interpreter, the
-    proposed solution should be implemented in two phases:
+    require major changes in the internals of the interpreter and
+    enforcing the use of magic comments in source code files which
+    place non-default encoding characters in string literals, comments
+    and Unicode literals, the proposed solution should be implemented
+    in two phases:

    1. Implement the magic comment detection and default encoding
       handling, but only apply the detected encoding to Unicode
       literals in the source file.

+       In addition to this step and to aid in the transition to
+       explicit encoding declaration, the tokenizer must check the
+       complete source file for compliance with the default encoding
+       (which usually is ASCII). If the source file does not properly
+       decode, a single warning is generated per file.
+
    2. Change the tokenizer/compiler base string type from char* to
       Py_UNICODE* and apply the encoding to the complete file.

+       Source files which fail to decode cause an error to be raised
+       during compilation.
+
+       The builtin compile() API will be enhanced to accept Unicode as
+       input. 8-bit string input is subject to the standard procedure
+       for encoding detection as decsribed above.
+
 Scope

-    This PEP only affects Python source code which makes use of the
-    proposed magic comment. Without the magic comment in the proposed
-    position, Python will treat the source file as it does currently
-    (using the Latin-1 encoding assumption) to maintain backwards
-    compatibility.
+    This PEP intends to provide an upgrade path from th current
+    (more-or-less) undefined source code encoding situation to a more
+    robust and portable definition.

 History

+    1.7: Added warnings to phase 1 implementation. Replaced the
+         Latin-1 default encoding with the interpreter's default
+         encoding. Added tweaks to compile().
+    1.4 - 1.6: Minor tweaks
    1.3: Worked in comments by Martin v. Loewis: 
         UTF-8 BOM mark detection, Emacs style magic comment,
         two phase approach to the implementation
@ -137,10 +157,8 @@ Copyright

    This document has been placed in the public domain.

-

 Local Variables:
 mode: indented-text
 indent-tabs-mode: nil
-fill-column: 70
 End: