Initial checkin of the PEP 263.

2001-07-18 20:27:09 +00:00 · 2001-07-18 20:27:09 +00:00 · ca40cb4bf0
parent feddfef167
commit ca40cb4bf0
1 changed files with 132 additions and 0 deletions
--- a/pep-0263.txt
+++ b/pep-0263.txt
@ -0,0 +1,132 @@
+PEP: 0263
+Title: Defining Python Source Code Encodings
+Version: $Revision$
+Author: mal@lemburg.com (Marc-Andr‚ Lemburg)
+Status: Draft
+Type: Standards Track
+Python-Version: 2.3
+Created: 06-Jun-2001
+Post-History: 
+Requires: 244
+
+Abstract
+
+    This PEP proposes to introduce a syntax to declare the encoding of
+    a Python source file. The encoding information is then used by the
+    Python parser to interpret the file using the given encoding. Most
+    notably this enhances the interpretation of Unicode literals in
+    the source code and makes it possible to write Unicode literals
+    using e.g. UTF-8 directly in an Unicode aware editor.
+
+Problem
+
+    In Python 2.1, Unicode literals can only be written using the
+    Latin-1 based encoding "unicode-escape". This makes the
+    programming environment rather unfriendly to Python users who live
+    and work in non-Latin-1 locales such as many of the Asian 
+    countries. Programmers can write their 8-bit strings using the
+    favourite encoding, but are bound to the "unicode-escape" encoding
+    for Unicode literals.
+
+Proposed Solution
+
+    I propose to make the Python source code encoding both visible and
+    changeable on a per-source file basis by using a special comment
+    at the top of the file to declare the encoding.
+
+    To make Python aware of this encoding declaration a number of
+    concept changes are necessary with repect to the handling of
+    Python source code data.
+
+Concepts
+
+    The PEP is based on the following concepts which would have to be
+    implemented to enable usage of such a magic comment:
+
+    1. The complete Python source file should use a single encoding.
+       Embedding of differently encoded data is not allowed and will
+       result in a decoding error during compilation of the Python
+       source code.
+
+    2. Handling of escape sequences should continue to work as it does 
+       now, but with all possible source code encodings, that is
+       standard string literals (both 8-bit and Unicode) are subject to 
+       escape sequence expansion while raw string literals only expand
+       a very small subset of escape sequences.
+
+    3. Python's tokenizer/compiler combo will need to be updated to
+       work as follows:
+
+       1. read the file
+
+       2. decode it into Unicode assuming a fixed per-file encoding
+
+       3. tokenize the Unicode content
+
+       4. compile it, creating Unicode objects from the given Unicode data
+          and creating string objects from the Unicode literal data
+          by first reencoding the Unicode data into 8-bit string data
+          using the given file encoding
+
+       5. variable names and other identifiers will be reencoded into
+          8-bit strings using the file encoding to assure backward
+          compatibility with the existing implementation
+
+          ISSUE: 
+
+              Should we restrict identifiers to ASCII ?
+
+       To make this backwards compatible, the implementation would have to
+       assume Latin-1 as the original file encoding if not given (otherwise,
+       binary data currently stored in 8-bit strings wouldn't make the
+       roundtrip).
+
+Comment Syntax
+
+    The magic comment will use the following syntax. It will have to
+    appear as first or second line in the Python source file.
+
+    ISSUE:
+
+        Possible choices for the format:
+
+        1. Emacs style:
+
+          #!/usr/bin/python
+          # -*- coding: utf-8; -*-
+
+        2. Via a pseudo-option to the interpreter (one which is not used
+           by the interpreter):
+
+          #!/usr/bin/python --encoding=utf-8
+
+        3. Using a special comment format:
+
+          #!/usr/bin/python
+          #!encoding = 'utf-8'
+
+        4. XML-style format:
+
+          #!/usr/bin/python
+          #?python encoding = 'utf-8'
+
+    Usage of a new keyword "directive" (see PEP 244) for this purpose
+    has been proposed, but was put aside due to PEP 244 not being
+    widely accepted (yet).
+
+Scope
+
+    This PEP only affects Python source code which makes use of the
+    proposed magic comment. Without the magic comment in the proposed
+    position, Python will treat the source file as it does currently
+    to maintain backwards compatibility.
+
+Copyright
+
+    This document has been placed in the public domain.
+
+
+Local Variables:
+mode: indented-text
+indent-tabs-mode: nil
+End: