diff --git a/pep-0263.txt b/pep-0263.txt new file mode 100644 index 000000000..6188022e7 --- /dev/null +++ b/pep-0263.txt @@ -0,0 +1,132 @@ +PEP: 0263 +Title: Defining Python Source Code Encodings +Version: $Revision$ +Author: mal@lemburg.com (Marc-Andr‚ Lemburg) +Status: Draft +Type: Standards Track +Python-Version: 2.3 +Created: 06-Jun-2001 +Post-History: +Requires: 244 + +Abstract + + This PEP proposes to introduce a syntax to declare the encoding of + a Python source file. The encoding information is then used by the + Python parser to interpret the file using the given encoding. Most + notably this enhances the interpretation of Unicode literals in + the source code and makes it possible to write Unicode literals + using e.g. UTF-8 directly in an Unicode aware editor. + +Problem + + In Python 2.1, Unicode literals can only be written using the + Latin-1 based encoding "unicode-escape". This makes the + programming environment rather unfriendly to Python users who live + and work in non-Latin-1 locales such as many of the Asian + countries. Programmers can write their 8-bit strings using the + favourite encoding, but are bound to the "unicode-escape" encoding + for Unicode literals. + +Proposed Solution + + I propose to make the Python source code encoding both visible and + changeable on a per-source file basis by using a special comment + at the top of the file to declare the encoding. + + To make Python aware of this encoding declaration a number of + concept changes are necessary with repect to the handling of + Python source code data. + +Concepts + + The PEP is based on the following concepts which would have to be + implemented to enable usage of such a magic comment: + + 1. The complete Python source file should use a single encoding. + Embedding of differently encoded data is not allowed and will + result in a decoding error during compilation of the Python + source code. + + 2. Handling of escape sequences should continue to work as it does + now, but with all possible source code encodings, that is + standard string literals (both 8-bit and Unicode) are subject to + escape sequence expansion while raw string literals only expand + a very small subset of escape sequences. + + 3. Python's tokenizer/compiler combo will need to be updated to + work as follows: + + 1. read the file + + 2. decode it into Unicode assuming a fixed per-file encoding + + 3. tokenize the Unicode content + + 4. compile it, creating Unicode objects from the given Unicode data + and creating string objects from the Unicode literal data + by first reencoding the Unicode data into 8-bit string data + using the given file encoding + + 5. variable names and other identifiers will be reencoded into + 8-bit strings using the file encoding to assure backward + compatibility with the existing implementation + + ISSUE: + + Should we restrict identifiers to ASCII ? + + To make this backwards compatible, the implementation would have to + assume Latin-1 as the original file encoding if not given (otherwise, + binary data currently stored in 8-bit strings wouldn't make the + roundtrip). + +Comment Syntax + + The magic comment will use the following syntax. It will have to + appear as first or second line in the Python source file. + + ISSUE: + + Possible choices for the format: + + 1. Emacs style: + + #!/usr/bin/python + # -*- coding: utf-8; -*- + + 2. Via a pseudo-option to the interpreter (one which is not used + by the interpreter): + + #!/usr/bin/python --encoding=utf-8 + + 3. Using a special comment format: + + #!/usr/bin/python + #!encoding = 'utf-8' + + 4. XML-style format: + + #!/usr/bin/python + #?python encoding = 'utf-8' + + Usage of a new keyword "directive" (see PEP 244) for this purpose + has been proposed, but was put aside due to PEP 244 not being + widely accepted (yet). + +Scope + + This PEP only affects Python source code which makes use of the + proposed magic comment. Without the magic comment in the proposed + position, Python will treat the source file as it does currently + to maintain backwards compatibility. + +Copyright + + This document has been placed in the public domain. + + +Local Variables: +mode: indented-text +indent-tabs-mode: nil +End: