Initial checkin of the PEP 263.
This commit is contained in:
parent
feddfef167
commit
ca40cb4bf0
|
@ -0,0 +1,132 @@
|
||||||
|
PEP: 0263
|
||||||
|
Title: Defining Python Source Code Encodings
|
||||||
|
Version: $Revision$
|
||||||
|
Author: mal@lemburg.com (Marc-Andr‚ Lemburg)
|
||||||
|
Status: Draft
|
||||||
|
Type: Standards Track
|
||||||
|
Python-Version: 2.3
|
||||||
|
Created: 06-Jun-2001
|
||||||
|
Post-History:
|
||||||
|
Requires: 244
|
||||||
|
|
||||||
|
Abstract
|
||||||
|
|
||||||
|
This PEP proposes to introduce a syntax to declare the encoding of
|
||||||
|
a Python source file. The encoding information is then used by the
|
||||||
|
Python parser to interpret the file using the given encoding. Most
|
||||||
|
notably this enhances the interpretation of Unicode literals in
|
||||||
|
the source code and makes it possible to write Unicode literals
|
||||||
|
using e.g. UTF-8 directly in an Unicode aware editor.
|
||||||
|
|
||||||
|
Problem
|
||||||
|
|
||||||
|
In Python 2.1, Unicode literals can only be written using the
|
||||||
|
Latin-1 based encoding "unicode-escape". This makes the
|
||||||
|
programming environment rather unfriendly to Python users who live
|
||||||
|
and work in non-Latin-1 locales such as many of the Asian
|
||||||
|
countries. Programmers can write their 8-bit strings using the
|
||||||
|
favourite encoding, but are bound to the "unicode-escape" encoding
|
||||||
|
for Unicode literals.
|
||||||
|
|
||||||
|
Proposed Solution
|
||||||
|
|
||||||
|
I propose to make the Python source code encoding both visible and
|
||||||
|
changeable on a per-source file basis by using a special comment
|
||||||
|
at the top of the file to declare the encoding.
|
||||||
|
|
||||||
|
To make Python aware of this encoding declaration a number of
|
||||||
|
concept changes are necessary with repect to the handling of
|
||||||
|
Python source code data.
|
||||||
|
|
||||||
|
Concepts
|
||||||
|
|
||||||
|
The PEP is based on the following concepts which would have to be
|
||||||
|
implemented to enable usage of such a magic comment:
|
||||||
|
|
||||||
|
1. The complete Python source file should use a single encoding.
|
||||||
|
Embedding of differently encoded data is not allowed and will
|
||||||
|
result in a decoding error during compilation of the Python
|
||||||
|
source code.
|
||||||
|
|
||||||
|
2. Handling of escape sequences should continue to work as it does
|
||||||
|
now, but with all possible source code encodings, that is
|
||||||
|
standard string literals (both 8-bit and Unicode) are subject to
|
||||||
|
escape sequence expansion while raw string literals only expand
|
||||||
|
a very small subset of escape sequences.
|
||||||
|
|
||||||
|
3. Python's tokenizer/compiler combo will need to be updated to
|
||||||
|
work as follows:
|
||||||
|
|
||||||
|
1. read the file
|
||||||
|
|
||||||
|
2. decode it into Unicode assuming a fixed per-file encoding
|
||||||
|
|
||||||
|
3. tokenize the Unicode content
|
||||||
|
|
||||||
|
4. compile it, creating Unicode objects from the given Unicode data
|
||||||
|
and creating string objects from the Unicode literal data
|
||||||
|
by first reencoding the Unicode data into 8-bit string data
|
||||||
|
using the given file encoding
|
||||||
|
|
||||||
|
5. variable names and other identifiers will be reencoded into
|
||||||
|
8-bit strings using the file encoding to assure backward
|
||||||
|
compatibility with the existing implementation
|
||||||
|
|
||||||
|
ISSUE:
|
||||||
|
|
||||||
|
Should we restrict identifiers to ASCII ?
|
||||||
|
|
||||||
|
To make this backwards compatible, the implementation would have to
|
||||||
|
assume Latin-1 as the original file encoding if not given (otherwise,
|
||||||
|
binary data currently stored in 8-bit strings wouldn't make the
|
||||||
|
roundtrip).
|
||||||
|
|
||||||
|
Comment Syntax
|
||||||
|
|
||||||
|
The magic comment will use the following syntax. It will have to
|
||||||
|
appear as first or second line in the Python source file.
|
||||||
|
|
||||||
|
ISSUE:
|
||||||
|
|
||||||
|
Possible choices for the format:
|
||||||
|
|
||||||
|
1. Emacs style:
|
||||||
|
|
||||||
|
#!/usr/bin/python
|
||||||
|
# -*- coding: utf-8; -*-
|
||||||
|
|
||||||
|
2. Via a pseudo-option to the interpreter (one which is not used
|
||||||
|
by the interpreter):
|
||||||
|
|
||||||
|
#!/usr/bin/python --encoding=utf-8
|
||||||
|
|
||||||
|
3. Using a special comment format:
|
||||||
|
|
||||||
|
#!/usr/bin/python
|
||||||
|
#!encoding = 'utf-8'
|
||||||
|
|
||||||
|
4. XML-style format:
|
||||||
|
|
||||||
|
#!/usr/bin/python
|
||||||
|
#?python encoding = 'utf-8'
|
||||||
|
|
||||||
|
Usage of a new keyword "directive" (see PEP 244) for this purpose
|
||||||
|
has been proposed, but was put aside due to PEP 244 not being
|
||||||
|
widely accepted (yet).
|
||||||
|
|
||||||
|
Scope
|
||||||
|
|
||||||
|
This PEP only affects Python source code which makes use of the
|
||||||
|
proposed magic comment. Without the magic comment in the proposed
|
||||||
|
position, Python will treat the source file as it does currently
|
||||||
|
to maintain backwards compatibility.
|
||||||
|
|
||||||
|
Copyright
|
||||||
|
|
||||||
|
This document has been placed in the public domain.
|
||||||
|
|
||||||
|
|
||||||
|
Local Variables:
|
||||||
|
mode: indented-text
|
||||||
|
indent-tabs-mode: nil
|
||||||
|
End:
|
Loading…
Reference in New Issue