Initial checkin of the PEP 263.
This commit is contained in:
parent
feddfef167
commit
ca40cb4bf0
|
@ -0,0 +1,132 @@
|
|||
PEP: 0263
|
||||
Title: Defining Python Source Code Encodings
|
||||
Version: $Revision$
|
||||
Author: mal@lemburg.com (Marc-Andr‚ Lemburg)
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Python-Version: 2.3
|
||||
Created: 06-Jun-2001
|
||||
Post-History:
|
||||
Requires: 244
|
||||
|
||||
Abstract
|
||||
|
||||
This PEP proposes to introduce a syntax to declare the encoding of
|
||||
a Python source file. The encoding information is then used by the
|
||||
Python parser to interpret the file using the given encoding. Most
|
||||
notably this enhances the interpretation of Unicode literals in
|
||||
the source code and makes it possible to write Unicode literals
|
||||
using e.g. UTF-8 directly in an Unicode aware editor.
|
||||
|
||||
Problem
|
||||
|
||||
In Python 2.1, Unicode literals can only be written using the
|
||||
Latin-1 based encoding "unicode-escape". This makes the
|
||||
programming environment rather unfriendly to Python users who live
|
||||
and work in non-Latin-1 locales such as many of the Asian
|
||||
countries. Programmers can write their 8-bit strings using the
|
||||
favourite encoding, but are bound to the "unicode-escape" encoding
|
||||
for Unicode literals.
|
||||
|
||||
Proposed Solution
|
||||
|
||||
I propose to make the Python source code encoding both visible and
|
||||
changeable on a per-source file basis by using a special comment
|
||||
at the top of the file to declare the encoding.
|
||||
|
||||
To make Python aware of this encoding declaration a number of
|
||||
concept changes are necessary with repect to the handling of
|
||||
Python source code data.
|
||||
|
||||
Concepts
|
||||
|
||||
The PEP is based on the following concepts which would have to be
|
||||
implemented to enable usage of such a magic comment:
|
||||
|
||||
1. The complete Python source file should use a single encoding.
|
||||
Embedding of differently encoded data is not allowed and will
|
||||
result in a decoding error during compilation of the Python
|
||||
source code.
|
||||
|
||||
2. Handling of escape sequences should continue to work as it does
|
||||
now, but with all possible source code encodings, that is
|
||||
standard string literals (both 8-bit and Unicode) are subject to
|
||||
escape sequence expansion while raw string literals only expand
|
||||
a very small subset of escape sequences.
|
||||
|
||||
3. Python's tokenizer/compiler combo will need to be updated to
|
||||
work as follows:
|
||||
|
||||
1. read the file
|
||||
|
||||
2. decode it into Unicode assuming a fixed per-file encoding
|
||||
|
||||
3. tokenize the Unicode content
|
||||
|
||||
4. compile it, creating Unicode objects from the given Unicode data
|
||||
and creating string objects from the Unicode literal data
|
||||
by first reencoding the Unicode data into 8-bit string data
|
||||
using the given file encoding
|
||||
|
||||
5. variable names and other identifiers will be reencoded into
|
||||
8-bit strings using the file encoding to assure backward
|
||||
compatibility with the existing implementation
|
||||
|
||||
ISSUE:
|
||||
|
||||
Should we restrict identifiers to ASCII ?
|
||||
|
||||
To make this backwards compatible, the implementation would have to
|
||||
assume Latin-1 as the original file encoding if not given (otherwise,
|
||||
binary data currently stored in 8-bit strings wouldn't make the
|
||||
roundtrip).
|
||||
|
||||
Comment Syntax
|
||||
|
||||
The magic comment will use the following syntax. It will have to
|
||||
appear as first or second line in the Python source file.
|
||||
|
||||
ISSUE:
|
||||
|
||||
Possible choices for the format:
|
||||
|
||||
1. Emacs style:
|
||||
|
||||
#!/usr/bin/python
|
||||
# -*- coding: utf-8; -*-
|
||||
|
||||
2. Via a pseudo-option to the interpreter (one which is not used
|
||||
by the interpreter):
|
||||
|
||||
#!/usr/bin/python --encoding=utf-8
|
||||
|
||||
3. Using a special comment format:
|
||||
|
||||
#!/usr/bin/python
|
||||
#!encoding = 'utf-8'
|
||||
|
||||
4. XML-style format:
|
||||
|
||||
#!/usr/bin/python
|
||||
#?python encoding = 'utf-8'
|
||||
|
||||
Usage of a new keyword "directive" (see PEP 244) for this purpose
|
||||
has been proposed, but was put aside due to PEP 244 not being
|
||||
widely accepted (yet).
|
||||
|
||||
Scope
|
||||
|
||||
This PEP only affects Python source code which makes use of the
|
||||
proposed magic comment. Without the magic comment in the proposed
|
||||
position, Python will treat the source file as it does currently
|
||||
to maintain backwards compatibility.
|
||||
|
||||
Copyright
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
End:
|
Loading…
Reference in New Issue