179 lines
6.6 KiB
Plaintext
179 lines
6.6 KiB
Plaintext
PEP: 0263
|
||
Title: Defining Python Source Code Encodings
|
||
Version: $Revision$
|
||
Author: mal@lemburg.com (Marc-André Lemburg)
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Python-Version: 2.3
|
||
Created: 06-Jun-2001
|
||
Last-Modified:
|
||
Post-History:
|
||
|
||
Abstract
|
||
|
||
This PEP proposes to introduce a syntax to declare the encoding of
|
||
a Python source file. The encoding information is then used by the
|
||
Python parser to interpret the file using the given encoding. Most
|
||
notably this enhances the interpretation of Unicode literals in
|
||
the source code and makes it possible to write Unicode literals
|
||
using e.g. UTF-8 directly in an Unicode aware editor.
|
||
|
||
Problem
|
||
|
||
In Python 2.1, Unicode literals can only be written using the
|
||
Latin-1 based encoding "unicode-escape". This makes the
|
||
programming environment rather unfriendly to Python users who live
|
||
and work in non-Latin-1 locales such as many of the Asian
|
||
countries. Programmers can write their 8-bit strings using the
|
||
favourite encoding, but are bound to the "unicode-escape" encoding
|
||
for Unicode literals.
|
||
|
||
Proposed Solution
|
||
|
||
I propose to make the Python source code encoding both visible and
|
||
changeable on a per-source file basis by using a special comment
|
||
at the top of the file to declare the encoding.
|
||
|
||
To make Python aware of this encoding declaration a number of
|
||
concept changes are necessary with repect to the handling of
|
||
Python source code data.
|
||
|
||
Defining the Encoding
|
||
|
||
Python will default to ASCII as standard encoding if no other
|
||
encoding hints are given.
|
||
|
||
To define a source code encoding, a magic comment must
|
||
be placed into the source files either as first or second
|
||
line in the file:
|
||
|
||
#!/usr/bin/python
|
||
# -*- coding: <encoding name> -*-
|
||
|
||
More precise, the first or second line must match the regular
|
||
expression "coding[:=]\s*([\w-_.]+)". The first group of this
|
||
expression is then interpreted as encoding name. If the encoding
|
||
is unknown to Python, an error is raised during compilation.
|
||
|
||
To aid with platforms such as Windows, which add Unicode BOM marks
|
||
to the beginning of Unicode files, the UTF-8 signature
|
||
'\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
|
||
(even if no magic encoding comment is given).
|
||
|
||
If a source file uses both the UTF-8 BOM mark signature and a
|
||
magic encoding comment, the only allowed encoding for the comment
|
||
is 'utf-8'. Any other encoding will cause an error.
|
||
|
||
Concepts
|
||
|
||
The PEP is based on the following concepts which would have to be
|
||
implemented to enable usage of such a magic comment:
|
||
|
||
1. The complete Python source file should use a single encoding.
|
||
Embedding of differently encoded data is not allowed and will
|
||
result in a decoding error during compilation of the Python
|
||
source code.
|
||
|
||
Any encoding which allows processing the first two lines in the
|
||
way indicated above is allowed as source code encoding, this
|
||
includes ASCII compatible encodings as well as certain
|
||
multi-byte encodings such as Shift_JIS. It does not include
|
||
encodings which use two or more bytes for all characters like
|
||
e.g. UTF-16. The reason for this is to keep the encoding
|
||
detection algorithm in the tokenizer simple.
|
||
|
||
2. Handling of escape sequences should continue to work as it does
|
||
now, but with all possible source code encodings, that is
|
||
standard string literals (both 8-bit and Unicode) are subject to
|
||
escape sequence expansion while raw string literals only expand
|
||
a very small subset of escape sequences.
|
||
|
||
3. Python's tokenizer/compiler combo will need to be updated to
|
||
work as follows:
|
||
|
||
1. read the file
|
||
|
||
2. decode it into Unicode assuming a fixed per-file encoding
|
||
|
||
3. tokenize the Unicode content
|
||
|
||
4. compile it, creating Unicode objects from the given Unicode data
|
||
and creating string objects from the Unicode literal data
|
||
by first reencoding the Unicode data into 8-bit string data
|
||
using the given file encoding
|
||
|
||
5. variable names and other identifiers will be reencoded into
|
||
8-bit strings using the file encoding to assure backward
|
||
compatibility with the existing implementation
|
||
|
||
Note that Python identifiers are restricted to the ASCII
|
||
subset of the encoding.
|
||
|
||
Implementation
|
||
|
||
Since changing the Python tokenizer/parser combination will
|
||
require major changes in the internals of the interpreter and
|
||
enforcing the use of magic comments in source code files which
|
||
place non-ASCII characters in string literals, comments
|
||
and Unicode literals, the proposed solution should be implemented
|
||
in two phases:
|
||
|
||
1. Implement the magic comment detection, but only apply the
|
||
detected encoding to Unicode literals in the source file.
|
||
|
||
If no magic comment is used, Python should continue to
|
||
use the standard [raw-]unicode-escape codecs for Unicode
|
||
literals.
|
||
|
||
In addition to this step and to aid in the transition to
|
||
explicit encoding declaration, the tokenizer must check the
|
||
complete source file for compliance with the declared
|
||
encoding. If the source file does not properly decode, a single
|
||
warning is generated per file.
|
||
|
||
2. Change the tokenizer/compiler base string type from char* to
|
||
Py_UNICODE* and apply the encoding to the complete file.
|
||
|
||
Source files which fail to decode cause an error to be raised
|
||
during compilation.
|
||
|
||
The builtin compile() API will be enhanced to accept Unicode as
|
||
input. 8-bit string input is subject to the standard procedure
|
||
for encoding detection as decsribed above.
|
||
|
||
Martin v. Loewis is working on a patch which implements phase 1.
|
||
See [1] for details.
|
||
|
||
Scope
|
||
|
||
This PEP intends to provide an upgrade path from th current
|
||
(more-or-less) undefined source code encoding situation to a more
|
||
robust and portable definition.
|
||
|
||
References
|
||
|
||
[1] Phase 1 implementation:
|
||
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=526840&group_id=5470
|
||
|
||
History
|
||
|
||
1.10 and above: see CVS history
|
||
1.8: Added '.' to the coding RE.
|
||
1.7: Added warnings to phase 1 implementation. Replaced the
|
||
Latin-1 default encoding with the interpreter's default
|
||
encoding. Added tweaks to compile().
|
||
1.4 - 1.6: Minor tweaks
|
||
1.3: Worked in comments by Martin v. Loewis:
|
||
UTF-8 BOM mark detection, Emacs style magic comment,
|
||
two phase approach to the implementation
|
||
|
||
Copyright
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
End:
|