2001-07-18 16:27:09 -04:00
|
|
|
|
PEP: 0263
|
|
|
|
|
Title: Defining Python Source Code Encodings
|
|
|
|
|
Version: $Revision$
|
2002-04-19 13:32:14 -04:00
|
|
|
|
Author: mal@lemburg.com (Marc-Andr<64> Lemburg),
|
|
|
|
|
loewis@informatik.hu-berlin.de (Martin v. L<>wis)
|
2002-08-05 11:14:31 -04:00
|
|
|
|
Status: Final
|
2001-07-18 16:27:09 -04:00
|
|
|
|
Type: Standards Track
|
|
|
|
|
Python-Version: 2.3
|
|
|
|
|
Created: 06-Jun-2001
|
2002-03-01 14:07:46 -05:00
|
|
|
|
Last-Modified:
|
2001-07-18 16:27:09 -04:00
|
|
|
|
Post-History:
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
|
|
|
|
|
This PEP proposes to introduce a syntax to declare the encoding of
|
|
|
|
|
a Python source file. The encoding information is then used by the
|
|
|
|
|
Python parser to interpret the file using the given encoding. Most
|
|
|
|
|
notably this enhances the interpretation of Unicode literals in
|
|
|
|
|
the source code and makes it possible to write Unicode literals
|
|
|
|
|
using e.g. UTF-8 directly in an Unicode aware editor.
|
|
|
|
|
|
|
|
|
|
Problem
|
|
|
|
|
|
|
|
|
|
In Python 2.1, Unicode literals can only be written using the
|
|
|
|
|
Latin-1 based encoding "unicode-escape". This makes the
|
|
|
|
|
programming environment rather unfriendly to Python users who live
|
|
|
|
|
and work in non-Latin-1 locales such as many of the Asian
|
|
|
|
|
countries. Programmers can write their 8-bit strings using the
|
2002-04-19 13:32:14 -04:00
|
|
|
|
favorite encoding, but are bound to the "unicode-escape" encoding
|
2001-07-18 16:27:09 -04:00
|
|
|
|
for Unicode literals.
|
|
|
|
|
|
|
|
|
|
Proposed Solution
|
|
|
|
|
|
|
|
|
|
I propose to make the Python source code encoding both visible and
|
|
|
|
|
changeable on a per-source file basis by using a special comment
|
|
|
|
|
at the top of the file to declare the encoding.
|
|
|
|
|
|
|
|
|
|
To make Python aware of this encoding declaration a number of
|
2002-04-19 13:32:14 -04:00
|
|
|
|
concept changes are necessary with respect to the handling of
|
2001-07-18 16:27:09 -04:00
|
|
|
|
Python source code data.
|
|
|
|
|
|
2002-02-26 05:01:25 -05:00
|
|
|
|
Defining the Encoding
|
|
|
|
|
|
2002-03-15 12:07:12 -05:00
|
|
|
|
Python will default to ASCII as standard encoding if no other
|
|
|
|
|
encoding hints are given.
|
2002-02-26 05:01:25 -05:00
|
|
|
|
|
|
|
|
|
To define a source code encoding, a magic comment must
|
|
|
|
|
be placed into the source files either as first or second
|
|
|
|
|
line in the file:
|
|
|
|
|
|
|
|
|
|
#!/usr/bin/python
|
|
|
|
|
# -*- coding: <encoding name> -*-
|
|
|
|
|
|
2002-02-27 06:07:16 -05:00
|
|
|
|
More precise, the first or second line must match the regular
|
2002-02-28 04:08:39 -05:00
|
|
|
|
expression "coding[:=]\s*([\w-_.]+)". The first group of this
|
2002-02-27 06:07:16 -05:00
|
|
|
|
expression is then interpreted as encoding name. If the encoding
|
|
|
|
|
is unknown to Python, an error is raised during compilation.
|
|
|
|
|
|
2002-02-26 05:01:25 -05:00
|
|
|
|
To aid with platforms such as Windows, which add Unicode BOM marks
|
|
|
|
|
to the beginning of Unicode files, the UTF-8 signature
|
|
|
|
|
'\xef\xbb\xbf' will be interpreted as 'utf-8' encoding as well
|
|
|
|
|
(even if no magic encoding comment is given).
|
|
|
|
|
|
|
|
|
|
If a source file uses both the UTF-8 BOM mark signature and a
|
|
|
|
|
magic encoding comment, the only allowed encoding for the comment
|
|
|
|
|
is 'utf-8'. Any other encoding will cause an error.
|
|
|
|
|
|
2001-07-18 16:27:09 -04:00
|
|
|
|
Concepts
|
|
|
|
|
|
|
|
|
|
The PEP is based on the following concepts which would have to be
|
|
|
|
|
implemented to enable usage of such a magic comment:
|
|
|
|
|
|
|
|
|
|
1. The complete Python source file should use a single encoding.
|
|
|
|
|
Embedding of differently encoded data is not allowed and will
|
|
|
|
|
result in a decoding error during compilation of the Python
|
2002-02-27 06:07:16 -05:00
|
|
|
|
source code.
|
2002-02-26 05:01:25 -05:00
|
|
|
|
|
2002-03-15 12:07:12 -05:00
|
|
|
|
Any encoding which allows processing the first two lines in the
|
|
|
|
|
way indicated above is allowed as source code encoding, this
|
|
|
|
|
includes ASCII compatible encodings as well as certain
|
2002-03-07 06:14:26 -05:00
|
|
|
|
multi-byte encodings such as Shift_JIS. It does not include
|
2002-03-15 12:07:12 -05:00
|
|
|
|
encodings which use two or more bytes for all characters like
|
|
|
|
|
e.g. UTF-16. The reason for this is to keep the encoding
|
2002-03-07 06:14:26 -05:00
|
|
|
|
detection algorithm in the tokenizer simple.
|
2001-07-18 16:27:09 -04:00
|
|
|
|
|
|
|
|
|
2. Handling of escape sequences should continue to work as it does
|
|
|
|
|
now, but with all possible source code encodings, that is
|
|
|
|
|
standard string literals (both 8-bit and Unicode) are subject to
|
|
|
|
|
escape sequence expansion while raw string literals only expand
|
|
|
|
|
a very small subset of escape sequences.
|
|
|
|
|
|
|
|
|
|
3. Python's tokenizer/compiler combo will need to be updated to
|
|
|
|
|
work as follows:
|
|
|
|
|
|
|
|
|
|
1. read the file
|
|
|
|
|
|
|
|
|
|
2. decode it into Unicode assuming a fixed per-file encoding
|
|
|
|
|
|
2002-04-19 13:32:14 -04:00
|
|
|
|
3. convert it into a UTF-8 byte string
|
2001-07-18 16:27:09 -04:00
|
|
|
|
|
2002-04-19 13:32:14 -04:00
|
|
|
|
4. tokenize the UTF-8 content
|
|
|
|
|
|
|
|
|
|
5. compile it, creating Unicode objects from the given Unicode data
|
2001-07-18 16:27:09 -04:00
|
|
|
|
and creating string objects from the Unicode literal data
|
2002-04-19 13:32:14 -04:00
|
|
|
|
by first reencoding the UTF-8 data into 8-bit string data
|
2001-07-18 16:27:09 -04:00
|
|
|
|
using the given file encoding
|
|
|
|
|
|
2002-02-26 05:01:25 -05:00
|
|
|
|
Note that Python identifiers are restricted to the ASCII
|
2002-04-19 13:32:14 -04:00
|
|
|
|
subset of the encoding, and thus need no further conversion
|
|
|
|
|
after step 4.
|
2001-07-18 16:27:09 -04:00
|
|
|
|
|
2002-02-26 05:01:25 -05:00
|
|
|
|
Implementation
|
2001-07-18 16:27:09 -04:00
|
|
|
|
|
2002-04-19 13:32:14 -04:00
|
|
|
|
For backwards-compatibility with existing code which currently
|
|
|
|
|
uses non-ASCII in string literals without declaring an encoding,
|
|
|
|
|
the implementation will be introduced in two phases:
|
2001-07-18 16:27:09 -04:00
|
|
|
|
|
2002-04-19 13:32:14 -04:00
|
|
|
|
1. Allow non-ASCII in string literals and comments, by internally
|
|
|
|
|
treating a missing encoding declaration as a declaration of
|
|
|
|
|
"iso-8859-1". This will cause arbitrary byte strings to
|
|
|
|
|
correctly round-trip between step 2 and step 5 of the
|
|
|
|
|
processing, and provide compatibility with Python 2.2 for
|
|
|
|
|
Unicode literals that contain non-ASCII bytes.
|
2002-02-27 06:07:16 -05:00
|
|
|
|
|
2002-04-19 13:32:14 -04:00
|
|
|
|
A warning will be issued if non-ASCII bytes are found in the
|
|
|
|
|
input, once per improperly encoded input file.
|
2001-07-18 16:27:09 -04:00
|
|
|
|
|
2002-04-19 13:32:14 -04:00
|
|
|
|
2. Remove the warning, and change the default encoding to "ascii".
|
2002-02-27 06:07:16 -05:00
|
|
|
|
|
2002-04-19 13:32:14 -04:00
|
|
|
|
The builtin compile() API will be enhanced to accept Unicode as
|
|
|
|
|
input. 8-bit string input is subject to the standard procedure for
|
|
|
|
|
encoding detection as described above.
|
2002-02-27 06:07:16 -05:00
|
|
|
|
|
2002-04-19 13:32:14 -04:00
|
|
|
|
SUZUKI Hisao is working on a patch; see [2] for details. A patch
|
|
|
|
|
implementing only phase 1 is available at [1].
|
2002-03-07 06:14:26 -05:00
|
|
|
|
|
2001-07-18 16:27:09 -04:00
|
|
|
|
Scope
|
|
|
|
|
|
2002-07-14 22:47:06 -04:00
|
|
|
|
This PEP intends to provide an upgrade path from the current
|
2002-02-27 06:07:16 -05:00
|
|
|
|
(more-or-less) undefined source code encoding situation to a more
|
|
|
|
|
robust and portable definition.
|
2002-02-26 05:01:25 -05:00
|
|
|
|
|
2002-03-07 06:14:26 -05:00
|
|
|
|
References
|
|
|
|
|
|
|
|
|
|
[1] Phase 1 implementation:
|
2002-04-19 13:32:14 -04:00
|
|
|
|
http://python.org/sf/526840
|
|
|
|
|
[2] Phase 2 implementation:
|
|
|
|
|
http://python.org/sf/534304
|
2002-03-07 06:14:26 -05:00
|
|
|
|
|
2002-02-26 05:01:25 -05:00
|
|
|
|
History
|
|
|
|
|
|
2002-03-07 06:14:26 -05:00
|
|
|
|
1.10 and above: see CVS history
|
2002-02-28 04:08:39 -05:00
|
|
|
|
1.8: Added '.' to the coding RE.
|
2002-02-27 06:07:16 -05:00
|
|
|
|
1.7: Added warnings to phase 1 implementation. Replaced the
|
|
|
|
|
Latin-1 default encoding with the interpreter's default
|
|
|
|
|
encoding. Added tweaks to compile().
|
|
|
|
|
1.4 - 1.6: Minor tweaks
|
2002-02-26 05:01:25 -05:00
|
|
|
|
1.3: Worked in comments by Martin v. Loewis:
|
|
|
|
|
UTF-8 BOM mark detection, Emacs style magic comment,
|
|
|
|
|
two phase approach to the implementation
|
2001-07-18 16:27:09 -04:00
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
End:
|