133 lines
4.2 KiB
Plaintext
133 lines
4.2 KiB
Plaintext
PEP: 0263
|
||
Title: Defining Python Source Code Encodings
|
||
Version: $Revision$
|
||
Author: mal@lemburg.com (Marc-Andr‚ Lemburg)
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Python-Version: 2.3
|
||
Created: 06-Jun-2001
|
||
Post-History:
|
||
Requires: 244
|
||
|
||
Abstract
|
||
|
||
This PEP proposes to introduce a syntax to declare the encoding of
|
||
a Python source file. The encoding information is then used by the
|
||
Python parser to interpret the file using the given encoding. Most
|
||
notably this enhances the interpretation of Unicode literals in
|
||
the source code and makes it possible to write Unicode literals
|
||
using e.g. UTF-8 directly in an Unicode aware editor.
|
||
|
||
Problem
|
||
|
||
In Python 2.1, Unicode literals can only be written using the
|
||
Latin-1 based encoding "unicode-escape". This makes the
|
||
programming environment rather unfriendly to Python users who live
|
||
and work in non-Latin-1 locales such as many of the Asian
|
||
countries. Programmers can write their 8-bit strings using the
|
||
favourite encoding, but are bound to the "unicode-escape" encoding
|
||
for Unicode literals.
|
||
|
||
Proposed Solution
|
||
|
||
I propose to make the Python source code encoding both visible and
|
||
changeable on a per-source file basis by using a special comment
|
||
at the top of the file to declare the encoding.
|
||
|
||
To make Python aware of this encoding declaration a number of
|
||
concept changes are necessary with repect to the handling of
|
||
Python source code data.
|
||
|
||
Concepts
|
||
|
||
The PEP is based on the following concepts which would have to be
|
||
implemented to enable usage of such a magic comment:
|
||
|
||
1. The complete Python source file should use a single encoding.
|
||
Embedding of differently encoded data is not allowed and will
|
||
result in a decoding error during compilation of the Python
|
||
source code.
|
||
|
||
2. Handling of escape sequences should continue to work as it does
|
||
now, but with all possible source code encodings, that is
|
||
standard string literals (both 8-bit and Unicode) are subject to
|
||
escape sequence expansion while raw string literals only expand
|
||
a very small subset of escape sequences.
|
||
|
||
3. Python's tokenizer/compiler combo will need to be updated to
|
||
work as follows:
|
||
|
||
1. read the file
|
||
|
||
2. decode it into Unicode assuming a fixed per-file encoding
|
||
|
||
3. tokenize the Unicode content
|
||
|
||
4. compile it, creating Unicode objects from the given Unicode data
|
||
and creating string objects from the Unicode literal data
|
||
by first reencoding the Unicode data into 8-bit string data
|
||
using the given file encoding
|
||
|
||
5. variable names and other identifiers will be reencoded into
|
||
8-bit strings using the file encoding to assure backward
|
||
compatibility with the existing implementation
|
||
|
||
ISSUE:
|
||
|
||
Should we restrict identifiers to ASCII ?
|
||
|
||
To make this backwards compatible, the implementation would have to
|
||
assume Latin-1 as the original file encoding if not given (otherwise,
|
||
binary data currently stored in 8-bit strings wouldn't make the
|
||
roundtrip).
|
||
|
||
Comment Syntax
|
||
|
||
The magic comment will use the following syntax. It will have to
|
||
appear as first or second line in the Python source file.
|
||
|
||
ISSUE:
|
||
|
||
Possible choices for the format:
|
||
|
||
1. Emacs style:
|
||
|
||
#!/usr/bin/python
|
||
# -*- coding: utf-8; -*-
|
||
|
||
2. Via a pseudo-option to the interpreter (one which is not used
|
||
by the interpreter):
|
||
|
||
#!/usr/bin/python --encoding=utf-8
|
||
|
||
3. Using a special comment format:
|
||
|
||
#!/usr/bin/python
|
||
#!encoding = 'utf-8'
|
||
|
||
4. XML-style format:
|
||
|
||
#!/usr/bin/python
|
||
#?python encoding = 'utf-8'
|
||
|
||
Usage of a new keyword "directive" (see PEP 244) for this purpose
|
||
has been proposed, but was put aside due to PEP 244 not being
|
||
widely accepted (yet).
|
||
|
||
Scope
|
||
|
||
This PEP only affects Python source code which makes use of the
|
||
proposed magic comment. Without the magic comment in the proposed
|
||
position, Python will treat the source file as it does currently
|
||
to maintain backwards compatibility.
|
||
|
||
Copyright
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
End:
|