133 lines
4.2 KiB
Plaintext
133 lines
4.2 KiB
Plaintext
|
PEP: 0263
|
|||
|
Title: Defining Python Source Code Encodings
|
|||
|
Version: $Revision$
|
|||
|
Author: mal@lemburg.com (Marc-Andr<64> Lemburg)
|
|||
|
Status: Draft
|
|||
|
Type: Standards Track
|
|||
|
Python-Version: 2.3
|
|||
|
Created: 06-Jun-2001
|
|||
|
Post-History:
|
|||
|
Requires: 244
|
|||
|
|
|||
|
Abstract
|
|||
|
|
|||
|
This PEP proposes to introduce a syntax to declare the encoding of
|
|||
|
a Python source file. The encoding information is then used by the
|
|||
|
Python parser to interpret the file using the given encoding. Most
|
|||
|
notably this enhances the interpretation of Unicode literals in
|
|||
|
the source code and makes it possible to write Unicode literals
|
|||
|
using e.g. UTF-8 directly in an Unicode aware editor.
|
|||
|
|
|||
|
Problem
|
|||
|
|
|||
|
In Python 2.1, Unicode literals can only be written using the
|
|||
|
Latin-1 based encoding "unicode-escape". This makes the
|
|||
|
programming environment rather unfriendly to Python users who live
|
|||
|
and work in non-Latin-1 locales such as many of the Asian
|
|||
|
countries. Programmers can write their 8-bit strings using the
|
|||
|
favourite encoding, but are bound to the "unicode-escape" encoding
|
|||
|
for Unicode literals.
|
|||
|
|
|||
|
Proposed Solution
|
|||
|
|
|||
|
I propose to make the Python source code encoding both visible and
|
|||
|
changeable on a per-source file basis by using a special comment
|
|||
|
at the top of the file to declare the encoding.
|
|||
|
|
|||
|
To make Python aware of this encoding declaration a number of
|
|||
|
concept changes are necessary with repect to the handling of
|
|||
|
Python source code data.
|
|||
|
|
|||
|
Concepts
|
|||
|
|
|||
|
The PEP is based on the following concepts which would have to be
|
|||
|
implemented to enable usage of such a magic comment:
|
|||
|
|
|||
|
1. The complete Python source file should use a single encoding.
|
|||
|
Embedding of differently encoded data is not allowed and will
|
|||
|
result in a decoding error during compilation of the Python
|
|||
|
source code.
|
|||
|
|
|||
|
2. Handling of escape sequences should continue to work as it does
|
|||
|
now, but with all possible source code encodings, that is
|
|||
|
standard string literals (both 8-bit and Unicode) are subject to
|
|||
|
escape sequence expansion while raw string literals only expand
|
|||
|
a very small subset of escape sequences.
|
|||
|
|
|||
|
3. Python's tokenizer/compiler combo will need to be updated to
|
|||
|
work as follows:
|
|||
|
|
|||
|
1. read the file
|
|||
|
|
|||
|
2. decode it into Unicode assuming a fixed per-file encoding
|
|||
|
|
|||
|
3. tokenize the Unicode content
|
|||
|
|
|||
|
4. compile it, creating Unicode objects from the given Unicode data
|
|||
|
and creating string objects from the Unicode literal data
|
|||
|
by first reencoding the Unicode data into 8-bit string data
|
|||
|
using the given file encoding
|
|||
|
|
|||
|
5. variable names and other identifiers will be reencoded into
|
|||
|
8-bit strings using the file encoding to assure backward
|
|||
|
compatibility with the existing implementation
|
|||
|
|
|||
|
ISSUE:
|
|||
|
|
|||
|
Should we restrict identifiers to ASCII ?
|
|||
|
|
|||
|
To make this backwards compatible, the implementation would have to
|
|||
|
assume Latin-1 as the original file encoding if not given (otherwise,
|
|||
|
binary data currently stored in 8-bit strings wouldn't make the
|
|||
|
roundtrip).
|
|||
|
|
|||
|
Comment Syntax
|
|||
|
|
|||
|
The magic comment will use the following syntax. It will have to
|
|||
|
appear as first or second line in the Python source file.
|
|||
|
|
|||
|
ISSUE:
|
|||
|
|
|||
|
Possible choices for the format:
|
|||
|
|
|||
|
1. Emacs style:
|
|||
|
|
|||
|
#!/usr/bin/python
|
|||
|
# -*- coding: utf-8; -*-
|
|||
|
|
|||
|
2. Via a pseudo-option to the interpreter (one which is not used
|
|||
|
by the interpreter):
|
|||
|
|
|||
|
#!/usr/bin/python --encoding=utf-8
|
|||
|
|
|||
|
3. Using a special comment format:
|
|||
|
|
|||
|
#!/usr/bin/python
|
|||
|
#!encoding = 'utf-8'
|
|||
|
|
|||
|
4. XML-style format:
|
|||
|
|
|||
|
#!/usr/bin/python
|
|||
|
#?python encoding = 'utf-8'
|
|||
|
|
|||
|
Usage of a new keyword "directive" (see PEP 244) for this purpose
|
|||
|
has been proposed, but was put aside due to PEP 244 not being
|
|||
|
widely accepted (yet).
|
|||
|
|
|||
|
Scope
|
|||
|
|
|||
|
This PEP only affects Python source code which makes use of the
|
|||
|
proposed magic comment. Without the magic comment in the proposed
|
|||
|
position, Python will treat the source file as it does currently
|
|||
|
to maintain backwards compatibility.
|
|||
|
|
|||
|
Copyright
|
|||
|
|
|||
|
This document has been placed in the public domain.
|
|||
|
|
|||
|
|
|||
|
Local Variables:
|
|||
|
mode: indented-text
|
|||
|
indent-tabs-mode: nil
|
|||
|
End:
|