Add PEP 3142.
This commit is contained in:
parent
fc960b4572
commit
3c28c51417
|
@ -118,6 +118,7 @@ Index by Category
|
||||||
S 3118 Revising the buffer protocol Oliphant, Banks
|
S 3118 Revising the buffer protocol Oliphant, Banks
|
||||||
S 3119 Introducing Abstract Base Classes GvR, Talin
|
S 3119 Introducing Abstract Base Classes GvR, Talin
|
||||||
S 3141 A Type Hierarchy for Numbers Yasskin
|
S 3141 A Type Hierarchy for Numbers Yasskin
|
||||||
|
S 3142 Using UTF-8 as the default source encoding von Löwis
|
||||||
|
|
||||||
Finished PEPs (done, implemented in Subversion)
|
Finished PEPs (done, implemented in Subversion)
|
||||||
|
|
||||||
|
@ -475,6 +476,7 @@ Numerical Index
|
||||||
S 3118 Revising the buffer protocol Oliphant, Banks
|
S 3118 Revising the buffer protocol Oliphant, Banks
|
||||||
S 3119 Introducing Abstract Base Classes GvR, Talin
|
S 3119 Introducing Abstract Base Classes GvR, Talin
|
||||||
S 3141 A Type Hierarchy for Numbers Yasskin
|
S 3141 A Type Hierarchy for Numbers Yasskin
|
||||||
|
S 3142 Using UTF-8 as the default source encoding von Löwis
|
||||||
|
|
||||||
|
|
||||||
Key
|
Key
|
||||||
|
|
|
@ -0,0 +1,107 @@
|
||||||
|
PEP: 3142
|
||||||
|
Title: Using UTF-8 as the default source encoding
|
||||||
|
Version: $Revision $
|
||||||
|
Last-Modified: $Date $
|
||||||
|
Author: Martin v. Löwis <martin@v.loewis.de>
|
||||||
|
Status: Draft
|
||||||
|
Type: Standards Track
|
||||||
|
Content-Type: text/x-rst
|
||||||
|
Created: 15-Apr-2007
|
||||||
|
Python-Version: 3.0
|
||||||
|
Post-History:
|
||||||
|
|
||||||
|
|
||||||
|
Specification
|
||||||
|
=============
|
||||||
|
|
||||||
|
This PEP proposes to change the default source encoding from ASCII to
|
||||||
|
UTF-8. Support for alternative source encodings [#pep263]_ continues to
|
||||||
|
exist; an explicit encoding declaration takes precedence over the
|
||||||
|
default.
|
||||||
|
|
||||||
|
|
||||||
|
A Bit of History
|
||||||
|
================
|
||||||
|
|
||||||
|
In Python 1, the source encoding was unspecified, except that the
|
||||||
|
source encoding had to be a superset of the system's basic execution
|
||||||
|
character set (i.e. an ASCII superset, on most systems). The source
|
||||||
|
encoding was only relevant for the lexis itself (bytes representing
|
||||||
|
letters for keywords, identifiers, punctuation, line breaks, etc).
|
||||||
|
The contents of a string literal was copied literally from the file
|
||||||
|
on source.
|
||||||
|
|
||||||
|
In Python 2.0, the source encoding changed to Latin-1 as a side effect
|
||||||
|
of introducing Unicode. For Unicode string literals, the characters
|
||||||
|
were still copied literally from the source file, but widened on a
|
||||||
|
character-by-character basis. As Unicode gives a fixed interpretation
|
||||||
|
to code points, this algorithm effectively fixed a source encoding, at
|
||||||
|
least for files containing non-ASCII characters in Unicode literals.
|
||||||
|
|
||||||
|
PEP 263 identified the problem that you can use only those Unicode
|
||||||
|
characters in a Unicode literal which are also in Latin-1, and
|
||||||
|
introduced a syntax for declaring the source encoding. If no source
|
||||||
|
encoding was given, the default should be ASCII. For compatibility
|
||||||
|
with Python 2.0 and 2.1, files were interpreted as Latin-1 for a
|
||||||
|
transitional period. This transition ended with Python 2.5, which
|
||||||
|
gives an error if non-ASCII characters are encountered and no source
|
||||||
|
encoding is declared.
|
||||||
|
|
||||||
|
Rationale
|
||||||
|
=========
|
||||||
|
|
||||||
|
With PEP 263, using arbitrary non-ASCII characters in a Python file is
|
||||||
|
possible, but tedious. One has to explicitly add an encoding
|
||||||
|
declaration. Even though some editors (like IDLE and Emacs) support
|
||||||
|
the declarations of PEP 263, many editors still do not (and never
|
||||||
|
will); users have to explicitly adjust the encoding which the editor
|
||||||
|
assumes on a file-by-file basis.
|
||||||
|
|
||||||
|
When the default encoding is changed to UTF-8, adding non-ASCII text
|
||||||
|
to Python files becomes easier and more portable: On some systems,
|
||||||
|
editors will automatically choose UTF-8 when saving text (e.g. on Unix
|
||||||
|
systems where the locale uses UTF-8). On other systems, editors will
|
||||||
|
guess the encoding when reading the file, and UTF-8 is easy to
|
||||||
|
guess. Yet other editors support associating a default encoding with a
|
||||||
|
file extension, allowing users to associate .py with UTF-8.
|
||||||
|
|
||||||
|
For Python 2, an important reason for using non-UTF-8 encodings was
|
||||||
|
that byte string literals would be in the source encoding at run-time,
|
||||||
|
allowing then to output them to a file or render them to the user
|
||||||
|
as-is. With Python 3, all strings will be Unicode strings, so the
|
||||||
|
original encoding of the source will have no impact at run-time.
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
The parser needs to be changed to accept bytes > 127 if no source
|
||||||
|
encoding is specified; instead of giving an error, it needs to check
|
||||||
|
that the bytes are well-formed UTF-8 (decoding is not necessary,
|
||||||
|
as the parser converts all source code to UTF-8, anyway).
|
||||||
|
|
||||||
|
IDLE needs to be changed to use UTF-8 as the default encoding.
|
||||||
|
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
.. [#pep263]
|
||||||
|
http://www.python.org/dev/peps/pep-0263/
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Copyright
|
||||||
|
=========
|
||||||
|
|
||||||
|
This document has been placed in the public domain.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
..
|
||||||
|
Local Variables:
|
||||||
|
mode: indented-text
|
||||||
|
indent-tabs-mode: nil
|
||||||
|
sentence-end-double-space: t
|
||||||
|
fill-column: 70
|
||||||
|
coding: utf-8
|
||||||
|
End:
|
Loading…
Reference in New Issue