Add PEP 383.
This commit is contained in:
parent
2d14b54016
commit
61cb395f24
|
@ -0,0 +1,118 @@
|
|||
PEP: 383
|
||||
Title: Non-decodable Bytes in System Character Interfaces
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Martin v. Löwis <martin@v.loewis.de>
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 22-Apr-2009
|
||||
Python-Version: 3.1
|
||||
Post-History:
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
File names, environment variables, and command line arguments are
|
||||
defined as being character data in POSIX; the C APIs however allow
|
||||
passing arbitrary bytes - whether these conform to a certain encoding
|
||||
or not. This PEP proposes a means of dealing with such irregularities
|
||||
by embedding the bytes in character strings in such a way that allows
|
||||
recreation of the original byte string.
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
||||
The C char type is a data type that is commonly used to represent both
|
||||
character data and bytes. Certain POSIX interfaces are specified and
|
||||
widely understood as operating on character data, however, the system
|
||||
call interfaces make no assumption on the encoding of these data, and
|
||||
pass them on as-is. With Python 3, character strings use a
|
||||
Unicode-based internal representation, making it difficult to ignore
|
||||
the encoding of byte strings in the same way that the C interfaces can
|
||||
ignore the encoding.
|
||||
|
||||
On the other hand, Microsoft Windows NT has correct the original
|
||||
design limitation of Unix, and made it explicit in its system
|
||||
interfaces that these data (file names, environment variables, command
|
||||
line arguments) are indeed character data, by providing a
|
||||
Unicode-based API (keeping a C-char-based one for backwards
|
||||
compatibility).
|
||||
|
||||
For Python 3, one proposed solution is to provide two sets of APIs: a
|
||||
byte-oriented one, and a character-oriented one, where the
|
||||
character-oriented one would be limited to not being able to represent
|
||||
all data accurately. Unfortunately, for Windows, the situation would
|
||||
be exactly the opposite: the byte-oriented interface cannot represent
|
||||
all data; only the character-oriented API can. As a consequence,
|
||||
libraries and applications that want to support all user data in a
|
||||
cross-platform manner have to accept mish-mash of bytes and characters
|
||||
exactly in the way that caused endless troubles for Python 2.x.
|
||||
|
||||
With this PEP, a uniform treatment of these data as characters becomes
|
||||
possible. The uniformity is achieved by using specific encoding
|
||||
algorithms, meaning that the data can be converted back to bytes on
|
||||
POSIX systems only if the same encoding is used.
|
||||
|
||||
Specification
|
||||
=============
|
||||
|
||||
On Windows, Python uses the wide character APIs to access
|
||||
character-oriented APIs, allowing direct conversion of the
|
||||
environmental data to Python str objects.
|
||||
|
||||
On POSIX systems, Python currently applies the locale's encoding to
|
||||
convert the byte data to Unicode. If the locale's encoding is UTF-8,
|
||||
it can represent the full set of Unicode characters, otherwise, only a
|
||||
subset is representable. In the latter case, using private-use
|
||||
characters to represent these bytes would be an option. For UTF-8,
|
||||
doing so would create an ambiguity, as the private-use characters may
|
||||
regularly occur in the input also.
|
||||
|
||||
To convert non-decodable bytes, a new error handler "python-escape" is
|
||||
introduced, which decodes non-decodable bytes using into a private-use
|
||||
character U+F01xx, which is believed to not conflict with private-use
|
||||
characters that currently exist in Python codecs.
|
||||
|
||||
The error handler interface is extended to allow the encode error
|
||||
handler to return byte strings immediately, in addition to returning
|
||||
Unicode strings which then get encoded again.
|
||||
|
||||
If the locale's encoding is UTF-8, the file system encoding is set to
|
||||
a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
|
||||
(which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
|
||||
|
||||
Discussion
|
||||
==========
|
||||
|
||||
While providing a uniform API to non-decodable bytes, this interface
|
||||
has the limitation that chosen representation only "works" if the data
|
||||
get converted back to bytes with the python-escape error handler
|
||||
also. Encoding the data with the locale's encoding and the (default)
|
||||
strict error handler will raise an exception, encoding them with UTF-8
|
||||
will produce non-sensical data.
|
||||
|
||||
For most applications, we assume that they eventually pass data
|
||||
received from a system interface back into the same system
|
||||
interfaces. For example, and application invoking os.listdir() will
|
||||
likely pass the result strings back into APIs like os.stat() or
|
||||
open(), which then encodes them back into their original byte
|
||||
representation. Applications that need to process the original byte
|
||||
strings can obtain them by encoding the character strings with the
|
||||
file system encoding, passing "python-escape" as the error handler
|
||||
name.
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
coding: utf-8
|
||||
End:
|
Loading…
Reference in New Issue