171 lines
6.7 KiB
Plaintext
171 lines
6.7 KiB
Plaintext
PEP: 383
|
||
Title: Non-decodable Bytes in System Character Interfaces
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Martin v. Löwis <martin@v.loewis.de>
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Content-Type: text/x-rst
|
||
Created: 22-Apr-2009
|
||
Python-Version: 3.1
|
||
Post-History:
|
||
|
||
Abstract
|
||
========
|
||
|
||
File names, environment variables, and command line arguments are
|
||
defined as being character data in POSIX; the C APIs however allow
|
||
passing arbitrary bytes - whether these conform to a certain encoding
|
||
or not. This PEP proposes a means of dealing with such irregularities
|
||
by embedding the bytes in character strings in such a way that allows
|
||
recreation of the original byte string.
|
||
|
||
Rationale
|
||
=========
|
||
|
||
The C char type is a data type that is commonly used to represent both
|
||
character data and bytes. Certain POSIX interfaces are specified and
|
||
widely understood as operating on character data, however, the system
|
||
call interfaces make no assumption on the encoding of these data, and
|
||
pass them on as-is. With Python 3, character strings use a
|
||
Unicode-based internal representation, making it difficult to ignore
|
||
the encoding of byte strings in the same way that the C interfaces can
|
||
ignore the encoding.
|
||
|
||
On the other hand, Microsoft Windows NT has corrected the original
|
||
design limitation of Unix, and made it explicit in its system
|
||
interfaces that these data (file names, environment variables, command
|
||
line arguments) are indeed character data, by providing a
|
||
Unicode-based API (keeping a C-char-based one for backwards
|
||
compatibility).
|
||
|
||
For Python 3, one proposed solution is to provide two sets of APIs: a
|
||
byte-oriented one, and a character-oriented one, where the
|
||
character-oriented one would be limited to not being able to represent
|
||
all data accurately. Unfortunately, for Windows, the situation would
|
||
be exactly the opposite: the byte-oriented interface cannot represent
|
||
all data; only the character-oriented API can. As a consequence,
|
||
libraries and applications that want to support all user data in a
|
||
cross-platform manner have to accept mish-mash of bytes and characters
|
||
exactly in the way that caused endless troubles for Python 2.x.
|
||
|
||
With this PEP, a uniform treatment of these data as characters becomes
|
||
possible. The uniformity is achieved by using specific encoding
|
||
algorithms, meaning that the data can be converted back to bytes on
|
||
POSIX systems only if the same encoding is used.
|
||
|
||
Being able to treat such strings uniformly will allow application
|
||
writers to abstract from details specific to the operating system, and
|
||
reduces the risk of one API failing when the other API would have
|
||
worked.
|
||
|
||
Specification
|
||
=============
|
||
|
||
On Windows, Python uses the wide character APIs to access
|
||
character-oriented APIs, allowing direct conversion of the
|
||
environmental data to Python str objects ([1]).
|
||
|
||
On POSIX systems, Python currently applies the locale's encoding to
|
||
convert the byte data to Unicode, failing for characters that cannot
|
||
be decoded. With this PEP, non-decodable bytes will be represented as
|
||
lone half surrogate codes U+DCxx.
|
||
|
||
To convert non-decodable bytes, a new error handler ([2])
|
||
"python-escape" is introduced, which produces these half
|
||
surrogates. On encoding, the error handler converts the half surrogate
|
||
back to the corresponding byte. This error handler will be used in any
|
||
API that receives or produces file names, command line arguments, or
|
||
environment variables.
|
||
|
||
The error handler interface is extended to allow the encode error
|
||
handler to return byte strings immediately, in addition to returning
|
||
Unicode strings which then get encoded again (also see the discussion
|
||
below).
|
||
|
||
If the locale's encoding is UTF-8, the file system encoding is set to
|
||
a new encoding "utf-8b", as the regular UTF-8 codec would not
|
||
re-encode half surrogates as single bytes. The UTF-8b codec decodes
|
||
invalid bytes (which must be >= 0x80) into half surrogate codes
|
||
U+DC80..U+DCFF. Unlike the utf-8 codec, the utf-8b codec follows the
|
||
strict definition of UTF-8 to determine what an invalid byte is
|
||
(which, among other restrictions, disallows to encode surrogate codes
|
||
in UTF-8).
|
||
|
||
Byte-orientied interfaces that already exist in Python 3.0 are not
|
||
affected by this specification. They are neither enhanced nor
|
||
deprecated.
|
||
|
||
Discussion
|
||
==========
|
||
|
||
While providing a uniform API to non-decodable bytes, this interface
|
||
has the limitation that chosen representation only "works" if the data
|
||
get converted back to bytes with the python-escape error handler
|
||
also. Encoding the data with the locale's encoding and the (default)
|
||
strict error handler will raise an exception, encoding them with UTF-8
|
||
will produce non-sensical data.
|
||
|
||
Data obtained from other sources may conflict with data produced
|
||
by this PEP. Dealing with such conflicts is out of scope of the PEP.
|
||
|
||
For most applications, we assume that they eventually pass data
|
||
received from a system interface back into the same system
|
||
interfaces. For example, an application invoking os.listdir() will
|
||
likely pass the result strings back into APIs like os.stat() or
|
||
open(), which then encodes them back into their original byte
|
||
representation. Applications that need to process the original byte
|
||
strings can obtain them by encoding the character strings with the
|
||
file system encoding, passing "python-escape" as the error handler
|
||
name. For example, a function that works like os.listdir, except
|
||
for accepting and returning bytes, would be written as::
|
||
|
||
def listdir_b(dirname):
|
||
fse = sys.getfilesystemencoding()
|
||
dirname = dirname.decode(fse, "python-escape")
|
||
for fn in os.listdir(dirname):
|
||
# fn is now a str object
|
||
yield fn.encode(fse, "python-escape")
|
||
|
||
The encode error handler interface presently requires replacement
|
||
Unicode to be provide in lieu of the non-encodable Unicode from the
|
||
source string. It promptly encodes that replacement Unicode. In some
|
||
error handlers, such as the python-escape proposed here, it is simpler
|
||
and more efficient for the error handler to provide a pre-encoded
|
||
replacement byte string, rather than forcing it to calculating Unicode
|
||
from which the encoder would create the desired bytes. In fact, with
|
||
python-escape, there are required byte sequences which cannot be
|
||
generated from replacement Unicode.
|
||
|
||
A few alternative approaches have been proposed:
|
||
|
||
* create a new string subclass that supports embedded bytes
|
||
* use different escape schemes, such as escaping with a NUL
|
||
character, or mapping to infrequent characters.
|
||
|
||
References
|
||
==========
|
||
|
||
[1] PEP 277
|
||
"Unicode file name support for Windows NT"
|
||
http://www.python.org/dev/peps/pep-0277/
|
||
|
||
[2] PEP 293
|
||
"Codec Error Handling Callbacks"
|
||
http://www.python.org/dev/peps/pep-0293/
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
coding: utf-8
|
||
End:
|