2009-04-22 02:42:06 -04:00
|
|
|
|
PEP: 383
|
|
|
|
|
Title: Non-decodable Bytes in System Character Interfaces
|
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
|
|
|
|
Author: Martin v. Löwis <martin@v.loewis.de>
|
|
|
|
|
Status: Draft
|
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
|
Created: 22-Apr-2009
|
|
|
|
|
Python-Version: 3.1
|
|
|
|
|
Post-History:
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
========
|
|
|
|
|
|
|
|
|
|
File names, environment variables, and command line arguments are
|
|
|
|
|
defined as being character data in POSIX; the C APIs however allow
|
|
|
|
|
passing arbitrary bytes - whether these conform to a certain encoding
|
|
|
|
|
or not. This PEP proposes a means of dealing with such irregularities
|
|
|
|
|
by embedding the bytes in character strings in such a way that allows
|
|
|
|
|
recreation of the original byte string.
|
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
The C char type is a data type that is commonly used to represent both
|
|
|
|
|
character data and bytes. Certain POSIX interfaces are specified and
|
|
|
|
|
widely understood as operating on character data, however, the system
|
|
|
|
|
call interfaces make no assumption on the encoding of these data, and
|
|
|
|
|
pass them on as-is. With Python 3, character strings use a
|
|
|
|
|
Unicode-based internal representation, making it difficult to ignore
|
|
|
|
|
the encoding of byte strings in the same way that the C interfaces can
|
|
|
|
|
ignore the encoding.
|
|
|
|
|
|
2009-04-22 15:08:10 -04:00
|
|
|
|
On the other hand, Microsoft Windows NT has corrected the original
|
2009-04-22 02:42:06 -04:00
|
|
|
|
design limitation of Unix, and made it explicit in its system
|
|
|
|
|
interfaces that these data (file names, environment variables, command
|
|
|
|
|
line arguments) are indeed character data, by providing a
|
|
|
|
|
Unicode-based API (keeping a C-char-based one for backwards
|
|
|
|
|
compatibility).
|
|
|
|
|
|
|
|
|
|
For Python 3, one proposed solution is to provide two sets of APIs: a
|
|
|
|
|
byte-oriented one, and a character-oriented one, where the
|
|
|
|
|
character-oriented one would be limited to not being able to represent
|
|
|
|
|
all data accurately. Unfortunately, for Windows, the situation would
|
|
|
|
|
be exactly the opposite: the byte-oriented interface cannot represent
|
|
|
|
|
all data; only the character-oriented API can. As a consequence,
|
|
|
|
|
libraries and applications that want to support all user data in a
|
|
|
|
|
cross-platform manner have to accept mish-mash of bytes and characters
|
|
|
|
|
exactly in the way that caused endless troubles for Python 2.x.
|
|
|
|
|
|
|
|
|
|
With this PEP, a uniform treatment of these data as characters becomes
|
|
|
|
|
possible. The uniformity is achieved by using specific encoding
|
|
|
|
|
algorithms, meaning that the data can be converted back to bytes on
|
|
|
|
|
POSIX systems only if the same encoding is used.
|
|
|
|
|
|
2009-04-25 08:31:23 -04:00
|
|
|
|
Being able to treat such strings uniformly will allow application
|
|
|
|
|
writers to abstract from details specific to the operating system, and
|
|
|
|
|
reduces the risk of one API failing when the other API would have
|
|
|
|
|
worked.
|
|
|
|
|
|
2009-04-22 02:42:06 -04:00
|
|
|
|
Specification
|
|
|
|
|
=============
|
|
|
|
|
|
|
|
|
|
On Windows, Python uses the wide character APIs to access
|
|
|
|
|
character-oriented APIs, allowing direct conversion of the
|
2009-04-30 03:02:13 -04:00
|
|
|
|
environmental data to Python str objects ([1]).
|
2009-04-22 02:42:06 -04:00
|
|
|
|
|
|
|
|
|
On POSIX systems, Python currently applies the locale's encoding to
|
2009-04-29 16:01:33 -04:00
|
|
|
|
convert the byte data to Unicode, failing for characters that cannot
|
|
|
|
|
be decoded. With this PEP, non-decodable bytes will be represented as
|
|
|
|
|
lone half surrogate codes U+DCxx.
|
2009-04-22 02:42:06 -04:00
|
|
|
|
|
2009-04-30 03:02:13 -04:00
|
|
|
|
To convert non-decodable bytes, a new error handler ([2])
|
|
|
|
|
"python-escape" is introduced, which produces these half
|
|
|
|
|
surrogates. On encoding, the error handler converts the half surrogate
|
|
|
|
|
back to the corresponding byte. This error handler will be used in any
|
|
|
|
|
API that receives or produces file names, command line arguments, or
|
|
|
|
|
environment variables.
|
2009-04-22 02:42:06 -04:00
|
|
|
|
|
|
|
|
|
The error handler interface is extended to allow the encode error
|
|
|
|
|
handler to return byte strings immediately, in addition to returning
|
|
|
|
|
Unicode strings which then get encoded again.
|
|
|
|
|
|
|
|
|
|
If the locale's encoding is UTF-8, the file system encoding is set to
|
2009-04-24 16:25:20 -04:00
|
|
|
|
a new encoding "utf-8b", as the regular UTF-8 codec would not
|
|
|
|
|
re-encode half surrogates as single bytes. The UTF-8b codec decodes
|
2009-04-28 13:08:14 -04:00
|
|
|
|
invalid bytes (which must be >= 0x80) into half surrogate codes
|
|
|
|
|
U+DC80..U+DCFF. Unlike the utf-8 codec, the utf-8b codec follows the
|
|
|
|
|
strict definition of UTF-8 to determine what an invalid byte is
|
|
|
|
|
(which, among other restrictions, disallows to encode surrogate codes
|
|
|
|
|
in UTF-8).
|
2009-04-22 02:42:06 -04:00
|
|
|
|
|
2009-04-29 02:26:43 -04:00
|
|
|
|
Byte-orientied interfaces that already exist in Python 3.0 are not
|
|
|
|
|
affected by this specification. They are neither enhanced nor
|
|
|
|
|
deprecated.
|
|
|
|
|
|
2009-04-22 02:42:06 -04:00
|
|
|
|
Discussion
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
While providing a uniform API to non-decodable bytes, this interface
|
|
|
|
|
has the limitation that chosen representation only "works" if the data
|
|
|
|
|
get converted back to bytes with the python-escape error handler
|
|
|
|
|
also. Encoding the data with the locale's encoding and the (default)
|
|
|
|
|
strict error handler will raise an exception, encoding them with UTF-8
|
|
|
|
|
will produce non-sensical data.
|
|
|
|
|
|
|
|
|
|
For most applications, we assume that they eventually pass data
|
|
|
|
|
received from a system interface back into the same system
|
2009-04-22 15:08:10 -04:00
|
|
|
|
interfaces. For example, an application invoking os.listdir() will
|
2009-04-22 02:42:06 -04:00
|
|
|
|
likely pass the result strings back into APIs like os.stat() or
|
|
|
|
|
open(), which then encodes them back into their original byte
|
|
|
|
|
representation. Applications that need to process the original byte
|
|
|
|
|
strings can obtain them by encoding the character strings with the
|
|
|
|
|
file system encoding, passing "python-escape" as the error handler
|
2009-04-29 02:26:43 -04:00
|
|
|
|
name. For example, a function that works like os.listdir, except
|
|
|
|
|
for accepting and returning bytes, would be written as::
|
|
|
|
|
|
|
|
|
|
def listdir_b(dirname):
|
|
|
|
|
fse = sys.getfilesystemencoding()
|
|
|
|
|
dirname = dirname.decode(fse, "python-escape")
|
|
|
|
|
for fn in os.listdir(dirname):
|
|
|
|
|
# fn is now a str object
|
2009-04-29 16:06:29 -04:00
|
|
|
|
yield fn.encode(fse, "python-escape")
|
2009-04-22 02:42:06 -04:00
|
|
|
|
|
2009-04-30 03:02:13 -04:00
|
|
|
|
References
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
[1] PEP 277
|
|
|
|
|
"Unicode file name support for Windows NT"
|
|
|
|
|
http://www.python.org/dev/peps/pep-0277/
|
|
|
|
|
|
|
|
|
|
[2] PEP 293
|
|
|
|
|
"Codec Error Handling Callbacks"
|
|
|
|
|
http://www.python.org/dev/peps/pep-0293/
|
|
|
|
|
|
2009-04-22 02:42:06 -04:00
|
|
|
|
Copyright
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
..
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
coding: utf-8
|
|
|
|
|
End:
|