179 lines
7.7 KiB
ReStructuredText
179 lines
7.7 KiB
ReStructuredText
PEP: 383
|
|
Title: Non-decodable Bytes in System Character Interfaces
|
|
Author: Martin von Löwis <martin@v.loewis.de>
|
|
Status: Final
|
|
Type: Standards Track
|
|
Content-Type: text/x-rst
|
|
Created: 22-Apr-2009
|
|
Python-Version: 3.1
|
|
Post-History:
|
|
|
|
Abstract
|
|
========
|
|
|
|
File names, environment variables, and command line arguments are
|
|
defined as being character data in POSIX; the C APIs however allow
|
|
passing arbitrary bytes - whether these conform to a certain encoding
|
|
or not. This PEP proposes a means of dealing with such irregularities
|
|
by embedding the bytes in character strings in such a way that allows
|
|
recreation of the original byte string.
|
|
|
|
Rationale
|
|
=========
|
|
|
|
The C char type is a data type that is commonly used to represent both
|
|
character data and bytes. Certain POSIX interfaces are specified and
|
|
widely understood as operating on character data, however, the system
|
|
call interfaces make no assumption on the encoding of these data, and
|
|
pass them on as-is. With Python 3, character strings use a
|
|
Unicode-based internal representation, making it difficult to ignore
|
|
the encoding of byte strings in the same way that the C interfaces can
|
|
ignore the encoding.
|
|
|
|
On the other hand, Microsoft Windows NT has corrected the original
|
|
design limitation of Unix, and made it explicit in its system
|
|
interfaces that these data (file names, environment variables, command
|
|
line arguments) are indeed character data, by providing a
|
|
Unicode-based API (keeping a C-char-based one for backwards
|
|
compatibility).
|
|
|
|
For Python 3, one proposed solution is to provide two sets of APIs: a
|
|
byte-oriented one, and a character-oriented one, where the
|
|
character-oriented one would be limited to not being able to represent
|
|
all data accurately. Unfortunately, for Windows, the situation would
|
|
be exactly the opposite: the byte-oriented interface cannot represent
|
|
all data; only the character-oriented API can. As a consequence,
|
|
libraries and applications that want to support all user data in a
|
|
cross-platform manner have to accept mish-mash of bytes and characters
|
|
exactly in the way that caused endless troubles for Python 2.x.
|
|
|
|
With this PEP, a uniform treatment of these data as characters becomes
|
|
possible. The uniformity is achieved by using specific encoding
|
|
algorithms, meaning that the data can be converted back to bytes on
|
|
POSIX systems only if the same encoding is used.
|
|
|
|
Being able to treat such strings uniformly will allow application
|
|
writers to abstract from details specific to the operating system, and
|
|
reduces the risk of one API failing when the other API would have
|
|
worked.
|
|
|
|
Specification
|
|
=============
|
|
|
|
On Windows, Python uses the wide character APIs to access
|
|
character-oriented APIs, allowing direct conversion of the
|
|
environmental data to Python str objects (:pep:`277`).
|
|
|
|
On POSIX systems, Python currently applies the locale's encoding to
|
|
convert the byte data to Unicode, failing for characters that cannot
|
|
be decoded. With this PEP, non-decodable bytes >= 128 will be
|
|
represented as lone surrogate codes U+DC80..U+DCFF. Bytes below
|
|
128 will produce exceptions; see the discussion below.
|
|
|
|
To convert non-decodable bytes, a new error handler (:pep:`293`)
|
|
"surrogateescape" is introduced, which produces these surrogates. On
|
|
encoding, the error handler converts the surrogate back to the
|
|
corresponding byte. This error handler will be used in any API that
|
|
receives or produces file names, command line arguments, or
|
|
environment variables.
|
|
|
|
The error handler interface is extended to allow the encode error
|
|
handler to return byte strings immediately, in addition to returning
|
|
Unicode strings which then get encoded again (also see the discussion
|
|
below).
|
|
|
|
Byte-oriented interfaces that already exist in Python 3.0 are not
|
|
affected by this specification. They are neither enhanced nor
|
|
deprecated.
|
|
|
|
External libraries that operate on file names (such as GUI file
|
|
choosers) should also encode them according to the PEP.
|
|
|
|
Discussion
|
|
==========
|
|
|
|
This surrogateescape encoding is based on Markus Kuhn's idea that
|
|
he called UTF-8b [3]_.
|
|
|
|
While providing a uniform API to non-decodable bytes, this interface
|
|
has the limitation that chosen representation only "works" if the data
|
|
get converted back to bytes with the surrogateescape error handler
|
|
also. Encoding the data with the locale's encoding and the (default)
|
|
strict error handler will raise an exception, encoding them with UTF-8
|
|
will produce nonsensical data.
|
|
|
|
Data obtained from other sources may conflict with data produced
|
|
by this PEP. Dealing with such conflicts is out of scope of the PEP.
|
|
|
|
This PEP allows the possibility of "smuggling" bytes in character
|
|
strings. This would be a security risk if the bytes are
|
|
security-critical when interpreted as characters on a target system,
|
|
such as path name separators. For this reason, the PEP rejects
|
|
smuggling bytes below 128. If the target system uses EBCDIC, such
|
|
smuggled bytes may still be a security risk, allowing smuggling of
|
|
e.g. square brackets or the backslash. Python currently does not
|
|
support EBCDIC, so this should not be a problem in practice. Anybody
|
|
porting Python to an EBCDIC system might want to adjust the error
|
|
handlers, or come up with other approaches to address the security
|
|
risks.
|
|
|
|
Encodings that are not compatible with ASCII are not supported by
|
|
this specification; bytes in the ASCII range that fail to decode
|
|
will cause an exception. It is widely agreed that such encodings
|
|
should not be used as locale charsets.
|
|
|
|
For most applications, we assume that they eventually pass data
|
|
received from a system interface back into the same system
|
|
interfaces. For example, an application invoking os.listdir() will
|
|
likely pass the result strings back into APIs like os.stat() or
|
|
open(), which then encodes them back into their original byte
|
|
representation. Applications that need to process the original byte
|
|
strings can obtain them by encoding the character strings with the
|
|
file system encoding, passing "surrogateescape" as the error handler
|
|
name. For example, a function that works like os.listdir, except for
|
|
accepting and returning bytes, would be written as::
|
|
|
|
def listdir_b(dirname):
|
|
fse = sys.getfilesystemencoding()
|
|
dirname = dirname.decode(fse, "surrogateescape")
|
|
for fn in os.listdir(dirname):
|
|
# fn is now a str object
|
|
yield fn.encode(fse, "surrogateescape")
|
|
|
|
The extension to the encode error handler interface proposed by this
|
|
PEP is necessary to implement the 'surrogateescape' error handler,
|
|
because there are required byte sequences which cannot be generated
|
|
from replacement Unicode. However, the encode error handler interface
|
|
presently requires replacement Unicode to be provided in lieu of the
|
|
non-encodable Unicode from the source string. Then it promptly
|
|
encodes that replacement Unicode. In some error handlers, such as the
|
|
'surrogateescape' proposed here, it is also simpler and more efficient
|
|
for the error handler to provide a pre-encoded replacement byte
|
|
string, rather than forcing it to calculating Unicode from which the
|
|
encoder would create the desired bytes.
|
|
|
|
A few alternative approaches have been proposed:
|
|
|
|
* create a new string subclass that supports embedded bytes
|
|
* use different escape schemes, such as escaping with a NUL
|
|
character, or mapping to infrequent characters.
|
|
|
|
Of these proposals, the approach of escaping each byte XX
|
|
with the sequence U+0000 U+00XX has the disadvantage that
|
|
encoding to UTF-8 will introduce a NUL byte in the UTF-8
|
|
sequence. As a consequence, C libraries may interpret this
|
|
as a string termination, even though the string continues.
|
|
In particular, the gtk libraries will truncate text in this
|
|
case; other libraries may show similar problems.
|
|
|
|
References
|
|
==========
|
|
|
|
.. [3] UTF-8b
|
|
https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html
|
|
|
|
Copyright
|
|
=========
|
|
|
|
This document has been placed in the public domain.
|