191 lines
7.9 KiB
Plaintext
191 lines
7.9 KiB
Plaintext
PEP: 383
|
||
Title: Non-decodable Bytes in System Character Interfaces
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Martin v. Löwis <martin@v.loewis.de>
|
||
Status: Final
|
||
Type: Standards Track
|
||
Content-Type: text/x-rst
|
||
Created: 22-Apr-2009
|
||
Python-Version: 3.1
|
||
Post-History:
|
||
|
||
Abstract
|
||
========
|
||
|
||
File names, environment variables, and command line arguments are
|
||
defined as being character data in POSIX; the C APIs however allow
|
||
passing arbitrary bytes - whether these conform to a certain encoding
|
||
or not. This PEP proposes a means of dealing with such irregularities
|
||
by embedding the bytes in character strings in such a way that allows
|
||
recreation of the original byte string.
|
||
|
||
Rationale
|
||
=========
|
||
|
||
The C char type is a data type that is commonly used to represent both
|
||
character data and bytes. Certain POSIX interfaces are specified and
|
||
widely understood as operating on character data, however, the system
|
||
call interfaces make no assumption on the encoding of these data, and
|
||
pass them on as-is. With Python 3, character strings use a
|
||
Unicode-based internal representation, making it difficult to ignore
|
||
the encoding of byte strings in the same way that the C interfaces can
|
||
ignore the encoding.
|
||
|
||
On the other hand, Microsoft Windows NT has corrected the original
|
||
design limitation of Unix, and made it explicit in its system
|
||
interfaces that these data (file names, environment variables, command
|
||
line arguments) are indeed character data, by providing a
|
||
Unicode-based API (keeping a C-char-based one for backwards
|
||
compatibility).
|
||
|
||
For Python 3, one proposed solution is to provide two sets of APIs: a
|
||
byte-oriented one, and a character-oriented one, where the
|
||
character-oriented one would be limited to not being able to represent
|
||
all data accurately. Unfortunately, for Windows, the situation would
|
||
be exactly the opposite: the byte-oriented interface cannot represent
|
||
all data; only the character-oriented API can. As a consequence,
|
||
libraries and applications that want to support all user data in a
|
||
cross-platform manner have to accept mish-mash of bytes and characters
|
||
exactly in the way that caused endless troubles for Python 2.x.
|
||
|
||
With this PEP, a uniform treatment of these data as characters becomes
|
||
possible. The uniformity is achieved by using specific encoding
|
||
algorithms, meaning that the data can be converted back to bytes on
|
||
POSIX systems only if the same encoding is used.
|
||
|
||
Being able to treat such strings uniformly will allow application
|
||
writers to abstract from details specific to the operating system, and
|
||
reduces the risk of one API failing when the other API would have
|
||
worked.
|
||
|
||
Specification
|
||
=============
|
||
|
||
On Windows, Python uses the wide character APIs to access
|
||
character-oriented APIs, allowing direct conversion of the
|
||
environmental data to Python str objects (:pep:`277`).
|
||
|
||
On POSIX systems, Python currently applies the locale's encoding to
|
||
convert the byte data to Unicode, failing for characters that cannot
|
||
be decoded. With this PEP, non-decodable bytes >= 128 will be
|
||
represented as lone surrogate codes U+DC80..U+DCFF. Bytes below
|
||
128 will produce exceptions; see the discussion below.
|
||
|
||
To convert non-decodable bytes, a new error handler (:pep:`293`)
|
||
"surrogateescape" is introduced, which produces these surrogates. On
|
||
encoding, the error handler converts the surrogate back to the
|
||
corresponding byte. This error handler will be used in any API that
|
||
receives or produces file names, command line arguments, or
|
||
environment variables.
|
||
|
||
The error handler interface is extended to allow the encode error
|
||
handler to return byte strings immediately, in addition to returning
|
||
Unicode strings which then get encoded again (also see the discussion
|
||
below).
|
||
|
||
Byte-oriented interfaces that already exist in Python 3.0 are not
|
||
affected by this specification. They are neither enhanced nor
|
||
deprecated.
|
||
|
||
External libraries that operate on file names (such as GUI file
|
||
choosers) should also encode them according to the PEP.
|
||
|
||
Discussion
|
||
==========
|
||
|
||
This surrogateescape encoding is based on Markus Kuhn's idea that
|
||
he called UTF-8b [3]_.
|
||
|
||
While providing a uniform API to non-decodable bytes, this interface
|
||
has the limitation that chosen representation only "works" if the data
|
||
get converted back to bytes with the surrogateescape error handler
|
||
also. Encoding the data with the locale's encoding and the (default)
|
||
strict error handler will raise an exception, encoding them with UTF-8
|
||
will produce non-sensical data.
|
||
|
||
Data obtained from other sources may conflict with data produced
|
||
by this PEP. Dealing with such conflicts is out of scope of the PEP.
|
||
|
||
This PEP allows the possibility of "smuggling" bytes in character
|
||
strings. This would be a security risk if the bytes are
|
||
security-critical when interpreted as characters on a target system,
|
||
such as path name separators. For this reason, the PEP rejects
|
||
smuggling bytes below 128. If the target system uses EBCDIC, such
|
||
smuggled bytes may still be a security risk, allowing smuggling of
|
||
e.g. square brackets or the backslash. Python currently does not
|
||
support EBCDIC, so this should not be a problem in practice. Anybody
|
||
porting Python to an EBCDIC system might want to adjust the error
|
||
handlers, or come up with other approaches to address the security
|
||
risks.
|
||
|
||
Encodings that are not compatible with ASCII are not supported by
|
||
this specification; bytes in the ASCII range that fail to decode
|
||
will cause an exception. It is widely agreed that such encodings
|
||
should not be used as locale charsets.
|
||
|
||
For most applications, we assume that they eventually pass data
|
||
received from a system interface back into the same system
|
||
interfaces. For example, an application invoking os.listdir() will
|
||
likely pass the result strings back into APIs like os.stat() or
|
||
open(), which then encodes them back into their original byte
|
||
representation. Applications that need to process the original byte
|
||
strings can obtain them by encoding the character strings with the
|
||
file system encoding, passing "surrogateescape" as the error handler
|
||
name. For example, a function that works like os.listdir, except for
|
||
accepting and returning bytes, would be written as::
|
||
|
||
def listdir_b(dirname):
|
||
fse = sys.getfilesystemencoding()
|
||
dirname = dirname.decode(fse, "surrogateescape")
|
||
for fn in os.listdir(dirname):
|
||
# fn is now a str object
|
||
yield fn.encode(fse, "surrogateescape")
|
||
|
||
The extension to the encode error handler interface proposed by this
|
||
PEP is necessary to implement the 'surrogateescape' error handler,
|
||
because there are required byte sequences which cannot be generated
|
||
from replacement Unicode. However, the encode error handler interface
|
||
presently requires replacement Unicode to be provided in lieu of the
|
||
non-encodable Unicode from the source string. Then it promptly
|
||
encodes that replacement Unicode. In some error handlers, such as the
|
||
'surrogateescape' proposed here, it is also simpler and more efficient
|
||
for the error handler to provide a pre-encoded replacement byte
|
||
string, rather than forcing it to calculating Unicode from which the
|
||
encoder would create the desired bytes.
|
||
|
||
A few alternative approaches have been proposed:
|
||
|
||
* create a new string subclass that supports embedded bytes
|
||
* use different escape schemes, such as escaping with a NUL
|
||
character, or mapping to infrequent characters.
|
||
|
||
Of these proposals, the approach of escaping each byte XX
|
||
with the sequence U+0000 U+00XX has the disadvantage that
|
||
encoding to UTF-8 will introduce a NUL byte in the UTF-8
|
||
sequence. As a consequence, C libraries may interpret this
|
||
as a string termination, even though the string continues.
|
||
In particular, the gtk libraries will truncate text in this
|
||
case; other libraries may show similar problems.
|
||
|
||
References
|
||
==========
|
||
|
||
.. [3] UTF-8b
|
||
http://permalink.gmane.org/gmane.comp.internationalization.linux/920
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
coding: utf-8
|
||
End:
|