utf8b -> surrogateescape.

This commit is contained in:
Martin v. Löwis 2009-05-10 07:53:39 +00:00
parent b36ace5076
commit c6afe672bd
1 changed files with 20 additions and 19 deletions

View File

@ -72,11 +72,12 @@ be decoded. With this PEP, non-decodable bytes >= 128 will be
represented as lone surrogate codes U+DC80..U+DCFF. Bytes below represented as lone surrogate codes U+DC80..U+DCFF. Bytes below
128 will produce exceptions; see the discussion below. 128 will produce exceptions; see the discussion below.
To convert non-decodable bytes, a new error handler ([2]) "utf8b" is To convert non-decodable bytes, a new error handler ([2])
introduced, which produces these surrogates. On encoding, the error "surrogateescape" is introduced, which produces these surrogates. On
handler converts the surrogate back to the corresponding byte. This encoding, the error handler converts the surrogate back to the
error handler will be used in any API that receives or produces file corresponding byte. This error handler will be used in any API that
names, command line arguments, or environment variables. receives or produces file names, command line arguments, or
environment variables.
The error handler interface is extended to allow the encode error The error handler interface is extended to allow the encode error
handler to return byte strings immediately, in addition to returning handler to return byte strings immediately, in addition to returning
@ -95,7 +96,7 @@ Discussion
While providing a uniform API to non-decodable bytes, this interface While providing a uniform API to non-decodable bytes, this interface
has the limitation that chosen representation only "works" if the data has the limitation that chosen representation only "works" if the data
get converted back to bytes with the utf8b error handler get converted back to bytes with the surrogateescape error handler
also. Encoding the data with the locale's encoding and the (default) also. Encoding the data with the locale's encoding and the (default)
strict error handler will raise an exception, encoding them with UTF-8 strict error handler will raise an exception, encoding them with UTF-8
will produce non-sensical data. will produce non-sensical data.
@ -115,28 +116,28 @@ likely pass the result strings back into APIs like os.stat() or
open(), which then encodes them back into their original byte open(), which then encodes them back into their original byte
representation. Applications that need to process the original byte representation. Applications that need to process the original byte
strings can obtain them by encoding the character strings with the strings can obtain them by encoding the character strings with the
file system encoding, passing "utf8b" as the error handler name. For file system encoding, passing "surrogateescape" as the error handler
example, a function that works like os.listdir, except for accepting name. For example, a function that works like os.listdir, except for
and returning bytes, would be written as:: accepting and returning bytes, would be written as::
def listdir_b(dirname): def listdir_b(dirname):
fse = sys.getfilesystemencoding() fse = sys.getfilesystemencoding()
dirname = dirname.decode(fse, "utf8b") dirname = dirname.decode(fse, "surrogateescape")
for fn in os.listdir(dirname): for fn in os.listdir(dirname):
# fn is now a str object # fn is now a str object
yield fn.encode(fse, "utf8b") yield fn.encode(fse, "surrogateescape")
The extension to the encode error handler interface proposed by this The extension to the encode error handler interface proposed by this
PEP is necessary to implement the 'utf8b' error handler, because there PEP is necessary to implement the 'surrogateescape' error handler,
are required byte sequences which cannot be generated from replacement because there are required byte sequences which cannot be generated
Unicode. However, the encode error handler interface presently from replacement Unicode. However, the encode error handler interface
requires replacement Unicode to be provided in lieu of the presently requires replacement Unicode to be provided in lieu of the
non-encodable Unicode from the source string. Then it promptly non-encodable Unicode from the source string. Then it promptly
encodes that replacement Unicode. In some error handlers, such as the encodes that replacement Unicode. In some error handlers, such as the
'utf8b' proposed here, it is also simpler and more efficient for the 'surrogateescape' proposed here, it is also simpler and more efficient
error handler to provide a pre-encoded replacement byte string, rather for the error handler to provide a pre-encoded replacement byte
than forcing it to calculating Unicode from which the encoder would string, rather than forcing it to calculating Unicode from which the
create the desired bytes. encoder would create the desired bytes.
A few alternative approaches have been proposed: A few alternative approaches have been proposed: