From c6afe672bd5510ea4f36b9bba4ca871f8f862d85 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Martin=20v=2E=20L=C3=B6wis?= Date: Sun, 10 May 2009 07:53:39 +0000 Subject: [PATCH] utf8b -> surrogateescape. --- pep-0383.txt | 39 ++++++++++++++++++++------------------- 1 file changed, 20 insertions(+), 19 deletions(-) diff --git a/pep-0383.txt b/pep-0383.txt index 6e580b472..a8ecc1a65 100644 --- a/pep-0383.txt +++ b/pep-0383.txt @@ -72,11 +72,12 @@ be decoded. With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF. Bytes below 128 will produce exceptions; see the discussion below. -To convert non-decodable bytes, a new error handler ([2]) "utf8b" is -introduced, which produces these surrogates. On encoding, the error -handler converts the surrogate back to the corresponding byte. This -error handler will be used in any API that receives or produces file -names, command line arguments, or environment variables. +To convert non-decodable bytes, a new error handler ([2]) +"surrogateescape" is introduced, which produces these surrogates. On +encoding, the error handler converts the surrogate back to the +corresponding byte. This error handler will be used in any API that +receives or produces file names, command line arguments, or +environment variables. The error handler interface is extended to allow the encode error handler to return byte strings immediately, in addition to returning @@ -95,7 +96,7 @@ Discussion While providing a uniform API to non-decodable bytes, this interface has the limitation that chosen representation only "works" if the data -get converted back to bytes with the utf8b error handler +get converted back to bytes with the surrogateescape error handler also. Encoding the data with the locale's encoding and the (default) strict error handler will raise an exception, encoding them with UTF-8 will produce non-sensical data. @@ -115,28 +116,28 @@ likely pass the result strings back into APIs like os.stat() or open(), which then encodes them back into their original byte representation. Applications that need to process the original byte strings can obtain them by encoding the character strings with the -file system encoding, passing "utf8b" as the error handler name. For -example, a function that works like os.listdir, except for accepting -and returning bytes, would be written as:: +file system encoding, passing "surrogateescape" as the error handler +name. For example, a function that works like os.listdir, except for +accepting and returning bytes, would be written as:: def listdir_b(dirname): fse = sys.getfilesystemencoding() - dirname = dirname.decode(fse, "utf8b") + dirname = dirname.decode(fse, "surrogateescape") for fn in os.listdir(dirname): # fn is now a str object - yield fn.encode(fse, "utf8b") + yield fn.encode(fse, "surrogateescape") The extension to the encode error handler interface proposed by this -PEP is necessary to implement the 'utf8b' error handler, because there -are required byte sequences which cannot be generated from replacement -Unicode. However, the encode error handler interface presently -requires replacement Unicode to be provided in lieu of the +PEP is necessary to implement the 'surrogateescape' error handler, +because there are required byte sequences which cannot be generated +from replacement Unicode. However, the encode error handler interface +presently requires replacement Unicode to be provided in lieu of the non-encodable Unicode from the source string. Then it promptly encodes that replacement Unicode. In some error handlers, such as the -'utf8b' proposed here, it is also simpler and more efficient for the -error handler to provide a pre-encoded replacement byte string, rather -than forcing it to calculating Unicode from which the encoder would -create the desired bytes. +'surrogateescape' proposed here, it is also simpler and more efficient +for the error handler to provide a pre-encoded replacement byte +string, rather than forcing it to calculating Unicode from which the +encoder would create the desired bytes. A few alternative approaches have been proposed: