utf8b -> surrogateescape.
This commit is contained in:
parent
b36ace5076
commit
c6afe672bd
39
pep-0383.txt
39
pep-0383.txt
|
@ -72,11 +72,12 @@ be decoded. With this PEP, non-decodable bytes >= 128 will be
|
|||
represented as lone surrogate codes U+DC80..U+DCFF. Bytes below
|
||||
128 will produce exceptions; see the discussion below.
|
||||
|
||||
To convert non-decodable bytes, a new error handler ([2]) "utf8b" is
|
||||
introduced, which produces these surrogates. On encoding, the error
|
||||
handler converts the surrogate back to the corresponding byte. This
|
||||
error handler will be used in any API that receives or produces file
|
||||
names, command line arguments, or environment variables.
|
||||
To convert non-decodable bytes, a new error handler ([2])
|
||||
"surrogateescape" is introduced, which produces these surrogates. On
|
||||
encoding, the error handler converts the surrogate back to the
|
||||
corresponding byte. This error handler will be used in any API that
|
||||
receives or produces file names, command line arguments, or
|
||||
environment variables.
|
||||
|
||||
The error handler interface is extended to allow the encode error
|
||||
handler to return byte strings immediately, in addition to returning
|
||||
|
@ -95,7 +96,7 @@ Discussion
|
|||
|
||||
While providing a uniform API to non-decodable bytes, this interface
|
||||
has the limitation that chosen representation only "works" if the data
|
||||
get converted back to bytes with the utf8b error handler
|
||||
get converted back to bytes with the surrogateescape error handler
|
||||
also. Encoding the data with the locale's encoding and the (default)
|
||||
strict error handler will raise an exception, encoding them with UTF-8
|
||||
will produce non-sensical data.
|
||||
|
@ -115,28 +116,28 @@ likely pass the result strings back into APIs like os.stat() or
|
|||
open(), which then encodes them back into their original byte
|
||||
representation. Applications that need to process the original byte
|
||||
strings can obtain them by encoding the character strings with the
|
||||
file system encoding, passing "utf8b" as the error handler name. For
|
||||
example, a function that works like os.listdir, except for accepting
|
||||
and returning bytes, would be written as::
|
||||
file system encoding, passing "surrogateescape" as the error handler
|
||||
name. For example, a function that works like os.listdir, except for
|
||||
accepting and returning bytes, would be written as::
|
||||
|
||||
def listdir_b(dirname):
|
||||
fse = sys.getfilesystemencoding()
|
||||
dirname = dirname.decode(fse, "utf8b")
|
||||
dirname = dirname.decode(fse, "surrogateescape")
|
||||
for fn in os.listdir(dirname):
|
||||
# fn is now a str object
|
||||
yield fn.encode(fse, "utf8b")
|
||||
yield fn.encode(fse, "surrogateescape")
|
||||
|
||||
The extension to the encode error handler interface proposed by this
|
||||
PEP is necessary to implement the 'utf8b' error handler, because there
|
||||
are required byte sequences which cannot be generated from replacement
|
||||
Unicode. However, the encode error handler interface presently
|
||||
requires replacement Unicode to be provided in lieu of the
|
||||
PEP is necessary to implement the 'surrogateescape' error handler,
|
||||
because there are required byte sequences which cannot be generated
|
||||
from replacement Unicode. However, the encode error handler interface
|
||||
presently requires replacement Unicode to be provided in lieu of the
|
||||
non-encodable Unicode from the source string. Then it promptly
|
||||
encodes that replacement Unicode. In some error handlers, such as the
|
||||
'utf8b' proposed here, it is also simpler and more efficient for the
|
||||
error handler to provide a pre-encoded replacement byte string, rather
|
||||
than forcing it to calculating Unicode from which the encoder would
|
||||
create the desired bytes.
|
||||
'surrogateescape' proposed here, it is also simpler and more efficient
|
||||
for the error handler to provide a pre-encoded replacement byte
|
||||
string, rather than forcing it to calculating Unicode from which the
|
||||
encoder would create the desired bytes.
|
||||
|
||||
A few alternative approaches have been proposed:
|
||||
|
||||
|
|
Loading…
Reference in New Issue