utf8b -> surrogateescape.
This commit is contained in:
parent
b36ace5076
commit
c6afe672bd
39
pep-0383.txt
39
pep-0383.txt
|
@ -72,11 +72,12 @@ be decoded. With this PEP, non-decodable bytes >= 128 will be
|
||||||
represented as lone surrogate codes U+DC80..U+DCFF. Bytes below
|
represented as lone surrogate codes U+DC80..U+DCFF. Bytes below
|
||||||
128 will produce exceptions; see the discussion below.
|
128 will produce exceptions; see the discussion below.
|
||||||
|
|
||||||
To convert non-decodable bytes, a new error handler ([2]) "utf8b" is
|
To convert non-decodable bytes, a new error handler ([2])
|
||||||
introduced, which produces these surrogates. On encoding, the error
|
"surrogateescape" is introduced, which produces these surrogates. On
|
||||||
handler converts the surrogate back to the corresponding byte. This
|
encoding, the error handler converts the surrogate back to the
|
||||||
error handler will be used in any API that receives or produces file
|
corresponding byte. This error handler will be used in any API that
|
||||||
names, command line arguments, or environment variables.
|
receives or produces file names, command line arguments, or
|
||||||
|
environment variables.
|
||||||
|
|
||||||
The error handler interface is extended to allow the encode error
|
The error handler interface is extended to allow the encode error
|
||||||
handler to return byte strings immediately, in addition to returning
|
handler to return byte strings immediately, in addition to returning
|
||||||
|
@ -95,7 +96,7 @@ Discussion
|
||||||
|
|
||||||
While providing a uniform API to non-decodable bytes, this interface
|
While providing a uniform API to non-decodable bytes, this interface
|
||||||
has the limitation that chosen representation only "works" if the data
|
has the limitation that chosen representation only "works" if the data
|
||||||
get converted back to bytes with the utf8b error handler
|
get converted back to bytes with the surrogateescape error handler
|
||||||
also. Encoding the data with the locale's encoding and the (default)
|
also. Encoding the data with the locale's encoding and the (default)
|
||||||
strict error handler will raise an exception, encoding them with UTF-8
|
strict error handler will raise an exception, encoding them with UTF-8
|
||||||
will produce non-sensical data.
|
will produce non-sensical data.
|
||||||
|
@ -115,28 +116,28 @@ likely pass the result strings back into APIs like os.stat() or
|
||||||
open(), which then encodes them back into their original byte
|
open(), which then encodes them back into their original byte
|
||||||
representation. Applications that need to process the original byte
|
representation. Applications that need to process the original byte
|
||||||
strings can obtain them by encoding the character strings with the
|
strings can obtain them by encoding the character strings with the
|
||||||
file system encoding, passing "utf8b" as the error handler name. For
|
file system encoding, passing "surrogateescape" as the error handler
|
||||||
example, a function that works like os.listdir, except for accepting
|
name. For example, a function that works like os.listdir, except for
|
||||||
and returning bytes, would be written as::
|
accepting and returning bytes, would be written as::
|
||||||
|
|
||||||
def listdir_b(dirname):
|
def listdir_b(dirname):
|
||||||
fse = sys.getfilesystemencoding()
|
fse = sys.getfilesystemencoding()
|
||||||
dirname = dirname.decode(fse, "utf8b")
|
dirname = dirname.decode(fse, "surrogateescape")
|
||||||
for fn in os.listdir(dirname):
|
for fn in os.listdir(dirname):
|
||||||
# fn is now a str object
|
# fn is now a str object
|
||||||
yield fn.encode(fse, "utf8b")
|
yield fn.encode(fse, "surrogateescape")
|
||||||
|
|
||||||
The extension to the encode error handler interface proposed by this
|
The extension to the encode error handler interface proposed by this
|
||||||
PEP is necessary to implement the 'utf8b' error handler, because there
|
PEP is necessary to implement the 'surrogateescape' error handler,
|
||||||
are required byte sequences which cannot be generated from replacement
|
because there are required byte sequences which cannot be generated
|
||||||
Unicode. However, the encode error handler interface presently
|
from replacement Unicode. However, the encode error handler interface
|
||||||
requires replacement Unicode to be provided in lieu of the
|
presently requires replacement Unicode to be provided in lieu of the
|
||||||
non-encodable Unicode from the source string. Then it promptly
|
non-encodable Unicode from the source string. Then it promptly
|
||||||
encodes that replacement Unicode. In some error handlers, such as the
|
encodes that replacement Unicode. In some error handlers, such as the
|
||||||
'utf8b' proposed here, it is also simpler and more efficient for the
|
'surrogateescape' proposed here, it is also simpler and more efficient
|
||||||
error handler to provide a pre-encoded replacement byte string, rather
|
for the error handler to provide a pre-encoded replacement byte
|
||||||
than forcing it to calculating Unicode from which the encoder would
|
string, rather than forcing it to calculating Unicode from which the
|
||||||
create the desired bytes.
|
encoder would create the desired bytes.
|
||||||
|
|
||||||
A few alternative approaches have been proposed:
|
A few alternative approaches have been proposed:
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue