Accept Lino Mastrodomenico's proposal of always using
low surrogates to represent non-decodable bytes.
This commit is contained in:
parent
1eb61be116
commit
1b7ea9323b
20
pep-0383.txt
20
pep-0383.txt
|
@ -62,25 +62,23 @@ character-oriented APIs, allowing direct conversion of the
|
|||
environmental data to Python str objects.
|
||||
|
||||
On POSIX systems, Python currently applies the locale's encoding to
|
||||
convert the byte data to Unicode. If the locale's encoding is UTF-8,
|
||||
it can represent the full set of Unicode characters, otherwise, only a
|
||||
subset is representable. In the latter case, using private-use
|
||||
characters to represent these bytes would be an option. For UTF-8,
|
||||
doing so would create an ambiguity, as the private-use characters may
|
||||
regularly occur in the input also.
|
||||
convert the byte data to Unicode. Non-decodable bytes will be
|
||||
represented as lone half surrogate codes U+DCxx.
|
||||
|
||||
To convert non-decodable bytes, a new error handler "python-escape" is
|
||||
introduced, which decodes non-decodable bytes using into a private-use
|
||||
character U+F01xx, which is believed to not conflict with private-use
|
||||
characters that currently exist in Python codecs.
|
||||
introduced, which produces these half surrogates. On encoding, the
|
||||
error handler converts the half surrogate back to the corresponding
|
||||
byte.
|
||||
|
||||
The error handler interface is extended to allow the encode error
|
||||
handler to return byte strings immediately, in addition to returning
|
||||
Unicode strings which then get encoded again.
|
||||
|
||||
If the locale's encoding is UTF-8, the file system encoding is set to
|
||||
a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
|
||||
(which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
|
||||
a new encoding "utf-8b", as the regular UTF-8 codec would not
|
||||
re-encode half surrogates as single bytes. The UTF-8b codec decodes
|
||||
non-decodable bytes (which must be >= 0x80) into half surrogate codes
|
||||
U+DC80..U+DCFF.
|
||||
|
||||
Discussion
|
||||
==========
|
||||
|
|
Loading…
Reference in New Issue