Accept Lino Mastrodomenico's proposal of always using

low surrogates to represent non-decodable bytes.
This commit is contained in:
Martin v. Löwis 2009-04-24 20:25:20 +00:00
parent 1eb61be116
commit 1b7ea9323b
1 changed files with 9 additions and 11 deletions

View File

@ -62,25 +62,23 @@ character-oriented APIs, allowing direct conversion of the
environmental data to Python str objects. environmental data to Python str objects.
On POSIX systems, Python currently applies the locale's encoding to On POSIX systems, Python currently applies the locale's encoding to
convert the byte data to Unicode. If the locale's encoding is UTF-8, convert the byte data to Unicode. Non-decodable bytes will be
it can represent the full set of Unicode characters, otherwise, only a represented as lone half surrogate codes U+DCxx.
subset is representable. In the latter case, using private-use
characters to represent these bytes would be an option. For UTF-8,
doing so would create an ambiguity, as the private-use characters may
regularly occur in the input also.
To convert non-decodable bytes, a new error handler "python-escape" is To convert non-decodable bytes, a new error handler "python-escape" is
introduced, which decodes non-decodable bytes using into a private-use introduced, which produces these half surrogates. On encoding, the
character U+F01xx, which is believed to not conflict with private-use error handler converts the half surrogate back to the corresponding
characters that currently exist in Python codecs. byte.
The error handler interface is extended to allow the encode error The error handler interface is extended to allow the encode error
handler to return byte strings immediately, in addition to returning handler to return byte strings immediately, in addition to returning
Unicode strings which then get encoded again. Unicode strings which then get encoded again.
If the locale's encoding is UTF-8, the file system encoding is set to If the locale's encoding is UTF-8, the file system encoding is set to
a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes a new encoding "utf-8b", as the regular UTF-8 codec would not
(which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. re-encode half surrogates as single bytes. The UTF-8b codec decodes
non-decodable bytes (which must be >= 0x80) into half surrogate codes
U+DC80..U+DCFF.
Discussion Discussion
========== ==========