Accept Lino Mastrodomenico's proposal of always using

low surrogates to represent non-decodable bytes.
This commit is contained in:
Martin v. Löwis 2009-04-24 20:25:20 +00:00
parent 1eb61be116
commit 1b7ea9323b
1 changed files with 9 additions and 11 deletions

View File

@ -62,25 +62,23 @@ character-oriented APIs, allowing direct conversion of the
environmental data to Python str objects.
On POSIX systems, Python currently applies the locale's encoding to
convert the byte data to Unicode. If the locale's encoding is UTF-8,
it can represent the full set of Unicode characters, otherwise, only a
subset is representable. In the latter case, using private-use
characters to represent these bytes would be an option. For UTF-8,
doing so would create an ambiguity, as the private-use characters may
regularly occur in the input also.
convert the byte data to Unicode. Non-decodable bytes will be
represented as lone half surrogate codes U+DCxx.
To convert non-decodable bytes, a new error handler "python-escape" is
introduced, which decodes non-decodable bytes using into a private-use
character U+F01xx, which is believed to not conflict with private-use
characters that currently exist in Python codecs.
introduced, which produces these half surrogates. On encoding, the
error handler converts the half surrogate back to the corresponding
byte.
The error handler interface is extended to allow the encode error
handler to return byte strings immediately, in addition to returning
Unicode strings which then get encoded again.
If the locale's encoding is UTF-8, the file system encoding is set to
a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
(which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
a new encoding "utf-8b", as the regular UTF-8 codec would not
re-encode half surrogates as single bytes. The UTF-8b codec decodes
non-decodable bytes (which must be >= 0x80) into half surrogate codes
U+DC80..U+DCFF.
Discussion
==========