Accept Lino Mastrodomenico's proposal of always using

low surrogates to represent non-decodable bytes.
2009-04-24 20:25:20 +00:00 · 2009-04-24 20:25:20 +00:00 · 1b7ea9323b
parent 1eb61be116
commit 1b7ea9323b
1 changed files with 9 additions and 11 deletions
--- a/pep-0383.txt
+++ b/pep-0383.txt
@ -62,25 +62,23 @@ character-oriented APIs, allowing direct conversion of the
 environmental data to Python str objects.

 On POSIX systems, Python currently applies the locale's encoding to
-convert the byte data to Unicode. If the locale's encoding is UTF-8,
-it can represent the full set of Unicode characters, otherwise, only a
-subset is representable. In the latter case, using private-use
-characters to represent these bytes would be an option. For UTF-8,
-doing so would create an ambiguity, as the private-use characters may
-regularly occur in the input also.
+convert the byte data to Unicode. Non-decodable bytes will be
+represented as lone half surrogate codes U+DCxx.

 To convert non-decodable bytes, a new error handler "python-escape" is
-introduced, which decodes non-decodable bytes using into a private-use
-character U+F01xx, which is believed to not conflict with private-use
-characters that currently exist in Python codecs.
+introduced, which produces these half surrogates. On encoding, the
+error handler converts the half surrogate back to the corresponding
+byte.

 The error handler interface is extended to allow the encode error
 handler to return byte strings immediately, in addition to returning
 Unicode strings which then get encoded again.

 If the locale's encoding is UTF-8, the file system encoding is set to
-a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
-(which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
+a new encoding "utf-8b", as the regular UTF-8 codec would not
+re-encode half surrogates as single bytes. The UTF-8b codec decodes
+non-decodable bytes (which must be >= 0x80) into half surrogate codes
+U+DC80..U+DCFF.

 Discussion
 ==========