reSTify PEP 293 (#353)

This commit is contained in:
Huang Huang 2017-08-19 03:02:48 +08:00 committed by Brett Cannon
parent e6fe4f377f
commit 3ea921ba72
1 changed files with 328 additions and 311 deletions

View File

@ -5,12 +5,14 @@ Last-Modified: $Date$
Author: Walter Dörwald <walter@livinglogic.de> Author: Walter Dörwald <walter@livinglogic.de>
Status: Final Status: Final
Type: Standards Track Type: Standards Track
Content-Type: text/x-rst
Created: 18-Jun-2002 Created: 18-Jun-2002
Python-Version: 2.3 Python-Version: 2.3
Post-History: 19-Jun-2002 Post-History: 19-Jun-2002
Abstract Abstract
========
This PEP aims at extending Python's fixed codec error handling This PEP aims at extending Python's fixed codec error handling
schemes with a more flexible callback based approach. schemes with a more flexible callback based approach.
@ -25,6 +27,7 @@ Abstract
Specification Specification
=============
Currently the set of codec error handling algorithms is fixed to Currently the set of codec error handling algorithms is fixed to
either "strict", "replace" or "ignore" and the semantics of these either "strict", "replace" or "ignore" and the semantics of these
@ -33,19 +36,19 @@ Specification
The proposed patch will make the set of error handling algorithms The proposed patch will make the set of error handling algorithms
extensible through a codec error handler registry which maps extensible through a codec error handler registry which maps
handler names to handler functions. This registry consists of the handler names to handler functions. This registry consists of the
following two C functions: following two C functions::
int PyCodec_RegisterError(const char *name, PyObject *error) int PyCodec_RegisterError(const char *name, PyObject *error)
PyObject *PyCodec_LookupError(const char *name) PyObject *PyCodec_LookupError(const char *name)
and their Python counterparts and their Python counterparts::
codecs.register_error(name, error) codecs.register_error(name, error)
codecs.lookup_error(name) codecs.lookup_error(name)
PyCodec_LookupError raises a LookupError if no callback function ``PyCodec_LookupError`` raises a ``LookupError`` if no callback function
has been registered under this name. has been registered under this name.
Similar to the encoding name registry there is no way of Similar to the encoding name registry there is no way of
@ -59,7 +62,7 @@ Specification
with this object. The callback returns information about how to with this object. The callback returns information about how to
proceed (or raises an exception). proceed (or raises an exception).
For encoding, the exception object will look like this: For encoding, the exception object will look like this::
class UnicodeEncodeError(UnicodeError): class UnicodeEncodeError(UnicodeError):
def __init__(self, encoding, object, start, end, reason): def __init__(self, encoding, object, start, end, reason):
@ -77,14 +80,14 @@ Specification
getter methods for the attributes, which have the following getter methods for the attributes, which have the following
meaning: meaning:
* encoding: The name of the encoding; * ``encoding``: The name of the encoding;
* object: The original unicode object for which encode() has * ``object``: The original unicode object for which ``encode()`` has
been called; been called;
* start: The position of the first unencodable character; * ``start``: The position of the first unencodable character;
* end: (The position of the last unencodable character)+1 (or * ``end``: (The position of the last unencodable character)+1 (or
the length of object, if all characters from start to the end the length of object, if all characters from start to the end
of object are unencodable); of object are unencodable);
* reason: The reason why object[start:end] couldn't be encoded. * ``reason``: The reason why ``object[start:end]`` couldn't be encoded.
If object has consecutive unencodable characters, the encoder If object has consecutive unencodable characters, the encoder
should collect those characters for one call to the callback if should collect those characters for one call to the callback if
@ -95,18 +98,18 @@ Specification
The callback must not modify the exception object. If the The callback must not modify the exception object. If the
callback does not raise an exception (either the one passed in, or callback does not raise an exception (either the one passed in, or
a different one), it must return a tuple: a different one), it must return a tuple::
(replacement, newpos) (replacement, newpos)
replacement is a unicode object that the encoder will encode and replacement is a unicode object that the encoder will encode and
emit instead of the unencodable object[start:end] part, newpos emit instead of the unencodable ``object[start:end]`` part, newpos
specifies a new position within object, where (after encoding the specifies a new position within object, where (after encoding the
replacement) the encoder will continue encoding. replacement) the encoder will continue encoding.
Negative values for newpos are treated as being relative to Negative values for newpos are treated as being relative to
end of object. If newpos is out of bounds the encoder will raise end of object. If newpos is out of bounds the encoder will raise
an IndexError. an ``IndexError``.
If the replacement string itself contains an unencodable character If the replacement string itself contains an unencodable character
the encoder raises the exception object (but may set a different the encoder raises the exception object (but may set a different
@ -115,44 +118,45 @@ Specification
Should further encoding errors occur, the encoder is allowed to Should further encoding errors occur, the encoder is allowed to
reuse the exception object for the next call to the callback. reuse the exception object for the next call to the callback.
Furthermore, the encoder is allowed to cache the result of Furthermore, the encoder is allowed to cache the result of
codecs.lookup_error. ``codecs.lookup_error``.
If the callback does not know how to handle the exception, it must If the callback does not know how to handle the exception, it must
raise a TypeError. raise a ``TypeError``.
Decoding works similar to encoding with the following differences: Decoding works similar to encoding with the following differences:
The exception class is named UnicodeDecodeError and the attribute
* The exception class is named ``UnicodeDecodeError`` and the attribute
object is the original 8bit string that the decoder is currently object is the original 8bit string that the decoder is currently
decoding. decoding.
The decoder will call the callback with those bytes that * The decoder will call the callback with those bytes that
constitute one undecodable sequence, even if there is more than constitute one undecodable sequence, even if there is more than
one undecodable sequence that is undecodable for the same reason one undecodable sequence that is undecodable for the same reason
directly after the first one. E.g. for the "unicode-escape" directly after the first one. E.g. for the "unicode-escape"
encoding, when decoding the illegal string "\\u00\\u01x", the encoding, when decoding the illegal string ``\\u00\\u01x``, the
callback will be called twice (once for "\\u00" and once for callback will be called twice (once for ``\\u00`` and once for
"\\u01"). This is done to be able to generate the correct number ``\\u01``). This is done to be able to generate the correct number
of replacement characters. of replacement characters.
The replacement returned from the callback is a unicode object * The replacement returned from the callback is a unicode object
that will be emitted by the decoder as-is without further that will be emitted by the decoder as-is without further
processing instead of the undecodable object[start:end] part. processing instead of the undecodable ``object[start:end]`` part.
There is a third API that uses the old strict/ignore/replace error There is a third API that uses the old strict/ignore/replace error
handling scheme: handling scheme::
PyUnicode_TranslateCharmap/unicode.translate PyUnicode_TranslateCharmap/unicode.translate
The proposed patch will enhance PyUnicode_TranslateCharmap, so The proposed patch will enhance ``PyUnicode_TranslateCharmap``, so
that it also supports the callback registry. This has the that it also supports the callback registry. This has the
additional side effect that PyUnicode_TranslateCharmap will additional side effect that ``PyUnicode_TranslateCharmap`` will
support multi-character replacement strings (see SF feature support multi-character replacement strings (see SF feature
request #403100 [1]). request #403100 [1]_).
For PyUnicode_TranslateCharmap the exception class will be named For ``PyUnicode_TranslateCharmap`` the exception class will be named
UnicodeTranslateError. PyUnicode_TranslateCharmap will collect ``UnicodeTranslateError``. ``PyUnicode_TranslateCharmap`` will collect
all consecutive untranslatable characters (i.e. those that map to all consecutive untranslatable characters (i.e. those that map to
None) and call the callback with them. The replacement returned ``None``) and call the callback with them. The replacement returned
from the callback is a unicode object that will be put in the from the callback is a unicode object that will be put in the
translated result as-is, without further processing. translated result as-is, without further processing.
@ -163,9 +167,9 @@ Specification
callback names: "backslashreplace" and "xmlcharrefreplace", which callback names: "backslashreplace" and "xmlcharrefreplace", which
can be used for encoding and translating and which will also be can be used for encoding and translating and which will also be
implemented in-place for all encoders and implemented in-place for all encoders and
PyUnicode_TranslateCharmap. ``PyUnicode_TranslateCharmap``.
The Python equivalent of these five callbacks will look like this: The Python equivalent of these five callbacks will look like this::
def strict(exc): def strict(exc):
raise exc raise exc
@ -212,16 +216,17 @@ Specification
raise TypeError("can't handle %s" % exc.__name__) raise TypeError("can't handle %s" % exc.__name__)
These five callback handlers will also be accessible to Python as These five callback handlers will also be accessible to Python as
codecs.strict_error, codecs.ignore_error, codecs.replace_error, ``codecs.strict_error``, ``codecs.ignore_error``, ``codecs.replace_error``,
codecs.backslashreplace_error and codecs.xmlcharrefreplace_error. ``codecs.backslashreplace_error`` and ``codecs.xmlcharrefreplace_error``.
Rationale Rationale
=========
Most legacy encoding do not support the full range of Unicode Most legacy encoding do not support the full range of Unicode
characters. For these cases many high level protocols support a characters. For these cases many high level protocols support a
way of escaping a Unicode character (e.g. Python itself supports way of escaping a Unicode character (e.g. Python itself supports
the \x, \u and \U convention, XML supports character references the ``\x``, ``\u`` and ``\U`` convention, XML supports character references
via &#xxx; etc.). via &#xxx; etc.).
When implementing such an encoding algorithm, a problem with the When implementing such an encoding algorithm, a problem with the
@ -231,12 +236,16 @@ Rationale
because encode does not provide any information about the location because encode does not provide any information about the location
of the error(s), so of the error(s), so
::
# (1) # (1)
us = u"xxx" us = u"xxx"
s = us.encode(encoding) s = us.encode(encoding)
has to be replaced by has to be replaced by
::
# (2) # (2)
us = u"xxx" us = u"xxx"
v = [] v = []
@ -257,7 +266,7 @@ Rationale
character. character.
To work around this problem, a stream writer - which keeps state To work around this problem, a stream writer - which keeps state
between calls to the encoding function - has to be used: between calls to the encoding function - has to be used::
# (3) # (3)
us = u"xxx" us = u"xxx"
@ -274,7 +283,7 @@ Rationale
s = v.getvalue() s = v.getvalue()
To compare the speed of (1) and (3) the following test script has To compare the speed of (1) and (3) the following test script has
been used: been used::
# (4) # (4)
import time import time
@ -306,7 +315,7 @@ Rationale
print "2:", t3-t2 print "2:", t3-t2
print "factor:", (t3-t2)/(t2-t1) print "factor:", (t3-t2)/(t2-t1)
On Linux this gives the following output (with Python 2.3a0): On Linux this gives the following output (with Python 2.3a0)::
1: 0.274321913719 1: 0.274321913719
2: 51.1284689903 2: 51.1284689903
@ -316,19 +325,23 @@ Rationale
Callbacks must be stateless, because as soon as a callback is Callbacks must be stateless, because as soon as a callback is
registered it is available globally and can be called by multiple registered it is available globally and can be called by multiple
encode() calls. To be able to use stateful callbacks, the errors ``encode()`` calls. To be able to use stateful callbacks, the errors
parameter for encode/decode/translate would have to be changed parameter for encode/decode/translate would have to be changed
from char * to PyObject *, so that the callback could be used from ``char *`` to ``PyObject *``, so that the callback could be used
directly, without the need to register the callback globally. As directly, without the need to register the callback globally. As
this requires changes to lots of C prototypes, this approach was this requires changes to lots of C prototypes, this approach was
rejected. rejected.
Currently all encoding/decoding functions have arguments Currently all encoding/decoding functions have arguments
::
const Py_UNICODE *p, int size const Py_UNICODE *p, int size
or or
::
const char *p, int size const char *p, int size
to specify the unicode characters/8bit characters to be to specify the unicode characters/8bit characters to be
@ -343,35 +356,36 @@ Rationale
For stream readers/writers the errors attribute must be changeable For stream readers/writers the errors attribute must be changeable
to be able to switch between different error handling methods to be able to switch between different error handling methods
during the lifetime of the stream reader/writer. This is currently during the lifetime of the stream reader/writer. This is currently
the case for codecs.StreamReader and codecs.StreamWriter and the case for ``codecs.StreamReader`` and ``codecs.StreamWriter`` and
all their subclasses. All core codecs and probably most of the all their subclasses. All core codecs and probably most of the
third party codecs (e.g. JapaneseCodecs) derive their stream third party codecs (e.g. ``JapaneseCodecs``) derive their stream
readers/writers from these classes so this already works, readers/writers from these classes so this already works,
but the attribute errors should be documented as a requirement. but the attribute errors should be documented as a requirement.
Implementation Notes Implementation Notes
====================
A sample implementation is available as SourceForge patch #432401 A sample implementation is available as SourceForge patch #432401
[2] including a script for testing the speed of various [2]_ including a script for testing the speed of various
string/encoding/error combinations and a test script. string/encoding/error combinations and a test script.
Currently the new exception classes are old style Python Currently the new exception classes are old style Python
classes. This means that accessing attributes results classes. This means that accessing attributes results
in a dict lookup. The C API is implemented in a way in a dict lookup. The C API is implemented in a way
that makes it possible to switch to new style classes that makes it possible to switch to new style classes
behind the scene, if Exception (and UnicodeError) will behind the scene, if ``Exception`` (and ``UnicodeError``) will
be changed to new style classes implemented in C for be changed to new style classes implemented in C for
improved performance. improved performance.
The class codecs.StreamReaderWriter uses the errors parameter for The class ``codecs.StreamReaderWriter`` uses the errors parameter for
both reading and writing. To be more flexible this should both reading and writing. To be more flexible this should
probably be changed to two separate parameters for reading and probably be changed to two separate parameters for reading and
writing. writing.
The errors parameter of PyUnicode_TranslateCharmap is not The errors parameter of ``PyUnicode_TranslateCharmap`` is not
availably to Python, which makes testing of the new functionality availably to Python, which makes testing of the new functionality
of PyUnicode_TranslateCharmap impossible with Python scripts. The of ``PyUnicode_TranslateCharmap`` impossible with Python scripts. The
patch should add an optional argument errors to unicode.translate patch should add an optional argument errors to unicode.translate
to expose the functionality and make testing possible. to expose the functionality and make testing possible.
@ -379,11 +393,12 @@ Implementation Notes
unicode and want to use the new machinery can define their own unicode and want to use the new machinery can define their own
exception classes and the strict handlers will automatically work exception classes and the strict handlers will automatically work
with it. The other predefined error handlers are unicode specific with it. The other predefined error handlers are unicode specific
and expect to get a Unicode(Encode|Decode|Translate)Error and expect to get a ``Unicode(Encode|Decode|Translate)Error``
exception object so they won't work. exception object so they won't work.
Backwards Compatibility Backwards Compatibility
=======================
The semantics of unicode.encode with errors="replace" has changed: The semantics of unicode.encode with errors="replace" has changed:
The old version always stored a ? character in the output string The old version always stored a ? character in the output string
@ -393,26 +408,28 @@ Backwards Compatibility
supported encodings are ASCII based, and thus map ? to ?, this supported encodings are ASCII based, and thus map ? to ?, this
should not be a problem in practice. should not be a problem in practice.
Illegal values for the errors argument raised ValueError before, Illegal values for the errors argument raised ``ValueError`` before,
now they will raise LookupError. now they will raise ``LookupError``.
References References
==========
[1] SF feature request #403100 .. [1] SF feature request #403100
"Multicharacter replacements in PyUnicode_TranslateCharmap" "Multicharacter replacements in PyUnicode_TranslateCharmap"
http://www.python.org/sf/403100 http://www.python.org/sf/403100
[2] SF patch #432401 "unicode encoding error callbacks" .. [2] SF patch #432401 "unicode encoding error callbacks"
http://www.python.org/sf/432401 http://www.python.org/sf/432401
Copyright Copyright
=========
This document has been placed in the public domain. This document has been placed in the public domain.
..
Local Variables: Local Variables:
mode: indented-text mode: indented-text
indent-tabs-mode: nil indent-tabs-mode: nil