PEP 293, Codec Error Handling Callbacks, Walter Dörwald

2002-06-19 03:22:11 +00:00 · 2002-06-19 03:22:11 +00:00 · 90997b3c1d
parent 77fe9cc31e
commit 90997b3c1d
1 changed files with 404 additions and 0 deletions
--- a/pep-0293.txt
+++ b/pep-0293.txt
@ -0,0 +1,404 @@
+PEP: 293
+Title: Codec Error Handling Callbacks
+Version: $Revision$
+Last-Modified: $Date$
+Author: Walter Dörwald
+Status: Draft
+Type: Standards Track
+Created: 18-Jun-2002
+Python-Version: 2.3
+Post-History:
+
+
+Abstract
+
+    This PEP aims at extending Python's fixed codec error handling
+    schemes with a more flexible callback based approach.
+
+    Python currently uses a fixed error handling for codec error
+    handlers.  This PEP describes a mechanism which allows Python to
+    use function callbacks as error handlers.  With these more
+    flexible error handlers it is possible to add new functionality to
+    existing codecs by e.g. providing fallback solutions or different
+    encodings for cases where the standard codec mapping does not
+    apply.
+
+
+Specification
+
+    Currently the set of codec error handling algorithms is fixed to
+    either "strict", "replace" or "ignore" and the semantics of these
+    algorithms is implemented separately for each codec.
+
+    The proposed patch will make the set of error handling algorithms
+    extensible through a codec error handler registry which maps
+    handler names to handler functions.  This registry consists of the
+    following two C functions:
+
+        int PyCodec_RegisterError(const char *name, PyObject *error)
+
+        PyObject *PyCodec_LookupError(const char *name)
+
+    and their Python counterparts
+
+        codecs.register_error(name, error)
+
+        codecs.lookup_error(name)
+
+    PyCodec_LookupError raises a LookupError if no callback function
+    has been registered under this name.
+
+    Similar to the encoding name registry there is no way of
+    unregistering callback functions or iterating through the
+    available functions.
+
+    The callback functions will be used in the following way by the
+    codecs: when the codec encounters an encoding/decoding error, the
+    callback function is looked up by name, the information about the
+    error is stored in an exception object and the callback is called
+    with this object.  The callback returns information about how to
+    proceed (or raises an exception).
+
+    For encoding, the exception object will look like this:
+
+       class UnicodeEncodeError(UnicodeError):
+           def __init__(self, encoding, object, start, end, reason):
+               UnicodeError.__init__(self,
+                   "encoding '%s' can't encode characters " +
+                   "in positions %d-%d: %s" % (encoding,
+                       start, end-1, reason))
+               self.encoding = encoding
+               self.object = object
+               self.start = start
+               self.end = end
+               self.reason = reason
+
+    This type will be implemented in C with the appropriate setter and
+    getter methods for the attributes, which have the following
+    meaning:
+
+      * encoding: The name of the encoding;
+      * object: The original unicode object for which encode() has
+        been called;
+      * start: The position of the first unencodable character;
+      * end: (The position of the last unencodable character)+1 (or
+        the length of object, if all characters from start to the end
+        of object are unencodable);
+      * reason: The reason why object[start:end] couldn't be encoded.
+
+    If object has consecutive unencodable characters, the encoder
+    should collect those characters for one call to the callback if
+    those characters can't be encoded for the same reason.  The
+    encoder is not required to implement this behaviour but may call
+    the callback for every single character, but it is strongly
+    suggested that the collecting method is implemented.
+
+    The callback must not modify the exception object.  If the
+    callback does not raise an exception (either the one passed in, or
+    a different one), it must return a tuple:
+
+        (replacement, newpos)
+
+    replacement is a unicode object that the encoder will encode and
+    emit instead of the unencodable object[start:end] part, newpos
+    specifies a new position within object, where (after encoding the
+    replacement) the encoder will continue encoding.
+
+    If the replacement string itself contains an unencodable character
+    the encoder raises the exception object (but may set a different
+    reason string before raising).
+
+    Should further encoding errors occur, the encoder is allowed to
+    reuse the exception object for the next call to the callback.
+    Furthermore the encoder is allowed to cache the result of
+    codecs.lookup_error.
+
+    If the callback does not know how to handle the exception, it must
+    raise a TypeError.
+
+    Decoding works similar to encoding with the following differences:
+    The exception class is named UnicodeDecodeError and the attribute
+    object is the original 8bit string that the decoder is currently
+    decoding.
+
+    The decoder will call the callback with those bytes that
+    constitute one undecodable sequence, even if there is more than
+    one undecodable sequence that is undecodable for the same reason
+    directly after the first one.  E.g. for the "unicode-escape"
+    encoding, when decoding the illegal string "\\u00\\u01x", the
+    callback will be called twice (once for "\\u00" and once for
+    "\\u01").  This is done to be able to generate the correct number
+    of replacement characters.
+
+    The replacement returned from the callback is a unicode object
+    that will be emitted by the decoder as-is without further
+    processing instead of the undecodable object[start:end] part.
+
+    There is a third API that uses the old strict/ignore/replace error
+    handling scheme:
+
+        PyUnicode_TranslateCharmap/unicode.translate
+
+    The proposed patch will enhance PyUnicode_TranslateCharmap, so
+    that it also supports the callback registry.  This has the
+    additional side effect that PyUnicode_TranslateCharmap will
+    support multi-character replacement strings (see SF feature
+    request #403100 [1]).
+
+    For PyUnicode_TranslateCharmap the exception class will be named
+    UnicodeTranslateError.  PyUnicode_TranslateCharmap will collect
+    all consecutive untranslatable characters (i.e. those that map to
+    None) and call the callback with them.  The replacement returned
+    from the callback is a unicode object that will be put in the
+    translated result as-is, without further processing.
+
+    All encoders and decoders are allowed to implement the callback
+    functionality themselves, if they recognize the callback name
+    (i.e. if it is a system callback like "strict", "replace" and
+    "ignore").  The proposed patch will add two additional system
+    callback names: "backslashreplace" and "xmlcharrefreplace", which
+    can be used for encoding and translating and which will also be
+    implemented in-place for all encoders and
+    PyUnicode_TranslateCharmap.
+
+    The Python equivalent of these five callbacks will look like this:
+
+        def strict(exc):
+            raise exc
+
+        def ignore(exc):
+            if isinstance(exc, UnicodeError):
+                return (u"", exc.end)
+            else:
+                raise TypeError("can't handle %s" % exc.__name__)
+
+       def replace(exc):
+            if isinstance(exc, UnicodeEncodeError):
+                return ((exc.end-exc.start)*u"?", exc.end)
+            elif isinstance(exc, UnicodeDecodeError):
+                return (u"\\ufffd", exc.end)
+            elif isinstance(exc, UnicodeTranslateError):
+                return ((exc.end-exc.start)*u"\\ufffd", exc.end)
+            else:
+                raise TypeError("can't handle %s" % exc.__name__)
+
+       def backslashreplace(exc):
+            if isinstance(exc,
+                (UnicodeEncodeError, UnicodeTranslateError)):
+                s = u""
+                for c in exc.object[exc.start:exc.end]:
+                   if ord(c)<=0xff:
+                       s += u"\\x%02x" % ord(c)
+                   elif ord(c)<=0xffff:
+                       s += u"\\u%04x" % ord(c)
+                   else:
+                       s += u"\\U%08x" % ord(c)
+                return (s, exc.end)
+            else:
+                raise TypeError("can't handle %s" % exc.__name__) 
+
+       def xmlcharrefreplace(exc):
+            if isinstance(exc,
+                (UnicodeEncodeError, UnicodeTranslateError)):
+                s = u""
+                for c in exc.object[exc.start:exc.end]:
+                   s += u"&#%d;" % ord(c)
+                return (s, exc.end)
+            else:
+                raise TypeError("can't handle %s" % exc.__name__) 
+
+    These five callback handlers will also be accessible to Python as
+    codecs.strict_error, codecs.ignore_error, codecs.replace_error,
+    codecs.backslashreplace_error and codecs.xmlcharrefreplace_error.
+
+
+Rationale
+
+    Most legacy encoding do not support the full range of Unicode
+    characters.  For these cases many high level protocols support a
+    way of escaping a Unicode character (e.g. Python itself supports
+    the \x, \u and \U convention, XML supports character references
+    via &#xxx; etc.).
+
+    When implementing such an encoding algorithm, a problem with the
+    current implementation of the encode method of Unicode objects
+    becomes apparent: For determining which characters are unencodable
+    by a certain encoding, every single character has to be tried,
+    because encode does not provide any information about the location
+    of the error(s), so
+
+        # (1)
+        us = u"xxx"
+        s = us.encode(encoding)
+
+    has to be replaced by
+
+        # (2)
+        us = u"xxx"
+        v = []
+        for c in us:
+            try:
+                v.append(c.encode(encoding))
+            except UnicodeError:
+                v.append("&#%d;" % ord(c))
+        s = "".join(v)
+
+    This slows down encoding dramatically as now the loop through the
+    string is done in Python code and no longer in C code.
+
+    Furthermore this solution poses problems with stateful encodings.
+    For example UTF-16 uses a Byte Order Mark at the start of the
+    encoded byte string to specify the byte order.  Using (2) with
+    UTF-16, results in an 8 bit string with a BOM between every
+    character.
+
+    To work around this problem, a stream writer - which keeps state
+    between calls to the encoding function - has to be used:
+
+        # (3)
+        us = u"xxx"
+        import codecs, cStringIO as StringIO
+        writer = codecs.getwriter(encoding)
+
+        v = StringIO.StringIO()
+        uv = writer(v)
+        for c in us:
+            try:
+                uv.write(c)
+            except UnicodeError:
+                uv.write(u"&#%d;" % ord(c))
+        s = v.getvalue()
+
+    To compare the speed of (1) and (3) the following test script has
+    been used:
+
+        # (4)
+        import time
+        us = u"äa"*1000000
+        encoding = "ascii"
+        import codecs, cStringIO as StringIO
+
+        t1 = time.time()
+
+        s1 = us.encode(encoding, "replace")
+
+        t2 = time.time()
+
+        writer = codecs.getwriter(encoding)
+
+        v = StringIO.StringIO()
+        uv = writer(v)
+        for c in us:
+            try:
+                uv.write(c)
+            except UnicodeError:
+                uv.write(u"?")
+        s2 = v.getvalue()
+
+        t3 = time.time()
+
+        assert(s1==s2)
+        print "1:", t2-t1
+        print "2:", t3-t2
+        print "factor:", (t3-t2)/(t2-t1)
+
+    On Linux this gives the following output (with Python 2.3a0):
+
+        1: 0.274321913719
+        2: 51.1284689903
+        factor: 186.381278466
+
+    i.e. (3) is 180 times slower than (1).
+
+    Codecs must be stateless, because as soon as a callback is
+    registered it is available globally and can be called by multiple
+    encode() calls.  To be able to use stateful callbacks, the errors
+    parameter for encode/decode/translate would have to be changed
+    from char * to PyObject *, so that the callback could be used
+    directly, without the need to register the callback globally.  As
+    this requires changes to lots of C prototypes, this approach was
+    rejected.
+
+    Currently all encoding/decoding functions have arguments
+
+        const Py_UNICODE *p, int size
+
+    or
+
+        const char *p, int size
+
+    to specify the unicode characters/8bit characters to be
+    encoded/decoded.  So in case of an error the codec has to create a
+    new unicode or str object from these parameters and store it in
+    the exception object.  The callers of these encoding/decoding
+    functions extract these parameters from str/unicode objects
+    themselves most of the time, so it could speed up error handling
+    if these object were passed directly.  As this again requires
+    changes to many C functions, this approach has been rejected.
+
+
+Implementation Notes
+
+    A sample implementation is available as SourceForge patch #432401
+    [2].  The current version of this patch differs from the
+    specification in the following way:
+
+      * The error information is passed from the codec to the callback
+        not as an exception object, but as a tuple, which has an
+        additional entry state, which can be used for additional
+        information the codec might want to pass to the callback.
+      * There are two separate registries (one for
+        encoding/translating and one for decoding)
+
+    The class codecs.StreamReaderWriter uses the errors parameter for
+    both reading and writing.  To be more flexible this should
+    probably be changed to two separate parameters for reading and
+    writing.
+
+    The errors parameter of PyUnicode_TranslateCharmap is not
+    availably to Python, which makes testing of the new functionality
+    of PyUnicode_TranslateCharmap impossible with Python scripts.  The
+    patch should add an optional argument errors to unicode.translate
+    to expose the functionality and make testing possible.
+
+    Codecs that do something different than encoding/decoding from/to
+    unicode and want to use the new machinery can define their own
+    exception classes and the strict handlers will automatically work
+    with it. The other predefined error handlers are unicode specific
+    and expect to get a Unicode(Encode|Decode|Translate)Error
+    exception object so they won't work.
+
+
+Backwards Compatibility
+
+    The semantics of unicode.encode with errors="replace" has changed:
+    The old version always stored a ? character in the output string
+    even if no character was mapped to ?  in the mapping.  With the
+    proposed patch, the replacement string from the callback callback
+    will again be looked up in the mapping dictionary.  But as all
+    supported encodings are ASCII based, and thus map ? to ?, this
+    should not be a problem in practice.
+
+
+References
+
+    [1] SF feature request #403100
+        "Multicharacter replacements in PyUnicode_TranslateCharmap"
+        http://www.python.org/sf/403100
+
+    [2] SF patch #432401 "unicode encoding error callbacks"
+        http://www.python.org/sf/432401
+
+
+Copyright
+
+    This document has been placed in the public domain.
+
+
+
+Local Variables:
+mode: indented-text
+indent-tabs-mode: nil
+sentence-end-double-space: t
+fill-column: 70
+End: