440 lines
15 KiB
ReStructuredText
440 lines
15 KiB
ReStructuredText
PEP: 293
|
|
Title: Codec Error Handling Callbacks
|
|
Version: $Revision$
|
|
Last-Modified: $Date$
|
|
Author: Walter Dörwald <walter@livinglogic.de>
|
|
Status: Final
|
|
Type: Standards Track
|
|
Content-Type: text/x-rst
|
|
Created: 18-Jun-2002
|
|
Python-Version: 2.3
|
|
Post-History: 19-Jun-2002
|
|
|
|
|
|
Abstract
|
|
========
|
|
|
|
This PEP aims at extending Python's fixed codec error handling
|
|
schemes with a more flexible callback based approach.
|
|
|
|
Python currently uses a fixed error handling for codec error
|
|
handlers. This PEP describes a mechanism which allows Python to
|
|
use function callbacks as error handlers. With these more
|
|
flexible error handlers it is possible to add new functionality to
|
|
existing codecs by e.g. providing fallback solutions or different
|
|
encodings for cases where the standard codec mapping does not
|
|
apply.
|
|
|
|
|
|
Specification
|
|
=============
|
|
|
|
Currently the set of codec error handling algorithms is fixed to
|
|
either "strict", "replace" or "ignore" and the semantics of these
|
|
algorithms is implemented separately for each codec.
|
|
|
|
The proposed patch will make the set of error handling algorithms
|
|
extensible through a codec error handler registry which maps
|
|
handler names to handler functions. This registry consists of the
|
|
following two C functions::
|
|
|
|
int PyCodec_RegisterError(const char *name, PyObject *error)
|
|
|
|
PyObject *PyCodec_LookupError(const char *name)
|
|
|
|
and their Python counterparts::
|
|
|
|
codecs.register_error(name, error)
|
|
|
|
codecs.lookup_error(name)
|
|
|
|
``PyCodec_LookupError`` raises a ``LookupError`` if no callback function
|
|
has been registered under this name.
|
|
|
|
Similar to the encoding name registry there is no way of
|
|
unregistering callback functions or iterating through the
|
|
available functions.
|
|
|
|
The callback functions will be used in the following way by the
|
|
codecs: when the codec encounters an encoding/decoding error, the
|
|
callback function is looked up by name, the information about the
|
|
error is stored in an exception object and the callback is called
|
|
with this object. The callback returns information about how to
|
|
proceed (or raises an exception).
|
|
|
|
For encoding, the exception object will look like this::
|
|
|
|
class UnicodeEncodeError(UnicodeError):
|
|
def __init__(self, encoding, object, start, end, reason):
|
|
UnicodeError.__init__(self,
|
|
"encoding '%s' can't encode characters " +
|
|
"in positions %d-%d: %s" % (encoding,
|
|
start, end-1, reason))
|
|
self.encoding = encoding
|
|
self.object = object
|
|
self.start = start
|
|
self.end = end
|
|
self.reason = reason
|
|
|
|
This type will be implemented in C with the appropriate setter and
|
|
getter methods for the attributes, which have the following
|
|
meaning:
|
|
|
|
* ``encoding``: The name of the encoding;
|
|
* ``object``: The original unicode object for which ``encode()`` has
|
|
been called;
|
|
* ``start``: The position of the first unencodable character;
|
|
* ``end``: (The position of the last unencodable character)+1 (or
|
|
the length of object, if all characters from start to the end
|
|
of object are unencodable);
|
|
* ``reason``: The reason why ``object[start:end]`` couldn't be encoded.
|
|
|
|
If object has consecutive unencodable characters, the encoder
|
|
should collect those characters for one call to the callback if
|
|
those characters can't be encoded for the same reason. The
|
|
encoder is not required to implement this behaviour but may call
|
|
the callback for every single character, but it is strongly
|
|
suggested that the collecting method is implemented.
|
|
|
|
The callback must not modify the exception object. If the
|
|
callback does not raise an exception (either the one passed in, or
|
|
a different one), it must return a tuple::
|
|
|
|
(replacement, newpos)
|
|
|
|
replacement is a unicode object that the encoder will encode and
|
|
emit instead of the unencodable ``object[start:end]`` part, newpos
|
|
specifies a new position within object, where (after encoding the
|
|
replacement) the encoder will continue encoding.
|
|
|
|
Negative values for newpos are treated as being relative to
|
|
end of object. If newpos is out of bounds the encoder will raise
|
|
an ``IndexError``.
|
|
|
|
If the replacement string itself contains an unencodable character
|
|
the encoder raises the exception object (but may set a different
|
|
reason string before raising).
|
|
|
|
Should further encoding errors occur, the encoder is allowed to
|
|
reuse the exception object for the next call to the callback.
|
|
Furthermore, the encoder is allowed to cache the result of
|
|
``codecs.lookup_error``.
|
|
|
|
If the callback does not know how to handle the exception, it must
|
|
raise a ``TypeError``.
|
|
|
|
Decoding works similar to encoding with the following differences:
|
|
|
|
* The exception class is named ``UnicodeDecodeError`` and the attribute
|
|
object is the original 8bit string that the decoder is currently
|
|
decoding.
|
|
|
|
* The decoder will call the callback with those bytes that
|
|
constitute one undecodable sequence, even if there is more than
|
|
one undecodable sequence that is undecodable for the same reason
|
|
directly after the first one. E.g. for the "unicode-escape"
|
|
encoding, when decoding the illegal string ``\\u00\\u01x``, the
|
|
callback will be called twice (once for ``\\u00`` and once for
|
|
``\\u01``). This is done to be able to generate the correct number
|
|
of replacement characters.
|
|
|
|
* The replacement returned from the callback is a unicode object
|
|
that will be emitted by the decoder as-is without further
|
|
processing instead of the undecodable ``object[start:end]`` part.
|
|
|
|
There is a third API that uses the old strict/ignore/replace error
|
|
handling scheme::
|
|
|
|
PyUnicode_TranslateCharmap/unicode.translate
|
|
|
|
The proposed patch will enhance ``PyUnicode_TranslateCharmap``, so
|
|
that it also supports the callback registry. This has the
|
|
additional side effect that ``PyUnicode_TranslateCharmap`` will
|
|
support multi-character replacement strings (see SF feature
|
|
request #403100 [1]_).
|
|
|
|
For ``PyUnicode_TranslateCharmap`` the exception class will be named
|
|
``UnicodeTranslateError``. ``PyUnicode_TranslateCharmap`` will collect
|
|
all consecutive untranslatable characters (i.e. those that map to
|
|
``None``) and call the callback with them. The replacement returned
|
|
from the callback is a unicode object that will be put in the
|
|
translated result as-is, without further processing.
|
|
|
|
All encoders and decoders are allowed to implement the callback
|
|
functionality themselves, if they recognize the callback name
|
|
(i.e. if it is a system callback like "strict", "replace" and
|
|
"ignore"). The proposed patch will add two additional system
|
|
callback names: "backslashreplace" and "xmlcharrefreplace", which
|
|
can be used for encoding and translating and which will also be
|
|
implemented in-place for all encoders and
|
|
``PyUnicode_TranslateCharmap``.
|
|
|
|
The Python equivalent of these five callbacks will look like this::
|
|
|
|
def strict(exc):
|
|
raise exc
|
|
|
|
def ignore(exc):
|
|
if isinstance(exc, UnicodeError):
|
|
return (u"", exc.end)
|
|
else:
|
|
raise TypeError("can't handle %s" % exc.__name__)
|
|
|
|
def replace(exc):
|
|
if isinstance(exc, UnicodeEncodeError):
|
|
return ((exc.end-exc.start)*u"?", exc.end)
|
|
elif isinstance(exc, UnicodeDecodeError):
|
|
return (u"\\ufffd", exc.end)
|
|
elif isinstance(exc, UnicodeTranslateError):
|
|
return ((exc.end-exc.start)*u"\\ufffd", exc.end)
|
|
else:
|
|
raise TypeError("can't handle %s" % exc.__name__)
|
|
|
|
def backslashreplace(exc):
|
|
if isinstance(exc,
|
|
(UnicodeEncodeError, UnicodeTranslateError)):
|
|
s = u""
|
|
for c in exc.object[exc.start:exc.end]:
|
|
if ord(c)<=0xff:
|
|
s += u"\\x%02x" % ord(c)
|
|
elif ord(c)<=0xffff:
|
|
s += u"\\u%04x" % ord(c)
|
|
else:
|
|
s += u"\\U%08x" % ord(c)
|
|
return (s, exc.end)
|
|
else:
|
|
raise TypeError("can't handle %s" % exc.__name__)
|
|
|
|
def xmlcharrefreplace(exc):
|
|
if isinstance(exc,
|
|
(UnicodeEncodeError, UnicodeTranslateError)):
|
|
s = u""
|
|
for c in exc.object[exc.start:exc.end]:
|
|
s += u"&#%d;" % ord(c)
|
|
return (s, exc.end)
|
|
else:
|
|
raise TypeError("can't handle %s" % exc.__name__)
|
|
|
|
These five callback handlers will also be accessible to Python as
|
|
``codecs.strict_error``, ``codecs.ignore_error``, ``codecs.replace_error``,
|
|
``codecs.backslashreplace_error`` and ``codecs.xmlcharrefreplace_error``.
|
|
|
|
|
|
Rationale
|
|
=========
|
|
|
|
Most legacy encoding do not support the full range of Unicode
|
|
characters. For these cases many high level protocols support a
|
|
way of escaping a Unicode character (e.g. Python itself supports
|
|
the ``\x``, ``\u`` and ``\U`` convention, XML supports character references
|
|
via &#xxx; etc.).
|
|
|
|
When implementing such an encoding algorithm, a problem with the
|
|
current implementation of the encode method of Unicode objects
|
|
becomes apparent: For determining which characters are unencodable
|
|
by a certain encoding, every single character has to be tried,
|
|
because encode does not provide any information about the location
|
|
of the error(s), so
|
|
|
|
::
|
|
|
|
# (1)
|
|
us = u"xxx"
|
|
s = us.encode(encoding)
|
|
|
|
has to be replaced by
|
|
|
|
::
|
|
|
|
# (2)
|
|
us = u"xxx"
|
|
v = []
|
|
for c in us:
|
|
try:
|
|
v.append(c.encode(encoding))
|
|
except UnicodeError:
|
|
v.append("&#%d;" % ord(c))
|
|
s = "".join(v)
|
|
|
|
This slows down encoding dramatically as now the loop through the
|
|
string is done in Python code and no longer in C code.
|
|
|
|
Furthermore, this solution poses problems with stateful encodings.
|
|
For example, UTF-16 uses a Byte Order Mark at the start of the
|
|
encoded byte string to specify the byte order. Using (2) with
|
|
UTF-16, results in an 8 bit string with a BOM between every
|
|
character.
|
|
|
|
To work around this problem, a stream writer - which keeps state
|
|
between calls to the encoding function - has to be used::
|
|
|
|
# (3)
|
|
us = u"xxx"
|
|
import codecs, cStringIO as StringIO
|
|
writer = codecs.getwriter(encoding)
|
|
|
|
v = StringIO.StringIO()
|
|
uv = writer(v)
|
|
for c in us:
|
|
try:
|
|
uv.write(c)
|
|
except UnicodeError:
|
|
uv.write(u"&#%d;" % ord(c))
|
|
s = v.getvalue()
|
|
|
|
To compare the speed of (1) and (3) the following test script has
|
|
been used::
|
|
|
|
# (4)
|
|
import time
|
|
us = u"äa"*1000000
|
|
encoding = "ascii"
|
|
import codecs, cStringIO as StringIO
|
|
|
|
t1 = time.time()
|
|
|
|
s1 = us.encode(encoding, "replace")
|
|
|
|
t2 = time.time()
|
|
|
|
writer = codecs.getwriter(encoding)
|
|
|
|
v = StringIO.StringIO()
|
|
uv = writer(v)
|
|
for c in us:
|
|
try:
|
|
uv.write(c)
|
|
except UnicodeError:
|
|
uv.write(u"?")
|
|
s2 = v.getvalue()
|
|
|
|
t3 = time.time()
|
|
|
|
assert(s1==s2)
|
|
print "1:", t2-t1
|
|
print "2:", t3-t2
|
|
print "factor:", (t3-t2)/(t2-t1)
|
|
|
|
On Linux this gives the following output (with Python 2.3a0)::
|
|
|
|
1: 0.274321913719
|
|
2: 51.1284689903
|
|
factor: 186.381278466
|
|
|
|
i.e. (3) is 180 times slower than (1).
|
|
|
|
Callbacks must be stateless, because as soon as a callback is
|
|
registered it is available globally and can be called by multiple
|
|
``encode()`` calls. To be able to use stateful callbacks, the errors
|
|
parameter for encode/decode/translate would have to be changed
|
|
from ``char *`` to ``PyObject *``, so that the callback could be used
|
|
directly, without the need to register the callback globally. As
|
|
this requires changes to lots of C prototypes, this approach was
|
|
rejected.
|
|
|
|
Currently all encoding/decoding functions have arguments
|
|
|
|
::
|
|
|
|
const Py_UNICODE *p, int size
|
|
|
|
or
|
|
|
|
::
|
|
|
|
const char *p, int size
|
|
|
|
to specify the unicode characters/8bit characters to be
|
|
encoded/decoded. So in case of an error the codec has to create a
|
|
new unicode or str object from these parameters and store it in
|
|
the exception object. The callers of these encoding/decoding
|
|
functions extract these parameters from str/unicode objects
|
|
themselves most of the time, so it could speed up error handling
|
|
if these object were passed directly. As this again requires
|
|
changes to many C functions, this approach has been rejected.
|
|
|
|
For stream readers/writers the errors attribute must be changeable
|
|
to be able to switch between different error handling methods
|
|
during the lifetime of the stream reader/writer. This is currently
|
|
the case for ``codecs.StreamReader`` and ``codecs.StreamWriter`` and
|
|
all their subclasses. All core codecs and probably most of the
|
|
third party codecs (e.g. ``JapaneseCodecs``) derive their stream
|
|
readers/writers from these classes so this already works,
|
|
but the attribute errors should be documented as a requirement.
|
|
|
|
|
|
Implementation Notes
|
|
====================
|
|
|
|
A sample implementation is available as SourceForge patch #432401
|
|
[2]_ including a script for testing the speed of various
|
|
string/encoding/error combinations and a test script.
|
|
|
|
Currently the new exception classes are old style Python
|
|
classes. This means that accessing attributes results
|
|
in a dict lookup. The C API is implemented in a way
|
|
that makes it possible to switch to new style classes
|
|
behind the scene, if ``Exception`` (and ``UnicodeError``) will
|
|
be changed to new style classes implemented in C for
|
|
improved performance.
|
|
|
|
The class ``codecs.StreamReaderWriter`` uses the errors parameter for
|
|
both reading and writing. To be more flexible this should
|
|
probably be changed to two separate parameters for reading and
|
|
writing.
|
|
|
|
The errors parameter of ``PyUnicode_TranslateCharmap`` is not
|
|
availably to Python, which makes testing of the new functionality
|
|
of ``PyUnicode_TranslateCharmap`` impossible with Python scripts. The
|
|
patch should add an optional argument errors to unicode.translate
|
|
to expose the functionality and make testing possible.
|
|
|
|
Codecs that do something different than encoding/decoding from/to
|
|
unicode and want to use the new machinery can define their own
|
|
exception classes and the strict handlers will automatically work
|
|
with it. The other predefined error handlers are unicode specific
|
|
and expect to get a ``Unicode(Encode|Decode|Translate)Error``
|
|
exception object so they won't work.
|
|
|
|
|
|
Backwards Compatibility
|
|
=======================
|
|
|
|
The semantics of unicode.encode with errors="replace" has changed:
|
|
The old version always stored a ? character in the output string
|
|
even if no character was mapped to ? in the mapping. With the
|
|
proposed patch, the replacement string from the callback will
|
|
again be looked up in the mapping dictionary. But as all
|
|
supported encodings are ASCII based, and thus map ? to ?, this
|
|
should not be a problem in practice.
|
|
|
|
Illegal values for the errors argument raised ``ValueError`` before,
|
|
now they will raise ``LookupError``.
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
.. [1] SF feature request #403100
|
|
"Multicharacter replacements in PyUnicode_TranslateCharmap"
|
|
https://bugs.python.org/issue403100
|
|
|
|
.. [2] SF patch #432401 "unicode encoding error callbacks"
|
|
https://bugs.python.org/issue432401
|
|
|
|
|
|
Copyright
|
|
=========
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
..
|
|
Local Variables:
|
|
mode: indented-text
|
|
indent-tabs-mode: nil
|
|
sentence-end-double-space: t
|
|
fill-column: 70
|
|
coding: utf-8
|
|
End:
|