PEP 293, Codec Error Handling Callbacks, Walter Dörwald
This commit is contained in:
parent
77fe9cc31e
commit
90997b3c1d
|
@ -0,0 +1,404 @@
|
|||
PEP: 293
|
||||
Title: Codec Error Handling Callbacks
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Walter Dörwald
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Created: 18-Jun-2002
|
||||
Python-Version: 2.3
|
||||
Post-History:
|
||||
|
||||
|
||||
Abstract
|
||||
|
||||
This PEP aims at extending Python's fixed codec error handling
|
||||
schemes with a more flexible callback based approach.
|
||||
|
||||
Python currently uses a fixed error handling for codec error
|
||||
handlers. This PEP describes a mechanism which allows Python to
|
||||
use function callbacks as error handlers. With these more
|
||||
flexible error handlers it is possible to add new functionality to
|
||||
existing codecs by e.g. providing fallback solutions or different
|
||||
encodings for cases where the standard codec mapping does not
|
||||
apply.
|
||||
|
||||
|
||||
Specification
|
||||
|
||||
Currently the set of codec error handling algorithms is fixed to
|
||||
either "strict", "replace" or "ignore" and the semantics of these
|
||||
algorithms is implemented separately for each codec.
|
||||
|
||||
The proposed patch will make the set of error handling algorithms
|
||||
extensible through a codec error handler registry which maps
|
||||
handler names to handler functions. This registry consists of the
|
||||
following two C functions:
|
||||
|
||||
int PyCodec_RegisterError(const char *name, PyObject *error)
|
||||
|
||||
PyObject *PyCodec_LookupError(const char *name)
|
||||
|
||||
and their Python counterparts
|
||||
|
||||
codecs.register_error(name, error)
|
||||
|
||||
codecs.lookup_error(name)
|
||||
|
||||
PyCodec_LookupError raises a LookupError if no callback function
|
||||
has been registered under this name.
|
||||
|
||||
Similar to the encoding name registry there is no way of
|
||||
unregistering callback functions or iterating through the
|
||||
available functions.
|
||||
|
||||
The callback functions will be used in the following way by the
|
||||
codecs: when the codec encounters an encoding/decoding error, the
|
||||
callback function is looked up by name, the information about the
|
||||
error is stored in an exception object and the callback is called
|
||||
with this object. The callback returns information about how to
|
||||
proceed (or raises an exception).
|
||||
|
||||
For encoding, the exception object will look like this:
|
||||
|
||||
class UnicodeEncodeError(UnicodeError):
|
||||
def __init__(self, encoding, object, start, end, reason):
|
||||
UnicodeError.__init__(self,
|
||||
"encoding '%s' can't encode characters " +
|
||||
"in positions %d-%d: %s" % (encoding,
|
||||
start, end-1, reason))
|
||||
self.encoding = encoding
|
||||
self.object = object
|
||||
self.start = start
|
||||
self.end = end
|
||||
self.reason = reason
|
||||
|
||||
This type will be implemented in C with the appropriate setter and
|
||||
getter methods for the attributes, which have the following
|
||||
meaning:
|
||||
|
||||
* encoding: The name of the encoding;
|
||||
* object: The original unicode object for which encode() has
|
||||
been called;
|
||||
* start: The position of the first unencodable character;
|
||||
* end: (The position of the last unencodable character)+1 (or
|
||||
the length of object, if all characters from start to the end
|
||||
of object are unencodable);
|
||||
* reason: The reason why object[start:end] couldn't be encoded.
|
||||
|
||||
If object has consecutive unencodable characters, the encoder
|
||||
should collect those characters for one call to the callback if
|
||||
those characters can't be encoded for the same reason. The
|
||||
encoder is not required to implement this behaviour but may call
|
||||
the callback for every single character, but it is strongly
|
||||
suggested that the collecting method is implemented.
|
||||
|
||||
The callback must not modify the exception object. If the
|
||||
callback does not raise an exception (either the one passed in, or
|
||||
a different one), it must return a tuple:
|
||||
|
||||
(replacement, newpos)
|
||||
|
||||
replacement is a unicode object that the encoder will encode and
|
||||
emit instead of the unencodable object[start:end] part, newpos
|
||||
specifies a new position within object, where (after encoding the
|
||||
replacement) the encoder will continue encoding.
|
||||
|
||||
If the replacement string itself contains an unencodable character
|
||||
the encoder raises the exception object (but may set a different
|
||||
reason string before raising).
|
||||
|
||||
Should further encoding errors occur, the encoder is allowed to
|
||||
reuse the exception object for the next call to the callback.
|
||||
Furthermore the encoder is allowed to cache the result of
|
||||
codecs.lookup_error.
|
||||
|
||||
If the callback does not know how to handle the exception, it must
|
||||
raise a TypeError.
|
||||
|
||||
Decoding works similar to encoding with the following differences:
|
||||
The exception class is named UnicodeDecodeError and the attribute
|
||||
object is the original 8bit string that the decoder is currently
|
||||
decoding.
|
||||
|
||||
The decoder will call the callback with those bytes that
|
||||
constitute one undecodable sequence, even if there is more than
|
||||
one undecodable sequence that is undecodable for the same reason
|
||||
directly after the first one. E.g. for the "unicode-escape"
|
||||
encoding, when decoding the illegal string "\\u00\\u01x", the
|
||||
callback will be called twice (once for "\\u00" and once for
|
||||
"\\u01"). This is done to be able to generate the correct number
|
||||
of replacement characters.
|
||||
|
||||
The replacement returned from the callback is a unicode object
|
||||
that will be emitted by the decoder as-is without further
|
||||
processing instead of the undecodable object[start:end] part.
|
||||
|
||||
There is a third API that uses the old strict/ignore/replace error
|
||||
handling scheme:
|
||||
|
||||
PyUnicode_TranslateCharmap/unicode.translate
|
||||
|
||||
The proposed patch will enhance PyUnicode_TranslateCharmap, so
|
||||
that it also supports the callback registry. This has the
|
||||
additional side effect that PyUnicode_TranslateCharmap will
|
||||
support multi-character replacement strings (see SF feature
|
||||
request #403100 [1]).
|
||||
|
||||
For PyUnicode_TranslateCharmap the exception class will be named
|
||||
UnicodeTranslateError. PyUnicode_TranslateCharmap will collect
|
||||
all consecutive untranslatable characters (i.e. those that map to
|
||||
None) and call the callback with them. The replacement returned
|
||||
from the callback is a unicode object that will be put in the
|
||||
translated result as-is, without further processing.
|
||||
|
||||
All encoders and decoders are allowed to implement the callback
|
||||
functionality themselves, if they recognize the callback name
|
||||
(i.e. if it is a system callback like "strict", "replace" and
|
||||
"ignore"). The proposed patch will add two additional system
|
||||
callback names: "backslashreplace" and "xmlcharrefreplace", which
|
||||
can be used for encoding and translating and which will also be
|
||||
implemented in-place for all encoders and
|
||||
PyUnicode_TranslateCharmap.
|
||||
|
||||
The Python equivalent of these five callbacks will look like this:
|
||||
|
||||
def strict(exc):
|
||||
raise exc
|
||||
|
||||
def ignore(exc):
|
||||
if isinstance(exc, UnicodeError):
|
||||
return (u"", exc.end)
|
||||
else:
|
||||
raise TypeError("can't handle %s" % exc.__name__)
|
||||
|
||||
def replace(exc):
|
||||
if isinstance(exc, UnicodeEncodeError):
|
||||
return ((exc.end-exc.start)*u"?", exc.end)
|
||||
elif isinstance(exc, UnicodeDecodeError):
|
||||
return (u"\\ufffd", exc.end)
|
||||
elif isinstance(exc, UnicodeTranslateError):
|
||||
return ((exc.end-exc.start)*u"\\ufffd", exc.end)
|
||||
else:
|
||||
raise TypeError("can't handle %s" % exc.__name__)
|
||||
|
||||
def backslashreplace(exc):
|
||||
if isinstance(exc,
|
||||
(UnicodeEncodeError, UnicodeTranslateError)):
|
||||
s = u""
|
||||
for c in exc.object[exc.start:exc.end]:
|
||||
if ord(c)<=0xff:
|
||||
s += u"\\x%02x" % ord(c)
|
||||
elif ord(c)<=0xffff:
|
||||
s += u"\\u%04x" % ord(c)
|
||||
else:
|
||||
s += u"\\U%08x" % ord(c)
|
||||
return (s, exc.end)
|
||||
else:
|
||||
raise TypeError("can't handle %s" % exc.__name__)
|
||||
|
||||
def xmlcharrefreplace(exc):
|
||||
if isinstance(exc,
|
||||
(UnicodeEncodeError, UnicodeTranslateError)):
|
||||
s = u""
|
||||
for c in exc.object[exc.start:exc.end]:
|
||||
s += u"&#%d;" % ord(c)
|
||||
return (s, exc.end)
|
||||
else:
|
||||
raise TypeError("can't handle %s" % exc.__name__)
|
||||
|
||||
These five callback handlers will also be accessible to Python as
|
||||
codecs.strict_error, codecs.ignore_error, codecs.replace_error,
|
||||
codecs.backslashreplace_error and codecs.xmlcharrefreplace_error.
|
||||
|
||||
|
||||
Rationale
|
||||
|
||||
Most legacy encoding do not support the full range of Unicode
|
||||
characters. For these cases many high level protocols support a
|
||||
way of escaping a Unicode character (e.g. Python itself supports
|
||||
the \x, \u and \U convention, XML supports character references
|
||||
via &#xxx; etc.).
|
||||
|
||||
When implementing such an encoding algorithm, a problem with the
|
||||
current implementation of the encode method of Unicode objects
|
||||
becomes apparent: For determining which characters are unencodable
|
||||
by a certain encoding, every single character has to be tried,
|
||||
because encode does not provide any information about the location
|
||||
of the error(s), so
|
||||
|
||||
# (1)
|
||||
us = u"xxx"
|
||||
s = us.encode(encoding)
|
||||
|
||||
has to be replaced by
|
||||
|
||||
# (2)
|
||||
us = u"xxx"
|
||||
v = []
|
||||
for c in us:
|
||||
try:
|
||||
v.append(c.encode(encoding))
|
||||
except UnicodeError:
|
||||
v.append("&#%d;" % ord(c))
|
||||
s = "".join(v)
|
||||
|
||||
This slows down encoding dramatically as now the loop through the
|
||||
string is done in Python code and no longer in C code.
|
||||
|
||||
Furthermore this solution poses problems with stateful encodings.
|
||||
For example UTF-16 uses a Byte Order Mark at the start of the
|
||||
encoded byte string to specify the byte order. Using (2) with
|
||||
UTF-16, results in an 8 bit string with a BOM between every
|
||||
character.
|
||||
|
||||
To work around this problem, a stream writer - which keeps state
|
||||
between calls to the encoding function - has to be used:
|
||||
|
||||
# (3)
|
||||
us = u"xxx"
|
||||
import codecs, cStringIO as StringIO
|
||||
writer = codecs.getwriter(encoding)
|
||||
|
||||
v = StringIO.StringIO()
|
||||
uv = writer(v)
|
||||
for c in us:
|
||||
try:
|
||||
uv.write(c)
|
||||
except UnicodeError:
|
||||
uv.write(u"&#%d;" % ord(c))
|
||||
s = v.getvalue()
|
||||
|
||||
To compare the speed of (1) and (3) the following test script has
|
||||
been used:
|
||||
|
||||
# (4)
|
||||
import time
|
||||
us = u"äa"*1000000
|
||||
encoding = "ascii"
|
||||
import codecs, cStringIO as StringIO
|
||||
|
||||
t1 = time.time()
|
||||
|
||||
s1 = us.encode(encoding, "replace")
|
||||
|
||||
t2 = time.time()
|
||||
|
||||
writer = codecs.getwriter(encoding)
|
||||
|
||||
v = StringIO.StringIO()
|
||||
uv = writer(v)
|
||||
for c in us:
|
||||
try:
|
||||
uv.write(c)
|
||||
except UnicodeError:
|
||||
uv.write(u"?")
|
||||
s2 = v.getvalue()
|
||||
|
||||
t3 = time.time()
|
||||
|
||||
assert(s1==s2)
|
||||
print "1:", t2-t1
|
||||
print "2:", t3-t2
|
||||
print "factor:", (t3-t2)/(t2-t1)
|
||||
|
||||
On Linux this gives the following output (with Python 2.3a0):
|
||||
|
||||
1: 0.274321913719
|
||||
2: 51.1284689903
|
||||
factor: 186.381278466
|
||||
|
||||
i.e. (3) is 180 times slower than (1).
|
||||
|
||||
Codecs must be stateless, because as soon as a callback is
|
||||
registered it is available globally and can be called by multiple
|
||||
encode() calls. To be able to use stateful callbacks, the errors
|
||||
parameter for encode/decode/translate would have to be changed
|
||||
from char * to PyObject *, so that the callback could be used
|
||||
directly, without the need to register the callback globally. As
|
||||
this requires changes to lots of C prototypes, this approach was
|
||||
rejected.
|
||||
|
||||
Currently all encoding/decoding functions have arguments
|
||||
|
||||
const Py_UNICODE *p, int size
|
||||
|
||||
or
|
||||
|
||||
const char *p, int size
|
||||
|
||||
to specify the unicode characters/8bit characters to be
|
||||
encoded/decoded. So in case of an error the codec has to create a
|
||||
new unicode or str object from these parameters and store it in
|
||||
the exception object. The callers of these encoding/decoding
|
||||
functions extract these parameters from str/unicode objects
|
||||
themselves most of the time, so it could speed up error handling
|
||||
if these object were passed directly. As this again requires
|
||||
changes to many C functions, this approach has been rejected.
|
||||
|
||||
|
||||
Implementation Notes
|
||||
|
||||
A sample implementation is available as SourceForge patch #432401
|
||||
[2]. The current version of this patch differs from the
|
||||
specification in the following way:
|
||||
|
||||
* The error information is passed from the codec to the callback
|
||||
not as an exception object, but as a tuple, which has an
|
||||
additional entry state, which can be used for additional
|
||||
information the codec might want to pass to the callback.
|
||||
* There are two separate registries (one for
|
||||
encoding/translating and one for decoding)
|
||||
|
||||
The class codecs.StreamReaderWriter uses the errors parameter for
|
||||
both reading and writing. To be more flexible this should
|
||||
probably be changed to two separate parameters for reading and
|
||||
writing.
|
||||
|
||||
The errors parameter of PyUnicode_TranslateCharmap is not
|
||||
availably to Python, which makes testing of the new functionality
|
||||
of PyUnicode_TranslateCharmap impossible with Python scripts. The
|
||||
patch should add an optional argument errors to unicode.translate
|
||||
to expose the functionality and make testing possible.
|
||||
|
||||
Codecs that do something different than encoding/decoding from/to
|
||||
unicode and want to use the new machinery can define their own
|
||||
exception classes and the strict handlers will automatically work
|
||||
with it. The other predefined error handlers are unicode specific
|
||||
and expect to get a Unicode(Encode|Decode|Translate)Error
|
||||
exception object so they won't work.
|
||||
|
||||
|
||||
Backwards Compatibility
|
||||
|
||||
The semantics of unicode.encode with errors="replace" has changed:
|
||||
The old version always stored a ? character in the output string
|
||||
even if no character was mapped to ? in the mapping. With the
|
||||
proposed patch, the replacement string from the callback callback
|
||||
will again be looked up in the mapping dictionary. But as all
|
||||
supported encodings are ASCII based, and thus map ? to ?, this
|
||||
should not be a problem in practice.
|
||||
|
||||
|
||||
References
|
||||
|
||||
[1] SF feature request #403100
|
||||
"Multicharacter replacements in PyUnicode_TranslateCharmap"
|
||||
http://www.python.org/sf/403100
|
||||
|
||||
[2] SF patch #432401 "unicode encoding error callbacks"
|
||||
http://www.python.org/sf/432401
|
||||
|
||||
|
||||
Copyright
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
End:
|
Loading…
Reference in New Issue