422 lines
16 KiB
Plaintext
422 lines
16 KiB
Plaintext
PEP: 293
|
||
Title: Codec Error Handling Callbacks
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Walter Dörwald
|
||
Status: Final
|
||
Type: Standards Track
|
||
Created: 18-Jun-2002
|
||
Python-Version: 2.3
|
||
Post-History: 19-Jun-2002
|
||
|
||
|
||
Abstract
|
||
|
||
This PEP aims at extending Python's fixed codec error handling
|
||
schemes with a more flexible callback based approach.
|
||
|
||
Python currently uses a fixed error handling for codec error
|
||
handlers. This PEP describes a mechanism which allows Python to
|
||
use function callbacks as error handlers. With these more
|
||
flexible error handlers it is possible to add new functionality to
|
||
existing codecs by e.g. providing fallback solutions or different
|
||
encodings for cases where the standard codec mapping does not
|
||
apply.
|
||
|
||
|
||
Specification
|
||
|
||
Currently the set of codec error handling algorithms is fixed to
|
||
either "strict", "replace" or "ignore" and the semantics of these
|
||
algorithms is implemented separately for each codec.
|
||
|
||
The proposed patch will make the set of error handling algorithms
|
||
extensible through a codec error handler registry which maps
|
||
handler names to handler functions. This registry consists of the
|
||
following two C functions:
|
||
|
||
int PyCodec_RegisterError(const char *name, PyObject *error)
|
||
|
||
PyObject *PyCodec_LookupError(const char *name)
|
||
|
||
and their Python counterparts
|
||
|
||
codecs.register_error(name, error)
|
||
|
||
codecs.lookup_error(name)
|
||
|
||
PyCodec_LookupError raises a LookupError if no callback function
|
||
has been registered under this name.
|
||
|
||
Similar to the encoding name registry there is no way of
|
||
unregistering callback functions or iterating through the
|
||
available functions.
|
||
|
||
The callback functions will be used in the following way by the
|
||
codecs: when the codec encounters an encoding/decoding error, the
|
||
callback function is looked up by name, the information about the
|
||
error is stored in an exception object and the callback is called
|
||
with this object. The callback returns information about how to
|
||
proceed (or raises an exception).
|
||
|
||
For encoding, the exception object will look like this:
|
||
|
||
class UnicodeEncodeError(UnicodeError):
|
||
def __init__(self, encoding, object, start, end, reason):
|
||
UnicodeError.__init__(self,
|
||
"encoding '%s' can't encode characters " +
|
||
"in positions %d-%d: %s" % (encoding,
|
||
start, end-1, reason))
|
||
self.encoding = encoding
|
||
self.object = object
|
||
self.start = start
|
||
self.end = end
|
||
self.reason = reason
|
||
|
||
This type will be implemented in C with the appropriate setter and
|
||
getter methods for the attributes, which have the following
|
||
meaning:
|
||
|
||
* encoding: The name of the encoding;
|
||
* object: The original unicode object for which encode() has
|
||
been called;
|
||
* start: The position of the first unencodable character;
|
||
* end: (The position of the last unencodable character)+1 (or
|
||
the length of object, if all characters from start to the end
|
||
of object are unencodable);
|
||
* reason: The reason why object[start:end] couldn't be encoded.
|
||
|
||
If object has consecutive unencodable characters, the encoder
|
||
should collect those characters for one call to the callback if
|
||
those characters can't be encoded for the same reason. The
|
||
encoder is not required to implement this behaviour but may call
|
||
the callback for every single character, but it is strongly
|
||
suggested that the collecting method is implemented.
|
||
|
||
The callback must not modify the exception object. If the
|
||
callback does not raise an exception (either the one passed in, or
|
||
a different one), it must return a tuple:
|
||
|
||
(replacement, newpos)
|
||
|
||
replacement is a unicode object that the encoder will encode and
|
||
emit instead of the unencodable object[start:end] part, newpos
|
||
specifies a new position within object, where (after encoding the
|
||
replacement) the encoder will continue encoding.
|
||
|
||
Negative values for newpos are treated as being relative to
|
||
end of object. If newpos is out of bounds the encoder will raise
|
||
an IndexError.
|
||
|
||
If the replacement string itself contains an unencodable character
|
||
the encoder raises the exception object (but may set a different
|
||
reason string before raising).
|
||
|
||
Should further encoding errors occur, the encoder is allowed to
|
||
reuse the exception object for the next call to the callback.
|
||
Furthermore the encoder is allowed to cache the result of
|
||
codecs.lookup_error.
|
||
|
||
If the callback does not know how to handle the exception, it must
|
||
raise a TypeError.
|
||
|
||
Decoding works similar to encoding with the following differences:
|
||
The exception class is named UnicodeDecodeError and the attribute
|
||
object is the original 8bit string that the decoder is currently
|
||
decoding.
|
||
|
||
The decoder will call the callback with those bytes that
|
||
constitute one undecodable sequence, even if there is more than
|
||
one undecodable sequence that is undecodable for the same reason
|
||
directly after the first one. E.g. for the "unicode-escape"
|
||
encoding, when decoding the illegal string "\\u00\\u01x", the
|
||
callback will be called twice (once for "\\u00" and once for
|
||
"\\u01"). This is done to be able to generate the correct number
|
||
of replacement characters.
|
||
|
||
The replacement returned from the callback is a unicode object
|
||
that will be emitted by the decoder as-is without further
|
||
processing instead of the undecodable object[start:end] part.
|
||
|
||
There is a third API that uses the old strict/ignore/replace error
|
||
handling scheme:
|
||
|
||
PyUnicode_TranslateCharmap/unicode.translate
|
||
|
||
The proposed patch will enhance PyUnicode_TranslateCharmap, so
|
||
that it also supports the callback registry. This has the
|
||
additional side effect that PyUnicode_TranslateCharmap will
|
||
support multi-character replacement strings (see SF feature
|
||
request #403100 [1]).
|
||
|
||
For PyUnicode_TranslateCharmap the exception class will be named
|
||
UnicodeTranslateError. PyUnicode_TranslateCharmap will collect
|
||
all consecutive untranslatable characters (i.e. those that map to
|
||
None) and call the callback with them. The replacement returned
|
||
from the callback is a unicode object that will be put in the
|
||
translated result as-is, without further processing.
|
||
|
||
All encoders and decoders are allowed to implement the callback
|
||
functionality themselves, if they recognize the callback name
|
||
(i.e. if it is a system callback like "strict", "replace" and
|
||
"ignore"). The proposed patch will add two additional system
|
||
callback names: "backslashreplace" and "xmlcharrefreplace", which
|
||
can be used for encoding and translating and which will also be
|
||
implemented in-place for all encoders and
|
||
PyUnicode_TranslateCharmap.
|
||
|
||
The Python equivalent of these five callbacks will look like this:
|
||
|
||
def strict(exc):
|
||
raise exc
|
||
|
||
def ignore(exc):
|
||
if isinstance(exc, UnicodeError):
|
||
return (u"", exc.end)
|
||
else:
|
||
raise TypeError("can't handle %s" % exc.__name__)
|
||
|
||
def replace(exc):
|
||
if isinstance(exc, UnicodeEncodeError):
|
||
return ((exc.end-exc.start)*u"?", exc.end)
|
||
elif isinstance(exc, UnicodeDecodeError):
|
||
return (u"\\ufffd", exc.end)
|
||
elif isinstance(exc, UnicodeTranslateError):
|
||
return ((exc.end-exc.start)*u"\\ufffd", exc.end)
|
||
else:
|
||
raise TypeError("can't handle %s" % exc.__name__)
|
||
|
||
def backslashreplace(exc):
|
||
if isinstance(exc,
|
||
(UnicodeEncodeError, UnicodeTranslateError)):
|
||
s = u""
|
||
for c in exc.object[exc.start:exc.end]:
|
||
if ord(c)<=0xff:
|
||
s += u"\\x%02x" % ord(c)
|
||
elif ord(c)<=0xffff:
|
||
s += u"\\u%04x" % ord(c)
|
||
else:
|
||
s += u"\\U%08x" % ord(c)
|
||
return (s, exc.end)
|
||
else:
|
||
raise TypeError("can't handle %s" % exc.__name__)
|
||
|
||
def xmlcharrefreplace(exc):
|
||
if isinstance(exc,
|
||
(UnicodeEncodeError, UnicodeTranslateError)):
|
||
s = u""
|
||
for c in exc.object[exc.start:exc.end]:
|
||
s += u"&#%d;" % ord(c)
|
||
return (s, exc.end)
|
||
else:
|
||
raise TypeError("can't handle %s" % exc.__name__)
|
||
|
||
These five callback handlers will also be accessible to Python as
|
||
codecs.strict_error, codecs.ignore_error, codecs.replace_error,
|
||
codecs.backslashreplace_error and codecs.xmlcharrefreplace_error.
|
||
|
||
|
||
Rationale
|
||
|
||
Most legacy encoding do not support the full range of Unicode
|
||
characters. For these cases many high level protocols support a
|
||
way of escaping a Unicode character (e.g. Python itself supports
|
||
the \x, \u and \U convention, XML supports character references
|
||
via &#xxx; etc.).
|
||
|
||
When implementing such an encoding algorithm, a problem with the
|
||
current implementation of the encode method of Unicode objects
|
||
becomes apparent: For determining which characters are unencodable
|
||
by a certain encoding, every single character has to be tried,
|
||
because encode does not provide any information about the location
|
||
of the error(s), so
|
||
|
||
# (1)
|
||
us = u"xxx"
|
||
s = us.encode(encoding)
|
||
|
||
has to be replaced by
|
||
|
||
# (2)
|
||
us = u"xxx"
|
||
v = []
|
||
for c in us:
|
||
try:
|
||
v.append(c.encode(encoding))
|
||
except UnicodeError:
|
||
v.append("&#%d;" % ord(c))
|
||
s = "".join(v)
|
||
|
||
This slows down encoding dramatically as now the loop through the
|
||
string is done in Python code and no longer in C code.
|
||
|
||
Furthermore this solution poses problems with stateful encodings.
|
||
For example UTF-16 uses a Byte Order Mark at the start of the
|
||
encoded byte string to specify the byte order. Using (2) with
|
||
UTF-16, results in an 8 bit string with a BOM between every
|
||
character.
|
||
|
||
To work around this problem, a stream writer - which keeps state
|
||
between calls to the encoding function - has to be used:
|
||
|
||
# (3)
|
||
us = u"xxx"
|
||
import codecs, cStringIO as StringIO
|
||
writer = codecs.getwriter(encoding)
|
||
|
||
v = StringIO.StringIO()
|
||
uv = writer(v)
|
||
for c in us:
|
||
try:
|
||
uv.write(c)
|
||
except UnicodeError:
|
||
uv.write(u"&#%d;" % ord(c))
|
||
s = v.getvalue()
|
||
|
||
To compare the speed of (1) and (3) the following test script has
|
||
been used:
|
||
|
||
# (4)
|
||
import time
|
||
us = u"äa"*1000000
|
||
encoding = "ascii"
|
||
import codecs, cStringIO as StringIO
|
||
|
||
t1 = time.time()
|
||
|
||
s1 = us.encode(encoding, "replace")
|
||
|
||
t2 = time.time()
|
||
|
||
writer = codecs.getwriter(encoding)
|
||
|
||
v = StringIO.StringIO()
|
||
uv = writer(v)
|
||
for c in us:
|
||
try:
|
||
uv.write(c)
|
||
except UnicodeError:
|
||
uv.write(u"?")
|
||
s2 = v.getvalue()
|
||
|
||
t3 = time.time()
|
||
|
||
assert(s1==s2)
|
||
print "1:", t2-t1
|
||
print "2:", t3-t2
|
||
print "factor:", (t3-t2)/(t2-t1)
|
||
|
||
On Linux this gives the following output (with Python 2.3a0):
|
||
|
||
1: 0.274321913719
|
||
2: 51.1284689903
|
||
factor: 186.381278466
|
||
|
||
i.e. (3) is 180 times slower than (1).
|
||
|
||
Callbacks must be stateless, because as soon as a callback is
|
||
registered it is available globally and can be called by multiple
|
||
encode() calls. To be able to use stateful callbacks, the errors
|
||
parameter for encode/decode/translate would have to be changed
|
||
from char * to PyObject *, so that the callback could be used
|
||
directly, without the need to register the callback globally. As
|
||
this requires changes to lots of C prototypes, this approach was
|
||
rejected.
|
||
|
||
Currently all encoding/decoding functions have arguments
|
||
|
||
const Py_UNICODE *p, int size
|
||
|
||
or
|
||
|
||
const char *p, int size
|
||
|
||
to specify the unicode characters/8bit characters to be
|
||
encoded/decoded. So in case of an error the codec has to create a
|
||
new unicode or str object from these parameters and store it in
|
||
the exception object. The callers of these encoding/decoding
|
||
functions extract these parameters from str/unicode objects
|
||
themselves most of the time, so it could speed up error handling
|
||
if these object were passed directly. As this again requires
|
||
changes to many C functions, this approach has been rejected.
|
||
|
||
For stream readers/writers the errors attribute must be changeable
|
||
to be able to switch between different error handling methods
|
||
during the lifetime of the stream reader/writer. This is currently
|
||
the case for codecs.StreamReader and codecs.StreamWriter and
|
||
all their subclasses. All core codecs and probably most of the
|
||
third party codecs (e.g. JapaneseCodecs) derive their stream
|
||
readers/writers from these classes so this already works,
|
||
but the attribute errors should be documented as a requirement.
|
||
|
||
|
||
Implementation Notes
|
||
|
||
A sample implementation is available as SourceForge patch #432401
|
||
[2] including a script for testing the speed of various
|
||
string/encoding/error combinations and a test script.
|
||
|
||
Currently the new exception classes are old style Python
|
||
classes. This means that accessing attributes results
|
||
in a dict lookup. The C API is implemented in a way
|
||
that makes it possible to switch to new style classes
|
||
behind the scene, if Exception (and UnicodeError) will
|
||
be changed to new style classes implemented in C for
|
||
improved performance.
|
||
|
||
The class codecs.StreamReaderWriter uses the errors parameter for
|
||
both reading and writing. To be more flexible this should
|
||
probably be changed to two separate parameters for reading and
|
||
writing.
|
||
|
||
The errors parameter of PyUnicode_TranslateCharmap is not
|
||
availably to Python, which makes testing of the new functionality
|
||
of PyUnicode_TranslateCharmap impossible with Python scripts. The
|
||
patch should add an optional argument errors to unicode.translate
|
||
to expose the functionality and make testing possible.
|
||
|
||
Codecs that do something different than encoding/decoding from/to
|
||
unicode and want to use the new machinery can define their own
|
||
exception classes and the strict handlers will automatically work
|
||
with it. The other predefined error handlers are unicode specific
|
||
and expect to get a Unicode(Encode|Decode|Translate)Error
|
||
exception object so they won't work.
|
||
|
||
|
||
Backwards Compatibility
|
||
|
||
The semantics of unicode.encode with errors="replace" has changed:
|
||
The old version always stored a ? character in the output string
|
||
even if no character was mapped to ? in the mapping. With the
|
||
proposed patch, the replacement string from the callback will
|
||
again be looked up in the mapping dictionary. But as all
|
||
supported encodings are ASCII based, and thus map ? to ?, this
|
||
should not be a problem in practice.
|
||
|
||
Illegal values for the errors argument raised ValueError before,
|
||
now they will raise LookupError.
|
||
|
||
|
||
References
|
||
|
||
[1] SF feature request #403100
|
||
"Multicharacter replacements in PyUnicode_TranslateCharmap"
|
||
http://www.python.org/sf/403100
|
||
|
||
[2] SF patch #432401 "unicode encoding error callbacks"
|
||
http://www.python.org/sf/432401
|
||
|
||
|
||
Copyright
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
End:
|