2002-06-18 23:22:11 -04:00
|
|
|
|
PEP: 293
|
|
|
|
|
Title: Codec Error Handling Callbacks
|
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
|
|
|
|
Author: Walter D<>rwald
|
|
|
|
|
Status: Draft
|
|
|
|
|
Type: Standards Track
|
|
|
|
|
Created: 18-Jun-2002
|
|
|
|
|
Python-Version: 2.3
|
2002-06-19 07:12:50 -04:00
|
|
|
|
Post-History: 19-Jun-2002
|
2002-06-18 23:22:11 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
|
|
|
|
|
This PEP aims at extending Python's fixed codec error handling
|
|
|
|
|
schemes with a more flexible callback based approach.
|
|
|
|
|
|
|
|
|
|
Python currently uses a fixed error handling for codec error
|
|
|
|
|
handlers. This PEP describes a mechanism which allows Python to
|
|
|
|
|
use function callbacks as error handlers. With these more
|
|
|
|
|
flexible error handlers it is possible to add new functionality to
|
|
|
|
|
existing codecs by e.g. providing fallback solutions or different
|
|
|
|
|
encodings for cases where the standard codec mapping does not
|
|
|
|
|
apply.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Specification
|
|
|
|
|
|
|
|
|
|
Currently the set of codec error handling algorithms is fixed to
|
|
|
|
|
either "strict", "replace" or "ignore" and the semantics of these
|
|
|
|
|
algorithms is implemented separately for each codec.
|
|
|
|
|
|
|
|
|
|
The proposed patch will make the set of error handling algorithms
|
|
|
|
|
extensible through a codec error handler registry which maps
|
|
|
|
|
handler names to handler functions. This registry consists of the
|
|
|
|
|
following two C functions:
|
|
|
|
|
|
|
|
|
|
int PyCodec_RegisterError(const char *name, PyObject *error)
|
|
|
|
|
|
|
|
|
|
PyObject *PyCodec_LookupError(const char *name)
|
|
|
|
|
|
|
|
|
|
and their Python counterparts
|
|
|
|
|
|
|
|
|
|
codecs.register_error(name, error)
|
|
|
|
|
|
|
|
|
|
codecs.lookup_error(name)
|
|
|
|
|
|
|
|
|
|
PyCodec_LookupError raises a LookupError if no callback function
|
|
|
|
|
has been registered under this name.
|
|
|
|
|
|
|
|
|
|
Similar to the encoding name registry there is no way of
|
|
|
|
|
unregistering callback functions or iterating through the
|
|
|
|
|
available functions.
|
|
|
|
|
|
|
|
|
|
The callback functions will be used in the following way by the
|
|
|
|
|
codecs: when the codec encounters an encoding/decoding error, the
|
|
|
|
|
callback function is looked up by name, the information about the
|
|
|
|
|
error is stored in an exception object and the callback is called
|
|
|
|
|
with this object. The callback returns information about how to
|
|
|
|
|
proceed (or raises an exception).
|
|
|
|
|
|
|
|
|
|
For encoding, the exception object will look like this:
|
|
|
|
|
|
|
|
|
|
class UnicodeEncodeError(UnicodeError):
|
|
|
|
|
def __init__(self, encoding, object, start, end, reason):
|
|
|
|
|
UnicodeError.__init__(self,
|
|
|
|
|
"encoding '%s' can't encode characters " +
|
|
|
|
|
"in positions %d-%d: %s" % (encoding,
|
|
|
|
|
start, end-1, reason))
|
|
|
|
|
self.encoding = encoding
|
|
|
|
|
self.object = object
|
|
|
|
|
self.start = start
|
|
|
|
|
self.end = end
|
|
|
|
|
self.reason = reason
|
|
|
|
|
|
|
|
|
|
This type will be implemented in C with the appropriate setter and
|
|
|
|
|
getter methods for the attributes, which have the following
|
|
|
|
|
meaning:
|
|
|
|
|
|
|
|
|
|
* encoding: The name of the encoding;
|
|
|
|
|
* object: The original unicode object for which encode() has
|
|
|
|
|
been called;
|
|
|
|
|
* start: The position of the first unencodable character;
|
|
|
|
|
* end: (The position of the last unencodable character)+1 (or
|
|
|
|
|
the length of object, if all characters from start to the end
|
|
|
|
|
of object are unencodable);
|
|
|
|
|
* reason: The reason why object[start:end] couldn't be encoded.
|
|
|
|
|
|
|
|
|
|
If object has consecutive unencodable characters, the encoder
|
|
|
|
|
should collect those characters for one call to the callback if
|
|
|
|
|
those characters can't be encoded for the same reason. The
|
|
|
|
|
encoder is not required to implement this behaviour but may call
|
|
|
|
|
the callback for every single character, but it is strongly
|
|
|
|
|
suggested that the collecting method is implemented.
|
|
|
|
|
|
|
|
|
|
The callback must not modify the exception object. If the
|
|
|
|
|
callback does not raise an exception (either the one passed in, or
|
|
|
|
|
a different one), it must return a tuple:
|
|
|
|
|
|
|
|
|
|
(replacement, newpos)
|
|
|
|
|
|
|
|
|
|
replacement is a unicode object that the encoder will encode and
|
|
|
|
|
emit instead of the unencodable object[start:end] part, newpos
|
|
|
|
|
specifies a new position within object, where (after encoding the
|
|
|
|
|
replacement) the encoder will continue encoding.
|
|
|
|
|
|
|
|
|
|
If the replacement string itself contains an unencodable character
|
|
|
|
|
the encoder raises the exception object (but may set a different
|
|
|
|
|
reason string before raising).
|
|
|
|
|
|
|
|
|
|
Should further encoding errors occur, the encoder is allowed to
|
|
|
|
|
reuse the exception object for the next call to the callback.
|
|
|
|
|
Furthermore the encoder is allowed to cache the result of
|
|
|
|
|
codecs.lookup_error.
|
|
|
|
|
|
|
|
|
|
If the callback does not know how to handle the exception, it must
|
|
|
|
|
raise a TypeError.
|
|
|
|
|
|
|
|
|
|
Decoding works similar to encoding with the following differences:
|
|
|
|
|
The exception class is named UnicodeDecodeError and the attribute
|
|
|
|
|
object is the original 8bit string that the decoder is currently
|
|
|
|
|
decoding.
|
|
|
|
|
|
|
|
|
|
The decoder will call the callback with those bytes that
|
|
|
|
|
constitute one undecodable sequence, even if there is more than
|
|
|
|
|
one undecodable sequence that is undecodable for the same reason
|
|
|
|
|
directly after the first one. E.g. for the "unicode-escape"
|
|
|
|
|
encoding, when decoding the illegal string "\\u00\\u01x", the
|
|
|
|
|
callback will be called twice (once for "\\u00" and once for
|
|
|
|
|
"\\u01"). This is done to be able to generate the correct number
|
|
|
|
|
of replacement characters.
|
|
|
|
|
|
|
|
|
|
The replacement returned from the callback is a unicode object
|
|
|
|
|
that will be emitted by the decoder as-is without further
|
|
|
|
|
processing instead of the undecodable object[start:end] part.
|
|
|
|
|
|
|
|
|
|
There is a third API that uses the old strict/ignore/replace error
|
|
|
|
|
handling scheme:
|
|
|
|
|
|
|
|
|
|
PyUnicode_TranslateCharmap/unicode.translate
|
|
|
|
|
|
|
|
|
|
The proposed patch will enhance PyUnicode_TranslateCharmap, so
|
|
|
|
|
that it also supports the callback registry. This has the
|
|
|
|
|
additional side effect that PyUnicode_TranslateCharmap will
|
|
|
|
|
support multi-character replacement strings (see SF feature
|
|
|
|
|
request #403100 [1]).
|
|
|
|
|
|
|
|
|
|
For PyUnicode_TranslateCharmap the exception class will be named
|
|
|
|
|
UnicodeTranslateError. PyUnicode_TranslateCharmap will collect
|
|
|
|
|
all consecutive untranslatable characters (i.e. those that map to
|
|
|
|
|
None) and call the callback with them. The replacement returned
|
|
|
|
|
from the callback is a unicode object that will be put in the
|
|
|
|
|
translated result as-is, without further processing.
|
|
|
|
|
|
|
|
|
|
All encoders and decoders are allowed to implement the callback
|
|
|
|
|
functionality themselves, if they recognize the callback name
|
|
|
|
|
(i.e. if it is a system callback like "strict", "replace" and
|
|
|
|
|
"ignore"). The proposed patch will add two additional system
|
|
|
|
|
callback names: "backslashreplace" and "xmlcharrefreplace", which
|
|
|
|
|
can be used for encoding and translating and which will also be
|
|
|
|
|
implemented in-place for all encoders and
|
|
|
|
|
PyUnicode_TranslateCharmap.
|
|
|
|
|
|
|
|
|
|
The Python equivalent of these five callbacks will look like this:
|
|
|
|
|
|
|
|
|
|
def strict(exc):
|
|
|
|
|
raise exc
|
|
|
|
|
|
|
|
|
|
def ignore(exc):
|
|
|
|
|
if isinstance(exc, UnicodeError):
|
|
|
|
|
return (u"", exc.end)
|
|
|
|
|
else:
|
|
|
|
|
raise TypeError("can't handle %s" % exc.__name__)
|
|
|
|
|
|
|
|
|
|
def replace(exc):
|
|
|
|
|
if isinstance(exc, UnicodeEncodeError):
|
|
|
|
|
return ((exc.end-exc.start)*u"?", exc.end)
|
|
|
|
|
elif isinstance(exc, UnicodeDecodeError):
|
|
|
|
|
return (u"\\ufffd", exc.end)
|
|
|
|
|
elif isinstance(exc, UnicodeTranslateError):
|
|
|
|
|
return ((exc.end-exc.start)*u"\\ufffd", exc.end)
|
|
|
|
|
else:
|
|
|
|
|
raise TypeError("can't handle %s" % exc.__name__)
|
|
|
|
|
|
|
|
|
|
def backslashreplace(exc):
|
|
|
|
|
if isinstance(exc,
|
|
|
|
|
(UnicodeEncodeError, UnicodeTranslateError)):
|
|
|
|
|
s = u""
|
|
|
|
|
for c in exc.object[exc.start:exc.end]:
|
|
|
|
|
if ord(c)<=0xff:
|
|
|
|
|
s += u"\\x%02x" % ord(c)
|
|
|
|
|
elif ord(c)<=0xffff:
|
|
|
|
|
s += u"\\u%04x" % ord(c)
|
|
|
|
|
else:
|
|
|
|
|
s += u"\\U%08x" % ord(c)
|
|
|
|
|
return (s, exc.end)
|
|
|
|
|
else:
|
|
|
|
|
raise TypeError("can't handle %s" % exc.__name__)
|
|
|
|
|
|
|
|
|
|
def xmlcharrefreplace(exc):
|
|
|
|
|
if isinstance(exc,
|
|
|
|
|
(UnicodeEncodeError, UnicodeTranslateError)):
|
|
|
|
|
s = u""
|
|
|
|
|
for c in exc.object[exc.start:exc.end]:
|
|
|
|
|
s += u"&#%d;" % ord(c)
|
|
|
|
|
return (s, exc.end)
|
|
|
|
|
else:
|
|
|
|
|
raise TypeError("can't handle %s" % exc.__name__)
|
|
|
|
|
|
|
|
|
|
These five callback handlers will also be accessible to Python as
|
|
|
|
|
codecs.strict_error, codecs.ignore_error, codecs.replace_error,
|
|
|
|
|
codecs.backslashreplace_error and codecs.xmlcharrefreplace_error.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
|
|
|
|
|
|
Most legacy encoding do not support the full range of Unicode
|
|
|
|
|
characters. For these cases many high level protocols support a
|
|
|
|
|
way of escaping a Unicode character (e.g. Python itself supports
|
|
|
|
|
the \x, \u and \U convention, XML supports character references
|
|
|
|
|
via &#xxx; etc.).
|
|
|
|
|
|
|
|
|
|
When implementing such an encoding algorithm, a problem with the
|
|
|
|
|
current implementation of the encode method of Unicode objects
|
|
|
|
|
becomes apparent: For determining which characters are unencodable
|
|
|
|
|
by a certain encoding, every single character has to be tried,
|
|
|
|
|
because encode does not provide any information about the location
|
|
|
|
|
of the error(s), so
|
|
|
|
|
|
|
|
|
|
# (1)
|
|
|
|
|
us = u"xxx"
|
|
|
|
|
s = us.encode(encoding)
|
|
|
|
|
|
|
|
|
|
has to be replaced by
|
|
|
|
|
|
|
|
|
|
# (2)
|
|
|
|
|
us = u"xxx"
|
|
|
|
|
v = []
|
|
|
|
|
for c in us:
|
|
|
|
|
try:
|
|
|
|
|
v.append(c.encode(encoding))
|
|
|
|
|
except UnicodeError:
|
|
|
|
|
v.append("&#%d;" % ord(c))
|
|
|
|
|
s = "".join(v)
|
|
|
|
|
|
|
|
|
|
This slows down encoding dramatically as now the loop through the
|
|
|
|
|
string is done in Python code and no longer in C code.
|
|
|
|
|
|
|
|
|
|
Furthermore this solution poses problems with stateful encodings.
|
|
|
|
|
For example UTF-16 uses a Byte Order Mark at the start of the
|
|
|
|
|
encoded byte string to specify the byte order. Using (2) with
|
|
|
|
|
UTF-16, results in an 8 bit string with a BOM between every
|
|
|
|
|
character.
|
|
|
|
|
|
|
|
|
|
To work around this problem, a stream writer - which keeps state
|
|
|
|
|
between calls to the encoding function - has to be used:
|
|
|
|
|
|
|
|
|
|
# (3)
|
|
|
|
|
us = u"xxx"
|
|
|
|
|
import codecs, cStringIO as StringIO
|
|
|
|
|
writer = codecs.getwriter(encoding)
|
|
|
|
|
|
|
|
|
|
v = StringIO.StringIO()
|
|
|
|
|
uv = writer(v)
|
|
|
|
|
for c in us:
|
|
|
|
|
try:
|
|
|
|
|
uv.write(c)
|
|
|
|
|
except UnicodeError:
|
|
|
|
|
uv.write(u"&#%d;" % ord(c))
|
|
|
|
|
s = v.getvalue()
|
|
|
|
|
|
|
|
|
|
To compare the speed of (1) and (3) the following test script has
|
|
|
|
|
been used:
|
|
|
|
|
|
|
|
|
|
# (4)
|
|
|
|
|
import time
|
|
|
|
|
us = u"<22>a"*1000000
|
|
|
|
|
encoding = "ascii"
|
|
|
|
|
import codecs, cStringIO as StringIO
|
|
|
|
|
|
|
|
|
|
t1 = time.time()
|
|
|
|
|
|
|
|
|
|
s1 = us.encode(encoding, "replace")
|
|
|
|
|
|
|
|
|
|
t2 = time.time()
|
|
|
|
|
|
|
|
|
|
writer = codecs.getwriter(encoding)
|
|
|
|
|
|
|
|
|
|
v = StringIO.StringIO()
|
|
|
|
|
uv = writer(v)
|
|
|
|
|
for c in us:
|
|
|
|
|
try:
|
|
|
|
|
uv.write(c)
|
|
|
|
|
except UnicodeError:
|
|
|
|
|
uv.write(u"?")
|
|
|
|
|
s2 = v.getvalue()
|
|
|
|
|
|
|
|
|
|
t3 = time.time()
|
|
|
|
|
|
|
|
|
|
assert(s1==s2)
|
|
|
|
|
print "1:", t2-t1
|
|
|
|
|
print "2:", t3-t2
|
|
|
|
|
print "factor:", (t3-t2)/(t2-t1)
|
|
|
|
|
|
|
|
|
|
On Linux this gives the following output (with Python 2.3a0):
|
|
|
|
|
|
|
|
|
|
1: 0.274321913719
|
|
|
|
|
2: 51.1284689903
|
|
|
|
|
factor: 186.381278466
|
|
|
|
|
|
|
|
|
|
i.e. (3) is 180 times slower than (1).
|
|
|
|
|
|
2002-07-25 09:23:29 -04:00
|
|
|
|
Callbacks must be stateless, because as soon as a callback is
|
2002-06-18 23:22:11 -04:00
|
|
|
|
registered it is available globally and can be called by multiple
|
|
|
|
|
encode() calls. To be able to use stateful callbacks, the errors
|
|
|
|
|
parameter for encode/decode/translate would have to be changed
|
|
|
|
|
from char * to PyObject *, so that the callback could be used
|
|
|
|
|
directly, without the need to register the callback globally. As
|
|
|
|
|
this requires changes to lots of C prototypes, this approach was
|
|
|
|
|
rejected.
|
|
|
|
|
|
|
|
|
|
Currently all encoding/decoding functions have arguments
|
|
|
|
|
|
|
|
|
|
const Py_UNICODE *p, int size
|
|
|
|
|
|
|
|
|
|
or
|
|
|
|
|
|
|
|
|
|
const char *p, int size
|
|
|
|
|
|
|
|
|
|
to specify the unicode characters/8bit characters to be
|
|
|
|
|
encoded/decoded. So in case of an error the codec has to create a
|
|
|
|
|
new unicode or str object from these parameters and store it in
|
|
|
|
|
the exception object. The callers of these encoding/decoding
|
|
|
|
|
functions extract these parameters from str/unicode objects
|
|
|
|
|
themselves most of the time, so it could speed up error handling
|
|
|
|
|
if these object were passed directly. As this again requires
|
|
|
|
|
changes to many C functions, this approach has been rejected.
|
|
|
|
|
|
2002-07-09 14:35:04 -04:00
|
|
|
|
For stream readers/writers the errors attribute must be changeable
|
|
|
|
|
to be able to switch between different error handling methods
|
|
|
|
|
during the lifetime of the stream reader/writer. This is currently
|
|
|
|
|
the case for codecs.StreamReader and codecs.StreamWriter and
|
|
|
|
|
all their subclasses. All core codecs and probably most of the
|
|
|
|
|
third party codecs (e.g. JapaneseCodecs) derive their stream
|
|
|
|
|
readers/writers from these classes so this already works,
|
|
|
|
|
but the attribute errors should be documented as a requirement.
|
|
|
|
|
|
2002-06-18 23:22:11 -04:00
|
|
|
|
|
|
|
|
|
Implementation Notes
|
|
|
|
|
|
|
|
|
|
A sample implementation is available as SourceForge patch #432401
|
2002-07-25 09:23:29 -04:00
|
|
|
|
[2] including a script for testing the speed of various
|
|
|
|
|
string/encoding/error combinations and a test script.
|
|
|
|
|
|
|
|
|
|
Currently the new exception classes are old style Python
|
|
|
|
|
classes. This means that accessing attributes results
|
|
|
|
|
in a dict lookup. The C API is implemented in a way
|
|
|
|
|
that makes it possible to switch to new style classes
|
|
|
|
|
behind the scene, if Exception (and UnicodeError) will
|
|
|
|
|
be changed to new style classes implemented in C for
|
|
|
|
|
improved performance.
|
2002-06-18 23:22:11 -04:00
|
|
|
|
|
|
|
|
|
The class codecs.StreamReaderWriter uses the errors parameter for
|
|
|
|
|
both reading and writing. To be more flexible this should
|
|
|
|
|
probably be changed to two separate parameters for reading and
|
|
|
|
|
writing.
|
|
|
|
|
|
|
|
|
|
The errors parameter of PyUnicode_TranslateCharmap is not
|
|
|
|
|
availably to Python, which makes testing of the new functionality
|
|
|
|
|
of PyUnicode_TranslateCharmap impossible with Python scripts. The
|
|
|
|
|
patch should add an optional argument errors to unicode.translate
|
|
|
|
|
to expose the functionality and make testing possible.
|
|
|
|
|
|
|
|
|
|
Codecs that do something different than encoding/decoding from/to
|
|
|
|
|
unicode and want to use the new machinery can define their own
|
|
|
|
|
exception classes and the strict handlers will automatically work
|
|
|
|
|
with it. The other predefined error handlers are unicode specific
|
|
|
|
|
and expect to get a Unicode(Encode|Decode|Translate)Error
|
|
|
|
|
exception object so they won't work.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Backwards Compatibility
|
|
|
|
|
|
|
|
|
|
The semantics of unicode.encode with errors="replace" has changed:
|
|
|
|
|
The old version always stored a ? character in the output string
|
|
|
|
|
even if no character was mapped to ? in the mapping. With the
|
|
|
|
|
proposed patch, the replacement string from the callback callback
|
|
|
|
|
will again be looked up in the mapping dictionary. But as all
|
|
|
|
|
supported encodings are ASCII based, and thus map ? to ?, this
|
|
|
|
|
should not be a problem in practice.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
References
|
|
|
|
|
|
|
|
|
|
[1] SF feature request #403100
|
|
|
|
|
"Multicharacter replacements in PyUnicode_TranslateCharmap"
|
|
|
|
|
http://www.python.org/sf/403100
|
|
|
|
|
|
|
|
|
|
[2] SF patch #432401 "unicode encoding error callbacks"
|
|
|
|
|
http://www.python.org/sf/432401
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
End:
|