From 90997b3c1d1f1586601279c5f5c04664cfccd603 Mon Sep 17 00:00:00 2001 From: Barry Warsaw Date: Wed, 19 Jun 2002 03:22:11 +0000 Subject: [PATCH] =?UTF-8?q?PEP=20293,=20Codec=20Error=20Handling=20Callbac?= =?UTF-8?q?ks,=20Walter=20D=C3=B6rwald?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- pep-0293.txt | 404 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 404 insertions(+) create mode 100644 pep-0293.txt diff --git a/pep-0293.txt b/pep-0293.txt new file mode 100644 index 000000000..641738c23 --- /dev/null +++ b/pep-0293.txt @@ -0,0 +1,404 @@ +PEP: 293 +Title: Codec Error Handling Callbacks +Version: $Revision$ +Last-Modified: $Date$ +Author: Walter Dörwald +Status: Draft +Type: Standards Track +Created: 18-Jun-2002 +Python-Version: 2.3 +Post-History: + + +Abstract + + This PEP aims at extending Python's fixed codec error handling + schemes with a more flexible callback based approach. + + Python currently uses a fixed error handling for codec error + handlers. This PEP describes a mechanism which allows Python to + use function callbacks as error handlers. With these more + flexible error handlers it is possible to add new functionality to + existing codecs by e.g. providing fallback solutions or different + encodings for cases where the standard codec mapping does not + apply. + + +Specification + + Currently the set of codec error handling algorithms is fixed to + either "strict", "replace" or "ignore" and the semantics of these + algorithms is implemented separately for each codec. + + The proposed patch will make the set of error handling algorithms + extensible through a codec error handler registry which maps + handler names to handler functions. This registry consists of the + following two C functions: + + int PyCodec_RegisterError(const char *name, PyObject *error) + + PyObject *PyCodec_LookupError(const char *name) + + and their Python counterparts + + codecs.register_error(name, error) + + codecs.lookup_error(name) + + PyCodec_LookupError raises a LookupError if no callback function + has been registered under this name. + + Similar to the encoding name registry there is no way of + unregistering callback functions or iterating through the + available functions. + + The callback functions will be used in the following way by the + codecs: when the codec encounters an encoding/decoding error, the + callback function is looked up by name, the information about the + error is stored in an exception object and the callback is called + with this object. The callback returns information about how to + proceed (or raises an exception). + + For encoding, the exception object will look like this: + + class UnicodeEncodeError(UnicodeError): + def __init__(self, encoding, object, start, end, reason): + UnicodeError.__init__(self, + "encoding '%s' can't encode characters " + + "in positions %d-%d: %s" % (encoding, + start, end-1, reason)) + self.encoding = encoding + self.object = object + self.start = start + self.end = end + self.reason = reason + + This type will be implemented in C with the appropriate setter and + getter methods for the attributes, which have the following + meaning: + + * encoding: The name of the encoding; + * object: The original unicode object for which encode() has + been called; + * start: The position of the first unencodable character; + * end: (The position of the last unencodable character)+1 (or + the length of object, if all characters from start to the end + of object are unencodable); + * reason: The reason why object[start:end] couldn't be encoded. + + If object has consecutive unencodable characters, the encoder + should collect those characters for one call to the callback if + those characters can't be encoded for the same reason. The + encoder is not required to implement this behaviour but may call + the callback for every single character, but it is strongly + suggested that the collecting method is implemented. + + The callback must not modify the exception object. If the + callback does not raise an exception (either the one passed in, or + a different one), it must return a tuple: + + (replacement, newpos) + + replacement is a unicode object that the encoder will encode and + emit instead of the unencodable object[start:end] part, newpos + specifies a new position within object, where (after encoding the + replacement) the encoder will continue encoding. + + If the replacement string itself contains an unencodable character + the encoder raises the exception object (but may set a different + reason string before raising). + + Should further encoding errors occur, the encoder is allowed to + reuse the exception object for the next call to the callback. + Furthermore the encoder is allowed to cache the result of + codecs.lookup_error. + + If the callback does not know how to handle the exception, it must + raise a TypeError. + + Decoding works similar to encoding with the following differences: + The exception class is named UnicodeDecodeError and the attribute + object is the original 8bit string that the decoder is currently + decoding. + + The decoder will call the callback with those bytes that + constitute one undecodable sequence, even if there is more than + one undecodable sequence that is undecodable for the same reason + directly after the first one. E.g. for the "unicode-escape" + encoding, when decoding the illegal string "\\u00\\u01x", the + callback will be called twice (once for "\\u00" and once for + "\\u01"). This is done to be able to generate the correct number + of replacement characters. + + The replacement returned from the callback is a unicode object + that will be emitted by the decoder as-is without further + processing instead of the undecodable object[start:end] part. + + There is a third API that uses the old strict/ignore/replace error + handling scheme: + + PyUnicode_TranslateCharmap/unicode.translate + + The proposed patch will enhance PyUnicode_TranslateCharmap, so + that it also supports the callback registry. This has the + additional side effect that PyUnicode_TranslateCharmap will + support multi-character replacement strings (see SF feature + request #403100 [1]). + + For PyUnicode_TranslateCharmap the exception class will be named + UnicodeTranslateError. PyUnicode_TranslateCharmap will collect + all consecutive untranslatable characters (i.e. those that map to + None) and call the callback with them. The replacement returned + from the callback is a unicode object that will be put in the + translated result as-is, without further processing. + + All encoders and decoders are allowed to implement the callback + functionality themselves, if they recognize the callback name + (i.e. if it is a system callback like "strict", "replace" and + "ignore"). The proposed patch will add two additional system + callback names: "backslashreplace" and "xmlcharrefreplace", which + can be used for encoding and translating and which will also be + implemented in-place for all encoders and + PyUnicode_TranslateCharmap. + + The Python equivalent of these five callbacks will look like this: + + def strict(exc): + raise exc + + def ignore(exc): + if isinstance(exc, UnicodeError): + return (u"", exc.end) + else: + raise TypeError("can't handle %s" % exc.__name__) + + def replace(exc): + if isinstance(exc, UnicodeEncodeError): + return ((exc.end-exc.start)*u"?", exc.end) + elif isinstance(exc, UnicodeDecodeError): + return (u"\\ufffd", exc.end) + elif isinstance(exc, UnicodeTranslateError): + return ((exc.end-exc.start)*u"\\ufffd", exc.end) + else: + raise TypeError("can't handle %s" % exc.__name__) + + def backslashreplace(exc): + if isinstance(exc, + (UnicodeEncodeError, UnicodeTranslateError)): + s = u"" + for c in exc.object[exc.start:exc.end]: + if ord(c)<=0xff: + s += u"\\x%02x" % ord(c) + elif ord(c)<=0xffff: + s += u"\\u%04x" % ord(c) + else: + s += u"\\U%08x" % ord(c) + return (s, exc.end) + else: + raise TypeError("can't handle %s" % exc.__name__) + + def xmlcharrefreplace(exc): + if isinstance(exc, + (UnicodeEncodeError, UnicodeTranslateError)): + s = u"" + for c in exc.object[exc.start:exc.end]: + s += u"&#%d;" % ord(c) + return (s, exc.end) + else: + raise TypeError("can't handle %s" % exc.__name__) + + These five callback handlers will also be accessible to Python as + codecs.strict_error, codecs.ignore_error, codecs.replace_error, + codecs.backslashreplace_error and codecs.xmlcharrefreplace_error. + + +Rationale + + Most legacy encoding do not support the full range of Unicode + characters. For these cases many high level protocols support a + way of escaping a Unicode character (e.g. Python itself supports + the \x, \u and \U convention, XML supports character references + via &#xxx; etc.). + + When implementing such an encoding algorithm, a problem with the + current implementation of the encode method of Unicode objects + becomes apparent: For determining which characters are unencodable + by a certain encoding, every single character has to be tried, + because encode does not provide any information about the location + of the error(s), so + + # (1) + us = u"xxx" + s = us.encode(encoding) + + has to be replaced by + + # (2) + us = u"xxx" + v = [] + for c in us: + try: + v.append(c.encode(encoding)) + except UnicodeError: + v.append("&#%d;" % ord(c)) + s = "".join(v) + + This slows down encoding dramatically as now the loop through the + string is done in Python code and no longer in C code. + + Furthermore this solution poses problems with stateful encodings. + For example UTF-16 uses a Byte Order Mark at the start of the + encoded byte string to specify the byte order. Using (2) with + UTF-16, results in an 8 bit string with a BOM between every + character. + + To work around this problem, a stream writer - which keeps state + between calls to the encoding function - has to be used: + + # (3) + us = u"xxx" + import codecs, cStringIO as StringIO + writer = codecs.getwriter(encoding) + + v = StringIO.StringIO() + uv = writer(v) + for c in us: + try: + uv.write(c) + except UnicodeError: + uv.write(u"&#%d;" % ord(c)) + s = v.getvalue() + + To compare the speed of (1) and (3) the following test script has + been used: + + # (4) + import time + us = u"äa"*1000000 + encoding = "ascii" + import codecs, cStringIO as StringIO + + t1 = time.time() + + s1 = us.encode(encoding, "replace") + + t2 = time.time() + + writer = codecs.getwriter(encoding) + + v = StringIO.StringIO() + uv = writer(v) + for c in us: + try: + uv.write(c) + except UnicodeError: + uv.write(u"?") + s2 = v.getvalue() + + t3 = time.time() + + assert(s1==s2) + print "1:", t2-t1 + print "2:", t3-t2 + print "factor:", (t3-t2)/(t2-t1) + + On Linux this gives the following output (with Python 2.3a0): + + 1: 0.274321913719 + 2: 51.1284689903 + factor: 186.381278466 + + i.e. (3) is 180 times slower than (1). + + Codecs must be stateless, because as soon as a callback is + registered it is available globally and can be called by multiple + encode() calls. To be able to use stateful callbacks, the errors + parameter for encode/decode/translate would have to be changed + from char * to PyObject *, so that the callback could be used + directly, without the need to register the callback globally. As + this requires changes to lots of C prototypes, this approach was + rejected. + + Currently all encoding/decoding functions have arguments + + const Py_UNICODE *p, int size + + or + + const char *p, int size + + to specify the unicode characters/8bit characters to be + encoded/decoded. So in case of an error the codec has to create a + new unicode or str object from these parameters and store it in + the exception object. The callers of these encoding/decoding + functions extract these parameters from str/unicode objects + themselves most of the time, so it could speed up error handling + if these object were passed directly. As this again requires + changes to many C functions, this approach has been rejected. + + +Implementation Notes + + A sample implementation is available as SourceForge patch #432401 + [2]. The current version of this patch differs from the + specification in the following way: + + * The error information is passed from the codec to the callback + not as an exception object, but as a tuple, which has an + additional entry state, which can be used for additional + information the codec might want to pass to the callback. + * There are two separate registries (one for + encoding/translating and one for decoding) + + The class codecs.StreamReaderWriter uses the errors parameter for + both reading and writing. To be more flexible this should + probably be changed to two separate parameters for reading and + writing. + + The errors parameter of PyUnicode_TranslateCharmap is not + availably to Python, which makes testing of the new functionality + of PyUnicode_TranslateCharmap impossible with Python scripts. The + patch should add an optional argument errors to unicode.translate + to expose the functionality and make testing possible. + + Codecs that do something different than encoding/decoding from/to + unicode and want to use the new machinery can define their own + exception classes and the strict handlers will automatically work + with it. The other predefined error handlers are unicode specific + and expect to get a Unicode(Encode|Decode|Translate)Error + exception object so they won't work. + + +Backwards Compatibility + + The semantics of unicode.encode with errors="replace" has changed: + The old version always stored a ? character in the output string + even if no character was mapped to ? in the mapping. With the + proposed patch, the replacement string from the callback callback + will again be looked up in the mapping dictionary. But as all + supported encodings are ASCII based, and thus map ? to ?, this + should not be a problem in practice. + + +References + + [1] SF feature request #403100 + "Multicharacter replacements in PyUnicode_TranslateCharmap" + http://www.python.org/sf/403100 + + [2] SF patch #432401 "unicode encoding error callbacks" + http://www.python.org/sf/432401 + + +Copyright + + This document has been placed in the public domain. + + + +Local Variables: +mode: indented-text +indent-tabs-mode: nil +sentence-end-double-space: t +fill-column: 70 +End: