diff --git a/pep-0293.txt b/pep-0293.txt index 0d0ab3ee7..48e4de38a 100644 --- a/pep-0293.txt +++ b/pep-0293.txt @@ -5,418 +5,435 @@ Last-Modified: $Date$ Author: Walter Dörwald Status: Final Type: Standards Track +Content-Type: text/x-rst Created: 18-Jun-2002 Python-Version: 2.3 Post-History: 19-Jun-2002 Abstract +======== - This PEP aims at extending Python's fixed codec error handling - schemes with a more flexible callback based approach. +This PEP aims at extending Python's fixed codec error handling +schemes with a more flexible callback based approach. - Python currently uses a fixed error handling for codec error - handlers. This PEP describes a mechanism which allows Python to - use function callbacks as error handlers. With these more - flexible error handlers it is possible to add new functionality to - existing codecs by e.g. providing fallback solutions or different - encodings for cases where the standard codec mapping does not - apply. +Python currently uses a fixed error handling for codec error +handlers. This PEP describes a mechanism which allows Python to +use function callbacks as error handlers. With these more +flexible error handlers it is possible to add new functionality to +existing codecs by e.g. providing fallback solutions or different +encodings for cases where the standard codec mapping does not +apply. Specification +============= - Currently the set of codec error handling algorithms is fixed to - either "strict", "replace" or "ignore" and the semantics of these - algorithms is implemented separately for each codec. +Currently the set of codec error handling algorithms is fixed to +either "strict", "replace" or "ignore" and the semantics of these +algorithms is implemented separately for each codec. - The proposed patch will make the set of error handling algorithms - extensible through a codec error handler registry which maps - handler names to handler functions. This registry consists of the - following two C functions: +The proposed patch will make the set of error handling algorithms +extensible through a codec error handler registry which maps +handler names to handler functions. This registry consists of the +following two C functions:: - int PyCodec_RegisterError(const char *name, PyObject *error) + int PyCodec_RegisterError(const char *name, PyObject *error) - PyObject *PyCodec_LookupError(const char *name) + PyObject *PyCodec_LookupError(const char *name) - and their Python counterparts +and their Python counterparts:: - codecs.register_error(name, error) + codecs.register_error(name, error) - codecs.lookup_error(name) + codecs.lookup_error(name) - PyCodec_LookupError raises a LookupError if no callback function - has been registered under this name. +``PyCodec_LookupError`` raises a ``LookupError`` if no callback function +has been registered under this name. - Similar to the encoding name registry there is no way of - unregistering callback functions or iterating through the - available functions. +Similar to the encoding name registry there is no way of +unregistering callback functions or iterating through the +available functions. - The callback functions will be used in the following way by the - codecs: when the codec encounters an encoding/decoding error, the - callback function is looked up by name, the information about the - error is stored in an exception object and the callback is called - with this object. The callback returns information about how to - proceed (or raises an exception). +The callback functions will be used in the following way by the +codecs: when the codec encounters an encoding/decoding error, the +callback function is looked up by name, the information about the +error is stored in an exception object and the callback is called +with this object. The callback returns information about how to +proceed (or raises an exception). - For encoding, the exception object will look like this: +For encoding, the exception object will look like this:: - class UnicodeEncodeError(UnicodeError): - def __init__(self, encoding, object, start, end, reason): - UnicodeError.__init__(self, - "encoding '%s' can't encode characters " + - "in positions %d-%d: %s" % (encoding, - start, end-1, reason)) - self.encoding = encoding - self.object = object - self.start = start - self.end = end - self.reason = reason + class UnicodeEncodeError(UnicodeError): + def __init__(self, encoding, object, start, end, reason): + UnicodeError.__init__(self, + "encoding '%s' can't encode characters " + + "in positions %d-%d: %s" % (encoding, + start, end-1, reason)) + self.encoding = encoding + self.object = object + self.start = start + self.end = end + self.reason = reason - This type will be implemented in C with the appropriate setter and - getter methods for the attributes, which have the following - meaning: +This type will be implemented in C with the appropriate setter and +getter methods for the attributes, which have the following +meaning: - * encoding: The name of the encoding; - * object: The original unicode object for which encode() has - been called; - * start: The position of the first unencodable character; - * end: (The position of the last unencodable character)+1 (or - the length of object, if all characters from start to the end - of object are unencodable); - * reason: The reason why object[start:end] couldn't be encoded. +* ``encoding``: The name of the encoding; +* ``object``: The original unicode object for which ``encode()`` has + been called; +* ``start``: The position of the first unencodable character; +* ``end``: (The position of the last unencodable character)+1 (or + the length of object, if all characters from start to the end + of object are unencodable); +* ``reason``: The reason why ``object[start:end]`` couldn't be encoded. - If object has consecutive unencodable characters, the encoder - should collect those characters for one call to the callback if - those characters can't be encoded for the same reason. The - encoder is not required to implement this behaviour but may call - the callback for every single character, but it is strongly - suggested that the collecting method is implemented. +If object has consecutive unencodable characters, the encoder +should collect those characters for one call to the callback if +those characters can't be encoded for the same reason. The +encoder is not required to implement this behaviour but may call +the callback for every single character, but it is strongly +suggested that the collecting method is implemented. - The callback must not modify the exception object. If the - callback does not raise an exception (either the one passed in, or - a different one), it must return a tuple: +The callback must not modify the exception object. If the +callback does not raise an exception (either the one passed in, or +a different one), it must return a tuple:: - (replacement, newpos) + (replacement, newpos) - replacement is a unicode object that the encoder will encode and - emit instead of the unencodable object[start:end] part, newpos - specifies a new position within object, where (after encoding the - replacement) the encoder will continue encoding. +replacement is a unicode object that the encoder will encode and +emit instead of the unencodable ``object[start:end]`` part, newpos +specifies a new position within object, where (after encoding the +replacement) the encoder will continue encoding. - Negative values for newpos are treated as being relative to - end of object. If newpos is out of bounds the encoder will raise - an IndexError. +Negative values for newpos are treated as being relative to +end of object. If newpos is out of bounds the encoder will raise +an ``IndexError``. - If the replacement string itself contains an unencodable character - the encoder raises the exception object (but may set a different - reason string before raising). +If the replacement string itself contains an unencodable character +the encoder raises the exception object (but may set a different +reason string before raising). - Should further encoding errors occur, the encoder is allowed to - reuse the exception object for the next call to the callback. - Furthermore, the encoder is allowed to cache the result of - codecs.lookup_error. +Should further encoding errors occur, the encoder is allowed to +reuse the exception object for the next call to the callback. +Furthermore, the encoder is allowed to cache the result of +``codecs.lookup_error``. - If the callback does not know how to handle the exception, it must - raise a TypeError. +If the callback does not know how to handle the exception, it must +raise a ``TypeError``. - Decoding works similar to encoding with the following differences: - The exception class is named UnicodeDecodeError and the attribute - object is the original 8bit string that the decoder is currently - decoding. +Decoding works similar to encoding with the following differences: - The decoder will call the callback with those bytes that - constitute one undecodable sequence, even if there is more than - one undecodable sequence that is undecodable for the same reason - directly after the first one. E.g. for the "unicode-escape" - encoding, when decoding the illegal string "\\u00\\u01x", the - callback will be called twice (once for "\\u00" and once for - "\\u01"). This is done to be able to generate the correct number - of replacement characters. +* The exception class is named ``UnicodeDecodeError`` and the attribute + object is the original 8bit string that the decoder is currently + decoding. - The replacement returned from the callback is a unicode object - that will be emitted by the decoder as-is without further - processing instead of the undecodable object[start:end] part. +* The decoder will call the callback with those bytes that + constitute one undecodable sequence, even if there is more than + one undecodable sequence that is undecodable for the same reason + directly after the first one. E.g. for the "unicode-escape" + encoding, when decoding the illegal string ``\\u00\\u01x``, the + callback will be called twice (once for ``\\u00`` and once for + ``\\u01``). This is done to be able to generate the correct number + of replacement characters. - There is a third API that uses the old strict/ignore/replace error - handling scheme: +* The replacement returned from the callback is a unicode object + that will be emitted by the decoder as-is without further + processing instead of the undecodable ``object[start:end]`` part. - PyUnicode_TranslateCharmap/unicode.translate +There is a third API that uses the old strict/ignore/replace error +handling scheme:: - The proposed patch will enhance PyUnicode_TranslateCharmap, so - that it also supports the callback registry. This has the - additional side effect that PyUnicode_TranslateCharmap will - support multi-character replacement strings (see SF feature - request #403100 [1]). + PyUnicode_TranslateCharmap/unicode.translate - For PyUnicode_TranslateCharmap the exception class will be named - UnicodeTranslateError. PyUnicode_TranslateCharmap will collect - all consecutive untranslatable characters (i.e. those that map to - None) and call the callback with them. The replacement returned - from the callback is a unicode object that will be put in the - translated result as-is, without further processing. +The proposed patch will enhance ``PyUnicode_TranslateCharmap``, so +that it also supports the callback registry. This has the +additional side effect that ``PyUnicode_TranslateCharmap`` will +support multi-character replacement strings (see SF feature +request #403100 [1]_). - All encoders and decoders are allowed to implement the callback - functionality themselves, if they recognize the callback name - (i.e. if it is a system callback like "strict", "replace" and - "ignore"). The proposed patch will add two additional system - callback names: "backslashreplace" and "xmlcharrefreplace", which - can be used for encoding and translating and which will also be - implemented in-place for all encoders and - PyUnicode_TranslateCharmap. +For ``PyUnicode_TranslateCharmap`` the exception class will be named +``UnicodeTranslateError``. ``PyUnicode_TranslateCharmap`` will collect +all consecutive untranslatable characters (i.e. those that map to +``None``) and call the callback with them. The replacement returned +from the callback is a unicode object that will be put in the +translated result as-is, without further processing. - The Python equivalent of these five callbacks will look like this: +All encoders and decoders are allowed to implement the callback +functionality themselves, if they recognize the callback name +(i.e. if it is a system callback like "strict", "replace" and +"ignore"). The proposed patch will add two additional system +callback names: "backslashreplace" and "xmlcharrefreplace", which +can be used for encoding and translating and which will also be +implemented in-place for all encoders and +``PyUnicode_TranslateCharmap``. - def strict(exc): - raise exc +The Python equivalent of these five callbacks will look like this:: - def ignore(exc): - if isinstance(exc, UnicodeError): - return (u"", exc.end) - else: - raise TypeError("can't handle %s" % exc.__name__) + def strict(exc): + raise exc - def replace(exc): - if isinstance(exc, UnicodeEncodeError): - return ((exc.end-exc.start)*u"?", exc.end) - elif isinstance(exc, UnicodeDecodeError): - return (u"\\ufffd", exc.end) - elif isinstance(exc, UnicodeTranslateError): - return ((exc.end-exc.start)*u"\\ufffd", exc.end) - else: - raise TypeError("can't handle %s" % exc.__name__) + def ignore(exc): + if isinstance(exc, UnicodeError): + return (u"", exc.end) + else: + raise TypeError("can't handle %s" % exc.__name__) - def backslashreplace(exc): - if isinstance(exc, - (UnicodeEncodeError, UnicodeTranslateError)): - s = u"" - for c in exc.object[exc.start:exc.end]: - if ord(c)<=0xff: - s += u"\\x%02x" % ord(c) - elif ord(c)<=0xffff: - s += u"\\u%04x" % ord(c) - else: - s += u"\\U%08x" % ord(c) - return (s, exc.end) - else: - raise TypeError("can't handle %s" % exc.__name__) + def replace(exc): + if isinstance(exc, UnicodeEncodeError): + return ((exc.end-exc.start)*u"?", exc.end) + elif isinstance(exc, UnicodeDecodeError): + return (u"\\ufffd", exc.end) + elif isinstance(exc, UnicodeTranslateError): + return ((exc.end-exc.start)*u"\\ufffd", exc.end) + else: + raise TypeError("can't handle %s" % exc.__name__) - def xmlcharrefreplace(exc): - if isinstance(exc, - (UnicodeEncodeError, UnicodeTranslateError)): - s = u"" - for c in exc.object[exc.start:exc.end]: - s += u"&#%d;" % ord(c) - return (s, exc.end) - else: - raise TypeError("can't handle %s" % exc.__name__) + def backslashreplace(exc): + if isinstance(exc, + (UnicodeEncodeError, UnicodeTranslateError)): + s = u"" + for c in exc.object[exc.start:exc.end]: + if ord(c)<=0xff: + s += u"\\x%02x" % ord(c) + elif ord(c)<=0xffff: + s += u"\\u%04x" % ord(c) + else: + s += u"\\U%08x" % ord(c) + return (s, exc.end) + else: + raise TypeError("can't handle %s" % exc.__name__) - These five callback handlers will also be accessible to Python as - codecs.strict_error, codecs.ignore_error, codecs.replace_error, - codecs.backslashreplace_error and codecs.xmlcharrefreplace_error. + def xmlcharrefreplace(exc): + if isinstance(exc, + (UnicodeEncodeError, UnicodeTranslateError)): + s = u"" + for c in exc.object[exc.start:exc.end]: + s += u"&#%d;" % ord(c) + return (s, exc.end) + else: + raise TypeError("can't handle %s" % exc.__name__) + +These five callback handlers will also be accessible to Python as +``codecs.strict_error``, ``codecs.ignore_error``, ``codecs.replace_error``, +``codecs.backslashreplace_error`` and ``codecs.xmlcharrefreplace_error``. Rationale +========= - Most legacy encoding do not support the full range of Unicode - characters. For these cases many high level protocols support a - way of escaping a Unicode character (e.g. Python itself supports - the \x, \u and \U convention, XML supports character references - via &#xxx; etc.). +Most legacy encoding do not support the full range of Unicode +characters. For these cases many high level protocols support a +way of escaping a Unicode character (e.g. Python itself supports +the ``\x``, ``\u`` and ``\U`` convention, XML supports character references +via &#xxx; etc.). - When implementing such an encoding algorithm, a problem with the - current implementation of the encode method of Unicode objects - becomes apparent: For determining which characters are unencodable - by a certain encoding, every single character has to be tried, - because encode does not provide any information about the location - of the error(s), so +When implementing such an encoding algorithm, a problem with the +current implementation of the encode method of Unicode objects +becomes apparent: For determining which characters are unencodable +by a certain encoding, every single character has to be tried, +because encode does not provide any information about the location +of the error(s), so - # (1) - us = u"xxx" - s = us.encode(encoding) +:: - has to be replaced by + # (1) + us = u"xxx" + s = us.encode(encoding) - # (2) - us = u"xxx" - v = [] - for c in us: - try: - v.append(c.encode(encoding)) - except UnicodeError: - v.append("&#%d;" % ord(c)) - s = "".join(v) +has to be replaced by - This slows down encoding dramatically as now the loop through the - string is done in Python code and no longer in C code. +:: - Furthermore, this solution poses problems with stateful encodings. - For example, UTF-16 uses a Byte Order Mark at the start of the - encoded byte string to specify the byte order. Using (2) with - UTF-16, results in an 8 bit string with a BOM between every - character. + # (2) + us = u"xxx" + v = [] + for c in us: + try: + v.append(c.encode(encoding)) + except UnicodeError: + v.append("&#%d;" % ord(c)) + s = "".join(v) - To work around this problem, a stream writer - which keeps state - between calls to the encoding function - has to be used: +This slows down encoding dramatically as now the loop through the +string is done in Python code and no longer in C code. - # (3) - us = u"xxx" - import codecs, cStringIO as StringIO - writer = codecs.getwriter(encoding) +Furthermore, this solution poses problems with stateful encodings. +For example, UTF-16 uses a Byte Order Mark at the start of the +encoded byte string to specify the byte order. Using (2) with +UTF-16, results in an 8 bit string with a BOM between every +character. - v = StringIO.StringIO() - uv = writer(v) - for c in us: - try: - uv.write(c) - except UnicodeError: - uv.write(u"&#%d;" % ord(c)) - s = v.getvalue() +To work around this problem, a stream writer - which keeps state +between calls to the encoding function - has to be used:: - To compare the speed of (1) and (3) the following test script has - been used: + # (3) + us = u"xxx" + import codecs, cStringIO as StringIO + writer = codecs.getwriter(encoding) - # (4) - import time - us = u"äa"*1000000 - encoding = "ascii" - import codecs, cStringIO as StringIO + v = StringIO.StringIO() + uv = writer(v) + for c in us: + try: + uv.write(c) + except UnicodeError: + uv.write(u"&#%d;" % ord(c)) + s = v.getvalue() - t1 = time.time() +To compare the speed of (1) and (3) the following test script has +been used:: - s1 = us.encode(encoding, "replace") + # (4) + import time + us = u"äa"*1000000 + encoding = "ascii" + import codecs, cStringIO as StringIO - t2 = time.time() + t1 = time.time() - writer = codecs.getwriter(encoding) + s1 = us.encode(encoding, "replace") - v = StringIO.StringIO() - uv = writer(v) - for c in us: - try: - uv.write(c) - except UnicodeError: - uv.write(u"?") - s2 = v.getvalue() + t2 = time.time() - t3 = time.time() + writer = codecs.getwriter(encoding) - assert(s1==s2) - print "1:", t2-t1 - print "2:", t3-t2 - print "factor:", (t3-t2)/(t2-t1) + v = StringIO.StringIO() + uv = writer(v) + for c in us: + try: + uv.write(c) + except UnicodeError: + uv.write(u"?") + s2 = v.getvalue() - On Linux this gives the following output (with Python 2.3a0): + t3 = time.time() - 1: 0.274321913719 - 2: 51.1284689903 - factor: 186.381278466 + assert(s1==s2) + print "1:", t2-t1 + print "2:", t3-t2 + print "factor:", (t3-t2)/(t2-t1) - i.e. (3) is 180 times slower than (1). +On Linux this gives the following output (with Python 2.3a0):: - Callbacks must be stateless, because as soon as a callback is - registered it is available globally and can be called by multiple - encode() calls. To be able to use stateful callbacks, the errors - parameter for encode/decode/translate would have to be changed - from char * to PyObject *, so that the callback could be used - directly, without the need to register the callback globally. As - this requires changes to lots of C prototypes, this approach was - rejected. + 1: 0.274321913719 + 2: 51.1284689903 + factor: 186.381278466 - Currently all encoding/decoding functions have arguments +i.e. (3) is 180 times slower than (1). - const Py_UNICODE *p, int size +Callbacks must be stateless, because as soon as a callback is +registered it is available globally and can be called by multiple +``encode()`` calls. To be able to use stateful callbacks, the errors +parameter for encode/decode/translate would have to be changed +from ``char *`` to ``PyObject *``, so that the callback could be used +directly, without the need to register the callback globally. As +this requires changes to lots of C prototypes, this approach was +rejected. - or +Currently all encoding/decoding functions have arguments - const char *p, int size +:: - to specify the unicode characters/8bit characters to be - encoded/decoded. So in case of an error the codec has to create a - new unicode or str object from these parameters and store it in - the exception object. The callers of these encoding/decoding - functions extract these parameters from str/unicode objects - themselves most of the time, so it could speed up error handling - if these object were passed directly. As this again requires - changes to many C functions, this approach has been rejected. + const Py_UNICODE *p, int size - For stream readers/writers the errors attribute must be changeable - to be able to switch between different error handling methods - during the lifetime of the stream reader/writer. This is currently - the case for codecs.StreamReader and codecs.StreamWriter and - all their subclasses. All core codecs and probably most of the - third party codecs (e.g. JapaneseCodecs) derive their stream - readers/writers from these classes so this already works, - but the attribute errors should be documented as a requirement. +or + +:: + + const char *p, int size + +to specify the unicode characters/8bit characters to be +encoded/decoded. So in case of an error the codec has to create a +new unicode or str object from these parameters and store it in +the exception object. The callers of these encoding/decoding +functions extract these parameters from str/unicode objects +themselves most of the time, so it could speed up error handling +if these object were passed directly. As this again requires +changes to many C functions, this approach has been rejected. + +For stream readers/writers the errors attribute must be changeable +to be able to switch between different error handling methods +during the lifetime of the stream reader/writer. This is currently +the case for ``codecs.StreamReader`` and ``codecs.StreamWriter`` and +all their subclasses. All core codecs and probably most of the +third party codecs (e.g. ``JapaneseCodecs``) derive their stream +readers/writers from these classes so this already works, +but the attribute errors should be documented as a requirement. Implementation Notes +==================== - A sample implementation is available as SourceForge patch #432401 - [2] including a script for testing the speed of various - string/encoding/error combinations and a test script. +A sample implementation is available as SourceForge patch #432401 +[2]_ including a script for testing the speed of various +string/encoding/error combinations and a test script. - Currently the new exception classes are old style Python - classes. This means that accessing attributes results - in a dict lookup. The C API is implemented in a way - that makes it possible to switch to new style classes - behind the scene, if Exception (and UnicodeError) will - be changed to new style classes implemented in C for - improved performance. +Currently the new exception classes are old style Python +classes. This means that accessing attributes results +in a dict lookup. The C API is implemented in a way +that makes it possible to switch to new style classes +behind the scene, if ``Exception`` (and ``UnicodeError``) will +be changed to new style classes implemented in C for +improved performance. - The class codecs.StreamReaderWriter uses the errors parameter for - both reading and writing. To be more flexible this should - probably be changed to two separate parameters for reading and - writing. +The class ``codecs.StreamReaderWriter`` uses the errors parameter for +both reading and writing. To be more flexible this should +probably be changed to two separate parameters for reading and +writing. - The errors parameter of PyUnicode_TranslateCharmap is not - availably to Python, which makes testing of the new functionality - of PyUnicode_TranslateCharmap impossible with Python scripts. The - patch should add an optional argument errors to unicode.translate - to expose the functionality and make testing possible. +The errors parameter of ``PyUnicode_TranslateCharmap`` is not +availably to Python, which makes testing of the new functionality +of ``PyUnicode_TranslateCharmap`` impossible with Python scripts. The +patch should add an optional argument errors to unicode.translate +to expose the functionality and make testing possible. - Codecs that do something different than encoding/decoding from/to - unicode and want to use the new machinery can define their own - exception classes and the strict handlers will automatically work - with it. The other predefined error handlers are unicode specific - and expect to get a Unicode(Encode|Decode|Translate)Error - exception object so they won't work. +Codecs that do something different than encoding/decoding from/to +unicode and want to use the new machinery can define their own +exception classes and the strict handlers will automatically work +with it. The other predefined error handlers are unicode specific +and expect to get a ``Unicode(Encode|Decode|Translate)Error`` +exception object so they won't work. Backwards Compatibility +======================= - The semantics of unicode.encode with errors="replace" has changed: - The old version always stored a ? character in the output string - even if no character was mapped to ? in the mapping. With the - proposed patch, the replacement string from the callback will - again be looked up in the mapping dictionary. But as all - supported encodings are ASCII based, and thus map ? to ?, this - should not be a problem in practice. +The semantics of unicode.encode with errors="replace" has changed: +The old version always stored a ? character in the output string +even if no character was mapped to ? in the mapping. With the +proposed patch, the replacement string from the callback will +again be looked up in the mapping dictionary. But as all +supported encodings are ASCII based, and thus map ? to ?, this +should not be a problem in practice. - Illegal values for the errors argument raised ValueError before, - now they will raise LookupError. +Illegal values for the errors argument raised ``ValueError`` before, +now they will raise ``LookupError``. References +========== - [1] SF feature request #403100 - "Multicharacter replacements in PyUnicode_TranslateCharmap" - http://www.python.org/sf/403100 +.. [1] SF feature request #403100 + "Multicharacter replacements in PyUnicode_TranslateCharmap" + http://www.python.org/sf/403100 - [2] SF patch #432401 "unicode encoding error callbacks" - http://www.python.org/sf/432401 +.. [2] SF patch #432401 "unicode encoding error callbacks" + http://www.python.org/sf/432401 Copyright +========= - This document has been placed in the public domain. +This document has been placed in the public domain. - -Local Variables: -mode: indented-text -indent-tabs-mode: nil -sentence-end-double-space: t -fill-column: 70 -coding: utf-8 -End: +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + coding: utf-8 + End: