reSTify PEP 293 (#353)
This commit is contained in:
parent
e6fe4f377f
commit
3ea921ba72
119
pep-0293.txt
119
pep-0293.txt
|
@ -5,12 +5,14 @@ Last-Modified: $Date$
|
||||||
Author: Walter Dörwald <walter@livinglogic.de>
|
Author: Walter Dörwald <walter@livinglogic.de>
|
||||||
Status: Final
|
Status: Final
|
||||||
Type: Standards Track
|
Type: Standards Track
|
||||||
|
Content-Type: text/x-rst
|
||||||
Created: 18-Jun-2002
|
Created: 18-Jun-2002
|
||||||
Python-Version: 2.3
|
Python-Version: 2.3
|
||||||
Post-History: 19-Jun-2002
|
Post-History: 19-Jun-2002
|
||||||
|
|
||||||
|
|
||||||
Abstract
|
Abstract
|
||||||
|
========
|
||||||
|
|
||||||
This PEP aims at extending Python's fixed codec error handling
|
This PEP aims at extending Python's fixed codec error handling
|
||||||
schemes with a more flexible callback based approach.
|
schemes with a more flexible callback based approach.
|
||||||
|
@ -25,6 +27,7 @@ Abstract
|
||||||
|
|
||||||
|
|
||||||
Specification
|
Specification
|
||||||
|
=============
|
||||||
|
|
||||||
Currently the set of codec error handling algorithms is fixed to
|
Currently the set of codec error handling algorithms is fixed to
|
||||||
either "strict", "replace" or "ignore" and the semantics of these
|
either "strict", "replace" or "ignore" and the semantics of these
|
||||||
|
@ -33,19 +36,19 @@ Specification
|
||||||
The proposed patch will make the set of error handling algorithms
|
The proposed patch will make the set of error handling algorithms
|
||||||
extensible through a codec error handler registry which maps
|
extensible through a codec error handler registry which maps
|
||||||
handler names to handler functions. This registry consists of the
|
handler names to handler functions. This registry consists of the
|
||||||
following two C functions:
|
following two C functions::
|
||||||
|
|
||||||
int PyCodec_RegisterError(const char *name, PyObject *error)
|
int PyCodec_RegisterError(const char *name, PyObject *error)
|
||||||
|
|
||||||
PyObject *PyCodec_LookupError(const char *name)
|
PyObject *PyCodec_LookupError(const char *name)
|
||||||
|
|
||||||
and their Python counterparts
|
and their Python counterparts::
|
||||||
|
|
||||||
codecs.register_error(name, error)
|
codecs.register_error(name, error)
|
||||||
|
|
||||||
codecs.lookup_error(name)
|
codecs.lookup_error(name)
|
||||||
|
|
||||||
PyCodec_LookupError raises a LookupError if no callback function
|
``PyCodec_LookupError`` raises a ``LookupError`` if no callback function
|
||||||
has been registered under this name.
|
has been registered under this name.
|
||||||
|
|
||||||
Similar to the encoding name registry there is no way of
|
Similar to the encoding name registry there is no way of
|
||||||
|
@ -59,7 +62,7 @@ Specification
|
||||||
with this object. The callback returns information about how to
|
with this object. The callback returns information about how to
|
||||||
proceed (or raises an exception).
|
proceed (or raises an exception).
|
||||||
|
|
||||||
For encoding, the exception object will look like this:
|
For encoding, the exception object will look like this::
|
||||||
|
|
||||||
class UnicodeEncodeError(UnicodeError):
|
class UnicodeEncodeError(UnicodeError):
|
||||||
def __init__(self, encoding, object, start, end, reason):
|
def __init__(self, encoding, object, start, end, reason):
|
||||||
|
@ -77,14 +80,14 @@ Specification
|
||||||
getter methods for the attributes, which have the following
|
getter methods for the attributes, which have the following
|
||||||
meaning:
|
meaning:
|
||||||
|
|
||||||
* encoding: The name of the encoding;
|
* ``encoding``: The name of the encoding;
|
||||||
* object: The original unicode object for which encode() has
|
* ``object``: The original unicode object for which ``encode()`` has
|
||||||
been called;
|
been called;
|
||||||
* start: The position of the first unencodable character;
|
* ``start``: The position of the first unencodable character;
|
||||||
* end: (The position of the last unencodable character)+1 (or
|
* ``end``: (The position of the last unencodable character)+1 (or
|
||||||
the length of object, if all characters from start to the end
|
the length of object, if all characters from start to the end
|
||||||
of object are unencodable);
|
of object are unencodable);
|
||||||
* reason: The reason why object[start:end] couldn't be encoded.
|
* ``reason``: The reason why ``object[start:end]`` couldn't be encoded.
|
||||||
|
|
||||||
If object has consecutive unencodable characters, the encoder
|
If object has consecutive unencodable characters, the encoder
|
||||||
should collect those characters for one call to the callback if
|
should collect those characters for one call to the callback if
|
||||||
|
@ -95,18 +98,18 @@ Specification
|
||||||
|
|
||||||
The callback must not modify the exception object. If the
|
The callback must not modify the exception object. If the
|
||||||
callback does not raise an exception (either the one passed in, or
|
callback does not raise an exception (either the one passed in, or
|
||||||
a different one), it must return a tuple:
|
a different one), it must return a tuple::
|
||||||
|
|
||||||
(replacement, newpos)
|
(replacement, newpos)
|
||||||
|
|
||||||
replacement is a unicode object that the encoder will encode and
|
replacement is a unicode object that the encoder will encode and
|
||||||
emit instead of the unencodable object[start:end] part, newpos
|
emit instead of the unencodable ``object[start:end]`` part, newpos
|
||||||
specifies a new position within object, where (after encoding the
|
specifies a new position within object, where (after encoding the
|
||||||
replacement) the encoder will continue encoding.
|
replacement) the encoder will continue encoding.
|
||||||
|
|
||||||
Negative values for newpos are treated as being relative to
|
Negative values for newpos are treated as being relative to
|
||||||
end of object. If newpos is out of bounds the encoder will raise
|
end of object. If newpos is out of bounds the encoder will raise
|
||||||
an IndexError.
|
an ``IndexError``.
|
||||||
|
|
||||||
If the replacement string itself contains an unencodable character
|
If the replacement string itself contains an unencodable character
|
||||||
the encoder raises the exception object (but may set a different
|
the encoder raises the exception object (but may set a different
|
||||||
|
@ -115,44 +118,45 @@ Specification
|
||||||
Should further encoding errors occur, the encoder is allowed to
|
Should further encoding errors occur, the encoder is allowed to
|
||||||
reuse the exception object for the next call to the callback.
|
reuse the exception object for the next call to the callback.
|
||||||
Furthermore, the encoder is allowed to cache the result of
|
Furthermore, the encoder is allowed to cache the result of
|
||||||
codecs.lookup_error.
|
``codecs.lookup_error``.
|
||||||
|
|
||||||
If the callback does not know how to handle the exception, it must
|
If the callback does not know how to handle the exception, it must
|
||||||
raise a TypeError.
|
raise a ``TypeError``.
|
||||||
|
|
||||||
Decoding works similar to encoding with the following differences:
|
Decoding works similar to encoding with the following differences:
|
||||||
The exception class is named UnicodeDecodeError and the attribute
|
|
||||||
|
* The exception class is named ``UnicodeDecodeError`` and the attribute
|
||||||
object is the original 8bit string that the decoder is currently
|
object is the original 8bit string that the decoder is currently
|
||||||
decoding.
|
decoding.
|
||||||
|
|
||||||
The decoder will call the callback with those bytes that
|
* The decoder will call the callback with those bytes that
|
||||||
constitute one undecodable sequence, even if there is more than
|
constitute one undecodable sequence, even if there is more than
|
||||||
one undecodable sequence that is undecodable for the same reason
|
one undecodable sequence that is undecodable for the same reason
|
||||||
directly after the first one. E.g. for the "unicode-escape"
|
directly after the first one. E.g. for the "unicode-escape"
|
||||||
encoding, when decoding the illegal string "\\u00\\u01x", the
|
encoding, when decoding the illegal string ``\\u00\\u01x``, the
|
||||||
callback will be called twice (once for "\\u00" and once for
|
callback will be called twice (once for ``\\u00`` and once for
|
||||||
"\\u01"). This is done to be able to generate the correct number
|
``\\u01``). This is done to be able to generate the correct number
|
||||||
of replacement characters.
|
of replacement characters.
|
||||||
|
|
||||||
The replacement returned from the callback is a unicode object
|
* The replacement returned from the callback is a unicode object
|
||||||
that will be emitted by the decoder as-is without further
|
that will be emitted by the decoder as-is without further
|
||||||
processing instead of the undecodable object[start:end] part.
|
processing instead of the undecodable ``object[start:end]`` part.
|
||||||
|
|
||||||
There is a third API that uses the old strict/ignore/replace error
|
There is a third API that uses the old strict/ignore/replace error
|
||||||
handling scheme:
|
handling scheme::
|
||||||
|
|
||||||
PyUnicode_TranslateCharmap/unicode.translate
|
PyUnicode_TranslateCharmap/unicode.translate
|
||||||
|
|
||||||
The proposed patch will enhance PyUnicode_TranslateCharmap, so
|
The proposed patch will enhance ``PyUnicode_TranslateCharmap``, so
|
||||||
that it also supports the callback registry. This has the
|
that it also supports the callback registry. This has the
|
||||||
additional side effect that PyUnicode_TranslateCharmap will
|
additional side effect that ``PyUnicode_TranslateCharmap`` will
|
||||||
support multi-character replacement strings (see SF feature
|
support multi-character replacement strings (see SF feature
|
||||||
request #403100 [1]).
|
request #403100 [1]_).
|
||||||
|
|
||||||
For PyUnicode_TranslateCharmap the exception class will be named
|
For ``PyUnicode_TranslateCharmap`` the exception class will be named
|
||||||
UnicodeTranslateError. PyUnicode_TranslateCharmap will collect
|
``UnicodeTranslateError``. ``PyUnicode_TranslateCharmap`` will collect
|
||||||
all consecutive untranslatable characters (i.e. those that map to
|
all consecutive untranslatable characters (i.e. those that map to
|
||||||
None) and call the callback with them. The replacement returned
|
``None``) and call the callback with them. The replacement returned
|
||||||
from the callback is a unicode object that will be put in the
|
from the callback is a unicode object that will be put in the
|
||||||
translated result as-is, without further processing.
|
translated result as-is, without further processing.
|
||||||
|
|
||||||
|
@ -163,9 +167,9 @@ Specification
|
||||||
callback names: "backslashreplace" and "xmlcharrefreplace", which
|
callback names: "backslashreplace" and "xmlcharrefreplace", which
|
||||||
can be used for encoding and translating and which will also be
|
can be used for encoding and translating and which will also be
|
||||||
implemented in-place for all encoders and
|
implemented in-place for all encoders and
|
||||||
PyUnicode_TranslateCharmap.
|
``PyUnicode_TranslateCharmap``.
|
||||||
|
|
||||||
The Python equivalent of these five callbacks will look like this:
|
The Python equivalent of these five callbacks will look like this::
|
||||||
|
|
||||||
def strict(exc):
|
def strict(exc):
|
||||||
raise exc
|
raise exc
|
||||||
|
@ -212,16 +216,17 @@ Specification
|
||||||
raise TypeError("can't handle %s" % exc.__name__)
|
raise TypeError("can't handle %s" % exc.__name__)
|
||||||
|
|
||||||
These five callback handlers will also be accessible to Python as
|
These five callback handlers will also be accessible to Python as
|
||||||
codecs.strict_error, codecs.ignore_error, codecs.replace_error,
|
``codecs.strict_error``, ``codecs.ignore_error``, ``codecs.replace_error``,
|
||||||
codecs.backslashreplace_error and codecs.xmlcharrefreplace_error.
|
``codecs.backslashreplace_error`` and ``codecs.xmlcharrefreplace_error``.
|
||||||
|
|
||||||
|
|
||||||
Rationale
|
Rationale
|
||||||
|
=========
|
||||||
|
|
||||||
Most legacy encoding do not support the full range of Unicode
|
Most legacy encoding do not support the full range of Unicode
|
||||||
characters. For these cases many high level protocols support a
|
characters. For these cases many high level protocols support a
|
||||||
way of escaping a Unicode character (e.g. Python itself supports
|
way of escaping a Unicode character (e.g. Python itself supports
|
||||||
the \x, \u and \U convention, XML supports character references
|
the ``\x``, ``\u`` and ``\U`` convention, XML supports character references
|
||||||
via &#xxx; etc.).
|
via &#xxx; etc.).
|
||||||
|
|
||||||
When implementing such an encoding algorithm, a problem with the
|
When implementing such an encoding algorithm, a problem with the
|
||||||
|
@ -231,12 +236,16 @@ Rationale
|
||||||
because encode does not provide any information about the location
|
because encode does not provide any information about the location
|
||||||
of the error(s), so
|
of the error(s), so
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
# (1)
|
# (1)
|
||||||
us = u"xxx"
|
us = u"xxx"
|
||||||
s = us.encode(encoding)
|
s = us.encode(encoding)
|
||||||
|
|
||||||
has to be replaced by
|
has to be replaced by
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
# (2)
|
# (2)
|
||||||
us = u"xxx"
|
us = u"xxx"
|
||||||
v = []
|
v = []
|
||||||
|
@ -257,7 +266,7 @@ Rationale
|
||||||
character.
|
character.
|
||||||
|
|
||||||
To work around this problem, a stream writer - which keeps state
|
To work around this problem, a stream writer - which keeps state
|
||||||
between calls to the encoding function - has to be used:
|
between calls to the encoding function - has to be used::
|
||||||
|
|
||||||
# (3)
|
# (3)
|
||||||
us = u"xxx"
|
us = u"xxx"
|
||||||
|
@ -274,7 +283,7 @@ Rationale
|
||||||
s = v.getvalue()
|
s = v.getvalue()
|
||||||
|
|
||||||
To compare the speed of (1) and (3) the following test script has
|
To compare the speed of (1) and (3) the following test script has
|
||||||
been used:
|
been used::
|
||||||
|
|
||||||
# (4)
|
# (4)
|
||||||
import time
|
import time
|
||||||
|
@ -306,7 +315,7 @@ Rationale
|
||||||
print "2:", t3-t2
|
print "2:", t3-t2
|
||||||
print "factor:", (t3-t2)/(t2-t1)
|
print "factor:", (t3-t2)/(t2-t1)
|
||||||
|
|
||||||
On Linux this gives the following output (with Python 2.3a0):
|
On Linux this gives the following output (with Python 2.3a0)::
|
||||||
|
|
||||||
1: 0.274321913719
|
1: 0.274321913719
|
||||||
2: 51.1284689903
|
2: 51.1284689903
|
||||||
|
@ -316,19 +325,23 @@ Rationale
|
||||||
|
|
||||||
Callbacks must be stateless, because as soon as a callback is
|
Callbacks must be stateless, because as soon as a callback is
|
||||||
registered it is available globally and can be called by multiple
|
registered it is available globally and can be called by multiple
|
||||||
encode() calls. To be able to use stateful callbacks, the errors
|
``encode()`` calls. To be able to use stateful callbacks, the errors
|
||||||
parameter for encode/decode/translate would have to be changed
|
parameter for encode/decode/translate would have to be changed
|
||||||
from char * to PyObject *, so that the callback could be used
|
from ``char *`` to ``PyObject *``, so that the callback could be used
|
||||||
directly, without the need to register the callback globally. As
|
directly, without the need to register the callback globally. As
|
||||||
this requires changes to lots of C prototypes, this approach was
|
this requires changes to lots of C prototypes, this approach was
|
||||||
rejected.
|
rejected.
|
||||||
|
|
||||||
Currently all encoding/decoding functions have arguments
|
Currently all encoding/decoding functions have arguments
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
const Py_UNICODE *p, int size
|
const Py_UNICODE *p, int size
|
||||||
|
|
||||||
or
|
or
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
const char *p, int size
|
const char *p, int size
|
||||||
|
|
||||||
to specify the unicode characters/8bit characters to be
|
to specify the unicode characters/8bit characters to be
|
||||||
|
@ -343,35 +356,36 @@ Rationale
|
||||||
For stream readers/writers the errors attribute must be changeable
|
For stream readers/writers the errors attribute must be changeable
|
||||||
to be able to switch between different error handling methods
|
to be able to switch between different error handling methods
|
||||||
during the lifetime of the stream reader/writer. This is currently
|
during the lifetime of the stream reader/writer. This is currently
|
||||||
the case for codecs.StreamReader and codecs.StreamWriter and
|
the case for ``codecs.StreamReader`` and ``codecs.StreamWriter`` and
|
||||||
all their subclasses. All core codecs and probably most of the
|
all their subclasses. All core codecs and probably most of the
|
||||||
third party codecs (e.g. JapaneseCodecs) derive their stream
|
third party codecs (e.g. ``JapaneseCodecs``) derive their stream
|
||||||
readers/writers from these classes so this already works,
|
readers/writers from these classes so this already works,
|
||||||
but the attribute errors should be documented as a requirement.
|
but the attribute errors should be documented as a requirement.
|
||||||
|
|
||||||
|
|
||||||
Implementation Notes
|
Implementation Notes
|
||||||
|
====================
|
||||||
|
|
||||||
A sample implementation is available as SourceForge patch #432401
|
A sample implementation is available as SourceForge patch #432401
|
||||||
[2] including a script for testing the speed of various
|
[2]_ including a script for testing the speed of various
|
||||||
string/encoding/error combinations and a test script.
|
string/encoding/error combinations and a test script.
|
||||||
|
|
||||||
Currently the new exception classes are old style Python
|
Currently the new exception classes are old style Python
|
||||||
classes. This means that accessing attributes results
|
classes. This means that accessing attributes results
|
||||||
in a dict lookup. The C API is implemented in a way
|
in a dict lookup. The C API is implemented in a way
|
||||||
that makes it possible to switch to new style classes
|
that makes it possible to switch to new style classes
|
||||||
behind the scene, if Exception (and UnicodeError) will
|
behind the scene, if ``Exception`` (and ``UnicodeError``) will
|
||||||
be changed to new style classes implemented in C for
|
be changed to new style classes implemented in C for
|
||||||
improved performance.
|
improved performance.
|
||||||
|
|
||||||
The class codecs.StreamReaderWriter uses the errors parameter for
|
The class ``codecs.StreamReaderWriter`` uses the errors parameter for
|
||||||
both reading and writing. To be more flexible this should
|
both reading and writing. To be more flexible this should
|
||||||
probably be changed to two separate parameters for reading and
|
probably be changed to two separate parameters for reading and
|
||||||
writing.
|
writing.
|
||||||
|
|
||||||
The errors parameter of PyUnicode_TranslateCharmap is not
|
The errors parameter of ``PyUnicode_TranslateCharmap`` is not
|
||||||
availably to Python, which makes testing of the new functionality
|
availably to Python, which makes testing of the new functionality
|
||||||
of PyUnicode_TranslateCharmap impossible with Python scripts. The
|
of ``PyUnicode_TranslateCharmap`` impossible with Python scripts. The
|
||||||
patch should add an optional argument errors to unicode.translate
|
patch should add an optional argument errors to unicode.translate
|
||||||
to expose the functionality and make testing possible.
|
to expose the functionality and make testing possible.
|
||||||
|
|
||||||
|
@ -379,11 +393,12 @@ Implementation Notes
|
||||||
unicode and want to use the new machinery can define their own
|
unicode and want to use the new machinery can define their own
|
||||||
exception classes and the strict handlers will automatically work
|
exception classes and the strict handlers will automatically work
|
||||||
with it. The other predefined error handlers are unicode specific
|
with it. The other predefined error handlers are unicode specific
|
||||||
and expect to get a Unicode(Encode|Decode|Translate)Error
|
and expect to get a ``Unicode(Encode|Decode|Translate)Error``
|
||||||
exception object so they won't work.
|
exception object so they won't work.
|
||||||
|
|
||||||
|
|
||||||
Backwards Compatibility
|
Backwards Compatibility
|
||||||
|
=======================
|
||||||
|
|
||||||
The semantics of unicode.encode with errors="replace" has changed:
|
The semantics of unicode.encode with errors="replace" has changed:
|
||||||
The old version always stored a ? character in the output string
|
The old version always stored a ? character in the output string
|
||||||
|
@ -393,26 +408,28 @@ Backwards Compatibility
|
||||||
supported encodings are ASCII based, and thus map ? to ?, this
|
supported encodings are ASCII based, and thus map ? to ?, this
|
||||||
should not be a problem in practice.
|
should not be a problem in practice.
|
||||||
|
|
||||||
Illegal values for the errors argument raised ValueError before,
|
Illegal values for the errors argument raised ``ValueError`` before,
|
||||||
now they will raise LookupError.
|
now they will raise ``LookupError``.
|
||||||
|
|
||||||
|
|
||||||
References
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
[1] SF feature request #403100
|
.. [1] SF feature request #403100
|
||||||
"Multicharacter replacements in PyUnicode_TranslateCharmap"
|
"Multicharacter replacements in PyUnicode_TranslateCharmap"
|
||||||
http://www.python.org/sf/403100
|
http://www.python.org/sf/403100
|
||||||
|
|
||||||
[2] SF patch #432401 "unicode encoding error callbacks"
|
.. [2] SF patch #432401 "unicode encoding error callbacks"
|
||||||
http://www.python.org/sf/432401
|
http://www.python.org/sf/432401
|
||||||
|
|
||||||
|
|
||||||
Copyright
|
Copyright
|
||||||
|
=========
|
||||||
|
|
||||||
This document has been placed in the public domain.
|
This document has been placed in the public domain.
|
||||||
|
|
||||||
|
|
||||||
|
..
|
||||||
Local Variables:
|
Local Variables:
|
||||||
mode: indented-text
|
mode: indented-text
|
||||||
indent-tabs-mode: nil
|
indent-tabs-mode: nil
|
||||||
|
|
Loading…
Reference in New Issue