reSTify PEP 100 (#422)
This commit is contained in:
parent
5a15c92dcf
commit
0fcce00dad
447
pep-0100.txt
447
pep-0100.txt
|
@ -5,12 +5,14 @@ Last-Modified: $Date$
|
|||
Author: mal@lemburg.com (Marc-André Lemburg)
|
||||
Status: Final
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 10-Mar-2000
|
||||
Python-Version: 2.0
|
||||
Post-History:
|
||||
|
||||
|
||||
Historical Note
|
||||
===============
|
||||
|
||||
This document was first written by Marc-Andre in the pre-PEP days,
|
||||
and was originally distributed as Misc/unicode.txt in Python
|
||||
|
@ -26,6 +28,7 @@ Historical Note
|
|||
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
The idea of this proposal is to add native Unicode 3.0 support to
|
||||
Python in a way that makes use of Unicode strings as simple as
|
||||
|
@ -40,11 +43,9 @@ Introduction
|
|||
integration.
|
||||
|
||||
The latest version of this document is always available at:
|
||||
|
||||
http://starship.python.net/~lemburg/unicode-proposal.txt
|
||||
|
||||
Older versions are available as:
|
||||
|
||||
http://starship.python.net/~lemburg/unicode-proposal-X.X.txt
|
||||
|
||||
[ed. note: new revisions should be made to this PEP document,
|
||||
|
@ -53,6 +54,7 @@ Introduction
|
|||
|
||||
|
||||
Conventions
|
||||
===========
|
||||
|
||||
- In examples we use u = Unicode object and s = Python string
|
||||
|
||||
|
@ -60,6 +62,7 @@ Conventions
|
|||
|
||||
|
||||
General Remarks
|
||||
===============
|
||||
|
||||
- Unicode encoding names should be lower case on output and
|
||||
case-insensitive on input (they will be converted to lower case
|
||||
|
@ -70,10 +73,11 @@ General Remarks
|
|||
16' is written as 'utf-16'.
|
||||
|
||||
- Codec modules should use the same names, but with hyphens
|
||||
converted to underscores, e.g. utf_8, utf_16, iso_8859_1.
|
||||
converted to underscores, e.g. ``utf_8``, ``utf_16``, ``iso_8859_1``.
|
||||
|
||||
|
||||
Unicode Default Encoding
|
||||
========================
|
||||
|
||||
The Unicode implementation has to make some assumption about the
|
||||
encoding of 8-bit strings passed to it for coercion and about the
|
||||
|
@ -86,16 +90,16 @@ Unicode Default Encoding
|
|||
possible. The <default encoding> can be set and queried using the
|
||||
two sys module APIs:
|
||||
|
||||
sys.setdefaultencoding(encoding)
|
||||
--> Sets the <default encoding> used by the Unicode implementation.
|
||||
``sys.setdefaultencoding(encoding)``
|
||||
Sets the <default encoding> used by the Unicode implementation.
|
||||
encoding has to be an encoding which is supported by the
|
||||
Python installation, otherwise, a LookupError is raised.
|
||||
|
||||
Note: This API is only available in site.py! It is
|
||||
removed from the sys module by site.py after usage.
|
||||
|
||||
sys.getdefaultencoding()
|
||||
--> Returns the current <default encoding>.
|
||||
``sys.getdefaultencoding()``
|
||||
Returns the current <default encoding>.
|
||||
|
||||
If not otherwise defined or set, the <default encoding> defaults
|
||||
to 'ascii'. This encoding is also the startup default of Python
|
||||
|
@ -113,9 +117,10 @@ Unicode Default Encoding
|
|||
|
||||
|
||||
Unicode Constructors
|
||||
====================
|
||||
|
||||
Python should provide a built-in constructor for Unicode strings
|
||||
which is available through __builtins__:
|
||||
which is available through ``__builtins__``::
|
||||
|
||||
u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"])
|
||||
|
||||
|
@ -129,17 +134,17 @@ Unicode Constructors
|
|||
ordinal (e.g. 'a' -> U+0061).
|
||||
|
||||
- all existing defined Python escape sequences are interpreted as
|
||||
Unicode ordinals; note that \xXXXX can represent all Unicode
|
||||
ordinals, and \OOO (octal) can represent Unicode ordinals up to
|
||||
Unicode ordinals; note that ``\xXXXX`` can represent all Unicode
|
||||
ordinals, and ``\OOO`` (octal) can represent Unicode ordinals up to
|
||||
U+01FF.
|
||||
|
||||
- a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
|
||||
error to have fewer than 4 digits after \u.
|
||||
- a new escape sequence, ``\uXXXX``, represents U+XXXX; it is a syntax
|
||||
error to have fewer than 4 digits after ``\u``.
|
||||
|
||||
For an explanation of possible values for errors see the Codec
|
||||
section below.
|
||||
|
||||
Examples:
|
||||
Examples::
|
||||
|
||||
u'abc' -> U+0061 U+0062 U+0063
|
||||
u'\u1234' -> U+1234
|
||||
|
@ -147,7 +152,7 @@ Unicode Constructors
|
|||
|
||||
The 'raw-unicode-escape' encoding is defined as follows:
|
||||
|
||||
- \uXXXX sequence represent the U+XXXX Unicode character if and
|
||||
- ``\uXXXX`` sequence represent the U+XXXX Unicode character if and
|
||||
only if the number of leading backslashes is odd
|
||||
|
||||
- all other characters represent themselves as Unicode ordinal
|
||||
|
@ -164,17 +169,21 @@ Unicode Constructors
|
|||
|
||||
|
||||
Unicode Type Object
|
||||
===================
|
||||
|
||||
Unicode objects should have the type UnicodeType with type name
|
||||
'unicode', made available through the standard types module.
|
||||
|
||||
|
||||
Unicode Output
|
||||
==============
|
||||
|
||||
Unicode objects have a method .encode([encoding=<default encoding>])
|
||||
which returns a Python string encoding the Unicode string using the
|
||||
given scheme (see Codecs).
|
||||
|
||||
::
|
||||
|
||||
print u := print u.encode() # using the <default encoding>
|
||||
|
||||
str(u) := u.encode() # using the <default encoding>
|
||||
|
@ -186,10 +195,11 @@ Unicode Output
|
|||
|
||||
|
||||
Unicode Ordinals
|
||||
================
|
||||
|
||||
Since Unicode 3.0 has a 32-bit ordinal character set, the
|
||||
implementation should provide 32-bit aware ordinal conversion
|
||||
APIs:
|
||||
APIs::
|
||||
|
||||
ord(u[:1]) (this is the standard ord() extended to work with Unicode
|
||||
objects)
|
||||
|
@ -199,8 +209,8 @@ Unicode Ordinals
|
|||
--> Unicode object for character i (provided it is 32-bit);
|
||||
ValueError otherwise
|
||||
|
||||
Both APIs should go into __builtins__ just like their string
|
||||
counterparts ord() and chr().
|
||||
Both APIs should go into ``__builtins__`` just like their string
|
||||
counterparts ``ord()`` and ``chr()``.
|
||||
|
||||
Note that Unicode provides space for private encodings. Usage of
|
||||
these can cause different output representations on different
|
||||
|
@ -209,6 +219,7 @@ Unicode Ordinals
|
|||
|
||||
|
||||
Comparison & Hash Value
|
||||
=======================
|
||||
|
||||
Unicode objects should compare equal to other objects after these
|
||||
other objects have been coerced to Unicode. For strings this
|
||||
|
@ -220,10 +231,10 @@ Comparison & Hash Value
|
|||
not guaranteed to return the same hash values as the default
|
||||
encoded equivalent string representation.
|
||||
|
||||
When compared using cmp() (or PyObject_Compare()) the
|
||||
implementation should mask TypeErrors raised during the conversion
|
||||
When compared using ``cmp()`` (or ``PyObject_Compare()``) the
|
||||
implementation should mask ``TypeErrors`` raised during the conversion
|
||||
to remain in synch with the string behavior. All other errors
|
||||
such as ValueErrors raised during coercion of strings to Unicode
|
||||
such as ``ValueErrors`` raised during coercion of strings to Unicode
|
||||
should not be masked and passed through to the user.
|
||||
|
||||
In containment tests ('a' in u'abc' and u'a' in 'abc') both sides
|
||||
|
@ -233,11 +244,14 @@ Comparison & Hash Value
|
|||
|
||||
|
||||
Coercion
|
||||
========
|
||||
|
||||
Using Python strings and Unicode objects to form new objects
|
||||
should always coerce to the more precise format, i.e. Unicode
|
||||
objects.
|
||||
|
||||
::
|
||||
|
||||
u + s := u + unicode(s)
|
||||
|
||||
s + u := unicode(s) + u
|
||||
|
@ -247,6 +261,8 @@ Coercion
|
|||
Unicode and then applying the arguments to the Unicode method of
|
||||
the same name, e.g.
|
||||
|
||||
::
|
||||
|
||||
string.join((s,u),sep) := (s + sep) + u
|
||||
|
||||
sep.join((s,u)) := (s + sep) + u
|
||||
|
@ -256,17 +272,19 @@ Coercion
|
|||
|
||||
|
||||
Exceptions
|
||||
==========
|
||||
|
||||
UnicodeError is defined in the exceptions module as a subclass of
|
||||
ValueError. It is available at the C level via
|
||||
PyExc_UnicodeError. All exceptions related to Unicode
|
||||
encoding/decoding should be subclasses of UnicodeError.
|
||||
``UnicodeError`` is defined in the exceptions module as a subclass of
|
||||
``ValueError``. It is available at the C level via
|
||||
``PyExc_UnicodeError``. All exceptions related to Unicode
|
||||
encoding/decoding should be subclasses of ``UnicodeError``.
|
||||
|
||||
|
||||
Codecs (Coder/Decoders) Lookup
|
||||
==============================
|
||||
|
||||
A Codec (see Codec Interface Definition) search registry should be
|
||||
implemented by a module "codecs":
|
||||
implemented by a module "codecs"::
|
||||
|
||||
codecs.register(search_function)
|
||||
|
||||
|
@ -276,22 +294,20 @@ Codecs (Coder/Decoders) Lookup
|
|||
(encoder, decoder, stream_reader, stream_writer) taking the
|
||||
following arguments:
|
||||
|
||||
encoder and decoder:
|
||||
|
||||
encoder and decoder
|
||||
These must be functions or methods which have the same
|
||||
interface as the .encode/.decode methods of Codec instances
|
||||
interface as the ``.encode``/``.decode`` methods of Codec instances
|
||||
(see Codec Interface). The functions/methods are expected to
|
||||
work in a stateless mode.
|
||||
|
||||
stream_reader and stream_writer:
|
||||
|
||||
stream_reader and stream_writer
|
||||
These need to be factory functions with the following
|
||||
interface:
|
||||
interface::
|
||||
|
||||
factory(stream,errors='strict')
|
||||
|
||||
The factory functions must return objects providing the
|
||||
interfaces defined by StreamWriter/StreamReader resp. (see
|
||||
interfaces defined by ``StreamWriter``/``StreamReader`` resp. (see
|
||||
Codec Interface). Stream codecs can maintain state.
|
||||
|
||||
Possible values for errors are defined in the Codec section
|
||||
|
@ -309,24 +325,27 @@ Codecs (Coder/Decoders) Lookup
|
|||
codecs tuple is found, a LookupError is raised. Otherwise, the
|
||||
codecs tuple is stored in the cache and returned to the caller.
|
||||
|
||||
To query the Codec instance the following API should be used:
|
||||
To query the Codec instance the following API should be used::
|
||||
|
||||
codecs.lookup(encoding)
|
||||
|
||||
This will either return the found codecs tuple or raise a
|
||||
LookupError.
|
||||
``LookupError``.
|
||||
|
||||
|
||||
Standard Codecs
|
||||
===============
|
||||
|
||||
Standard codecs should live inside an encodings/ package directory
|
||||
in the Standard Python Code Library. The __init__.py file of that
|
||||
in the Standard Python Code Library. The ``__init__.py`` file of that
|
||||
directory should include a Codec Lookup compatible search function
|
||||
implementing a lazy module based codec lookup.
|
||||
|
||||
Python should provide a few standard codecs for the most relevant
|
||||
encodings, e.g.
|
||||
|
||||
::
|
||||
|
||||
'utf-8': 8-bit variable length encoding
|
||||
'utf-16': 16-bit variable length encoding (little/big endian)
|
||||
'utf-16-le': utf-16 but explicitly little endian
|
||||
|
@ -350,6 +369,7 @@ Standard Codecs
|
|||
|
||||
|
||||
Codecs Interface Definition
|
||||
===========================
|
||||
|
||||
The following base class should be defined in the module "codecs".
|
||||
They provide not only templates for use by encoding module
|
||||
|
@ -358,15 +378,17 @@ Codecs Interface Definition
|
|||
|
||||
Note that the Codec Interface defined here is well suitable for a
|
||||
larger range of applications. The Unicode implementation expects
|
||||
Unicode objects on input for .encode() and .write() and character
|
||||
buffer compatible objects on input for .decode(). Output of
|
||||
.encode() and .read() should be a Python string and .decode() must
|
||||
Unicode objects on input for ``.encode()`` and ``.write()`` and character
|
||||
buffer compatible objects on input for ``.decode()``. Output of
|
||||
``.encode()`` and ``.read()`` should be a Python string and ``.decode()`` must
|
||||
return an Unicode object.
|
||||
|
||||
First, we have the stateless encoders/decoders. These do not work
|
||||
in chunks as the stream codecs (see below) do, because all
|
||||
components are expected to be available in memory.
|
||||
|
||||
::
|
||||
|
||||
class Codec:
|
||||
|
||||
"""Defines the interface for stateless encoders/decoders.
|
||||
|
@ -415,14 +437,16 @@ Codecs Interface Definition
|
|||
|
||||
"""
|
||||
|
||||
StreamWriter and StreamReader define the interface for stateful
|
||||
``StreamWriter`` and ``StreamReader`` define the interface for stateful
|
||||
encoders/decoders which work on streams. These allow processing
|
||||
of the data in chunks to efficiently use memory. If you have
|
||||
large strings in memory, you may want to wrap them with cStringIO
|
||||
large strings in memory, you may want to wrap them with ``cStringIO``
|
||||
objects and then use these codecs on them to be able to do chunk
|
||||
processing as well, e.g. to provide progress information to the
|
||||
user.
|
||||
|
||||
::
|
||||
|
||||
class StreamWriter(Codec):
|
||||
|
||||
def __init__(self,stream,errors='strict'):
|
||||
|
@ -593,8 +617,8 @@ Codecs Interface Definition
|
|||
return getattr(self.stream,name)
|
||||
|
||||
|
||||
Stream codec implementors are free to combine the StreamWriter and
|
||||
StreamReader interfaces into one class. Even combining all these
|
||||
Stream codec implementors are free to combine the ``StreamWriter`` and
|
||||
``StreamReader`` interfaces into one class. Even combining all these
|
||||
with the Codec class should be possible.
|
||||
|
||||
Implementors are free to add additional methods to enhance the
|
||||
|
@ -616,12 +640,14 @@ Codecs Interface Definition
|
|||
|
||||
|
||||
Whitespace
|
||||
==========
|
||||
|
||||
The .split() method will have to know about what is considered
|
||||
The ``.split()`` method will have to know about what is considered
|
||||
whitespace in Unicode.
|
||||
|
||||
|
||||
Case Conversion
|
||||
===============
|
||||
|
||||
Case conversion is rather complicated with Unicode data, since
|
||||
there are many different conditions to respect. See
|
||||
|
@ -635,24 +661,26 @@ Case Conversion
|
|||
(see the Unicode standard file SpecialCasing.txt) should be left
|
||||
to user land routines and not go into the core interpreter.
|
||||
|
||||
The methods .capitalize() and .iscapitalized() should follow the
|
||||
The methods ``.capitalize()`` and ``.iscapitalized()`` should follow the
|
||||
case mapping algorithm defined in the above technical report as
|
||||
closely as possible.
|
||||
|
||||
|
||||
Line Breaks
|
||||
===========
|
||||
|
||||
Line breaking should be done for all Unicode characters having the
|
||||
B property as well as the combinations CRLF, CR, LF (interpreted
|
||||
in that order) and other special line separators defined by the
|
||||
standard.
|
||||
|
||||
The Unicode type should provide a .splitlines() method which
|
||||
The Unicode type should provide a ``.splitlines()`` method which
|
||||
returns a list of lines according to the above specification. See
|
||||
Unicode Methods.
|
||||
|
||||
|
||||
Unicode Character Properties
|
||||
============================
|
||||
|
||||
A separate module "unicodedata" should provide a compact interface
|
||||
to all Unicode character properties defined in the standard's
|
||||
|
@ -675,14 +703,16 @@ Unicode Character Properties
|
|||
|
||||
|
||||
Private Code Point Areas
|
||||
========================
|
||||
|
||||
Support for these is left to user land Codecs and not explicitly
|
||||
integrated into the core. Note that due to the Internal Format
|
||||
being implemented, only the area between \uE000 and \uF8FF is
|
||||
being implemented, only the area between ``\uE000`` and ``\uF8FF`` is
|
||||
usable for private encodings.
|
||||
|
||||
|
||||
Internal Format
|
||||
===============
|
||||
|
||||
The internal format for Unicode objects should use a Python
|
||||
specific fixed format <PythonUnicode> implemented as 'unsigned
|
||||
|
@ -720,10 +750,10 @@ Internal Format
|
|||
Interning is not needed (for now), since Python identifiers are
|
||||
defined as being ASCII only.
|
||||
|
||||
codecs.BOM should return the byte order mark (BOM) for the format
|
||||
``codecs.BOM`` should return the byte order mark (BOM) for the format
|
||||
used internally. The codecs module should provide the following
|
||||
additional constants for convenience and reference (codecs.BOM
|
||||
will either be BOM_BE or BOM_LE depending on the platform):
|
||||
additional constants for convenience and reference (``codecs.BOM``
|
||||
will either be ``BOM_BE`` or ``BOM_LE`` depending on the platform)::
|
||||
|
||||
BOM_BE: '\376\377'
|
||||
(corresponds to Unicode U+0000FEFF in UTF-16 on big endian
|
||||
|
@ -744,19 +774,20 @@ Internal Format
|
|||
format, hence the illegal character definition.
|
||||
|
||||
The configure script should provide aid in deciding whether Python
|
||||
can use the native wchar_t type or not (it has to be a 16-bit
|
||||
can use the native ``wchar_t`` type or not (it has to be a 16-bit
|
||||
unsigned type).
|
||||
|
||||
|
||||
Buffer Interface
|
||||
================
|
||||
|
||||
Implement the buffer interface using the <defenc> Python string
|
||||
object as basis for bf_getcharbuf and the internal buffer for
|
||||
bf_getreadbuf. If bf_getcharbuf is requested and the <defenc>
|
||||
object as basis for ``bf_getcharbuf`` and the internal buffer for
|
||||
``bf_getreadbuf``. If ``bf_getcharbuf`` is requested and the <defenc>
|
||||
object does not yet exist, it is created first.
|
||||
|
||||
Note that as special case, the parser marker "s#" will not return
|
||||
raw Unicode UTF-16 data (which the bf_getreadbuf returns), but
|
||||
raw Unicode UTF-16 data (which the ``bf_getreadbuf`` returns), but
|
||||
instead tries to encode the Unicode object using the default
|
||||
encoding and then returns a pointer to the resulting string object
|
||||
(or raises an exception in case the conversion fails). This was
|
||||
|
@ -768,13 +799,14 @@ Buffer Interface
|
|||
specification of the encoding to use.
|
||||
|
||||
If you need to access the read buffer interface of Unicode
|
||||
objects, use the PyObject_AsReadBuffer() interface.
|
||||
objects, use the ``PyObject_AsReadBuffer()`` interface.
|
||||
|
||||
The internal format can also be accessed using the
|
||||
'unicode-internal' codec, e.g. via u.encode('unicode-internal').
|
||||
'unicode-internal' codec, e.g. via ``u.encode('unicode-internal')``.
|
||||
|
||||
|
||||
Pickle/Marshalling
|
||||
==================
|
||||
|
||||
Should have native Unicode object support. The objects should be
|
||||
encoded using platform independent encodings.
|
||||
|
@ -786,6 +818,7 @@ Pickle/Marshalling
|
|||
|
||||
|
||||
Regular Expressions
|
||||
===================
|
||||
|
||||
Secret Labs AB is working on a Unicode-aware regular expression
|
||||
machinery. It works on plain 8-bit, UCS-2, and (optionally) UCS-4
|
||||
|
@ -799,10 +832,11 @@ Regular Expressions
|
|||
|
||||
|
||||
Formatting Markers
|
||||
==================
|
||||
|
||||
Format markers are used in Python format strings. If Python
|
||||
strings are used as format strings, the following interpretations
|
||||
should be in effect:
|
||||
should be in effect::
|
||||
|
||||
'%s': For Unicode objects this will cause coercion of the
|
||||
whole format string to Unicode. Note that you should use
|
||||
|
@ -814,71 +848,78 @@ Formatting Markers
|
|||
according to the format string. Numbers are first converted to
|
||||
strings and then to Unicode.
|
||||
|
||||
::
|
||||
|
||||
'%s': Python strings are interpreted as Unicode
|
||||
string using the <default encoding>. Unicode objects are
|
||||
taken as is.
|
||||
|
||||
All other string formatters should work accordingly.
|
||||
|
||||
Example:
|
||||
Example::
|
||||
|
||||
u"%s %s" % (u"abc", "abc") == u"abc abc"
|
||||
|
||||
|
||||
Internal Argument Parsing
|
||||
=========================
|
||||
|
||||
These markers are used by the PyArg_ParseTuple() APIs:
|
||||
These markers are used by the ``PyArg_ParseTuple()`` APIs:
|
||||
|
||||
"U": Check for Unicode object and return a pointer to it
|
||||
"U"
|
||||
Check for Unicode object and return a pointer to it
|
||||
|
||||
"s": For Unicode objects: return a pointer to the object's
|
||||
"s"
|
||||
For Unicode objects: return a pointer to the object's
|
||||
<defenc> buffer (which uses the <default encoding>).
|
||||
|
||||
"s#": Access to the default encoded version of the Unicode object
|
||||
"s#"
|
||||
Access to the default encoded version of the Unicode object
|
||||
(see Buffer Interface); note that the length relates to
|
||||
the length of the default encoded string rather than the
|
||||
Unicode object length.
|
||||
|
||||
"t#": Same as "s#".
|
||||
"t#"
|
||||
Same as "s#".
|
||||
|
||||
"es":
|
||||
Takes two parameters: encoding (const char *) and buffer
|
||||
(char **).
|
||||
"es"
|
||||
Takes two parameters: encoding (``const char *``) and buffer
|
||||
(``char **``).
|
||||
|
||||
The input object is first coerced to Unicode in the usual
|
||||
way and then encoded into a string using the given
|
||||
encoding.
|
||||
|
||||
On output, a buffer of the needed size is allocated and
|
||||
returned through *buffer as NULL-terminated string. The
|
||||
returned through ``*buffer`` as NULL-terminated string. The
|
||||
encoded may not contain embedded NULL characters. The
|
||||
caller is responsible for calling PyMem_Free() to free the
|
||||
allocated *buffer after usage.
|
||||
caller is responsible for calling ``PyMem_Free()`` to free the
|
||||
allocated ``*buffer`` after usage.
|
||||
|
||||
"es#":
|
||||
Takes three parameters: encoding (const char *), buffer
|
||||
(char **) and buffer_len (int *).
|
||||
"es#"
|
||||
Takes three parameters: encoding (``const char *``), buffer
|
||||
(``char **``) and buffer_len (``int *``).
|
||||
|
||||
The input object is first coerced to Unicode in the usual
|
||||
way and then encoded into a string using the given
|
||||
encoding.
|
||||
|
||||
If *buffer is non-NULL, *buffer_len must be set to
|
||||
sizeof(buffer) on input. Output is then copied to *buffer.
|
||||
If ``*buffer`` is non-NULL, ``*buffer_len`` must be set to
|
||||
``sizeof(buffer)`` on input. Output is then copied to ``*buffer``.
|
||||
|
||||
If *buffer is NULL, a buffer of the needed size is
|
||||
allocated and output copied into it. *buffer is then
|
||||
If ``*buffer`` is NULL, a buffer of the needed size is
|
||||
allocated and output copied into it. ``*buffer`` is then
|
||||
updated to point to the allocated memory area. The caller
|
||||
is responsible for calling PyMem_Free() to free the
|
||||
allocated *buffer after usage.
|
||||
is responsible for calling ``PyMem_Free()`` to free the
|
||||
allocated ``*buffer`` after usage.
|
||||
|
||||
In both cases *buffer_len is updated to the number of
|
||||
In both cases ``*buffer_len`` is updated to the number of
|
||||
characters written (excluding the trailing NULL-byte).
|
||||
The output buffer is assured to be NULL-terminated.
|
||||
|
||||
Examples:
|
||||
|
||||
Using "es#" with auto-allocation:
|
||||
Using "es#" with auto-allocation::
|
||||
|
||||
static PyObject *
|
||||
test_parser(PyObject *self,
|
||||
|
@ -902,7 +943,7 @@ Internal Argument Parsing
|
|||
return str;
|
||||
}
|
||||
|
||||
Using "es" with auto-allocation returning a NULL-terminated string:
|
||||
Using "es" with auto-allocation returning a NULL-terminated string::
|
||||
|
||||
static PyObject *
|
||||
test_parser(PyObject *self,
|
||||
|
@ -925,7 +966,7 @@ Internal Argument Parsing
|
|||
return str;
|
||||
}
|
||||
|
||||
Using "es#" with a pre-allocated buffer:
|
||||
Using "es#" with a pre-allocated buffer::
|
||||
|
||||
static PyObject *
|
||||
test_parser(PyObject *self,
|
||||
|
@ -951,6 +992,7 @@ Internal Argument Parsing
|
|||
|
||||
|
||||
File/Stream Output
|
||||
==================
|
||||
|
||||
Since file.write(object) and most other stream writers use the
|
||||
"s#" or "t#" argument parsing marker for querying the data to
|
||||
|
@ -966,6 +1008,7 @@ File/Stream Output
|
|||
|
||||
|
||||
File/Stream Input
|
||||
=================
|
||||
|
||||
Only the user knows what encoding the input data uses, so no
|
||||
special magic is applied. The user will have to explicitly
|
||||
|
@ -975,8 +1018,9 @@ File/Stream Input
|
|||
|
||||
|
||||
Unicode Methods & Attributes
|
||||
============================
|
||||
|
||||
All Python string methods, plus:
|
||||
All Python string methods, plus::
|
||||
|
||||
.encode([encoding=<default encoding>][,errors="strict"])
|
||||
--> see Unicode Output
|
||||
|
@ -989,6 +1033,7 @@ Unicode Methods & Attributes
|
|||
|
||||
|
||||
Code Base
|
||||
=========
|
||||
|
||||
We should use Fredrik Lundh's Unicode object implementation as
|
||||
basis. It already implements most of the string methods needed
|
||||
|
@ -999,6 +1044,7 @@ Code Base
|
|||
|
||||
|
||||
Test Cases
|
||||
==========
|
||||
|
||||
Test cases should follow those in Lib/test/test_string.py and
|
||||
include additional checks for the Codec Registry and the Standard
|
||||
|
@ -1006,130 +1052,205 @@ Test Cases
|
|||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Unicode Consortium:
|
||||
http://www.unicode.org/
|
||||
* Unicode Consortium: http://www.unicode.org/
|
||||
|
||||
Unicode FAQ:
|
||||
http://www.unicode.org/unicode/faq/
|
||||
* Unicode FAQ: http://www.unicode.org/unicode/faq/
|
||||
|
||||
Unicode 3.0:
|
||||
http://www.unicode.org/unicode/standard/versions/Unicode3.0.html
|
||||
* Unicode 3.0: http://www.unicode.org/unicode/standard/versions/Unicode3.0.html
|
||||
|
||||
Unicode-TechReports:
|
||||
http://www.unicode.org/unicode/reports/techreports.html
|
||||
* Unicode-TechReports: http://www.unicode.org/unicode/reports/techreports.html
|
||||
|
||||
Unicode-Mappings:
|
||||
ftp://ftp.unicode.org/Public/MAPPINGS/
|
||||
* Unicode-Mappings: ftp://ftp.unicode.org/Public/MAPPINGS/
|
||||
|
||||
Introduction to Unicode (a little outdated by still nice to read):
|
||||
* Introduction to Unicode (a little outdated by still nice to read):
|
||||
http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html
|
||||
|
||||
For comparison:
|
||||
* For comparison:
|
||||
Introducing Unicode to ECMAScript (aka JavaScript) --
|
||||
http://www-4.ibm.com/software/developer/library/internationalization-support.html
|
||||
|
||||
IANA Character Set Names:
|
||||
* IANA Character Set Names:
|
||||
ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
|
||||
|
||||
Discussion of UTF-8 and Unicode support for POSIX and Linux:
|
||||
* Discussion of UTF-8 and Unicode support for POSIX and Linux:
|
||||
http://www.cl.cam.ac.uk/~mgk25/unicode.html
|
||||
|
||||
Encodings:
|
||||
* Encodings:
|
||||
|
||||
Overview:
|
||||
http://czyborra.com/utf/
|
||||
* Overview: http://czyborra.com/utf/
|
||||
|
||||
UCS-2:
|
||||
http://www.uazone.org/multiling/unicode/ucs2.html
|
||||
* UCS-2: http://www.uazone.org/multiling/unicode/ucs2.html
|
||||
|
||||
UTF-7:
|
||||
Defined in RFC2152, e.g.
|
||||
* UTF-7: Defined in RFC2152, e.g.
|
||||
http://www.uazone.org/multiling/ml-docs/rfc2152.txt
|
||||
|
||||
UTF-8:
|
||||
Defined in RFC2279, e.g.
|
||||
* UTF-8: Defined in RFC2279, e.g.
|
||||
https://tools.ietf.org/html/rfc2279
|
||||
|
||||
UTF-16:
|
||||
http://www.uazone.org/multiling/unicode/wg2n1035.html
|
||||
* UTF-16: http://www.uazone.org/multiling/unicode/wg2n1035.html
|
||||
|
||||
|
||||
History of this Proposal
|
||||
========================
|
||||
|
||||
[ed. note: revisions prior to 1.7 are available in the CVS history
|
||||
of Misc/unicode.txt from the standard Python distribution. All
|
||||
subsequent history is available via the CVS revisions on this
|
||||
file.]
|
||||
|
||||
1.7: Added note about the changed behaviour of "s#".
|
||||
1.6: Changed <defencstr> to <defenc> since this is the name used in the
|
||||
implementation. Added notes about the usage of <defenc> in
|
||||
1.7
|
||||
---
|
||||
|
||||
* Added note about the changed behaviour of "s#".
|
||||
|
||||
1.6
|
||||
---
|
||||
|
||||
* Changed <defencstr> to <defenc> since this is the name used in the
|
||||
implementation.
|
||||
* Added notes about the usage of <defenc> in
|
||||
the buffer protocol implementation.
|
||||
1.5: Added notes about setting the <default encoding>. Fixed some
|
||||
typos (thanks to Andrew Kuchling). Changed <defencstr> to
|
||||
<utf8str>.
|
||||
1.4: Added note about mixed type comparisons and contains tests.
|
||||
Changed treating of Unicode objects in format strings (if
|
||||
used with '%s' % u they will now cause the format string to
|
||||
|
||||
1.5
|
||||
---
|
||||
|
||||
* Added notes about setting the <default encoding>.
|
||||
* Fixed some typos (thanks to Andrew Kuchling).
|
||||
* Changed <defencstr> to <utf8str>.
|
||||
|
||||
1.4
|
||||
---
|
||||
|
||||
* Added note about mixed type comparisons and contains tests.
|
||||
* Changed treating of Unicode objects in format strings (if
|
||||
used with ``'%s' % u`` they will now cause the format string to
|
||||
be coerced to Unicode, thus producing a Unicode object on
|
||||
return). Added link to IANA charset names (thanks to Lars
|
||||
Marius Garshol). Added new codec methods .readline(),
|
||||
.readlines() and .writelines().
|
||||
1.3: Added new "es" and "es#" parser markers
|
||||
1.2: Removed POD about codecs.open()
|
||||
1.1: Added note about comparisons and hash values. Added note about
|
||||
case mapping algorithms. Changed stream codecs .read() and
|
||||
.write() method to match the standard file-like object
|
||||
return).
|
||||
* Added link to IANA charset names (thanks to Lars
|
||||
Marius Garshol).
|
||||
* Added new codec methods ``.readline()``,
|
||||
``.readlines()`` and ``.writelines()``.
|
||||
|
||||
1.3
|
||||
---
|
||||
|
||||
* Added new "es" and "es#" parser markers
|
||||
|
||||
1.2
|
||||
---
|
||||
|
||||
* Removed POD about ``codecs.open()``
|
||||
|
||||
1.1
|
||||
---
|
||||
|
||||
* Added note about comparisons and hash values.
|
||||
* Added note about case mapping algorithms.
|
||||
* Changed stream codecs ``.read()`` and ``.write()`` method
|
||||
to match the standard file-like object
|
||||
methods (bytes consumed information is no longer returned by
|
||||
the methods)
|
||||
1.0: changed encode Codec method to be symmetric to the decode method
|
||||
|
||||
1.0
|
||||
---
|
||||
|
||||
* changed encode Codec method to be symmetric to the decode method
|
||||
(they both return (object, data consumed) now and thus become
|
||||
interchangeable); removed __init__ method of Codec class (the
|
||||
interchangeable);
|
||||
* removed ``__init__`` method of Codec class (the
|
||||
methods are stateless) and moved the errors argument down to
|
||||
the methods; made the Codec design more generic w/r to type
|
||||
of input and output objects; changed StreamWriter.flush to
|
||||
StreamWriter.reset in order to avoid overriding the stream's
|
||||
.flush() method; renamed .breaklines() to .splitlines();
|
||||
renamed the module unicodec to codecs; modified the File I/O
|
||||
section to refer to the stream codecs.
|
||||
0.9: changed errors keyword argument definition; added 'replace' error
|
||||
handling; changed the codec APIs to accept buffer like
|
||||
objects on input; some minor typo fixes; added Whitespace
|
||||
section and included references for Unicode characters that
|
||||
have the whitespace and the line break characteristic; added
|
||||
note that search functions can expect lower-case encoding
|
||||
names; dropped slicing and offsets in the codec APIs
|
||||
0.8: added encodings package and raw unicode escape encoding; untabified
|
||||
the proposal; added notes on Unicode format strings; added
|
||||
.breaklines() method
|
||||
0.7: added a whole new set of codec APIs; added a different
|
||||
encoder lookup scheme; fixed some names
|
||||
0.6: changed "s#" to "t#"; changed <defencbuf> to <defencstr> holding
|
||||
a real Python string object; changed Buffer Interface to
|
||||
delegate requests to <defencstr>'s buffer interface; removed
|
||||
the explicit reference to the unicodec.codecs dictionary (the
|
||||
the methods;
|
||||
* made the Codec design more generic w/r to type
|
||||
of input and output objects;
|
||||
* changed ``StreamWriter.flush`` to ``StreamWriter.reset`` in order to
|
||||
avoid overriding the stream's ``.flush()`` method;
|
||||
* renamed ``.breaklines()`` to ``.splitlines()``;
|
||||
* renamed the module unicodec to codecs;
|
||||
* modified the File I/O section to refer to the stream codecs.
|
||||
|
||||
0.9
|
||||
---
|
||||
|
||||
* changed errors keyword argument definition;
|
||||
* added 'replace' error handling;
|
||||
* changed the codec APIs to accept buffer like
|
||||
objects on input;
|
||||
* some minor typo fixes;
|
||||
* added Whitespace section and included references for Unicode characters that
|
||||
have the whitespace and the line break characteristic;
|
||||
* added note that search functions can expect lower-case encoding names;
|
||||
* dropped slicing and offsets in the codec APIs
|
||||
|
||||
0.8
|
||||
---
|
||||
|
||||
* added encodings package and raw unicode escape encoding;
|
||||
* untabified the proposal;
|
||||
* added notes on Unicode format strings;
|
||||
* added ``.breaklines()`` method
|
||||
|
||||
0.7
|
||||
---
|
||||
|
||||
* added a whole new set of codec APIs;
|
||||
* added a different encoder lookup scheme;
|
||||
* fixed some names
|
||||
|
||||
0.6
|
||||
---
|
||||
|
||||
* changed "s#" to "t#";
|
||||
* changed <defencbuf> to <defencstr> holding
|
||||
a real Python string object;
|
||||
* changed Buffer Interface to
|
||||
delegate requests to <defencstr>'s buffer interface;
|
||||
* removed the explicit reference to the unicodec.codecs dictionary (the
|
||||
module can implement this in way fit for the purpose);
|
||||
removed the settable default encoding; move UnicodeError from
|
||||
unicodec to exceptions; "s#" not returns the internal data;
|
||||
passed the UCS-2/UTF-16 checking from the Unicode constructor
|
||||
* removed the settable default encoding;
|
||||
* move ``UnicodeError`` from unicodec to exceptions;
|
||||
* "s#" not returns the internal data;
|
||||
* passed the UCS-2/UTF-16 checking from the Unicode constructor
|
||||
to the Codecs
|
||||
0.5: moved sys.bom to unicodec.BOM; added sections on case mapping,
|
||||
private use encodings and Unicode character properties
|
||||
0.4: added Codec interface, notes on %-formatting, changed some encoding
|
||||
details, added comments on stream wrappers, fixed some
|
||||
discussion points (most important: Internal Format),
|
||||
clarified the 'unicode-escape' encoding, added encoding
|
||||
|
||||
0.5
|
||||
---
|
||||
|
||||
* moved ``sys.bom`` to ``unicodec.BOM``;
|
||||
* added sections on case mapping,
|
||||
* private use encodings and Unicode character properties
|
||||
|
||||
0.4
|
||||
---
|
||||
|
||||
* added Codec interface, notes on %-formatting,
|
||||
* changed some encoding details,
|
||||
* added comments on stream wrappers,
|
||||
* fixed some discussion points (most important: Internal Format),
|
||||
* clarified the 'unicode-escape' encoding, added encoding
|
||||
references
|
||||
0.3: added references, comments on codec modules, the internal format,
|
||||
bf_getcharbuffer and the RE engine; added 'unicode-escape'
|
||||
|
||||
0.3
|
||||
---
|
||||
|
||||
* added references, comments on codec modules, the internal format,
|
||||
bf_getcharbuffer and the RE engine;
|
||||
* added 'unicode-escape'
|
||||
encoding proposed by Tim Peters and fixed repr(u) accordingly
|
||||
0.2: integrated Guido's suggestions, added stream codecs and file
|
||||
wrapping
|
||||
0.1: first version
|
||||
|
||||
0.2
|
||||
---
|
||||
|
||||
* integrated Guido's suggestions, added stream codecs and file wrapping
|
||||
|
||||
0.1
|
||||
---
|
||||
|
||||
* first version
|
||||
|
||||
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
|
|
Loading…
Reference in New Issue