1247 lines
40 KiB
ReStructuredText
1247 lines
40 KiB
ReStructuredText
PEP: 100
|
|
Title: Python Unicode Integration
|
|
Author: Marc-André Lemburg <mal@lemburg.com>
|
|
Status: Final
|
|
Type: Standards Track
|
|
Content-Type: text/x-rst
|
|
Created: 10-Mar-2000
|
|
Python-Version: 2.0
|
|
Post-History:
|
|
|
|
|
|
Historical Note
|
|
===============
|
|
|
|
This document was first written by Marc-Andre in the pre-PEP days,
|
|
and was originally distributed as Misc/unicode.txt in Python
|
|
distributions up to and included Python 2.1. The last revision of
|
|
the proposal in that location was labeled version 1.7 (CVS
|
|
revision 3.10). Because the document clearly serves the purpose
|
|
of an informational PEP in the post-PEP era, it has been moved
|
|
here and reformatted to comply with PEP guidelines. Future
|
|
revisions will be made to this document, while Misc/unicode.txt
|
|
will contain a pointer to this PEP.
|
|
|
|
-Barry Warsaw, PEP editor
|
|
|
|
|
|
Introduction
|
|
============
|
|
|
|
The idea of this proposal is to add native Unicode 3.0 support to
|
|
Python in a way that makes use of Unicode strings as simple as
|
|
possible without introducing too many pitfalls along the way.
|
|
|
|
Since this goal is not easy to achieve -- strings being one of the
|
|
most fundamental objects in Python -- we expect this proposal to
|
|
undergo some significant refinements.
|
|
|
|
Note that the current version of this proposal is still a bit
|
|
unsorted due to the many different aspects of the Unicode-Python
|
|
integration.
|
|
|
|
The latest version of this document is always available at:
|
|
http://starship.python.net/~lemburg/unicode-proposal.txt
|
|
|
|
Older versions are available as:
|
|
http://starship.python.net/~lemburg/unicode-proposal-X.X.txt
|
|
|
|
[ed. note: new revisions should be made to this PEP document,
|
|
while the historical record previous to version 1.7 should be
|
|
retrieved from MAL's url, or Misc/unicode.txt]
|
|
|
|
|
|
Conventions
|
|
===========
|
|
|
|
- In examples we use u = Unicode object and s = Python string
|
|
|
|
- 'XXX' markings indicate points of discussion (PODs)
|
|
|
|
|
|
General Remarks
|
|
===============
|
|
|
|
- Unicode encoding names should be lower case on output and
|
|
case-insensitive on input (they will be converted to lower case
|
|
by all APIs taking an encoding name as input).
|
|
|
|
- Encoding names should follow the name conventions as used by the
|
|
Unicode Consortium: spaces are converted to hyphens, e.g. 'utf
|
|
16' is written as 'utf-16'.
|
|
|
|
- Codec modules should use the same names, but with hyphens
|
|
converted to underscores, e.g. ``utf_8``, ``utf_16``, ``iso_8859_1``.
|
|
|
|
|
|
Unicode Default Encoding
|
|
========================
|
|
|
|
The Unicode implementation has to make some assumption about the
|
|
encoding of 8-bit strings passed to it for coercion and about the
|
|
encoding to as default for conversion of Unicode to strings when
|
|
no specific encoding is given. This encoding is called <default
|
|
encoding> throughout this text.
|
|
|
|
For this, the implementation maintains a global which can be set
|
|
in the site.py Python startup script. Subsequent changes are not
|
|
possible. The <default encoding> can be set and queried using the
|
|
two sys module APIs:
|
|
|
|
``sys.setdefaultencoding(encoding)``
|
|
Sets the <default encoding> used by the Unicode implementation.
|
|
encoding has to be an encoding which is supported by the
|
|
Python installation, otherwise, a LookupError is raised.
|
|
|
|
Note: This API is only available in site.py! It is
|
|
removed from the sys module by site.py after usage.
|
|
|
|
``sys.getdefaultencoding()``
|
|
Returns the current <default encoding>.
|
|
|
|
If not otherwise defined or set, the <default encoding> defaults
|
|
to 'ascii'. This encoding is also the startup default of Python
|
|
(and in effect before site.py is executed).
|
|
|
|
Note that the default site.py startup module contains disabled
|
|
optional code which can set the <default encoding> according to
|
|
the encoding defined by the current locale. The locale module is
|
|
used to extract the encoding from the locale default settings
|
|
defined by the OS environment (see locale.py). If the encoding
|
|
cannot be determined, is unknown or unsupported, the code defaults
|
|
to setting the <default encoding> to 'ascii'. To enable this
|
|
code, edit the site.py file or place the appropriate code into the
|
|
sitecustomize.py module of your Python installation.
|
|
|
|
|
|
Unicode Constructors
|
|
====================
|
|
|
|
Python should provide a built-in constructor for Unicode strings
|
|
which is available through ``__builtins__``::
|
|
|
|
u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"])
|
|
|
|
u = u'<unicode-escape encoded Python string>'
|
|
|
|
u = ur'<raw-unicode-escape encoded Python string>'
|
|
|
|
With the 'unicode-escape' encoding being defined as:
|
|
|
|
- all non-escape characters represent themselves as Unicode
|
|
ordinal (e.g. 'a' -> U+0061).
|
|
|
|
- all existing defined Python escape sequences are interpreted as
|
|
Unicode ordinals; note that ``\xXXXX`` can represent all Unicode
|
|
ordinals, and ``\OOO`` (octal) can represent Unicode ordinals up to
|
|
U+01FF.
|
|
|
|
- a new escape sequence, ``\uXXXX``, represents U+XXXX; it is a syntax
|
|
error to have fewer than 4 digits after ``\u``.
|
|
|
|
For an explanation of possible values for errors see the Codec
|
|
section below.
|
|
|
|
Examples::
|
|
|
|
u'abc' -> U+0061 U+0062 U+0063
|
|
u'\u1234' -> U+1234
|
|
u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+005c
|
|
|
|
The 'raw-unicode-escape' encoding is defined as follows:
|
|
|
|
- ``\uXXXX`` sequence represent the U+XXXX Unicode character if and
|
|
only if the number of leading backslashes is odd
|
|
|
|
- all other characters represent themselves as Unicode ordinal
|
|
(e.g. 'b' -> U+0062)
|
|
|
|
Note that you should provide some hint to the encoding you used to
|
|
write your programs as pragma line in one the first few comment
|
|
lines of the source file (e.g. '# source file encoding: latin-1').
|
|
If you only use 7-bit ASCII then everything is fine and no such
|
|
notice is needed, but if you include Latin-1 characters not
|
|
defined in ASCII, it may well be worthwhile including a hint since
|
|
people in other countries will want to be able to read your source
|
|
strings too.
|
|
|
|
|
|
Unicode Type Object
|
|
===================
|
|
|
|
Unicode objects should have the type UnicodeType with type name
|
|
'unicode', made available through the standard types module.
|
|
|
|
|
|
Unicode Output
|
|
==============
|
|
|
|
Unicode objects have a method .encode([encoding=<default encoding>])
|
|
which returns a Python string encoding the Unicode string using the
|
|
given scheme (see Codecs).
|
|
|
|
::
|
|
|
|
print u := print u.encode() # using the <default encoding>
|
|
|
|
str(u) := u.encode() # using the <default encoding>
|
|
|
|
repr(u) := "u%s" % repr(u.encode('unicode-escape'))
|
|
|
|
Also see Internal Argument Parsing and Buffer Interface for
|
|
details on how other APIs written in C will treat Unicode objects.
|
|
|
|
|
|
Unicode Ordinals
|
|
================
|
|
|
|
Since Unicode 3.0 has a 32-bit ordinal character set, the
|
|
implementation should provide 32-bit aware ordinal conversion
|
|
APIs::
|
|
|
|
ord(u[:1]) (this is the standard ord() extended to work with Unicode
|
|
objects)
|
|
--> Unicode ordinal number (32-bit)
|
|
|
|
unichr(i)
|
|
--> Unicode object for character i (provided it is 32-bit);
|
|
ValueError otherwise
|
|
|
|
Both APIs should go into ``__builtins__`` just like their string
|
|
counterparts ``ord()`` and ``chr()``.
|
|
|
|
Note that Unicode provides space for private encodings. Usage of
|
|
these can cause different output representations on different
|
|
machines. This problem is not a Python or Unicode problem, but a
|
|
machine setup and maintenance one.
|
|
|
|
|
|
Comparison & Hash Value
|
|
=======================
|
|
|
|
Unicode objects should compare equal to other objects after these
|
|
other objects have been coerced to Unicode. For strings this
|
|
means that they are interpreted as Unicode string using the
|
|
<default encoding>.
|
|
|
|
Unicode objects should return the same hash value as their ASCII
|
|
equivalent strings. Unicode strings holding non-ASCII values are
|
|
not guaranteed to return the same hash values as the default
|
|
encoded equivalent string representation.
|
|
|
|
When compared using ``cmp()`` (or ``PyObject_Compare()``) the
|
|
implementation should mask ``TypeErrors`` raised during the conversion
|
|
to remain in synch with the string behavior. All other errors
|
|
such as ``ValueErrors`` raised during coercion of strings to Unicode
|
|
should not be masked and passed through to the user.
|
|
|
|
In containment tests ('a' in u'abc' and u'a' in 'abc') both sides
|
|
should be coerced to Unicode before applying the test. Errors
|
|
occurring during coercion (e.g. None in u'abc') should not be
|
|
masked.
|
|
|
|
|
|
Coercion
|
|
========
|
|
|
|
Using Python strings and Unicode objects to form new objects
|
|
should always coerce to the more precise format, i.e. Unicode
|
|
objects.
|
|
|
|
::
|
|
|
|
u + s := u + unicode(s)
|
|
|
|
s + u := unicode(s) + u
|
|
|
|
All string methods should delegate the call to an equivalent
|
|
Unicode object method call by converting all involved strings to
|
|
Unicode and then applying the arguments to the Unicode method of
|
|
the same name, e.g.
|
|
|
|
::
|
|
|
|
string.join((s,u),sep) := (s + sep) + u
|
|
|
|
sep.join((s,u)) := (s + sep) + u
|
|
|
|
For a discussion of %-formatting w/r to Unicode objects, see
|
|
Formatting Markers.
|
|
|
|
|
|
Exceptions
|
|
==========
|
|
|
|
``UnicodeError`` is defined in the exceptions module as a subclass of
|
|
``ValueError``. It is available at the C level via
|
|
``PyExc_UnicodeError``. All exceptions related to Unicode
|
|
encoding/decoding should be subclasses of ``UnicodeError``.
|
|
|
|
|
|
Codecs (Coder/Decoders) Lookup
|
|
==============================
|
|
|
|
A Codec (see Codec Interface Definition) search registry should be
|
|
implemented by a module "codecs"::
|
|
|
|
codecs.register(search_function)
|
|
|
|
Search functions are expected to take one argument, the encoding
|
|
name in all lower case letters and with hyphens and spaces
|
|
converted to underscores, and return a tuple of functions
|
|
(encoder, decoder, stream_reader, stream_writer) taking the
|
|
following arguments:
|
|
|
|
encoder and decoder
|
|
These must be functions or methods which have the same
|
|
interface as the ``.encode``/``.decode`` methods of Codec instances
|
|
(see Codec Interface). The functions/methods are expected to
|
|
work in a stateless mode.
|
|
|
|
stream_reader and stream_writer
|
|
These need to be factory functions with the following
|
|
interface::
|
|
|
|
factory(stream,errors='strict')
|
|
|
|
The factory functions must return objects providing the
|
|
interfaces defined by ``StreamWriter``/``StreamReader`` resp. (see
|
|
Codec Interface). Stream codecs can maintain state.
|
|
|
|
Possible values for errors are defined in the Codec section
|
|
below.
|
|
|
|
In case a search function cannot find a given encoding, it should
|
|
return None.
|
|
|
|
Aliasing support for encodings is left to the search functions to
|
|
implement.
|
|
|
|
The codecs module will maintain an encoding cache for performance
|
|
reasons. Encodings are first looked up in the cache. If not
|
|
found, the list of registered search functions is scanned. If no
|
|
codecs tuple is found, a LookupError is raised. Otherwise, the
|
|
codecs tuple is stored in the cache and returned to the caller.
|
|
|
|
To query the Codec instance the following API should be used::
|
|
|
|
codecs.lookup(encoding)
|
|
|
|
This will either return the found codecs tuple or raise a
|
|
``LookupError``.
|
|
|
|
|
|
Standard Codecs
|
|
===============
|
|
|
|
Standard codecs should live inside an encodings/ package directory
|
|
in the Standard Python Code Library. The ``__init__.py`` file of that
|
|
directory should include a Codec Lookup compatible search function
|
|
implementing a lazy module based codec lookup.
|
|
|
|
Python should provide a few standard codecs for the most relevant
|
|
encodings, e.g.
|
|
|
|
::
|
|
|
|
'utf-8': 8-bit variable length encoding
|
|
'utf-16': 16-bit variable length encoding (little/big endian)
|
|
'utf-16-le': utf-16 but explicitly little endian
|
|
'utf-16-be': utf-16 but explicitly big endian
|
|
'ascii': 7-bit ASCII codepage
|
|
'iso-8859-1': ISO 8859-1 (Latin 1) codepage
|
|
'unicode-escape': See Unicode Constructors for a definition
|
|
'raw-unicode-escape': See Unicode Constructors for a definition
|
|
'native': Dump of the Internal Format used by Python
|
|
|
|
Common aliases should also be provided per default, e.g.
|
|
'latin-1' for 'iso-8859-1'.
|
|
|
|
Note: 'utf-16' should be implemented by using and requiring byte
|
|
order marks (BOM) for file input/output.
|
|
|
|
All other encodings such as the CJK ones to support Asian scripts
|
|
should be implemented in separate packages which do not get
|
|
included in the core Python distribution and are not a part of
|
|
this proposal.
|
|
|
|
|
|
Codecs Interface Definition
|
|
===========================
|
|
|
|
The following base class should be defined in the module "codecs".
|
|
They provide not only templates for use by encoding module
|
|
implementors, but also define the interface which is expected by
|
|
the Unicode implementation.
|
|
|
|
Note that the Codec Interface defined here is well suitable for a
|
|
larger range of applications. The Unicode implementation expects
|
|
Unicode objects on input for ``.encode()`` and ``.write()`` and character
|
|
buffer compatible objects on input for ``.decode()``. Output of
|
|
``.encode()`` and ``.read()`` should be a Python string and ``.decode()`` must
|
|
return an Unicode object.
|
|
|
|
First, we have the stateless encoders/decoders. These do not work
|
|
in chunks as the stream codecs (see below) do, because all
|
|
components are expected to be available in memory.
|
|
|
|
::
|
|
|
|
class Codec:
|
|
|
|
"""Defines the interface for stateless encoders/decoders.
|
|
|
|
The .encode()/.decode() methods may implement different
|
|
error handling schemes by providing the errors argument.
|
|
These string values are defined:
|
|
|
|
'strict' - raise an error (or a subclass)
|
|
'ignore' - ignore the character and continue with the next
|
|
'replace' - replace with a suitable replacement character;
|
|
Python will use the official U+FFFD
|
|
REPLACEMENT CHARACTER for the builtin Unicode
|
|
codecs.
|
|
"""
|
|
|
|
def encode(self,input,errors='strict'):
|
|
|
|
"""Encodes the object input and returns a tuple (output
|
|
object, length consumed).
|
|
|
|
errors defines the error handling to apply. It
|
|
defaults to 'strict' handling.
|
|
|
|
The method may not store state in the Codec instance.
|
|
Use StreamCodec for codecs which have to keep state in
|
|
order to make encoding/decoding efficient.
|
|
"""
|
|
|
|
def decode(self,input,errors='strict'):
|
|
|
|
"""Decodes the object input and returns a tuple (output
|
|
object, length consumed).
|
|
|
|
input must be an object which provides the
|
|
bf_getreadbuf buffer slot. Python strings, buffer
|
|
objects and memory mapped files are examples of objects
|
|
providing this slot.
|
|
|
|
errors defines the error handling to apply. It
|
|
defaults to 'strict' handling.
|
|
|
|
The method may not store state in the Codec instance.
|
|
Use StreamCodec for codecs which have to keep state in
|
|
order to make encoding/decoding efficient.
|
|
|
|
"""
|
|
|
|
``StreamWriter`` and ``StreamReader`` define the interface for stateful
|
|
encoders/decoders which work on streams. These allow processing
|
|
of the data in chunks to efficiently use memory. If you have
|
|
large strings in memory, you may want to wrap them with ``cStringIO``
|
|
objects and then use these codecs on them to be able to do chunk
|
|
processing as well, e.g. to provide progress information to the
|
|
user.
|
|
|
|
::
|
|
|
|
class StreamWriter(Codec):
|
|
|
|
def __init__(self,stream,errors='strict'):
|
|
|
|
"""Creates a StreamWriter instance.
|
|
|
|
stream must be a file-like object open for writing
|
|
(binary) data.
|
|
|
|
The StreamWriter may implement different error handling
|
|
schemes by providing the errors keyword argument.
|
|
These parameters are defined:
|
|
|
|
'strict' - raise a ValueError (or a subclass)
|
|
'ignore' - ignore the character and continue with the next
|
|
'replace'- replace with a suitable replacement character
|
|
"""
|
|
self.stream = stream
|
|
self.errors = errors
|
|
|
|
def write(self,object):
|
|
|
|
"""Writes the object's contents encoded to self.stream.
|
|
"""
|
|
data, consumed = self.encode(object,self.errors)
|
|
self.stream.write(data)
|
|
|
|
def writelines(self, list):
|
|
|
|
"""Writes the concatenated list of strings to the stream
|
|
using .write().
|
|
"""
|
|
self.write(''.join(list))
|
|
|
|
def reset(self):
|
|
|
|
"""Flushes and resets the codec buffers used for keeping state.
|
|
|
|
Calling this method should ensure that the data on the
|
|
output is put into a clean state, that allows appending
|
|
of new fresh data without having to rescan the whole
|
|
stream to recover state.
|
|
"""
|
|
pass
|
|
|
|
def __getattr__(self,name, getattr=getattr):
|
|
|
|
"""Inherit all other methods from the underlying stream.
|
|
"""
|
|
return getattr(self.stream,name)
|
|
|
|
|
|
class StreamReader(Codec):
|
|
|
|
def __init__(self,stream,errors='strict'):
|
|
|
|
"""Creates a StreamReader instance.
|
|
|
|
stream must be a file-like object open for reading
|
|
(binary) data.
|
|
|
|
The StreamReader may implement different error handling
|
|
schemes by providing the errors keyword argument.
|
|
These parameters are defined:
|
|
|
|
'strict' - raise a ValueError (or a subclass)
|
|
'ignore' - ignore the character and continue with the next
|
|
'replace'- replace with a suitable replacement character;
|
|
"""
|
|
self.stream = stream
|
|
self.errors = errors
|
|
|
|
def read(self,size=-1):
|
|
|
|
"""Decodes data from the stream self.stream and returns the
|
|
resulting object.
|
|
|
|
size indicates the approximate maximum number of bytes
|
|
to read from the stream for decoding purposes. The
|
|
decoder can modify this setting as appropriate. The
|
|
default value -1 indicates to read and decode as much
|
|
as possible. size is intended to prevent having to
|
|
decode huge files in one step.
|
|
|
|
The method should use a greedy read strategy meaning
|
|
that it should read as much data as is allowed within
|
|
the definition of the encoding and the given size, e.g.
|
|
if optional encoding endings or state markers are
|
|
available on the stream, these should be read too.
|
|
"""
|
|
# Unsliced reading:
|
|
if size < 0:
|
|
return self.decode(self.stream.read())[0]
|
|
|
|
# Sliced reading:
|
|
read = self.stream.read
|
|
decode = self.decode
|
|
data = read(size)
|
|
i = 0
|
|
while 1:
|
|
try:
|
|
object, decodedbytes = decode(data)
|
|
except ValueError,why:
|
|
# This method is slow but should work under pretty
|
|
# much all conditions; at most 10 tries are made
|
|
i = i + 1
|
|
newdata = read(1)
|
|
if not newdata or i > 10:
|
|
raise
|
|
data = data + newdata
|
|
else:
|
|
return object
|
|
|
|
def readline(self, size=None):
|
|
|
|
"""Read one line from the input stream and return the
|
|
decoded data.
|
|
|
|
Note: Unlike the .readlines() method, this method
|
|
inherits the line breaking knowledge from the
|
|
underlying stream's .readline() method -- there is
|
|
currently no support for line breaking using the codec
|
|
decoder due to lack of line buffering. Subclasses
|
|
should however, if possible, try to implement this
|
|
method using their own knowledge of line breaking.
|
|
|
|
size, if given, is passed as size argument to the
|
|
stream's .readline() method.
|
|
"""
|
|
if size is None:
|
|
line = self.stream.readline()
|
|
else:
|
|
line = self.stream.readline(size)
|
|
return self.decode(line)[0]
|
|
|
|
def readlines(self, sizehint=0):
|
|
|
|
"""Read all lines available on the input stream
|
|
and return them as list of lines.
|
|
|
|
Line breaks are implemented using the codec's decoder
|
|
method and are included in the list entries.
|
|
|
|
sizehint, if given, is passed as size argument to the
|
|
stream's .read() method.
|
|
"""
|
|
if sizehint is None:
|
|
data = self.stream.read()
|
|
else:
|
|
data = self.stream.read(sizehint)
|
|
return self.decode(data)[0].splitlines(1)
|
|
|
|
def reset(self):
|
|
|
|
"""Resets the codec buffers used for keeping state.
|
|
|
|
Note that no stream repositioning should take place.
|
|
This method is primarily intended to be able to recover
|
|
from decoding errors.
|
|
|
|
"""
|
|
pass
|
|
|
|
def __getattr__(self,name, getattr=getattr):
|
|
|
|
""" Inherit all other methods from the underlying stream.
|
|
"""
|
|
return getattr(self.stream,name)
|
|
|
|
|
|
Stream codec implementors are free to combine the ``StreamWriter`` and
|
|
``StreamReader`` interfaces into one class. Even combining all these
|
|
with the Codec class should be possible.
|
|
|
|
Implementors are free to add additional methods to enhance the
|
|
codec functionality or provide extra state information needed for
|
|
them to work. The internal codec implementation will only use the
|
|
above interfaces, though.
|
|
|
|
It is not required by the Unicode implementation to use these base
|
|
classes, only the interfaces must match; this allows writing
|
|
Codecs as extension types.
|
|
|
|
As guideline, large mapping tables should be implemented using
|
|
static C data in separate (shared) extension modules. That way
|
|
multiple processes can share the same data.
|
|
|
|
A tool to auto-convert Unicode mapping files to mapping modules
|
|
should be provided to simplify support for additional mappings
|
|
(see References).
|
|
|
|
|
|
Whitespace
|
|
==========
|
|
|
|
The ``.split()`` method will have to know about what is considered
|
|
whitespace in Unicode.
|
|
|
|
|
|
Case Conversion
|
|
===============
|
|
|
|
Case conversion is rather complicated with Unicode data, since
|
|
there are many different conditions to respect. See
|
|
|
|
http://www.unicode.org/unicode/reports/tr13/
|
|
|
|
for some guidelines on implementing case conversion.
|
|
|
|
For Python, we should only implement the 1-1 conversions included
|
|
in Unicode. Locale dependent and other special case conversions
|
|
(see the Unicode standard file SpecialCasing.txt) should be left
|
|
to user land routines and not go into the core interpreter.
|
|
|
|
The methods ``.capitalize()`` and ``.iscapitalized()`` should follow the
|
|
case mapping algorithm defined in the above technical report as
|
|
closely as possible.
|
|
|
|
|
|
Line Breaks
|
|
===========
|
|
|
|
Line breaking should be done for all Unicode characters having the
|
|
B property as well as the combinations CRLF, CR, LF (interpreted
|
|
in that order) and other special line separators defined by the
|
|
standard.
|
|
|
|
The Unicode type should provide a ``.splitlines()`` method which
|
|
returns a list of lines according to the above specification. See
|
|
Unicode Methods.
|
|
|
|
|
|
Unicode Character Properties
|
|
============================
|
|
|
|
A separate module "unicodedata" should provide a compact interface
|
|
to all Unicode character properties defined in the standard's
|
|
UnicodeData.txt file.
|
|
|
|
Among other things, these properties provide ways to recognize
|
|
numbers, digits, spaces, whitespace, etc.
|
|
|
|
Since this module will have to provide access to all Unicode
|
|
characters, it will eventually have to contain the data from
|
|
UnicodeData.txt which takes up around 600kB. For this reason, the
|
|
data should be stored in static C data. This enables compilation
|
|
as shared module which the underlying OS can shared between
|
|
processes (unlike normal Python code modules).
|
|
|
|
There should be a standard Python interface for accessing this
|
|
information so that other implementors can plug in their own
|
|
possibly enhanced versions, e.g. ones that do decompressing of the
|
|
data on-the-fly.
|
|
|
|
|
|
Private Code Point Areas
|
|
========================
|
|
|
|
Support for these is left to user land Codecs and not explicitly
|
|
integrated into the core. Note that due to the Internal Format
|
|
being implemented, only the area between ``\uE000`` and ``\uF8FF`` is
|
|
usable for private encodings.
|
|
|
|
|
|
Internal Format
|
|
===============
|
|
|
|
The internal format for Unicode objects should use a Python
|
|
specific fixed format <PythonUnicode> implemented as 'unsigned
|
|
short' (or another unsigned numeric type having 16 bits). Byte
|
|
order is platform dependent.
|
|
|
|
This format will hold UTF-16 encodings of the corresponding
|
|
Unicode ordinals. The Python Unicode implementation will address
|
|
these values as if they were UCS-2 values. UCS-2 and UTF-16 are
|
|
the same for all currently defined Unicode character points.
|
|
UTF-16 without surrogates provides access to about 64k characters
|
|
and covers all characters in the Basic Multilingual Plane (BMP) of
|
|
Unicode.
|
|
|
|
It is the Codec's responsibility to ensure that the data they pass
|
|
to the Unicode object constructor respects this assumption. The
|
|
constructor does not check the data for Unicode compliance or use
|
|
of surrogates.
|
|
|
|
Future implementations can extend the 32 bit restriction to the
|
|
full set of all UTF-16 addressable characters (around 1M
|
|
characters).
|
|
|
|
The Unicode API should provide interface routines from
|
|
<PythonUnicode> to the compiler's wchar_t which can be 16 or 32
|
|
bit depending on the compiler/libc/platform being used.
|
|
|
|
Unicode objects should have a pointer to a cached Python string
|
|
object <defenc> holding the object's value using the <default
|
|
encoding>. This is needed for performance and internal parsing
|
|
(see Internal Argument Parsing) reasons. The buffer is filled
|
|
when the first conversion request to the <default encoding> is
|
|
issued on the object.
|
|
|
|
Interning is not needed (for now), since Python identifiers are
|
|
defined as being ASCII only.
|
|
|
|
``codecs.BOM`` should return the byte order mark (BOM) for the format
|
|
used internally. The codecs module should provide the following
|
|
additional constants for convenience and reference (``codecs.BOM``
|
|
will either be ``BOM_BE`` or ``BOM_LE`` depending on the platform)::
|
|
|
|
BOM_BE: '\376\377'
|
|
(corresponds to Unicode U+0000FEFF in UTF-16 on big endian
|
|
platforms == ZERO WIDTH NO-BREAK SPACE)
|
|
|
|
BOM_LE: '\377\376'
|
|
(corresponds to Unicode U+0000FFFE in UTF-16 on little endian
|
|
platforms == defined as being an illegal Unicode character)
|
|
|
|
BOM4_BE: '\000\000\376\377'
|
|
(corresponds to Unicode U+0000FEFF in UCS-4)
|
|
|
|
BOM4_LE: '\377\376\000\000'
|
|
(corresponds to Unicode U+0000FFFE in UCS-4)
|
|
|
|
Note that Unicode sees big endian byte order as being "correct".
|
|
The swapped order is taken to be an indicator for a "wrong"
|
|
format, hence the illegal character definition.
|
|
|
|
The configure script should provide aid in deciding whether Python
|
|
can use the native ``wchar_t`` type or not (it has to be a 16-bit
|
|
unsigned type).
|
|
|
|
|
|
Buffer Interface
|
|
================
|
|
|
|
Implement the buffer interface using the <defenc> Python string
|
|
object as basis for ``bf_getcharbuf`` and the internal buffer for
|
|
``bf_getreadbuf``. If ``bf_getcharbuf`` is requested and the <defenc>
|
|
object does not yet exist, it is created first.
|
|
|
|
Note that as special case, the parser marker "s#" will not return
|
|
raw Unicode UTF-16 data (which the ``bf_getreadbuf`` returns), but
|
|
instead tries to encode the Unicode object using the default
|
|
encoding and then returns a pointer to the resulting string object
|
|
(or raises an exception in case the conversion fails). This was
|
|
done in order to prevent accidentally writing binary data to an
|
|
output stream which the other end might not recognize.
|
|
|
|
This has the advantage of being able to write to output streams
|
|
(which typically use this interface) without additional
|
|
specification of the encoding to use.
|
|
|
|
If you need to access the read buffer interface of Unicode
|
|
objects, use the ``PyObject_AsReadBuffer()`` interface.
|
|
|
|
The internal format can also be accessed using the
|
|
'unicode-internal' codec, e.g. via ``u.encode('unicode-internal')``.
|
|
|
|
|
|
Pickle/Marshalling
|
|
==================
|
|
|
|
Should have native Unicode object support. The objects should be
|
|
encoded using platform independent encodings.
|
|
|
|
Marshal should use UTF-8 and Pickle should either choose
|
|
Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as
|
|
encoding. Using UTF-8 instead of UTF-16 has the advantage of
|
|
eliminating the need to store a BOM mark.
|
|
|
|
|
|
Regular Expressions
|
|
===================
|
|
|
|
Secret Labs AB is working on a Unicode-aware regular expression
|
|
machinery. It works on plain 8-bit, UCS-2, and (optionally) UCS-4
|
|
internal character buffers.
|
|
|
|
Also see
|
|
|
|
http://www.unicode.org/unicode/reports/tr18/
|
|
|
|
for some remarks on how to treat Unicode REs.
|
|
|
|
|
|
Formatting Markers
|
|
==================
|
|
|
|
Format markers are used in Python format strings. If Python
|
|
strings are used as format strings, the following interpretations
|
|
should be in effect::
|
|
|
|
'%s': For Unicode objects this will cause coercion of the
|
|
whole format string to Unicode. Note that you should use
|
|
a Unicode format string to start with for performance
|
|
reasons.
|
|
|
|
In case the format string is an Unicode object, all parameters are
|
|
coerced to Unicode first and then put together and formatted
|
|
according to the format string. Numbers are first converted to
|
|
strings and then to Unicode.
|
|
|
|
::
|
|
|
|
'%s': Python strings are interpreted as Unicode
|
|
string using the <default encoding>. Unicode objects are
|
|
taken as is.
|
|
|
|
All other string formatters should work accordingly.
|
|
|
|
Example::
|
|
|
|
u"%s %s" % (u"abc", "abc") == u"abc abc"
|
|
|
|
|
|
Internal Argument Parsing
|
|
=========================
|
|
|
|
These markers are used by the ``PyArg_ParseTuple()`` APIs:
|
|
|
|
"U"
|
|
Check for Unicode object and return a pointer to it
|
|
|
|
"s"
|
|
For Unicode objects: return a pointer to the object's
|
|
<defenc> buffer (which uses the <default encoding>).
|
|
|
|
"s#"
|
|
Access to the default encoded version of the Unicode object
|
|
(see Buffer Interface); note that the length relates to
|
|
the length of the default encoded string rather than the
|
|
Unicode object length.
|
|
|
|
"t#"
|
|
Same as "s#".
|
|
|
|
"es"
|
|
Takes two parameters: encoding (``const char *``) and buffer
|
|
(``char **``).
|
|
|
|
The input object is first coerced to Unicode in the usual
|
|
way and then encoded into a string using the given
|
|
encoding.
|
|
|
|
On output, a buffer of the needed size is allocated and
|
|
returned through ``*buffer`` as NULL-terminated string. The
|
|
encoded may not contain embedded NULL characters. The
|
|
caller is responsible for calling ``PyMem_Free()`` to free the
|
|
allocated ``*buffer`` after usage.
|
|
|
|
"es#"
|
|
Takes three parameters: encoding (``const char *``), buffer
|
|
(``char **``) and buffer_len (``int *``).
|
|
|
|
The input object is first coerced to Unicode in the usual
|
|
way and then encoded into a string using the given
|
|
encoding.
|
|
|
|
If ``*buffer`` is non-NULL, ``*buffer_len`` must be set to
|
|
``sizeof(buffer)`` on input. Output is then copied to ``*buffer``.
|
|
|
|
If ``*buffer`` is NULL, a buffer of the needed size is
|
|
allocated and output copied into it. ``*buffer`` is then
|
|
updated to point to the allocated memory area. The caller
|
|
is responsible for calling ``PyMem_Free()`` to free the
|
|
allocated ``*buffer`` after usage.
|
|
|
|
In both cases ``*buffer_len`` is updated to the number of
|
|
characters written (excluding the trailing NULL-byte).
|
|
The output buffer is assured to be NULL-terminated.
|
|
|
|
Examples:
|
|
|
|
Using "es#" with auto-allocation::
|
|
|
|
static PyObject *
|
|
test_parser(PyObject *self,
|
|
PyObject *args)
|
|
{
|
|
PyObject *str;
|
|
const char *encoding = "latin-1";
|
|
char *buffer = NULL;
|
|
int buffer_len = 0;
|
|
|
|
if (!PyArg_ParseTuple(args, "es#:test_parser",
|
|
encoding, &buffer, &buffer_len))
|
|
return NULL;
|
|
if (!buffer) {
|
|
PyErr_SetString(PyExc_SystemError,
|
|
"buffer is NULL");
|
|
return NULL;
|
|
}
|
|
str = PyString_FromStringAndSize(buffer, buffer_len);
|
|
PyMem_Free(buffer);
|
|
return str;
|
|
}
|
|
|
|
Using "es" with auto-allocation returning a NULL-terminated string::
|
|
|
|
static PyObject *
|
|
test_parser(PyObject *self,
|
|
PyObject *args)
|
|
{
|
|
PyObject *str;
|
|
const char *encoding = "latin-1";
|
|
char *buffer = NULL;
|
|
|
|
if (!PyArg_ParseTuple(args, "es:test_parser",
|
|
encoding, &buffer))
|
|
return NULL;
|
|
if (!buffer) {
|
|
PyErr_SetString(PyExc_SystemError,
|
|
"buffer is NULL");
|
|
return NULL;
|
|
}
|
|
str = PyString_FromString(buffer);
|
|
PyMem_Free(buffer);
|
|
return str;
|
|
}
|
|
|
|
Using "es#" with a pre-allocated buffer::
|
|
|
|
static PyObject *
|
|
test_parser(PyObject *self,
|
|
PyObject *args)
|
|
{
|
|
PyObject *str;
|
|
const char *encoding = "latin-1";
|
|
char _buffer[10];
|
|
char *buffer = _buffer;
|
|
int buffer_len = sizeof(_buffer);
|
|
|
|
if (!PyArg_ParseTuple(args, "es#:test_parser",
|
|
encoding, &buffer, &buffer_len))
|
|
return NULL;
|
|
if (!buffer) {
|
|
PyErr_SetString(PyExc_SystemError,
|
|
"buffer is NULL");
|
|
return NULL;
|
|
}
|
|
str = PyString_FromStringAndSize(buffer, buffer_len);
|
|
return str;
|
|
}
|
|
|
|
|
|
File/Stream Output
|
|
==================
|
|
|
|
Since file.write(object) and most other stream writers use the
|
|
"s#" or "t#" argument parsing marker for querying the data to
|
|
write, the default encoded string version of the Unicode object
|
|
will be written to the streams (see Buffer Interface).
|
|
|
|
For explicit handling of files using Unicode, the standard stream
|
|
codecs as available through the codecs module should be used.
|
|
|
|
The codecs module should provide a short-cut
|
|
open(filename,mode,encoding) available which also assures that
|
|
mode contains the 'b' character when needed.
|
|
|
|
|
|
File/Stream Input
|
|
=================
|
|
|
|
Only the user knows what encoding the input data uses, so no
|
|
special magic is applied. The user will have to explicitly
|
|
convert the string data to Unicode objects as needed or use the
|
|
file wrappers defined in the codecs module (see File/Stream
|
|
Output).
|
|
|
|
|
|
Unicode Methods & Attributes
|
|
============================
|
|
|
|
All Python string methods, plus::
|
|
|
|
.encode([encoding=<default encoding>][,errors="strict"])
|
|
--> see Unicode Output
|
|
|
|
.splitlines([include_breaks=0])
|
|
--> breaks the Unicode string into a list of (Unicode) lines;
|
|
returns the lines with line breaks included, if
|
|
include_breaks is true. See Line Breaks for a
|
|
specification of how line breaking is done.
|
|
|
|
|
|
Code Base
|
|
=========
|
|
|
|
We should use Fredrik Lundh's Unicode object implementation as
|
|
basis. It already implements most of the string methods needed
|
|
and provides a well written code base which we can build upon.
|
|
|
|
The object sharing implemented in Fredrik's implementation should
|
|
be dropped.
|
|
|
|
|
|
Test Cases
|
|
==========
|
|
|
|
Test cases should follow those in Lib/test/test_string.py and
|
|
include additional checks for the Codec Registry and the Standard
|
|
Codecs.
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
* Unicode Consortium: http://www.unicode.org/
|
|
|
|
* Unicode FAQ: http://www.unicode.org/unicode/faq/
|
|
|
|
* Unicode 3.0: http://www.unicode.org/unicode/standard/versions/Unicode3.0.html
|
|
|
|
* Unicode-TechReports: http://www.unicode.org/unicode/reports/techreports.html
|
|
|
|
* Unicode-Mappings: ftp://ftp.unicode.org/Public/MAPPINGS/
|
|
|
|
* Introduction to Unicode (a little outdated by still nice to read):
|
|
http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html
|
|
|
|
* For comparison:
|
|
Introducing Unicode to ECMAScript (aka JavaScript) --
|
|
http://www-4.ibm.com/software/developer/library/internationalization-support.html
|
|
|
|
* IANA Character Set Names:
|
|
ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
|
|
|
|
* Discussion of UTF-8 and Unicode support for POSIX and Linux:
|
|
http://www.cl.cam.ac.uk/~mgk25/unicode.html
|
|
|
|
* Encodings:
|
|
|
|
* Overview: http://czyborra.com/utf/
|
|
|
|
* UCS-2: http://www.uazone.org/multiling/unicode/ucs2.html
|
|
|
|
* UTF-7: Defined in :rfc:`2152`
|
|
|
|
* UTF-8: Defined in :rfc:`2279`
|
|
|
|
* UTF-16: http://www.uazone.org/multiling/unicode/wg2n1035.html
|
|
|
|
|
|
History of this Proposal
|
|
========================
|
|
|
|
[ed. note: revisions prior to 1.7 are available in the CVS history
|
|
of Misc/unicode.txt from the standard Python distribution. All
|
|
subsequent history is available via the CVS revisions on this
|
|
file.]
|
|
|
|
1.7
|
|
---
|
|
|
|
* Added note about the changed behaviour of "s#".
|
|
|
|
1.6
|
|
---
|
|
|
|
* Changed <defencstr> to <defenc> since this is the name used in the
|
|
implementation.
|
|
* Added notes about the usage of <defenc> in
|
|
the buffer protocol implementation.
|
|
|
|
1.5
|
|
---
|
|
|
|
* Added notes about setting the <default encoding>.
|
|
* Fixed some typos (thanks to Andrew Kuchling).
|
|
* Changed <defencstr> to <utf8str>.
|
|
|
|
1.4
|
|
---
|
|
|
|
* Added note about mixed type comparisons and contains tests.
|
|
* Changed treating of Unicode objects in format strings (if
|
|
used with ``'%s' % u`` they will now cause the format string to
|
|
be coerced to Unicode, thus producing a Unicode object on
|
|
return).
|
|
* Added link to IANA charset names (thanks to Lars
|
|
Marius Garshol).
|
|
* Added new codec methods ``.readline()``,
|
|
``.readlines()`` and ``.writelines()``.
|
|
|
|
1.3
|
|
---
|
|
|
|
* Added new "es" and "es#" parser markers
|
|
|
|
1.2
|
|
---
|
|
|
|
* Removed POD about ``codecs.open()``
|
|
|
|
1.1
|
|
---
|
|
|
|
* Added note about comparisons and hash values.
|
|
* Added note about case mapping algorithms.
|
|
* Changed stream codecs ``.read()`` and ``.write()`` method
|
|
to match the standard file-like object
|
|
methods (bytes consumed information is no longer returned by
|
|
the methods)
|
|
|
|
1.0
|
|
---
|
|
|
|
* changed encode Codec method to be symmetric to the decode method
|
|
(they both return (object, data consumed) now and thus become
|
|
interchangeable);
|
|
* removed ``__init__`` method of Codec class (the
|
|
methods are stateless) and moved the errors argument down to
|
|
the methods;
|
|
* made the Codec design more generic w/r to type
|
|
of input and output objects;
|
|
* changed ``StreamWriter.flush`` to ``StreamWriter.reset`` in order to
|
|
avoid overriding the stream's ``.flush()`` method;
|
|
* renamed ``.breaklines()`` to ``.splitlines()``;
|
|
* renamed the module unicodec to codecs;
|
|
* modified the File I/O section to refer to the stream codecs.
|
|
|
|
0.9
|
|
---
|
|
|
|
* changed errors keyword argument definition;
|
|
* added 'replace' error handling;
|
|
* changed the codec APIs to accept buffer like
|
|
objects on input;
|
|
* some minor typo fixes;
|
|
* added Whitespace section and included references for Unicode characters that
|
|
have the whitespace and the line break characteristic;
|
|
* added note that search functions can expect lower-case encoding names;
|
|
* dropped slicing and offsets in the codec APIs
|
|
|
|
0.8
|
|
---
|
|
|
|
* added encodings package and raw unicode escape encoding;
|
|
* untabified the proposal;
|
|
* added notes on Unicode format strings;
|
|
* added ``.breaklines()`` method
|
|
|
|
0.7
|
|
---
|
|
|
|
* added a whole new set of codec APIs;
|
|
* added a different encoder lookup scheme;
|
|
* fixed some names
|
|
|
|
0.6
|
|
---
|
|
|
|
* changed "s#" to "t#";
|
|
* changed <defencbuf> to <defencstr> holding
|
|
a real Python string object;
|
|
* changed Buffer Interface to
|
|
delegate requests to <defencstr>'s buffer interface;
|
|
* removed the explicit reference to the unicodec.codecs dictionary (the
|
|
module can implement this in way fit for the purpose);
|
|
* removed the settable default encoding;
|
|
* move ``UnicodeError`` from unicodec to exceptions;
|
|
* "s#" not returns the internal data;
|
|
* passed the UCS-2/UTF-16 checking from the Unicode constructor
|
|
to the Codecs
|
|
|
|
0.5
|
|
---
|
|
|
|
* moved ``sys.bom`` to ``unicodec.BOM``;
|
|
* added sections on case mapping,
|
|
* private use encodings and Unicode character properties
|
|
|
|
0.4
|
|
---
|
|
|
|
* added Codec interface, notes on %-formatting,
|
|
* changed some encoding details,
|
|
* added comments on stream wrappers,
|
|
* fixed some discussion points (most important: Internal Format),
|
|
* clarified the 'unicode-escape' encoding, added encoding
|
|
references
|
|
|
|
0.3
|
|
---
|
|
|
|
* added references, comments on codec modules, the internal format,
|
|
bf_getcharbuffer and the RE engine;
|
|
* added 'unicode-escape'
|
|
encoding proposed by Tim Peters and fixed repr(u) accordingly
|
|
|
|
0.2
|
|
---
|
|
|
|
* integrated Guido's suggestions, added stream codecs and file wrapping
|
|
|
|
0.1
|
|
---
|
|
|
|
* first version
|