292 lines
9.6 KiB
ReStructuredText
292 lines
9.6 KiB
ReStructuredText
PEP: 261
|
|
Title: Support for "wide" Unicode characters
|
|
Version: $Revision$
|
|
Last-Modified: $Date$
|
|
Author: Paul Prescod <paul@prescod.net>
|
|
Status: Final
|
|
Type: Standards Track
|
|
Content-Type: text/x-rst
|
|
Created: 27-Jun-2001
|
|
Python-Version: 2.2
|
|
Post-History: 27-Jun-2001
|
|
|
|
|
|
Abstract
|
|
========
|
|
|
|
Python 2.1 unicode characters can have ordinals only up to ``2**16 - 1``.
|
|
This range corresponds to a range in Unicode known as the Basic
|
|
Multilingual Plane. There are now characters in Unicode that live
|
|
on other "planes". The largest addressable character in Unicode
|
|
has the ordinal ``17 * 2**16 - 1`` (``0x10ffff``). For readability, we
|
|
will call this TOPCHAR and call characters in this range "wide
|
|
characters".
|
|
|
|
|
|
Glossary
|
|
========
|
|
|
|
Character
|
|
Used by itself, means the addressable units of a Python
|
|
Unicode string.
|
|
|
|
Code point
|
|
A code point is an integer between 0 and TOPCHAR.
|
|
If you imagine Unicode as a mapping from integers to
|
|
characters, each integer is a code point. But the
|
|
integers between 0 and TOPCHAR that do not map to
|
|
characters are also code points. Some will someday
|
|
be used for characters. Some are guaranteed never
|
|
to be used for characters.
|
|
|
|
Codec
|
|
A set of functions for translating between physical
|
|
encodings (e.g. on disk or coming in from a network)
|
|
into logical Python objects.
|
|
|
|
Encoding
|
|
Mechanism for representing abstract characters in terms of
|
|
physical bits and bytes. Encodings allow us to store
|
|
Unicode characters on disk and transmit them over networks
|
|
in a manner that is compatible with other Unicode software.
|
|
|
|
Surrogate pair
|
|
Two physical characters that represent a single logical
|
|
character. Part of a convention for representing 32-bit
|
|
code points in terms of two 16-bit code points.
|
|
|
|
Unicode string
|
|
A Python type representing a sequence of code points with
|
|
"string semantics" (e.g. case conversions, regular
|
|
expression compatibility, etc.) Constructed with the
|
|
``unicode()`` function.
|
|
|
|
|
|
Proposed Solution
|
|
=================
|
|
|
|
One solution would be to merely increase the maximum ordinal
|
|
to a larger value. Unfortunately the only straightforward
|
|
implementation of this idea is to use 4 bytes per character.
|
|
This has the effect of doubling the size of most Unicode
|
|
strings. In order to avoid imposing this cost on every
|
|
user, Python 2.2 will allow the 4-byte implementation as a
|
|
build-time option. Users can choose whether they care about
|
|
wide characters or prefer to preserve memory.
|
|
|
|
The 4-byte option is called "wide ``Py_UNICODE``". The 2-byte option
|
|
is called "narrow ``Py_UNICODE``".
|
|
|
|
Most things will behave identically in the wide and narrow worlds.
|
|
|
|
* ``unichr(i)`` for 0 <= i < ``2**16`` (``0x10000``) always returns a
|
|
length-one string.
|
|
|
|
* ``unichr(i)`` for ``2**16`` <= i <= TOPCHAR will return a
|
|
length-one string on wide Python builds. On narrow builds it will
|
|
raise ``ValueError``.
|
|
|
|
**ISSUE**
|
|
|
|
Python currently allows ``\U`` literals that cannot be
|
|
represented as a single Python character. It generates two
|
|
Python characters known as a "surrogate pair". Should this
|
|
be disallowed on future narrow Python builds?
|
|
|
|
**Pro:**
|
|
|
|
Python already the construction of a surrogate pair
|
|
for a large unicode literal character escape sequence.
|
|
This is basically designed as a simple way to construct
|
|
"wide characters" even in a narrow Python build. It is also
|
|
somewhat logical considering that the Unicode-literal syntax
|
|
is basically a short-form way of invoking the unicode-escape
|
|
codec.
|
|
|
|
**Con:**
|
|
|
|
Surrogates could be easily created this way but the user
|
|
still needs to be careful about slicing, indexing, printing
|
|
etc. Therefore, some have suggested that Unicode
|
|
literals should not support surrogates.
|
|
|
|
|
|
**ISSUE**
|
|
|
|
Should Python allow the construction of characters that do
|
|
not correspond to Unicode code points? Unassigned Unicode
|
|
code points should obviously be legal (because they could
|
|
be assigned at any time). But code points above TOPCHAR are
|
|
guaranteed never to be used by Unicode. Should we allow access
|
|
to them anyhow?
|
|
|
|
**Pro:**
|
|
|
|
If a Python user thinks they know what they're doing why
|
|
should we try to prevent them from violating the Unicode
|
|
spec? After all, we don't stop 8-bit strings from
|
|
containing non-ASCII characters.
|
|
|
|
**Con:**
|
|
|
|
Codecs and other Unicode-consuming code will have to be
|
|
careful of these characters which are disallowed by the
|
|
Unicode specification.
|
|
|
|
* ``ord()`` is always the inverse of ``unichr()``
|
|
|
|
* There is an integer value in the sys module that describes the
|
|
largest ordinal for a character in a Unicode string on the current
|
|
interpreter. ``sys.maxunicode`` is ``2**16-1`` (``0xffff``) on narrow builds
|
|
of Python and TOPCHAR on wide builds.
|
|
|
|
**ISSUE:**
|
|
|
|
Should there be distinct constants for accessing
|
|
TOPCHAR and the real upper bound for the domain of
|
|
``unichr`` (if they differ)? There has also been a
|
|
suggestion of ``sys.unicodewidth`` which can take the
|
|
values ``'wide'`` and ``'narrow'``.
|
|
|
|
* every Python Unicode character represents exactly one Unicode code
|
|
point (i.e. Python Unicode Character = Abstract Unicode character).
|
|
|
|
* codecs will be upgraded to support "wide characters"
|
|
(represented directly in UCS-4, and as variable-length sequences
|
|
in UTF-8 and UTF-16). This is the main part of the implementation
|
|
left to be done.
|
|
|
|
* There is a convention in the Unicode world for encoding a 32-bit
|
|
code point in terms of two 16-bit code points. These are known
|
|
as "surrogate pairs". Python's codecs will adopt this convention
|
|
and encode 32-bit code points as surrogate pairs on narrow Python
|
|
builds.
|
|
|
|
**ISSUE**
|
|
|
|
Should there be a way to tell codecs not to generate
|
|
surrogates and instead treat wide characters as
|
|
errors?
|
|
|
|
**Pro:**
|
|
|
|
I might want to write code that works only with
|
|
fixed-width characters and does not have to worry about
|
|
surrogates.
|
|
|
|
**Con:**
|
|
|
|
No clear proposal of how to communicate this to codecs.
|
|
|
|
* there are no restrictions on constructing strings that use
|
|
code points "reserved for surrogates" improperly. These are
|
|
called "isolated surrogates". The codecs should disallow reading
|
|
these from files, but you could construct them using string
|
|
literals or ``unichr()``.
|
|
|
|
|
|
Implementation
|
|
==============
|
|
|
|
There is a new define::
|
|
|
|
#define Py_UNICODE_SIZE 2
|
|
|
|
To test whether UCS2 or UCS4 is in use, the derived macro
|
|
``Py_UNICODE_WIDE`` should be used, which is defined when UCS-4 is in
|
|
use.
|
|
|
|
There is a new configure option:
|
|
|
|
===================== ============================================
|
|
--enable-unicode=ucs2 configures a narrow ``Py_UNICODE``, and uses
|
|
wchar_t if it fits
|
|
--enable-unicode=ucs4 configures a wide ``Py_UNICODE``, and uses
|
|
wchar_t if it fits
|
|
--enable-unicode same as "=ucs2"
|
|
--disable-unicode entirely remove the Unicode functionality.
|
|
===================== ============================================
|
|
|
|
It is also proposed that one day ``--enable-unicode`` will just
|
|
default to the width of your platforms ``wchar_t``.
|
|
|
|
Windows builds will be narrow for a while based on the fact that
|
|
there have been few requests for wide characters, those requests
|
|
are mostly from hard-core programmers with the ability to buy
|
|
their own Python and Windows itself is strongly biased towards
|
|
16-bit characters.
|
|
|
|
|
|
Notes
|
|
=====
|
|
|
|
This PEP does NOT imply that people using Unicode need to use a
|
|
4-byte encoding for their files on disk or sent over the network.
|
|
It only allows them to do so. For example, ASCII is still a
|
|
legitimate (7-bit) Unicode-encoding.
|
|
|
|
It has been proposed that there should be a module that handles
|
|
surrogates in narrow Python builds for programmers. If someone
|
|
wants to implement that, it will be another PEP. It might also be
|
|
combined with features that allow other kinds of character-,
|
|
word- and line- based indexing.
|
|
|
|
|
|
Rejected Suggestions
|
|
====================
|
|
|
|
More or less the status-quo
|
|
|
|
We could officially say that Python characters are 16-bit and
|
|
require programmers to implement wide characters in their
|
|
application logic by combining surrogate pairs. This is a heavy
|
|
burden because emulating 32-bit characters is likely to be
|
|
very inefficient if it is coded entirely in Python. Plus these
|
|
abstracted pseudo-strings would not be legal as input to the
|
|
regular expression engine.
|
|
|
|
"Space-efficient Unicode" type
|
|
|
|
Another class of solution is to use some efficient storage
|
|
internally but present an abstraction of wide characters to
|
|
the programmer. Any of these would require a much more complex
|
|
implementation than the accepted solution. For instance consider
|
|
the impact on the regular expression engine. In theory, we could
|
|
move to this implementation in the future without breaking Python
|
|
code. A future Python could "emulate" wide Python semantics on
|
|
narrow Python. Guido is not willing to undertake the
|
|
implementation right now.
|
|
|
|
Two types
|
|
|
|
We could introduce a 32-bit Unicode type alongside the 16-bit
|
|
type. There is a lot of code that expects there to be only a
|
|
single Unicode type.
|
|
|
|
This PEP represents the least-effort solution. Over the next
|
|
several years, 32-bit Unicode characters will become more common
|
|
and that may either convince us that we need a more sophisticated
|
|
solution or (on the other hand) convince us that simply
|
|
mandating wide Unicode characters is an appropriate solution.
|
|
Right now the two options on the table are do nothing or do
|
|
this.
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
Unicode Glossary: http://www.unicode.org/glossary/
|
|
|
|
|
|
Copyright
|
|
=========
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
..
|
|
Local Variables:
|
|
mode: indented-text
|
|
indent-tabs-mode: nil
|
|
End:
|