2001-06-27 18:47:09 -04:00
|
|
|
|
PEP: 261
|
2001-06-27 19:12:08 -04:00
|
|
|
|
Title: Support for "wide" Unicode characters
|
|
|
|
|
Version: $Revision$
|
2001-06-27 18:47:09 -04:00
|
|
|
|
Author: paulp@activestate.com (Paul Prescod)
|
|
|
|
|
Status: Draft
|
|
|
|
|
Type: Standards Track
|
|
|
|
|
Created: 27-Jun-2001
|
2001-06-27 19:12:08 -04:00
|
|
|
|
Python-Version: 2.2
|
2001-06-27 18:47:09 -04:00
|
|
|
|
Post-History: 27-Jun-2001
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
|
|
|
|
|
Python 2.1 unicode characters can have ordinals only up to 65536.
|
|
|
|
|
These characters are known as Basic Multilinual Plane characters.
|
|
|
|
|
There are now characters in Unicode that live on other "planes".
|
|
|
|
|
The largest addressable character in Unicode has the ordinal
|
|
|
|
|
2**20 + 2**16 - 1. For readability, we will call this TOPCHAR.
|
|
|
|
|
|
2001-06-27 19:12:08 -04:00
|
|
|
|
|
2001-06-27 18:47:09 -04:00
|
|
|
|
Proposed Solution
|
|
|
|
|
|
|
|
|
|
One solution would be to merely increase the maximum ordinal to a
|
2001-06-27 19:12:08 -04:00
|
|
|
|
larger value. Unfortunately the only straightforward
|
|
|
|
|
implementation of this idea is to increase the character code unit
|
|
|
|
|
to 4 bytes. This has the effect of doubling the size of most
|
|
|
|
|
Unicode strings. In order to avoid imposing this cost on every
|
|
|
|
|
user, Python 2.2 will allow 4-byte Unicode characters as a
|
|
|
|
|
build-time option.
|
|
|
|
|
|
|
|
|
|
The 4-byte option is called "wide Py_UNICODE". The 2-byte option
|
2001-06-27 18:47:09 -04:00
|
|
|
|
is called "narrow Py_UNICODE".
|
|
|
|
|
|
|
|
|
|
Most things will behave identically in the wide and narrow worlds.
|
|
|
|
|
|
2001-06-27 19:12:08 -04:00
|
|
|
|
* the \u and \U literal syntaxes will always generate the same
|
|
|
|
|
data that the unichr function would. They are just different
|
2001-06-27 18:47:09 -04:00
|
|
|
|
syntaxes for the same thing.
|
|
|
|
|
|
|
|
|
|
* unichr(i) for 0 <= i <= 2**16 always returns a size-one string.
|
|
|
|
|
|
2001-06-27 19:12:08 -04:00
|
|
|
|
* unichr(i) for 2**16+1 <= i <= TOPCHAR will always return a
|
|
|
|
|
string representing the character.
|
2001-06-27 18:47:09 -04:00
|
|
|
|
|
|
|
|
|
* BUT on narrow builds of Python, the string will actually be
|
|
|
|
|
composed of two characters called a "surrogate pair".
|
|
|
|
|
|
|
|
|
|
* ord() will now accept surrogate pairs and return the ordinal of
|
2001-06-27 19:12:08 -04:00
|
|
|
|
the "wide" character. Open question: should it accept surrogate
|
2001-06-27 18:47:09 -04:00
|
|
|
|
pairs on wide Python builds?
|
|
|
|
|
|
|
|
|
|
* There is an integer value in the sys module that describes the
|
|
|
|
|
largest ordinal for a Unicode character on the current
|
2001-06-27 19:12:08 -04:00
|
|
|
|
interpreter. sys.maxunicode is 2**16-1 on narrow builds of
|
|
|
|
|
Python. On wide builds it could be either TOPCHAR or 2**32-1.
|
|
|
|
|
That's an open question.
|
2001-06-27 18:47:09 -04:00
|
|
|
|
|
2001-06-27 19:12:08 -04:00
|
|
|
|
* Note that ord() can in some cases return ordinals higher than
|
|
|
|
|
sys.maxunicode because it accepts surrogate pairs on narrow
|
|
|
|
|
Python builds.
|
2001-06-27 18:47:09 -04:00
|
|
|
|
|
2001-06-27 19:12:08 -04:00
|
|
|
|
* codecs will be upgraded to support "wide characters". On narrow
|
|
|
|
|
Python builds, the codecs will generate surrogate pairs, on wide
|
|
|
|
|
Python builds they will generate a single character.
|
2001-06-27 18:47:09 -04:00
|
|
|
|
|
|
|
|
|
* new codecs will be written for 4-byte Unicode and older codecs
|
|
|
|
|
will be updated to recognize surrogates and map them to wide
|
|
|
|
|
characters on wide Pythons.
|
|
|
|
|
|
2001-06-27 19:12:08 -04:00
|
|
|
|
* there are no restrictions on constructing strings that use code
|
|
|
|
|
points "reserved for surrogates" improperly. These are called
|
|
|
|
|
"lone surrogates". The codecs should disallow reading these but
|
|
|
|
|
you could construct them using string literals or unichr().
|
|
|
|
|
|
2001-06-27 18:47:09 -04:00
|
|
|
|
|
|
|
|
|
Implementation
|
|
|
|
|
|
|
|
|
|
There is a new (experimental) define in Include/unicodeobject.h:
|
|
|
|
|
|
|
|
|
|
#undef USE_UCS4_STORAGE
|
|
|
|
|
|
|
|
|
|
if defined, Py_UNICODE is set to the same thing as Py_UCS4.
|
|
|
|
|
|
|
|
|
|
USE_UCS4_STORAGE
|
|
|
|
|
|
|
|
|
|
There is a new configure options:
|
|
|
|
|
|
|
|
|
|
--enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
|
2001-06-27 19:12:08 -04:00
|
|
|
|
wchar_t if it fits
|
2001-06-27 18:47:09 -04:00
|
|
|
|
--enable-unicode=ucs4 configures a wide Py_UNICODE likewise
|
|
|
|
|
--enable-unicode configures Py_UNICODE to wchar_t if available,
|
|
|
|
|
and to UCS-4 if not; this is the default
|
|
|
|
|
|
|
|
|
|
The intention is that --disable-unicode, or --enable-unicode=no
|
|
|
|
|
removes the Unicode type altogether; this is not yet implemented.
|
|
|
|
|
|
2001-06-27 19:12:08 -04:00
|
|
|
|
|
2001-06-27 18:47:09 -04:00
|
|
|
|
Notes
|
|
|
|
|
|
|
|
|
|
Note that len(unichr(i))==2 for i>=0x10000 on narrow machines.
|
|
|
|
|
|
|
|
|
|
This means (for example) that the following code is not portable:
|
|
|
|
|
|
|
|
|
|
x = 0x10000
|
|
|
|
|
if unichr(x) in somestring:
|
|
|
|
|
...
|
|
|
|
|
|
2001-06-27 19:12:08 -04:00
|
|
|
|
In general, you should be careful using "in" if the character that
|
|
|
|
|
is searched for could have been generated from unichr applied to a
|
|
|
|
|
number greater than 0x10000 or from a string literal greater than
|
|
|
|
|
0x10000.
|
2001-06-27 18:47:09 -04:00
|
|
|
|
|
|
|
|
|
This PEP does NOT imply that people using Unicode need to use a
|
2001-06-27 19:12:08 -04:00
|
|
|
|
4-byte encoding. It only allows them to do so. For example,
|
|
|
|
|
ASCII is still a legitimate (7-bit) Unicode-encoding.
|
|
|
|
|
|
2001-06-27 18:47:09 -04:00
|
|
|
|
|
|
|
|
|
Open Questions
|
|
|
|
|
|
|
|
|
|
"Code points" above TOPCHAR cannot be expressed in two 16-bit
|
2001-06-27 19:12:08 -04:00
|
|
|
|
characters. These are not assigned to Unicode characters and
|
|
|
|
|
supposedly will never be. Should we allow them to be passed as
|
|
|
|
|
arguments to unichr() anyhow? We could allow knowledgable
|
|
|
|
|
programmers to use these "unused" characters for whatever they
|
|
|
|
|
want, though Unicode does not address them.
|
2001-06-27 18:47:09 -04:00
|
|
|
|
|
2001-06-27 19:12:08 -04:00
|
|
|
|
"Lone surrogates" "should not" occur on wide platforms. Should
|
2001-06-27 18:47:09 -04:00
|
|
|
|
ord() still accept them?
|
2001-06-27 19:12:08 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
End:
|