Updated terminology and format.
This commit is contained in:
parent
ec6436a790
commit
bdff046670
296
pep-0261.txt
296
pep-0261.txt
|
@ -11,122 +11,268 @@ Post-History: 27-Jun-2001
|
|||
|
||||
Abstract
|
||||
|
||||
Python 2.1 unicode characters can have ordinals only up to 65536.
|
||||
These characters are known as Basic Multilinual Plane characters.
|
||||
There are now characters in Unicode that live on other "planes".
|
||||
The largest addressable character in Unicode has the ordinal
|
||||
2**20 + 2**16 - 1. For readability, we will call this TOPCHAR.
|
||||
Python 2.1 unicode characters can have ordinals only up to 2**16 -1.
|
||||
This range corresponds to a range in Unicode known as the Basic
|
||||
Multilingual Plane. There are now characters in Unicode that live
|
||||
on other "planes". The largest addressable character in Unicode
|
||||
has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we
|
||||
will call this TOPCHAR and call characters in this range "wide
|
||||
characters".
|
||||
|
||||
|
||||
Glossary
|
||||
|
||||
Character
|
||||
|
||||
Used by itself, means the addressable units of a Python
|
||||
Unicode string.
|
||||
|
||||
Code point
|
||||
|
||||
A code point is an integer between 0 and TOPCHAR.
|
||||
If you imagine Unicode as a mapping from integers to
|
||||
characters, each integer is a code point. But the
|
||||
integers between 0 and TOPCHAR that do not map to
|
||||
characters are also code points. Some will someday
|
||||
be used for characters. Some are guaranteed never
|
||||
to be used for characters.
|
||||
|
||||
Codec
|
||||
|
||||
A set of functions for translating between physical
|
||||
encodings (e.g. on disk or coming in from a network)
|
||||
into logical Python objects.
|
||||
|
||||
Encoding
|
||||
|
||||
Mechanism for representing abstract characters in terms of
|
||||
physical bits and bytes. Encodings allow us to store
|
||||
Unicode characters on disk and transmit them over networks
|
||||
in a manner that is compatible with other Unicode software.
|
||||
|
||||
Surrogate pair
|
||||
|
||||
Two physical characters that represent a single logical
|
||||
character. Part of a convention for representing 32-bit
|
||||
code points in terms of two 16-bit code points.
|
||||
|
||||
Unicode string
|
||||
|
||||
A Python type representing a sequence of code points with
|
||||
"string semantics" (e.g. case conversions, regular
|
||||
expression compatibility, etc.) Constructed with the
|
||||
unicode() function.
|
||||
|
||||
|
||||
Proposed Solution
|
||||
|
||||
One solution would be to merely increase the maximum ordinal to a
|
||||
larger value. Unfortunately the only straightforward
|
||||
implementation of this idea is to increase the character code unit
|
||||
to 4 bytes. This has the effect of doubling the size of most
|
||||
Unicode strings. In order to avoid imposing this cost on every
|
||||
user, Python 2.2 will allow 4-byte Unicode characters as a
|
||||
build-time option.
|
||||
One solution would be to merely increase the maximum ordinal
|
||||
to a larger value. Unfortunately the only straightforward
|
||||
implementation of this idea is to use 4 bytes per character.
|
||||
This has the effect of doubling the size of most Unicode
|
||||
strings. In order to avoid imposing this cost on every
|
||||
user, Python 2.2 will allow the 4-byte implementation as a
|
||||
build-time option. Users can choose whether they care about
|
||||
wide characters or prefer to preserve memory.
|
||||
|
||||
The 4-byte option is called "wide Py_UNICODE". The 2-byte option
|
||||
The 4-byte option is called "wide Py_UNICODE". The 2-byte option
|
||||
is called "narrow Py_UNICODE".
|
||||
|
||||
Most things will behave identically in the wide and narrow worlds.
|
||||
|
||||
* the \u and \U literal syntaxes will always generate the same
|
||||
data that the unichr function would. They are just different
|
||||
syntaxes for the same thing.
|
||||
* unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
|
||||
length-one string.
|
||||
|
||||
* unichr(i) for 0 <= i <= 2**16 always returns a size-one string.
|
||||
* unichr(i) for 2**16 <= i <= TOPCHAR will return a
|
||||
length-one string on wide Python builds. On narrow builds it will
|
||||
raise ValueError.
|
||||
|
||||
* unichr(i) for 2**16+1 <= i <= TOPCHAR will always return a
|
||||
string representing the character.
|
||||
ISSUE
|
||||
|
||||
* BUT on narrow builds of Python, the string will actually be
|
||||
composed of two characters called a "surrogate pair".
|
||||
Python currently allows \U literals that cannot be
|
||||
represented as a single Python character. It generates two
|
||||
Python characters known as a "surrogate pair". Should this
|
||||
be disallowed on future narrow Python builds?
|
||||
|
||||
* ord() will now accept surrogate pairs and return the ordinal of
|
||||
the "wide" character. Open question: should it accept surrogate
|
||||
pairs on wide Python builds?
|
||||
Pro:
|
||||
|
||||
Python already the construction of a surrogate pair
|
||||
for a large unicode literal character escape sequence.
|
||||
This is basically designed as a simple way to construct
|
||||
"wide characters" even in a narrow Python build. It is also
|
||||
somewhat logical considering that the Unicode-literal syntax
|
||||
is basically a short-form way of invoking the unicode-escape
|
||||
codec.
|
||||
|
||||
Con:
|
||||
|
||||
Surrogates could be easily created this way but the user
|
||||
still needs to be careful about slicing, indexing, printing
|
||||
etc. Therefore some have suggested that Unicode
|
||||
literals should not support surrogates.
|
||||
|
||||
|
||||
ISSUE
|
||||
|
||||
Should Python allow the construction of characters that do
|
||||
not correspond to Unicode code points? Unassigned Unicode
|
||||
code points should obviously be legal (because they could
|
||||
be assigned at any time). But code points above TOPCHAR are
|
||||
guaranteed never to be used by Unicode. Should we allow access
|
||||
to them anyhow?
|
||||
|
||||
Pro:
|
||||
|
||||
If a Python user thinks they know what they're doing why
|
||||
should we try to prevent them from violating the Unicode
|
||||
spec? After all, we don't stop 8-bit strings from
|
||||
containing non-ASCII characters.
|
||||
|
||||
Con:
|
||||
|
||||
Codecs and other Unicode-consuming code will have to be
|
||||
careful of these characters which are disallowed by the
|
||||
Unicode specification.
|
||||
|
||||
* ord() is always the inverse of unichr()
|
||||
|
||||
* There is an integer value in the sys module that describes the
|
||||
largest ordinal for a Unicode character on the current
|
||||
interpreter. sys.maxunicode is 2**16-1 on narrow builds of
|
||||
Python. On wide builds it could be either TOPCHAR or 2**32-1.
|
||||
That's an open question.
|
||||
largest ordinal for a character in a Unicode string on the current
|
||||
interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds
|
||||
of Python and TOPCHAR on wide builds.
|
||||
|
||||
* Note that ord() can in some cases return ordinals higher than
|
||||
sys.maxunicode because it accepts surrogate pairs on narrow
|
||||
Python builds.
|
||||
ISSUE: Should there be distinct constants for accessing
|
||||
TOPCHAR and the real upper bound for the domain of
|
||||
unichr (if they differ)? There has also been a
|
||||
suggestion of sys.unicodewidth which can take the
|
||||
values 'wide' and 'narrow'.
|
||||
|
||||
* codecs will be upgraded to support "wide characters". On narrow
|
||||
Python builds, the codecs will generate surrogate pairs, on wide
|
||||
Python builds they will generate a single character.
|
||||
* every Python Unicode character represents exactly one Unicode code
|
||||
point (i.e. Python Unicode Character = Abstract Unicode character).
|
||||
|
||||
* new codecs will be written for 4-byte Unicode and older codecs
|
||||
will be updated to recognize surrogates and map them to wide
|
||||
characters on wide Pythons.
|
||||
* codecs will be upgraded to support "wide characters"
|
||||
(represented directly in UCS-4, and as variable-length sequences
|
||||
in UTF-8 and UTF-16). This is the main part of the implementation
|
||||
left to be done.
|
||||
|
||||
* there are no restrictions on constructing strings that use code
|
||||
points "reserved for surrogates" improperly. These are called
|
||||
"lone surrogates". The codecs should disallow reading these but
|
||||
you could construct them using string literals or unichr().
|
||||
* There is a convention in the Unicode world for encoding a 32-bit
|
||||
code point in terms of two 16-bit code points. These are known
|
||||
as "surrogate pairs". Python's codecs will adopt this convention
|
||||
and encode 32-bit code points as surrogate pairs on narrow Python
|
||||
builds.
|
||||
|
||||
ISSUE
|
||||
|
||||
Should there be a way to tell codecs not to generate
|
||||
surrogates and instead treat wide characters as
|
||||
errors?
|
||||
|
||||
Pro:
|
||||
|
||||
I might want to write code that works only with
|
||||
fixed-width characters and does not have to worry about
|
||||
surrogates.
|
||||
|
||||
|
||||
Con:
|
||||
|
||||
No clear proposal of how to communicate this to codecs.
|
||||
|
||||
* there are no restrictions on constructing strings that use
|
||||
code points "reserved for surrogates" improperly. These are
|
||||
called "isolated surrogates". The codecs should disallow reading
|
||||
these from files, but you could construct them using string
|
||||
literals or unichr().
|
||||
|
||||
|
||||
Implementation
|
||||
|
||||
There is a new (experimental) define in Include/unicodeobject.h:
|
||||
There is a new (experimental) define:
|
||||
|
||||
#undef USE_UCS4_STORAGE
|
||||
#define PY_UNICODE_SIZE 2
|
||||
|
||||
if defined, Py_UNICODE is set to the same thing as Py_UCS4.
|
||||
|
||||
USE_UCS4_STORAGE
|
||||
|
||||
There is a new configure options:
|
||||
There is a new configure option:
|
||||
|
||||
--enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
|
||||
wchar_t if it fits
|
||||
--enable-unicode=ucs4 configures a wide Py_UNICODE likewise
|
||||
--enable-unicode configures Py_UNICODE to wchar_t if available,
|
||||
and to UCS-4 if not; this is the default
|
||||
--enable-unicode=ucs4 configures a wide Py_UNICODE, and uses
|
||||
whchar_t if it fits
|
||||
--enable-unicode same as "=ucs2"
|
||||
|
||||
The intention is that --disable-unicode, or --enable-unicode=no
|
||||
removes the Unicode type altogether; this is not yet implemented.
|
||||
|
||||
It is also proposed that one day --enable-unicode will just
|
||||
default to the width of your platforms wchar_t.
|
||||
|
||||
Windows builds will be narrow for a while based on the fact that
|
||||
there have been few requests for wide characters, those requests
|
||||
are mostly from hard-core programmers with the ability to buy
|
||||
their own Python and Windows itself is strongly biased towards
|
||||
16-bit characters.
|
||||
|
||||
|
||||
Notes
|
||||
|
||||
Note that len(unichr(i))==2 for i>=0x10000 on narrow machines.
|
||||
|
||||
This means (for example) that the following code is not portable:
|
||||
|
||||
x = 0x10000
|
||||
if unichr(x) in somestring:
|
||||
...
|
||||
|
||||
In general, you should be careful using "in" if the character that
|
||||
is searched for could have been generated from unichr applied to a
|
||||
number greater than 0x10000 or from a string literal greater than
|
||||
0x10000.
|
||||
|
||||
This PEP does NOT imply that people using Unicode need to use a
|
||||
4-byte encoding. It only allows them to do so. For example,
|
||||
ASCII is still a legitimate (7-bit) Unicode-encoding.
|
||||
4-byte encoding for their files on disk or sent over the network.
|
||||
It only allows them to do so. For example, ASCII is still a
|
||||
legitimate (7-bit) Unicode-encoding.
|
||||
|
||||
It has been proposed that there should be a module that handles
|
||||
surrogates in narrow Python builds for programmers. If someone
|
||||
wants to implement that, it will be another PEP. It might also be
|
||||
combined with features that allow other kinds of character-,
|
||||
word- and line- based indexing.
|
||||
|
||||
|
||||
Open Questions
|
||||
Rejected Suggestions
|
||||
|
||||
"Code points" above TOPCHAR cannot be expressed in two 16-bit
|
||||
characters. These are not assigned to Unicode characters and
|
||||
supposedly will never be. Should we allow them to be passed as
|
||||
arguments to unichr() anyhow? We could allow knowledgable
|
||||
programmers to use these "unused" characters for whatever they
|
||||
want, though Unicode does not address them.
|
||||
More or less the status-quo
|
||||
|
||||
"Lone surrogates" "should not" occur on wide platforms. Should
|
||||
ord() still accept them?
|
||||
We could officially say that Python characters are 16-bit and
|
||||
require programmers to implement wide characters in their
|
||||
application logic by combining surrogate pairs. This is a heavy
|
||||
burden because emulating 32-bit characters is likely to be
|
||||
very inefficient if it is coded entirely in Python. Plus these
|
||||
abstracted pseudo-strings would not be legal as input to the
|
||||
regular expression engine.
|
||||
|
||||
"Space-efficient Unicode" type
|
||||
|
||||
Another class of solution is to use some efficient storage
|
||||
internally but present an abstraction of wide characters to
|
||||
the programmer. Any of these would require a much more complex
|
||||
implementation than the accepted solution. For instance consider
|
||||
the impact on the regular expression engine. In theory, we could
|
||||
move to this implementation in the future without breaking Python
|
||||
code. A future Python could "emulate" wide Python semantics on
|
||||
narrow Python. Guido is not willing to undertake the
|
||||
implementation right now.
|
||||
|
||||
Two types
|
||||
|
||||
We could introduce a 32-bit Unicode type alongside the 16-bit
|
||||
type. There is a lot of code that expects there to be only a
|
||||
single Unicode type.
|
||||
|
||||
This PEP represents the least-effort solution. Over the next
|
||||
several years, 32-bit Unicode characters will become more common
|
||||
and that may either convince us that we need a more sophisticated
|
||||
solution or (on the other hand) convince us that simply
|
||||
mandating wide Unicode characters is an appropriate solution.
|
||||
Right now the two options on the table are do nothing or do
|
||||
this.
|
||||
|
||||
|
||||
References
|
||||
|
||||
Unicode Glossary: http://www.unicode.org/glossary/
|
||||
|
||||
|
||||
Copyright
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
Local Variables:
|
||||
|
|
Loading…
Reference in New Issue