diff --git a/pep-0261.txt b/pep-0261.txt index 4a3175f26..9e3c25c54 100644 --- a/pep-0261.txt +++ b/pep-0261.txt @@ -11,122 +11,268 @@ Post-History: 27-Jun-2001 Abstract - Python 2.1 unicode characters can have ordinals only up to 65536. - These characters are known as Basic Multilinual Plane characters. - There are now characters in Unicode that live on other "planes". - The largest addressable character in Unicode has the ordinal - 2**20 + 2**16 - 1. For readability, we will call this TOPCHAR. + Python 2.1 unicode characters can have ordinals only up to 2**16 -1. + This range corresponds to a range in Unicode known as the Basic + Multilingual Plane. There are now characters in Unicode that live + on other "planes". The largest addressable character in Unicode + has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we + will call this TOPCHAR and call characters in this range "wide + characters". + + +Glossary + + Character + + Used by itself, means the addressable units of a Python + Unicode string. + + Code point + + A code point is an integer between 0 and TOPCHAR. + If you imagine Unicode as a mapping from integers to + characters, each integer is a code point. But the + integers between 0 and TOPCHAR that do not map to + characters are also code points. Some will someday + be used for characters. Some are guaranteed never + to be used for characters. + + Codec + + A set of functions for translating between physical + encodings (e.g. on disk or coming in from a network) + into logical Python objects. + + Encoding + + Mechanism for representing abstract characters in terms of + physical bits and bytes. Encodings allow us to store + Unicode characters on disk and transmit them over networks + in a manner that is compatible with other Unicode software. + + Surrogate pair + + Two physical characters that represent a single logical + character. Part of a convention for representing 32-bit + code points in terms of two 16-bit code points. + + Unicode string + + A Python type representing a sequence of code points with + "string semantics" (e.g. case conversions, regular + expression compatibility, etc.) Constructed with the + unicode() function. Proposed Solution - One solution would be to merely increase the maximum ordinal to a - larger value. Unfortunately the only straightforward - implementation of this idea is to increase the character code unit - to 4 bytes. This has the effect of doubling the size of most - Unicode strings. In order to avoid imposing this cost on every - user, Python 2.2 will allow 4-byte Unicode characters as a - build-time option. + One solution would be to merely increase the maximum ordinal + to a larger value. Unfortunately the only straightforward + implementation of this idea is to use 4 bytes per character. + This has the effect of doubling the size of most Unicode + strings. In order to avoid imposing this cost on every + user, Python 2.2 will allow the 4-byte implementation as a + build-time option. Users can choose whether they care about + wide characters or prefer to preserve memory. - The 4-byte option is called "wide Py_UNICODE". The 2-byte option + The 4-byte option is called "wide Py_UNICODE". The 2-byte option is called "narrow Py_UNICODE". Most things will behave identically in the wide and narrow worlds. - * the \u and \U literal syntaxes will always generate the same - data that the unichr function would. They are just different - syntaxes for the same thing. + * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a + length-one string. - * unichr(i) for 0 <= i <= 2**16 always returns a size-one string. + * unichr(i) for 2**16 <= i <= TOPCHAR will return a + length-one string on wide Python builds. On narrow builds it will + raise ValueError. - * unichr(i) for 2**16+1 <= i <= TOPCHAR will always return a - string representing the character. + ISSUE - * BUT on narrow builds of Python, the string will actually be - composed of two characters called a "surrogate pair". + Python currently allows \U literals that cannot be + represented as a single Python character. It generates two + Python characters known as a "surrogate pair". Should this + be disallowed on future narrow Python builds? - * ord() will now accept surrogate pairs and return the ordinal of - the "wide" character. Open question: should it accept surrogate - pairs on wide Python builds? + Pro: + + Python already the construction of a surrogate pair + for a large unicode literal character escape sequence. + This is basically designed as a simple way to construct + "wide characters" even in a narrow Python build. It is also + somewhat logical considering that the Unicode-literal syntax + is basically a short-form way of invoking the unicode-escape + codec. + + Con: + + Surrogates could be easily created this way but the user + still needs to be careful about slicing, indexing, printing + etc. Therefore some have suggested that Unicode + literals should not support surrogates. + + + ISSUE + + Should Python allow the construction of characters that do + not correspond to Unicode code points? Unassigned Unicode + code points should obviously be legal (because they could + be assigned at any time). But code points above TOPCHAR are + guaranteed never to be used by Unicode. Should we allow access + to them anyhow? + + Pro: + + If a Python user thinks they know what they're doing why + should we try to prevent them from violating the Unicode + spec? After all, we don't stop 8-bit strings from + containing non-ASCII characters. + + Con: + + Codecs and other Unicode-consuming code will have to be + careful of these characters which are disallowed by the + Unicode specification. + + * ord() is always the inverse of unichr() * There is an integer value in the sys module that describes the - largest ordinal for a Unicode character on the current - interpreter. sys.maxunicode is 2**16-1 on narrow builds of - Python. On wide builds it could be either TOPCHAR or 2**32-1. - That's an open question. + largest ordinal for a character in a Unicode string on the current + interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds + of Python and TOPCHAR on wide builds. - * Note that ord() can in some cases return ordinals higher than - sys.maxunicode because it accepts surrogate pairs on narrow - Python builds. + ISSUE: Should there be distinct constants for accessing + TOPCHAR and the real upper bound for the domain of + unichr (if they differ)? There has also been a + suggestion of sys.unicodewidth which can take the + values 'wide' and 'narrow'. - * codecs will be upgraded to support "wide characters". On narrow - Python builds, the codecs will generate surrogate pairs, on wide - Python builds they will generate a single character. + * every Python Unicode character represents exactly one Unicode code + point (i.e. Python Unicode Character = Abstract Unicode character). - * new codecs will be written for 4-byte Unicode and older codecs - will be updated to recognize surrogates and map them to wide - characters on wide Pythons. + * codecs will be upgraded to support "wide characters" + (represented directly in UCS-4, and as variable-length sequences + in UTF-8 and UTF-16). This is the main part of the implementation + left to be done. - * there are no restrictions on constructing strings that use code - points "reserved for surrogates" improperly. These are called - "lone surrogates". The codecs should disallow reading these but - you could construct them using string literals or unichr(). + * There is a convention in the Unicode world for encoding a 32-bit + code point in terms of two 16-bit code points. These are known + as "surrogate pairs". Python's codecs will adopt this convention + and encode 32-bit code points as surrogate pairs on narrow Python + builds. + + ISSUE + + Should there be a way to tell codecs not to generate + surrogates and instead treat wide characters as + errors? + + Pro: + + I might want to write code that works only with + fixed-width characters and does not have to worry about + surrogates. + + + Con: + + No clear proposal of how to communicate this to codecs. + + * there are no restrictions on constructing strings that use + code points "reserved for surrogates" improperly. These are + called "isolated surrogates". The codecs should disallow reading + these from files, but you could construct them using string + literals or unichr(). Implementation - There is a new (experimental) define in Include/unicodeobject.h: + There is a new (experimental) define: - #undef USE_UCS4_STORAGE + #define PY_UNICODE_SIZE 2 - if defined, Py_UNICODE is set to the same thing as Py_UCS4. - - USE_UCS4_STORAGE - - There is a new configure options: + There is a new configure option: --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses wchar_t if it fits - --enable-unicode=ucs4 configures a wide Py_UNICODE likewise - --enable-unicode configures Py_UNICODE to wchar_t if available, - and to UCS-4 if not; this is the default + --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses + whchar_t if it fits + --enable-unicode same as "=ucs2" The intention is that --disable-unicode, or --enable-unicode=no removes the Unicode type altogether; this is not yet implemented. + It is also proposed that one day --enable-unicode will just + default to the width of your platforms wchar_t. + + Windows builds will be narrow for a while based on the fact that + there have been few requests for wide characters, those requests + are mostly from hard-core programmers with the ability to buy + their own Python and Windows itself is strongly biased towards + 16-bit characters. + Notes - Note that len(unichr(i))==2 for i>=0x10000 on narrow machines. - - This means (for example) that the following code is not portable: - - x = 0x10000 - if unichr(x) in somestring: - ... - - In general, you should be careful using "in" if the character that - is searched for could have been generated from unichr applied to a - number greater than 0x10000 or from a string literal greater than - 0x10000. - This PEP does NOT imply that people using Unicode need to use a - 4-byte encoding. It only allows them to do so. For example, - ASCII is still a legitimate (7-bit) Unicode-encoding. + 4-byte encoding for their files on disk or sent over the network. + It only allows them to do so. For example, ASCII is still a + legitimate (7-bit) Unicode-encoding. + + It has been proposed that there should be a module that handles + surrogates in narrow Python builds for programmers. If someone + wants to implement that, it will be another PEP. It might also be + combined with features that allow other kinds of character-, + word- and line- based indexing. -Open Questions +Rejected Suggestions - "Code points" above TOPCHAR cannot be expressed in two 16-bit - characters. These are not assigned to Unicode characters and - supposedly will never be. Should we allow them to be passed as - arguments to unichr() anyhow? We could allow knowledgable - programmers to use these "unused" characters for whatever they - want, though Unicode does not address them. + More or less the status-quo - "Lone surrogates" "should not" occur on wide platforms. Should - ord() still accept them? + We could officially say that Python characters are 16-bit and + require programmers to implement wide characters in their + application logic by combining surrogate pairs. This is a heavy + burden because emulating 32-bit characters is likely to be + very inefficient if it is coded entirely in Python. Plus these + abstracted pseudo-strings would not be legal as input to the + regular expression engine. + "Space-efficient Unicode" type + + Another class of solution is to use some efficient storage + internally but present an abstraction of wide characters to + the programmer. Any of these would require a much more complex + implementation than the accepted solution. For instance consider + the impact on the regular expression engine. In theory, we could + move to this implementation in the future without breaking Python + code. A future Python could "emulate" wide Python semantics on + narrow Python. Guido is not willing to undertake the + implementation right now. + + Two types + + We could introduce a 32-bit Unicode type alongside the 16-bit + type. There is a lot of code that expects there to be only a + single Unicode type. + + This PEP represents the least-effort solution. Over the next + several years, 32-bit Unicode characters will become more common + and that may either convince us that we need a more sophisticated + solution or (on the other hand) convince us that simply + mandating wide Unicode characters is an appropriate solution. + Right now the two options on the table are do nothing or do + this. + + +References + + Unicode Glossary: http://www.unicode.org/glossary/ + + +Copyright + + This document has been placed in the public domain. Local Variables: