Updated terminology and format.

2001-07-01 19:52:25 +00:00 · 2001-07-01 19:52:25 +00:00 · bdff046670
parent ec6436a790
commit bdff046670
1 changed files with 221 additions and 75 deletions
--- a/pep-0261.txt
+++ b/pep-0261.txt
@ -11,122 +11,268 @@ Post-History: 27-Jun-2001

 Abstract

-    Python 2.1 unicode characters can have ordinals only up to 65536. 
-    These characters are known as Basic Multilinual Plane characters.
-    There are now characters in Unicode that live on other "planes".
-    The largest addressable character in Unicode has the ordinal
-    2**20 + 2**16 - 1. For readability, we will call this TOPCHAR.
+    Python 2.1 unicode characters can have ordinals only up to 2**16 -1.  
+    This range corresponds to a range in Unicode known as the Basic
+    Multilingual Plane. There are now characters in Unicode that live
+    on other "planes". The largest addressable character in Unicode
+    has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we
+    will call this TOPCHAR and call characters in this range "wide 
+    characters".
+
+
+Glossary
+
+    Character 
+        
+        Used by itself, means the addressable units of a Python 
+        Unicode string.
+
+    Code point
+
+        A code point is an integer between 0 and TOPCHAR.
+        If you imagine Unicode as a mapping from integers to
+        characters, each integer is a code point. But the 
+        integers between 0 and TOPCHAR that do not map to
+        characters are also code points. Some will someday 
+        be used for characters. Some are guaranteed never 
+        to be used for characters.
+
+    Codec
+
+        A set of functions for translating between physical
+        encodings (e.g. on disk or coming in from a network)
+        into logical Python objects.
+
+    Encoding
+
+        Mechanism for representing abstract characters in terms of
+        physical bits and bytes. Encodings allow us to store
+        Unicode characters on disk and transmit them over networks
+        in a manner that is compatible with other Unicode software.
+
+    Surrogate pair
+
+        Two physical characters that represent a single logical
+        character. Part of a convention for representing 32-bit
+        code points in terms of two 16-bit code points.
+
+    Unicode string
+
+          A Python type representing a sequence of code points with
+          "string semantics" (e.g. case conversions, regular
+          expression compatibility, etc.) Constructed with the 
+          unicode() function.


 Proposed Solution

-    One solution would be to merely increase the maximum ordinal to a
-    larger value.  Unfortunately the only straightforward
-    implementation of this idea is to increase the character code unit
-    to 4 bytes.  This has the effect of doubling the size of most
-    Unicode strings.  In order to avoid imposing this cost on every
-    user, Python 2.2 will allow 4-byte Unicode characters as a
-    build-time option.
+    One solution would be to merely increase the maximum ordinal 
+    to a larger value. Unfortunately the only straightforward
+    implementation of this idea is to use 4 bytes per character.
+    This has the effect of doubling the size of most Unicode 
+    strings. In order to avoid imposing this cost on every
+    user, Python 2.2 will allow the 4-byte implementation as a
+    build-time option. Users can choose whether they care about
+    wide characters or prefer to preserve memory.

-    The 4-byte option is called "wide Py_UNICODE".  The 2-byte option
+    The 4-byte option is called "wide Py_UNICODE". The 2-byte option
    is called "narrow Py_UNICODE".

    Most things will behave identically in the wide and narrow worlds.

-    * the \u and \U literal syntaxes will always generate the same
-      data that the unichr function would.  They are just different
-      syntaxes for the same thing.
+    * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
+      length-one string.

-    * unichr(i) for 0 <= i <= 2**16 always returns a size-one string.
+    * unichr(i) for 2**16 <= i <= TOPCHAR will return a
+      length-one string on wide Python builds. On narrow builds it will 
+      raise ValueError.

-    * unichr(i) for 2**16+1 <= i <= TOPCHAR will always return a
-      string representing the character.
+        ISSUE 

-    * BUT on narrow builds of Python, the string will actually be
-      composed of two characters called a "surrogate pair".
+            Python currently allows \U literals that cannot be
+            represented as a single Python character. It generates two
+            Python characters known as a "surrogate pair". Should this
+            be disallowed on future narrow Python builds?

-    * ord() will now accept surrogate pairs and return the ordinal of
-      the "wide" character.  Open question: should it accept surrogate
-      pairs on wide Python builds?
+        Pro:
+
+            Python already the construction of a surrogate pair
+            for a large unicode literal character escape sequence.
+            This is basically designed as a simple way to construct
+            "wide characters" even in a narrow Python build. It is also
+            somewhat logical considering that the Unicode-literal syntax
+            is basically a short-form way of invoking the unicode-escape
+            codec.
+
+        Con:
+
+            Surrogates could be easily created this way but the user
+            still needs to be careful about slicing, indexing, printing 
+            etc. Therefore some have suggested that Unicode
+            literals should not support surrogates.
+
+
+        ISSUE 
+
+            Should Python allow the construction of characters that do
+            not correspond to Unicode code points?  Unassigned Unicode 
+            code points should obviously be legal (because they could 
+            be assigned at any time). But code points above TOPCHAR are 
+            guaranteed never to be used by Unicode. Should we allow access 
+            to them anyhow?
+
+        Pro:
+
+            If a Python user thinks they know what they're doing why
+            should we try to prevent them from violating the Unicode
+            spec? After all, we don't stop 8-bit strings from
+            containing non-ASCII characters.
+
+        Con:
+
+            Codecs and other Unicode-consuming code will have to be
+            careful of these characters which are disallowed by the
+            Unicode specification.
+
+    * ord() is always the inverse of unichr()

    * There is an integer value in the sys module that describes the
-      largest ordinal for a Unicode character on the current
-      interpreter.  sys.maxunicode is 2**16-1 on narrow builds of
-      Python.  On wide builds it could be either TOPCHAR or 2**32-1.
-      That's an open question.
+      largest ordinal for a character in a Unicode string on the current
+      interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds
+      of Python and TOPCHAR on wide builds.

-    * Note that ord() can in some cases return ordinals higher than
-      sys.maxunicode because it accepts surrogate pairs on narrow
-      Python builds.
+        ISSUE: Should there be distinct constants for accessing
+               TOPCHAR and the real upper bound for the domain of 
+               unichr (if they differ)? There has also been a
+               suggestion of sys.unicodewidth which can take the 
+               values 'wide' and 'narrow'.

-    * codecs will be upgraded to support "wide characters".  On narrow
-      Python builds, the codecs will generate surrogate pairs, on wide
-      Python builds they will generate a single character.
+    * every Python Unicode character represents exactly one Unicode code 
+      point (i.e. Python Unicode Character = Abstract Unicode character).

-    * new codecs will be written for 4-byte Unicode and older codecs
-      will be updated to recognize surrogates and map them to wide
-      characters on wide Pythons.
+    * codecs will be upgraded to support "wide characters"
+      (represented directly in UCS-4, and as variable-length sequences
+      in UTF-8 and UTF-16). This is the main part of the implementation 
+      left to be done.

-    * there are no restrictions on constructing strings that use code
-      points "reserved for surrogates" improperly.  These are called
-      "lone surrogates".  The codecs should disallow reading these but
-      you could construct them using string literals or unichr().
+    * There is a convention in the Unicode world for encoding a 32-bit
+      code point in terms of two 16-bit code points. These are known
+      as "surrogate pairs". Python's codecs will adopt this convention
+      and encode 32-bit code points as surrogate pairs on narrow Python
+      builds. 
+
+        ISSUE 
+
+            Should there be a way to tell codecs not to generate
+            surrogates and instead treat wide characters as 
+            errors?
+
+        Pro:
+
+            I might want to write code that works only with
+            fixed-width characters and does not have to worry about
+            surrogates.
+
+
+        Con:
+
+            No clear proposal of how to communicate this to codecs.
+
+    * there are no restrictions on constructing strings that use 
+      code points "reserved for surrogates" improperly. These are
+      called "isolated surrogates". The codecs should disallow reading
+      these from files, but you could construct them using string 
+      literals or unichr().


 Implementation

-    There is a new (experimental) define in Include/unicodeobject.h:
+    There is a new (experimental) define:

-        #undef USE_UCS4_STORAGE
+        #define PY_UNICODE_SIZE 2

-    if defined, Py_UNICODE is set to the same thing as Py_UCS4.
-
-        USE_UCS4_STORAGE
-
-    There is a new configure options:
+    There is a new configure option:

        --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
                              wchar_t if it fits
-        --enable-unicode=ucs4 configures a wide Py_UNICODE likewise
-        --enable-unicode      configures Py_UNICODE to wchar_t if available,
-                              and to UCS-4 if not; this is the default
+        --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses
+                              whchar_t if it fits
+        --enable-unicode      same as "=ucs2"

    The intention is that --disable-unicode, or --enable-unicode=no
    removes the Unicode type altogether; this is not yet implemented.

+    It is also proposed that one day --enable-unicode will just
+    default to the width of your platforms wchar_t.
+
+    Windows builds will be narrow for a while based on the fact that
+    there have been few requests for wide characters, those requests
+    are mostly from hard-core programmers with the ability to buy
+    their own Python and Windows itself is strongly biased towards
+    16-bit characters.
+

 Notes

-    Note that len(unichr(i))==2 for i>=0x10000 on narrow machines.
-
-    This means (for example) that the following code is not portable:
-
-    x = 0x10000
-    if unichr(x) in somestring:
-        ...
-
-    In general, you should be careful using "in" if the character that
-    is searched for could have been generated from unichr applied to a
-    number greater than 0x10000 or from a string literal greater than
-    0x10000.
-
    This PEP does NOT imply that people using Unicode need to use a
-    4-byte encoding.  It only allows them to do so.  For example,
-    ASCII is still a legitimate (7-bit) Unicode-encoding.
+    4-byte encoding for their files on disk or sent over the network. 
+    It only allows them to do so. For example, ASCII is still a 
+    legitimate (7-bit) Unicode-encoding.
+
+    It has been proposed that there should be a module that handles
+    surrogates in narrow Python builds for programmers. If someone 
+    wants to implement that, it will be another PEP. It might also be 
+    combined with features that allow other kinds of character-, 
+    word- and line- based indexing.


-Open Questions
+Rejected Suggestions

-    "Code points" above TOPCHAR cannot be expressed in two 16-bit
-    characters.  These are not assigned to Unicode characters and
-    supposedly will never be.  Should we allow them to be passed as
-    arguments to unichr() anyhow?  We could allow knowledgable
-    programmers to use these "unused" characters for whatever they
-    want, though Unicode does not address them.
+    More or less the status-quo

-    "Lone surrogates" "should not" occur on wide platforms.  Should
-    ord() still accept them?
+        We could officially say that Python characters are 16-bit and
+        require programmers to implement wide characters in their
+        application logic by combining surrogate pairs. This is a heavy 
+        burden because emulating 32-bit characters is likely to be
+        very inefficient if it is coded entirely in Python. Plus these
+        abstracted pseudo-strings would not be legal as input to the
+        regular expression engine.

+    "Space-efficient Unicode" type
+
+        Another class of solution is to use some efficient storage
+        internally but present an abstraction of wide characters to
+        the programmer. Any of these would require a much more complex
+        implementation than the accepted solution. For instance consider
+        the impact on the regular expression engine. In theory, we could
+        move to this implementation in the future without breaking Python
+        code. A future Python could "emulate" wide Python semantics on
+        narrow Python. Guido is not willing to undertake the
+        implementation right now.
+
+    Two types
+
+        We could introduce a 32-bit Unicode type alongside the 16-bit
+        type. There is a lot of code that expects there to be only a 
+        single Unicode type.
+
+    This PEP represents the least-effort solution. Over the next
+    several years, 32-bit Unicode characters will become more common
+    and that may either convince us that we need a more sophisticated 
+    solution or (on the other hand) convince us that simply 
+    mandating wide Unicode characters is an appropriate solution.
+    Right now the two options on the table are do nothing or do
+    this.
+
+
+References
+
+    Unicode Glossary: http://www.unicode.org/glossary/
+
+
+Copyright
+
+    This document has been placed in the public domain.


 Local Variables: