python-peps/peps/pep-0261.rst

PEP: 261
Title: Support for "wide" Unicode characters
Version: $Revision$
Last-Modified: $Date$
Author: Paul Prescod <paul@prescod.net>
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Created: 27-Jun-2001
Python-Version: 2.2
Post-History: 27-Jun-2001


Abstract
========

Python 2.1 unicode characters can have ordinals only up to ``2**16 - 1``.
This range corresponds to a range in Unicode known as the Basic
Multilingual Plane. There are now characters in Unicode that live
on other "planes". The largest addressable character in Unicode
has the ordinal ``17 * 2**16 - 1`` (``0x10ffff``). For readability, we
will call this TOPCHAR and call characters in this range "wide
characters".


Glossary
========

Character
   Used by itself, means the addressable units of a Python
   Unicode string.

Code point
   A code point is an integer between 0 and TOPCHAR.
   If you imagine Unicode as a mapping from integers to
   characters, each integer is a code point. But the
   integers between 0 and TOPCHAR that do not map to
   characters are also code points. Some will someday
   be used for characters. Some are guaranteed never
   to be used for characters.

Codec
   A set of functions for translating between physical
   encodings (e.g. on disk or coming in from a network)
   into logical Python objects.

Encoding
   Mechanism for representing abstract characters in terms of
   physical bits and bytes. Encodings allow us to store
   Unicode characters on disk and transmit them over networks
   in a manner that is compatible with other Unicode software.

Surrogate pair
   Two physical characters that represent a single logical
   character. Part of a convention for representing 32-bit
   code points in terms of two 16-bit code points.

Unicode string
   A Python type representing a sequence of code points with
   "string semantics" (e.g. case conversions, regular
   expression compatibility, etc.) Constructed with the
   ``unicode()`` function.


Proposed Solution
=================

One solution would be to merely increase the maximum ordinal
to a larger value. Unfortunately the only straightforward
implementation of this idea is to use 4 bytes per character.
This has the effect of doubling the size of most Unicode
strings. In order to avoid imposing this cost on every
user, Python 2.2 will allow the 4-byte implementation as a
build-time option. Users can choose whether they care about
wide characters or prefer to preserve memory.

The 4-byte option is called "wide ``Py_UNICODE``". The 2-byte option
is called "narrow ``Py_UNICODE``".

Most things will behave identically in the wide and narrow worlds.

* ``unichr(i)`` for 0 <= i < ``2**16`` (``0x10000``) always returns a
  length-one string.

* ``unichr(i)`` for ``2**16`` <= i <= TOPCHAR will return a
  length-one string on wide Python builds. On narrow builds it will
  raise ``ValueError``.

  **ISSUE**

     Python currently allows ``\U`` literals that cannot be
     represented as a single Python character. It generates two
     Python characters known as a "surrogate pair". Should this
     be disallowed on future narrow Python builds?

  **Pro:**

     Python already the construction of a surrogate pair
     for a large unicode literal character escape sequence.
     This is basically designed as a simple way to construct
     "wide characters" even in a narrow Python build. It is also
     somewhat logical considering that the Unicode-literal syntax
     is basically a short-form way of invoking the unicode-escape
     codec.

  **Con:**

     Surrogates could be easily created this way but the user
     still needs to be careful about slicing, indexing, printing
     etc. Therefore, some have suggested that Unicode
     literals should not support surrogates.


  **ISSUE**

     Should Python allow the construction of characters that do
     not correspond to Unicode code points?  Unassigned Unicode
     code points should obviously be legal (because they could
     be assigned at any time). But code points above TOPCHAR are
     guaranteed never to be used by Unicode. Should we allow access
     to them anyhow?

  **Pro:**

     If a Python user thinks they know what they're doing why
     should we try to prevent them from violating the Unicode
     spec? After all, we don't stop 8-bit strings from
     containing non-ASCII characters.

  **Con:**

     Codecs and other Unicode-consuming code will have to be
     careful of these characters which are disallowed by the
     Unicode specification.

* ``ord()`` is always the inverse of ``unichr()``

* There is an integer value in the sys module that describes the
  largest ordinal for a character in a Unicode string on the current
  interpreter. ``sys.maxunicode`` is ``2**16-1`` (``0xffff``) on narrow builds
  of Python and TOPCHAR on wide builds.

  **ISSUE:**

     Should there be distinct constants for accessing
     TOPCHAR and the real upper bound for the domain of
     ``unichr`` (if they differ)? There has also been a
     suggestion of ``sys.unicodewidth`` which can take the
     values ``'wide'`` and ``'narrow'``.

* every Python Unicode character represents exactly one Unicode code
  point (i.e. Python Unicode Character = Abstract Unicode character).

* codecs will be upgraded to support "wide characters"
  (represented directly in UCS-4, and as variable-length sequences
  in UTF-8 and UTF-16). This is the main part of the implementation
  left to be done.

* There is a convention in the Unicode world for encoding a 32-bit
  code point in terms of two 16-bit code points. These are known
  as "surrogate pairs". Python's codecs will adopt this convention
  and encode 32-bit code points as surrogate pairs on narrow Python
  builds.

  **ISSUE**

     Should there be a way to tell codecs not to generate
     surrogates and instead treat wide characters as
     errors?

  **Pro:**

     I might want to write code that works only with
     fixed-width characters and does not have to worry about
     surrogates.

  **Con:**

     No clear proposal of how to communicate this to codecs.

* there are no restrictions on constructing strings that use
  code points "reserved for surrogates" improperly. These are
  called "isolated surrogates". The codecs should disallow reading
  these from files, but you could construct them using string
  literals or ``unichr()``.


Implementation
==============

There is a new define::

    #define Py_UNICODE_SIZE 2

To test whether UCS2 or UCS4 is in use, the derived macro
``Py_UNICODE_WIDE`` should be used, which is defined when UCS-4 is in
use.

There is a new configure option:

=====================  ============================================
--enable-unicode=ucs2  configures a narrow ``Py_UNICODE``, and uses
                       wchar_t if it fits
--enable-unicode=ucs4  configures a wide ``Py_UNICODE``, and uses
                       wchar_t if it fits
--enable-unicode       same as "=ucs2"
--disable-unicode      entirely remove the Unicode functionality.
=====================  ============================================

It is also proposed that one day ``--enable-unicode`` will just
default to the width of your platforms ``wchar_t``.

Windows builds will be narrow for a while based on the fact that
there have been few requests for wide characters, those requests
are mostly from hard-core programmers with the ability to buy
their own Python and Windows itself is strongly biased towards
16-bit characters.


Notes
=====

This PEP does NOT imply that people using Unicode need to use a
4-byte encoding for their files on disk or sent over the network.
It only allows them to do so. For example, ASCII is still a
legitimate (7-bit) Unicode-encoding.

It has been proposed that there should be a module that handles
surrogates in narrow Python builds for programmers. If someone
wants to implement that, it will be another PEP. It might also be
combined with features that allow other kinds of character-,
word- and line- based indexing.


Rejected Suggestions
====================

More or less the status-quo

   We could officially say that Python characters are 16-bit and
   require programmers to implement wide characters in their
   application logic by combining surrogate pairs. This is a heavy
   burden because emulating 32-bit characters is likely to be
   very inefficient if it is coded entirely in Python. Plus these
   abstracted pseudo-strings would not be legal as input to the
   regular expression engine.

"Space-efficient Unicode" type

   Another class of solution is to use some efficient storage
   internally but present an abstraction of wide characters to
   the programmer. Any of these would require a much more complex
   implementation than the accepted solution. For instance consider
   the impact on the regular expression engine. In theory, we could
   move to this implementation in the future without breaking Python
   code. A future Python could "emulate" wide Python semantics on
   narrow Python. Guido is not willing to undertake the
   implementation right now.

Two types

   We could introduce a 32-bit Unicode type alongside the 16-bit
   type. There is a lot of code that expects there to be only a
   single Unicode type.

This PEP represents the least-effort solution. Over the next
several years, 32-bit Unicode characters will become more common
and that may either convince us that we need a more sophisticated
solution or (on the other hand) convince us that simply
mandating wide Unicode characters is an appropriate solution.
Right now the two options on the table are do nothing or do
this.


References
==========

Unicode Glossary: http://www.unicode.org/glossary/


Copyright
=========

This document has been placed in the public domain.


..
  Local Variables:
  mode: indented-text
  indent-tabs-mode: nil
  End: