python-peps/pep-0393.txt

PEP: 393
Title: Flexible String Representation
Version: $Revision$
Last-Modified: $Date$
Author: Martin v. Löwis <martin@v.loewis.de>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 24-Jan-2010
Python-Version: 3.3
Post-History:

Abstract
========

The Unicode string type is changed to support multiple internal
representations, depending on the character with the largest Unicode
ordinal (1, 2, or 4 bytes). This will allow a space-efficient
representation in common cases, but give access to full UCS-4 on all
systems. For compatibility with existing APIs, several representations
may exist in parallel; over time, this compatibility should be phased
out.

Rationale
=========

There are two classes of complaints about the current implementation
of the unicode type: on systems only supporting UTF-16, users complain
that non-BMP characters are not properly supported. On systems using
UCS-4 internally (and also sometimes on systems using UCS-2), there is
a complaint that Unicode strings take up too much memory - especially
compared to Python 2.x, where the same code would often use ASCII
strings (i.e. ASCII-encoded byte strings). With the proposed approach,
ASCII-only Unicode strings will again use only one byte per character;
while still allowing efficient indexing of strings containing non-BMP
characters (as strings containing them will use 4 bytes per
character).

One problem with the approach is support for existing applications
(e.g. extension modules). For compatibility, redundant representations
may be computed. Applications are encouraged to phase out reliance on
a specific internal representation if possible. As interaction with
other libraries will often require some sort of internal
representation, the specification chooses UTF-8 as the recommended way
of exposing strings to C code.

For many strings (e.g. ASCII), multiple representations may actually
share memory (e.g. the shortest form may be shared with the UTF-8 form
if all characters are ASCII). With such sharing, the overhead of
compatibility representations is reduced.

Specification
=============

The Unicode object structure is changed to this definition::

  typedef struct {
    PyObject_HEAD
    Py_ssize_t length;
    union {
        void *any;
        Py_UCS1 *latin1;
        Py_UCS2 *ucs2;
        Py_UCS4 *ucs4;
    } data;
    Py_hash_t hash;
    int state;
    Py_ssize_t utf8_length;
    void *utf8;
    Py_ssize_t wstr_length;
    void *wstr;
  } PyUnicodeObject;

These fields have the following interpretations:

- length: number of code points in the string (result of sq_length)
- data: shortest-form representation of the unicode string.
  The string is null-terminated (in its respective representation).
- hash: same as in Python 3.2
- state:

  * lowest 2 bits (mask 0x03) - interned-state (SSTATE_*) as in 3.2
  * next 2 bits (mask 0x0C) - form of str:

    + 00 => str is not initialized (data are in wstr)
    + 01 => 1 byte (Latin-1)
    + 10 => 2 byte (UCS-2)
    + 11 => 4 byte (UCS-4);

  * next bit (mask 0x10): 1 if str memory follows PyUnicodeObject  

- utf8_length, utf8: UTF-8 representation (null-terminated)
- wstr_length, wstr: representation in platform's wchar_t
  (null-terminated). If wchar_t is 16-bit, this form may use surrogate
  pairs (in which cast wstr_length differs form length).

All three representations are optional, although the data form is
considered the canonical representation which can be absent only
while the string is being created. If the representation is absent,
the pointer is NULL, and the corresponding length field may contain
arbitrary data.

The Py_UNICODE type is still supported but deprecated. It is always
defined as a typedef for wchar_t, so the wstr representation can double
as Py_UNICODE representation.

The data and utf8 pointers point to the same memory if the string uses
only ASCII characters (using only Latin-1 is not sufficient). The data
and wstr pointers point to the same memory if the string happens to
fit exactly to the wchar_t type of the platform (i.e. uses some
BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some
non-BMP characters if sizeof(wchar_t) is 4).

If the string is created directly with the canonical representation
(see below), this representation doesn't take a separate memory block,
but is allocated right after the PyUnicodeObject struct.

String Creation
---------------

The recommended way to create a Unicode object is to use the function
PyUnicode_New::

   PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar);

Both parameters must denote the eventual size/range of the strings.
In particular, codecs using this API must compute both the number of
characters and the maximum character in advance. An string is
allocated according to the specified size and character range and is
null-terminated; the actual characters in it may be unitialized.

PyUnicode_FromString and PyUnicode_FromStringAndSize remain supported
for processing UTF-8 input; the input is decoded, and the UTF-8
representation is not yet set for the string.

PyUnicode_FromUnicode remains supported but is deprecated. If the
Py_UNICODE pointer is non-null, the data representation is set. If the
pointer is NULL, a properly-sized wstr representation is allocated,
which can be modified until PyUnicode_Ready() is called (explicitly
or implicitly). Resizing a Unicode string remains possible until it
is finalized.

PyUnicode_Ready() converts a string containing only a wstr
representation into the canonical representation. Unless wstr and data
can share the memory, the wstr representation is discarded after the
conversion. PyUnicode_FAST_READY() is a wrapper that avoids the 
function call if the string is already ready. Both APIs return 0
on success and -1 on failure.

String Access
-------------

The canonical representation can be accessed using two macros
PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the
values PyUnicode_WCHAR_KIND (0), PyUnicode_1BYTE_KIND (1),
PyUnicode_2BYTE_KIND (2), or PyUnicode_4BYTE_KIND (3). PyUnicode_DATA
gives the void pointer to the data. Access to individual characters
should use PyUnicode_{READ|WRITE}[_CHAR]:
 
 - PyUnciode_READ(kind, data, index)
 - PyUnicode_WRITE(kind, data, index, value)
 - PyUnicode_READ_CHAR(unicode, index)
 - PyUnicode_WRITE_CHAR(unicode, index, value)

All these macros assume that the string is in canonical form;
callers need to ensure this by calling PyUnicode_FAST_READY.

A new function PyUnicode_AsUTF8 is provided to access the UTF-8
representation. It is thus identical to the existing
_PyUnicode_AsString, which is removed. The function will compute the
utf8 representation when first called. Since this representation will
consume memory until the string object is released, applications
should use the existing PyUnicode_AsUTF8String where possible
(which generates a new string object every time). APIs that implicitly
converts a string to a char* (such as the ParseTuple functions) will
use PyUnicode_AsUTF8 to compute a conversion.

PyUnicode_AsUnicode is deprecated; it computes the wstr representation
on first use.

Stable ABI
----------

None of the functions in this PEP become part of the stable ABI.

GDB Debugging Hooks
-------------------
Tools/gdb/libpython.py contains debugging hooks that embed knowledge
about the internals of CPython's data types, include PyUnicodeObject
instances.  It will need to be slightly updated to track the change.

Open Issues
===========

- When an application uses the legacy API, it may hold onto
  the Py_UNICODE* representation, and yet start calling Unicode
  APIs, which would call PyUnicode_Ready, invalidating the 
  Py_UNICODE* representation; this would be an incompatible change.
  The following solutions can be considered:

  * accept it as an incompatible change. Applications using the
    legacy API will have to fill out the Py_UNICODE buffer completely
    before calling any API on the string under construction.
  * require explicit PyUnicode_Ready calls in such applications;
    fail with a fatal error if a non-ready string is ever read.
    This would also be an incompatible change, but one that is
    more easily detected during testing.
  * as a compromise between these approaches, implicit PyUnicode_Ready
    calls (i.e. those not deliberately following the construction of
    a PyUnicode object) could produce a warning if they convert an
    object.

- Which of the APIs created during the development of the PEP should
  be public?

Discussion
==========

Several concerns have been raised about the approach presented here:

It makes the implementation more complex. That's true, but considered
worth it given the benefits.

The Py_UNICODE representation is not instantaneously available,
slowing down applications that request it. While this is also true,
applications that care about this problem can be rewritten to use the
data representation.

The question was raised whether the wchar_t representation is
discouraged, or scheduled for removal. This is not the intent of this
PEP; applications that use them will see a performance penalty,
though. Future versions of Python may consider to remove them.

Performance
-----------

Performance of this patch must be considered for both memory
consumption and runtime efficiency. For memory consumption, the
expectation is that applications that have many large strings will see
a reduction in memory usage. For small strings, the effects depend on
the pointer size of the system, and the size of the Py_UNICODE/wchar_t
type. The following table demonstrates this for various small ASCII
string sizes and platforms.

+-------+---------------------------------+----------------+
|string | Python 3.2                      | This PEP       |
|size   +----------------+----------------+                |
|       | 16-bit wchar_t | 32-bit wchar_t |                |
|       +---------+------+--------+-------+--------+-------+
|       | 32-bit  |64-bit| 32-bit |64-bit | 32-bit |64-bit |
+-------+---------+------+--------+-------+--------+-------+
|1      | 40      | 64   | 40     |  64   | 48     | 88    |
+-------+---------+------+--------+-------+--------+-------+
|2      | 40      | 64   | 48     |  72   | 48     | 88    |
+-------+---------+------+--------+-------+--------+-------+
|3      | 40      | 64   | 48     |  72   | 48     | 88    |
+-------+---------+------+--------+-------+--------+-------+
|4      | 48      | 72   | 56     |  80   | 48     | 88    |
+-------+---------+------+--------+-------+--------+-------+
|5      | 48      | 72   | 56     |  80   | 48     | 88    |
+-------+---------+------+--------+-------+--------+-------+
|6      | 48      | 72   | 64     |  88   | 48     | 88    |
+-------+---------+------+--------+-------+--------+-------+
|7      | 48      | 72   | 64     |  88   | 48     | 88    |
+-------+---------+------+--------+-------+--------+-------+
|8      | 56      | 80   | 72     |  96   | 56     | 88    |
+-------+---------+------+--------+-------+--------+-------+

The runtime effect is significantly affected by the API being
used. After porting the relevant pieces of code to the new API,
the iobench, stringbench, and json benchmarks see typically
slowdowns of 1% to 30%; for specific benchmarks, speedups may
happen as may happen significantly larger slowdowns.

Porting Guidelines
==================

Only a small fraction of C code is affected by this PEP, namely code
that needs to look "inside" unicode strings.  That code doesn't
necessarily need to be ported to this API, as the existing API will
continue to work correctly. In particular, modules that need to
support both Python 2 and Python 3 might get too complicated when
simultaneously supporting this new API and the old Unicode API.

In order to port modules to the new API, try to eliminate
the use of these API elements:

- the Py_UNICODE type,
- PyUnicode_AS_UNICODE and PyUnicode_AsUnicode,
- PyUnicode_GET_LENGTH and PyUnicode_GetSize, and
- PyUnicode_FromUnicode.

When iterating over an existing string, or looking at specific
characters, use indexing operations rather than pointer arithmetic;
indexing works well for PyUnicode_READ(_CHAR) and PyUnicode_WRITE. Use
void* as the buffer type for characters to let the compiler detect
invalid dereferencing operations. If you do want to use pointer
arithmentics (e.g. when converting existing code), use (unsigned)
char* as the buffer type, and keep the element size (1, 2, or 4) in a
variable. Notice that (1<<(kind-1)) will produce the element size
given a buffer kind.

When creating new strings, it was common in Python to start of with a
heuristical buffer size, and then grow or shrink if the heuristics
failed. With this PEP, this is now less practical, as you need not
only a heuristics for the length of the string, but also for the
maximum character.

In order to avoid heuristics, you need to make two passes over the
input: once to determine the output length, and the maximum character;
then allocate the target string with PyUnicode_New and iterate over
the input a second time to produce the final output. While this may
sound expensive, it could actually be cheaper than having to copy the
result again as in the following approach.

If you take the heuristical route, avoid allocating a string meant to
be resized, as resizing strings won't work for their canonical
representation.  Instead, allocate a separate buffer to collect the
characters, and then construct a unicode object from that using
PyUnicode_FromKindAndData. One option is to use Py_UCS4 as the buffer
element, assuming for the worst case in character ordinals. This will
allow for pointer arithmetics, but may require a lot of memory.
Alternatively, start with a 1-byte buffer, and increase the element
size as you encounter larger characters. In any case,
PyUnicode_FromKindAndData will scan over the buffer to verify the
maximum character.

For common tasks, direct access to the string representation may not
be necessary: PyUnicode_Find, PyUnicode_FindChar, PyUnicode_Ord, and
PyUnicode_CopyCharacters help in analyzing and creating string
objects, operating on indices instead of data pointers.

Copyright
=========

This document has been placed in the public domain.


..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End: