PEP: 393 Title: Flexible String Representation Version: $Revision$ Last-Modified: $Date$ Author: Martin v. Löwis Status: Accepted Type: Standards Track Content-Type: text/x-rst Created: 24-Jan-2010 Python-Version: 3.3 Post-History: Abstract ======== The Unicode string type is changed to support multiple internal representations, depending on the character with the largest Unicode ordinal (1, 2, or 4 bytes). This will allow a space-efficient representation in common cases, but give access to full UCS-4 on all systems. For compatibility with existing APIs, several representations may exist in parallel; over time, this compatibility should be phased out. The distinction between narrow and wide Unicode builds is dropped. An implementation of this PEP is available at [1]_. Rationale ========= There are two classes of complaints about the current implementation of the unicode type: on systems only supporting UTF-16, users complain that non-BMP characters are not properly supported. On systems using UCS-4 internally (and also sometimes on systems using UCS-2), there is a complaint that Unicode strings take up too much memory - especially compared to Python 2.x, where the same code would often use ASCII strings (i.e. ASCII-encoded byte strings). With the proposed approach, ASCII-only Unicode strings will again use only one byte per character; while still allowing efficient indexing of strings containing non-BMP characters (as strings containing them will use 4 bytes per character). One problem with the approach is support for existing applications (e.g. extension modules). For compatibility, redundant representations may be computed. Applications are encouraged to phase out reliance on a specific internal representation if possible. As interaction with other libraries will often require some sort of internal representation, the specification chooses UTF-8 as the recommended way of exposing strings to C code. For many strings (e.g. ASCII), multiple representations may actually share memory (e.g. the shortest form may be shared with the UTF-8 form if all characters are ASCII). With such sharing, the overhead of compatibility representations is reduced. If representations do share data, it is also possible to omit structure fields, reducing the base size of string objects. Specification ============= Unicode structures are now defined as a hierarchy of structures, namely:: typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; struct { unsigned int interned:2; unsigned int kind:2; unsigned int compact:1; unsigned int ascii:1; unsigned int ready:1; } state; wchar_t *wstr; } PyASCIIObject; typedef struct { PyASCIIObject _base; Py_ssize_t utf8_length; char *utf8; Py_ssize_t wstr_length; } PyCompactUnicodeObject; typedef struct { PyCompactUnicodeObject _base; union { void *any; Py_UCS1 *latin1; Py_UCS2 *ucs2; Py_UCS4 *ucs4; } data; } PyUnicodeObject; Objects for which both size and maximum character are known at creation time are called "compact" unicode objects; character data immediately follow the base structure. If the maximum character is less than 128, they use the PyASCIIObject structure, and the UTF-8 data, the UTF-8 length and the wstr length are the same as the length and the ASCII data. For non-ASCII strings, the PyCompactObject structure is used. Resizing compact objects is not supported. Objects for which the maximum character is not given at creation time are called "legacy" objects, created through PyUnicode_FromStringAndSize(NULL, length). They use the PyUnicodeObject structure. Initially, their data is only in the wstr pointer; when PyUnicode_READY is called, the data pointer (union) is allocated. Resizing is possible as long PyUnicode_READY has not been called. The fields have the following interpretations: - length: number of code points in the string (result of sq_length) - interned: interned-state (SSTATE_*) as in 3.2 - kind: form of string + 00 => str is not initialized (data are in wstr) + 01 => 1 byte (Latin-1) + 10 => 2 byte (UCS-2) + 11 => 4 byte (UCS-4); - compact: the object uses one of the compact representations (implies ready) - ascii: the object uses the PyASCIIObject representation (implies compact and ready) - ready: the canonical representation is ready to be accessed through PyUnicode_DATA and PyUnicode_GET_LENGTH. This is set either if the object is compact, or the data pointer and length have been initialized. - wstr_length, wstr: representation in platform's wchar_t (null-terminated). If wchar_t is 16-bit, this form may use surrogate pairs (in which cast wstr_length differs form length). wstr_length differs from length only if there are surrogate pairs in the representation. - utf8_length, utf8: UTF-8 representation (null-terminated). - data: shortest-form representation of the unicode string. The string is null-terminated (in its respective representation). All three representations are optional, although the data form is considered the canonical representation which can be absent only while the string is being created. If the representation is absent, the pointer is NULL, and the corresponding length field may contain arbitrary data. The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation. The data and utf8 pointers point to the same memory if the string uses only ASCII characters (using only Latin-1 is not sufficient). The data and wstr pointers point to the same memory if the string happens to fit exactly to the wchar_t type of the platform (i.e. uses some BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some non-BMP characters if sizeof(wchar_t) is 4). String Creation --------------- The recommended way to create a Unicode object is to use the function PyUnicode_New:: PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar); Both parameters must denote the eventual size/range of the strings. In particular, codecs using this API must compute both the number of characters and the maximum character in advance. An string is allocated according to the specified size and character range and is null-terminated; the actual characters in it may be uninitialized. PyUnicode_FromString and PyUnicode_FromStringAndSize remain supported for processing UTF-8 input; the input is decoded, and the UTF-8 representation is not yet set for the string. PyUnicode_FromUnicode remains supported but is deprecated. If the Py_UNICODE pointer is non-null, the data representation is set. If the pointer is NULL, a properly-sized wstr representation is allocated, which can be modified until PyUnicode_READY() is called (explicitly or implicitly). Resizing a Unicode string remains possible until it is finalized. PyUnicode_READY() converts a string containing only a wstr representation into the canonical representation. Unless wstr and data can share the memory, the wstr representation is discarded after the conversion. The macro returns 0 on success and -1 on failure, which happens in particular if the memory allocation fails. String Access ------------- The canonical representation can be accessed using two macros PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the values PyUnicode_WCHAR_KIND (0), PyUnicode_1BYTE_KIND (1), PyUnicode_2BYTE_KIND (2), or PyUnicode_4BYTE_KIND (3). PyUnicode_DATA gives the void pointer to the data. Access to individual characters should use PyUnicode_{READ|WRITE}[_CHAR]: - PyUnciode_READ(kind, data, index) - PyUnicode_WRITE(kind, data, index, value) - PyUnicode_READ_CHAR(unicode, index) All these macros assume that the string is in canonical form; callers need to ensure this by calling PyUnicode_READY. A new function PyUnicode_AsUTF8 is provided to access the UTF-8 representation. It is thus identical to the existing _PyUnicode_AsString, which is removed. The function will compute the utf8 representation when first called. Since this representation will consume memory until the string object is released, applications should use the existing PyUnicode_AsUTF8String where possible (which generates a new string object every time). APIs that implicitly converts a string to a char* (such as the ParseTuple functions) will use PyUnicode_AsUTF8 to compute a conversion. New API ------- This section summarizes the API additions. Macros to access the internal representation of a Unicode object (read-only): - PyUnicode_IS_COMPACT_ASCII(o), PyUnicode_IS_COMPACT(o), PyUnicode_IS_READY(o) - PyUnicode_GET_LENGTH(o) - PyUnicode_KIND(o), PyUnicode_CHARACTER_SIZE(o), PyUnicode_MAX_CHAR_VALUE(o) - PyUnicode_DATA(o), PyUnicode_1BYTE_DATA(o), PyUnicode_2BYTE_DATA(o), PyUnicode_4BYTE_DATA(o) Character access macros: - PyUnicode_READ(kind, data, index), PyUnicode_READ_CHAR(o, index) - PyUnicode_WRITE(kind, data, index, value) Other macros: - PyUnicode_READY(o) - PyUnicode_CONVERT_BYTES(from_type, to_type, begin, end, to) String creation functions: - PyUnicode_New(size, maxchar) - PyUnicode_FromKindAndData(kind, data, size) - PyUnicode_Substring(o, start, end) Character access utility functions: - PyUnicode_GetLength(o), PyUnicode_ReadChar(o, index), PyUnicode_WriteChar(o, index, character) - PyUnicode_CopyCharacters(to, to_start, from, from_start, how_many) - PyUnicode_FindChar(str, ch, start, end, direction) Representation conversion: - PyUnicode_AsUCS4(o, buffer, buflen) - PyUnicode_AsUCS4Copy(o) - PyUnicode_AsUnicodeAndSize(o, size_out) - PyUnicode_AsUTF8(o) - PyUnicode_AsUTF8AndSize(o, size_out) UCS4 utility functions: - Py_UCS4_{strlen, strcpy, strcat, strncpy, strcmp, strncpy, strcmp, strncmp, strchr, strrchr} Stable ABI ---------- The following functions are added to the stable ABI (PEP 384), as they are independent of the actual representation of Unicode objects: PyUnicode_New, PyUnicode_Substring, PyUnicode_GetLength, PyUnicode_ReadChar, PyUnicode_WriteChar, PyUnicode_Find, PyUnicode_FindChar. GDB Debugging Hooks ------------------- Tools/gdb/libpython.py contains debugging hooks that embed knowledge about the internals of CPython's data types, include PyUnicodeObject instances. It has been updated to track the change. Deprecations, Removals, and Incompatibilities --------------------------------------------- While the Py_UNICODE representation and APIs are deprecated with this PEP, no removal of the respective APIs is scheduled. The APIs should remain available at least five years after the PEP is accepted; before they are removed, existing extension modules should be studied to find out whether a sufficient majority of the open-source code on PyPI has been ported to the new API. A reasonable motivation for using the deprecated API even in new code is for code that shall work both on Python 2 and Python 3. The following macros and functions are deprecated: - PyUnicode_FromUnicode - PyUnicode_GET_SIZE, PyUnicode_GetSize, PyUnicode_GET_DATA_SIZE, - PyUnicode_AS_UNICODE, PyUnicode_AsUnicode, PyUnicode_AsUnicodeAndSize - PyUnicode_COPY, PyUnicode_FILL, PyUnicode_MATCH - PyUnicode_Encode, PyUnicode_EncodeUTF7, PyUnicode_EncodeUTF8, PyUnicode_EncodeUTF16, PyUnicode_EncodeUTF32, PyUnicode_EncodeUnicodeEscape, PyUnicode_EncodeRawUnicodeEscape, PyUnicode_EncodeLatin1, PyUnicode_EncodeASCII, PyUnicode_EncodeCharmap, PyUnicode_TranslateCharmap, PyUnicode_EncodeMBCS, PyUnicode_EncodeDecimal, PyUnicode_TransformDecimalToASCII - Py_UNICODE_{strlen, strcat, strcpy, strcmp, strchr, strrchr} - PyUnicode_AsUnicodeCopy - PyUnicode_GetMax _PyUnicode_AsDefaultEncodedString is removed. It previously returned a borrowed reference to an UTF-8-encoded bytes object. Since the unicode object cannot anymore cache such a reference, implementing it without leaking memory is not possible. No deprecation phase is provided, since it was an API for internal use only. Extension modules using the legacy API may inadvertently call PyUnicode_READY, by calling some API that requires that the object is ready, and then continue accessing the (now invalid) Py_UNICODE pointer. Such code will break with this PEP. The code was already flawed in 3.2, as there is was no explicit guarantee that the PyUnicode_AS_UNICODE result would stay valid after an API call (due to the possibility of string resizing). Modules that face this issue need to re-fetch the Py_UNICODE pointer after API calls; doing so will continue to work correctly in earlier Python versions. Discussion ========== Several concerns have been raised about the approach presented here: It makes the implementation more complex. That's true, but considered worth it given the benefits. The Py_UNICODE representation is not instantaneously available, slowing down applications that request it. While this is also true, applications that care about this problem can be rewritten to use the data representation. Performance ----------- Performance of this patch must be considered for both memory consumption and runtime efficiency. For memory consumption, the expectation is that applications that have many large strings will see a reduction in memory usage. For small strings, the effects depend on the pointer size of the system, and the size of the Py_UNICODE/wchar_t type. The following table demonstrates this for various small ASCII and Latin-1 string sizes and platforms. +-------+---------------------------------+---------------------------------+ |string | Python 3.2 | This PEP | |size +----------------+----------------+----------------+----------------+ | | 16-bit wchar_t | 32-bit wchar_t | ASCII | Latin-1 | | +---------+------+--------+-------+--------+-------+--------+-------+ | | 32-bit |64-bit| 32-bit |64-bit | 32-bit |64-bit | 32-bit |64-bit | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |1 | 32 | 64 | 40 | 64 | 32 | 56 | 40 | 80 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |2 | 40 | 64 | 40 | 72 | 32 | 56 | 40 | 80 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |3 | 40 | 64 | 48 | 72 | 32 | 56 | 40 | 80 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |4 | 40 | 72 | 48 | 80 | 32 | 56 | 48 | 80 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |5 | 40 | 72 | 56 | 80 | 32 | 56 | 48 | 80 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |6 | 48 | 72 | 56 | 88 | 32 | 56 | 48 | 80 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |7 | 48 | 72 | 64 | 88 | 32 | 56 | 48 | 80 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ |8 | 48 | 80 | 64 | 96 | 40 | 64 | 48 | 88 | +-------+---------+------+--------+-------+--------+-------+--------+-------+ The runtime effect is significantly affected by the API being used. After porting the relevant pieces of code to the new API, the iobench, stringbench, and json benchmarks see typically slowdowns of 1% to 30%; for specific benchmarks, speedups may happen as may happen significantly larger slowdowns. Porting Guidelines ================== Only a small fraction of C code is affected by this PEP, namely code that needs to look "inside" unicode strings. That code doesn't necessarily need to be ported to this API, as the existing API will continue to work correctly. In particular, modules that need to support both Python 2 and Python 3 might get too complicated when simultaneously supporting this new API and the old Unicode API. In order to port modules to the new API, try to eliminate the use of these API elements: - the Py_UNICODE type, - PyUnicode_AS_UNICODE and PyUnicode_AsUnicode, - PyUnicode_GET_SIZE and PyUnicode_GetSize, and - PyUnicode_FromUnicode. When iterating over an existing string, or looking at specific characters, use indexing operations rather than pointer arithmetic; indexing works well for PyUnicode_READ(_CHAR) and PyUnicode_WRITE. Use void* as the buffer type for characters to let the compiler detect invalid dereferencing operations. If you do want to use pointer arithmetics (e.g. when converting existing code), use (unsigned) char* as the buffer type, and keep the element size (1, 2, or 4) in a variable. Notice that (1<<(kind-1)) will produce the element size given a buffer kind. When creating new strings, it was common in Python to start of with a heuristical buffer size, and then grow or shrink if the heuristics failed. With this PEP, this is now less practical, as you need not only a heuristics for the length of the string, but also for the maximum character. In order to avoid heuristics, you need to make two passes over the input: once to determine the output length, and the maximum character; then allocate the target string with PyUnicode_New and iterate over the input a second time to produce the final output. While this may sound expensive, it could actually be cheaper than having to copy the result again as in the following approach. If you take the heuristical route, avoid allocating a string meant to be resized, as resizing strings won't work for their canonical representation. Instead, allocate a separate buffer to collect the characters, and then construct a unicode object from that using PyUnicode_FromKindAndData. One option is to use Py_UCS4 as the buffer element, assuming for the worst case in character ordinals. This will allow for pointer arithmetics, but may require a lot of memory. Alternatively, start with a 1-byte buffer, and increase the element size as you encounter larger characters. In any case, PyUnicode_FromKindAndData will scan over the buffer to verify the maximum character. For common tasks, direct access to the string representation may not be necessary: PyUnicode_Find, PyUnicode_FindChar, PyUnicode_Ord, and PyUnicode_CopyCharacters help in analyzing and creating string objects, operating on indexes instead of data pointers. References ========== .. [1] PEP 393 branch https://bitbucket.org/t0rsten/pep-393 Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: