2011-01-24 15:00:09 -05:00
|
|
|
|
PEP: 393
|
|
|
|
|
Title: Flexible String Representation
|
2011-01-24 15:14:21 -05:00
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
2011-01-24 15:00:09 -05:00
|
|
|
|
Author: Martin v. Löwis <martin@v.loewis.de>
|
|
|
|
|
Status: Draft
|
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
|
Created: 24-Jan-2010
|
|
|
|
|
Python-Version: 3.3
|
|
|
|
|
Post-History:
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
========
|
|
|
|
|
|
|
|
|
|
The Unicode string type is changed to support multiple internal
|
|
|
|
|
representations, depending on the character with the largest Unicode
|
|
|
|
|
ordinal (1, 2, or 4 bytes). This will allow a space-efficient
|
|
|
|
|
representation in common cases, but give access to full UCS-4 on all
|
|
|
|
|
systems. For compatibility with existing APIs, several representations
|
|
|
|
|
may exist in parallel; over time, this compatibility should be phased
|
|
|
|
|
out.
|
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
There are two classes of complaints about the current implementation
|
|
|
|
|
of the unicode type: on systems only supporting UTF-16, users complain
|
|
|
|
|
that non-BMP characters are not properly supported. On systems using
|
|
|
|
|
UCS-4 internally (and also sometimes on systems using UCS-2), there is
|
|
|
|
|
a complaint that Unicode strings take up too much memory - especially
|
|
|
|
|
compared to Python 2.x, where the same code would often use ASCII
|
|
|
|
|
strings (i.e. ASCII-encoded byte strings). With the proposed approach,
|
|
|
|
|
ASCII-only Unicode strings will again use only one byte per character;
|
|
|
|
|
while still allowing efficient indexing of strings containing non-BMP
|
|
|
|
|
characters (as strings containing them will use 4 bytes per
|
|
|
|
|
character).
|
|
|
|
|
|
|
|
|
|
One problem with the approach is support for existing applications
|
|
|
|
|
(e.g. extension modules). For compatibility, redundant representations
|
|
|
|
|
may be computed. Applications are encouraged to phase out reliance on
|
|
|
|
|
a specific internal representation if possible. As interaction with
|
|
|
|
|
other libraries will often require some sort of internal
|
|
|
|
|
representation, the specification choses UTF-8 as the recommended way
|
|
|
|
|
of exposing strings to C code.
|
|
|
|
|
|
|
|
|
|
For many strings (e.g. ASCII), multiple representations may actually
|
|
|
|
|
share memory (e.g. the shortest form may be shared with the UTF-8 form
|
|
|
|
|
if all characters are ASCII). With such sharing, the overhead of
|
|
|
|
|
compatibility representations is reduced.
|
|
|
|
|
|
|
|
|
|
Specification
|
|
|
|
|
=============
|
|
|
|
|
|
|
|
|
|
The Unicode object structure is changed to this definition::
|
|
|
|
|
|
|
|
|
|
typedef struct {
|
|
|
|
|
PyObject_HEAD
|
|
|
|
|
Py_ssize_t length;
|
|
|
|
|
void *str;
|
|
|
|
|
Py_hash_t hash;
|
|
|
|
|
int state;
|
|
|
|
|
Py_ssize_t utf8_length;
|
|
|
|
|
void *utf8;
|
|
|
|
|
Py_ssize_t wstr_length;
|
|
|
|
|
void *wstr;
|
|
|
|
|
} PyUnicodeObject;
|
|
|
|
|
|
|
|
|
|
These fields have the following interpretations:
|
|
|
|
|
|
|
|
|
|
- length: number of code points in the string (result of sq_length)
|
2011-01-27 16:37:25 -05:00
|
|
|
|
- str: shortest-form representation of the unicode string
|
2011-01-24 15:00:09 -05:00
|
|
|
|
The string is null-terminated (in its respective representation).
|
2011-01-27 16:37:25 -05:00
|
|
|
|
- hash: same as in Python 3.2
|
|
|
|
|
- state:
|
|
|
|
|
|
|
|
|
|
* lowest 2 bits (mask 0x03) - interned-state (SSTATE_*) as in 3.2
|
|
|
|
|
* next 2 bits (mask 0x0C) - form of str:
|
|
|
|
|
|
|
|
|
|
+ 00 => reserved
|
|
|
|
|
+ 01 => 1 byte (Latin-1)
|
|
|
|
|
+ 10 => 2 byte (UCS-2)
|
|
|
|
|
+ 11 => 4 byte (UCS-4);
|
|
|
|
|
|
|
|
|
|
* next bit (mask 0x10): 1 if str memory follows PyUnicodeObject
|
|
|
|
|
|
2011-01-24 15:00:09 -05:00
|
|
|
|
- utf8_length, utf8: UTF-8 representation (null-terminated)
|
|
|
|
|
- wstr_length, wstr: representation in platform's wchar_t
|
|
|
|
|
(null-terminated). If wchar_t is 16-bit, this form may use surrogate
|
|
|
|
|
pairs (in which cast wstr_length differs form length).
|
|
|
|
|
|
|
|
|
|
All three representations are optional, although the str form is
|
|
|
|
|
considered the canonical representation which can be absent only
|
2011-01-27 16:16:50 -05:00
|
|
|
|
while the string is being created. If the representation is absent,
|
|
|
|
|
the pointer is NULL, and the corresponding length field may contain
|
|
|
|
|
arbitrary data.
|
2011-01-24 15:00:09 -05:00
|
|
|
|
|
|
|
|
|
The Py_UNICODE type is still supported but deprecated. It is always
|
|
|
|
|
defined as a typedef for wchar_t, so the wstr representation can double
|
|
|
|
|
as Py_UNICODE representation.
|
|
|
|
|
|
|
|
|
|
The str and utf8 pointers point to the same memory if the string uses
|
|
|
|
|
only ASCII characters (using only Latin-1 is not sufficient). The str
|
|
|
|
|
and wstr pointers point to the same memory if the string happens to
|
|
|
|
|
fit exactly to the wchar_t type of the platform (i.e. uses some
|
|
|
|
|
BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some
|
|
|
|
|
non-BMP characters if sizeof(wchar_t) is 4).
|
|
|
|
|
|
|
|
|
|
If the string is created directly with the canonical representation
|
|
|
|
|
(see below), this representation doesn't take a separate memory block,
|
|
|
|
|
but is allocated right after the PyUnicodeObject struct.
|
|
|
|
|
|
|
|
|
|
String Creation
|
|
|
|
|
---------------
|
|
|
|
|
|
|
|
|
|
The recommended way to create a Unicode object is to use the function
|
|
|
|
|
PyUnicode_New::
|
|
|
|
|
|
|
|
|
|
PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar);
|
|
|
|
|
|
|
|
|
|
Both parameters must denote the eventual size/range of the strings.
|
|
|
|
|
In particular, codecs using this API must compute both the number of
|
|
|
|
|
characters and the maximum character in advance. An string is
|
|
|
|
|
allocated according to the specified size and character range and is
|
|
|
|
|
null-terminated; the actual characters in it may be unitialized.
|
|
|
|
|
|
|
|
|
|
PyUnicode_FromString and PyUnicode_FromStringAndSize remain supported
|
|
|
|
|
for processing UTF-8 input; the input is decoded, and the UTF-8
|
|
|
|
|
representation is not yet set for the string.
|
|
|
|
|
|
|
|
|
|
PyUnicode_FromUnicode remains supported but is deprecated. If the
|
|
|
|
|
Py_UNICODE pointer is non-null, the str representation is set. If the
|
|
|
|
|
pointer is NULL, a properly-sized wstr representation is allocated,
|
2011-01-27 16:37:25 -05:00
|
|
|
|
which can be modified until PyUnicode_Ready() is called (explicitly
|
2011-01-24 15:00:09 -05:00
|
|
|
|
or implicitly). Resizing a Unicode string remains possible until it
|
|
|
|
|
is finalized.
|
|
|
|
|
|
2011-01-27 16:37:25 -05:00
|
|
|
|
PyUnicode_Ready() converts a string containing only a wstr
|
2011-01-24 15:00:09 -05:00
|
|
|
|
representation into the canonical representation. Unless wstr and str
|
|
|
|
|
can share the memory, the wstr representation is discarded after the
|
|
|
|
|
conversion.
|
|
|
|
|
|
|
|
|
|
String Access
|
|
|
|
|
-------------
|
|
|
|
|
|
|
|
|
|
The canonical representation can be accessed using two macros
|
|
|
|
|
PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the
|
|
|
|
|
value PyUnicode_1BYTE (1), PyUnicode_2BYTE (2), or PyUnicode_4BYTE
|
|
|
|
|
(3). PyUnicode_Data gives the void pointer to the data, masking out
|
2011-01-27 16:37:25 -05:00
|
|
|
|
the pointer kind. All these functions call PyUnicode_Ready
|
2011-01-24 15:00:09 -05:00
|
|
|
|
in case the canonical representation hasn't been computed yet.
|
|
|
|
|
|
|
|
|
|
A new function PyUnicode_AsUTF8 is provided to access the UTF-8
|
|
|
|
|
representation. It is thus identical to the existing
|
|
|
|
|
_PyUnicode_AsString, which is removed. The function will compute the
|
|
|
|
|
utf8 representation when first called. Since this representation will
|
|
|
|
|
consume memory until the string object is released, applications
|
|
|
|
|
should use the existing PyUnicode_AsUTF8String where possible
|
|
|
|
|
(which generates a new string object every time). API that implicitly
|
|
|
|
|
converts a string to a char* (such as the ParseTuple functions) will
|
2011-01-27 16:37:25 -05:00
|
|
|
|
use PyUnicode_AsUTF8 to compute a conversion.
|
2011-01-24 15:00:09 -05:00
|
|
|
|
|
|
|
|
|
PyUnicode_AsUnicode is deprecated; it computes the wstr representation
|
|
|
|
|
on first use.
|
|
|
|
|
|
|
|
|
|
String Operations
|
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
|
|
Various convenience functions will be provided to deal with the
|
|
|
|
|
canonical representation, in particular with respect to concatenation
|
|
|
|
|
and slicing.
|
|
|
|
|
|
|
|
|
|
Stable ABI
|
|
|
|
|
----------
|
|
|
|
|
|
|
|
|
|
None of the functions in this PEP become part of the stable ABI.
|
|
|
|
|
|
2011-01-27 16:16:50 -05:00
|
|
|
|
GDB Debugging Hooks
|
|
|
|
|
-------------------
|
|
|
|
|
Tools/gdb/libpython.py contains debugging hooks that embed knowledge
|
|
|
|
|
about the internals of CPython's data types, include PyUnicodeObject
|
|
|
|
|
instances. It will need to be slightly updated to track the change.
|
|
|
|
|
|
2011-01-27 16:42:35 -05:00
|
|
|
|
Discussion
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
Several concerns have been raised about the approach presented here:
|
|
|
|
|
|
|
|
|
|
It makes the implementation more complex. That's true, but considered
|
|
|
|
|
worth given the gains.
|
2011-01-27 16:16:50 -05:00
|
|
|
|
|
2011-01-27 16:47:00 -05:00
|
|
|
|
The Py_Unicode representation is not instantaneously available,
|
|
|
|
|
slowing down applications that request it. While this is also true,
|
|
|
|
|
applications that care about this problem can be rewritten to use the
|
|
|
|
|
str representation.
|
|
|
|
|
|
2011-01-24 15:00:09 -05:00
|
|
|
|
Copyright
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
..
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
coding: utf-8
|
|
|
|
|
End:
|