2011-01-24 15:00:09 -05:00
|
|
|
|
PEP: 393
|
|
|
|
|
Title: Flexible String Representation
|
2011-01-24 15:14:21 -05:00
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
2011-01-24 15:00:09 -05:00
|
|
|
|
Author: Martin v. Löwis <martin@v.loewis.de>
|
2011-09-27 18:09:37 -04:00
|
|
|
|
Status: Accepted
|
2011-01-24 15:00:09 -05:00
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
|
Created: 24-Jan-2010
|
|
|
|
|
Python-Version: 3.3
|
|
|
|
|
Post-History:
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
========
|
|
|
|
|
|
|
|
|
|
The Unicode string type is changed to support multiple internal
|
|
|
|
|
representations, depending on the character with the largest Unicode
|
|
|
|
|
ordinal (1, 2, or 4 bytes). This will allow a space-efficient
|
|
|
|
|
representation in common cases, but give access to full UCS-4 on all
|
|
|
|
|
systems. For compatibility with existing APIs, several representations
|
|
|
|
|
may exist in parallel; over time, this compatibility should be phased
|
2011-09-26 06:25:49 -04:00
|
|
|
|
out. The distinction between narrow and wide Unicode builds is
|
|
|
|
|
dropped. An implementation of this PEP is available at [1]_.
|
2011-01-24 15:00:09 -05:00
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
There are two classes of complaints about the current implementation
|
|
|
|
|
of the unicode type: on systems only supporting UTF-16, users complain
|
|
|
|
|
that non-BMP characters are not properly supported. On systems using
|
|
|
|
|
UCS-4 internally (and also sometimes on systems using UCS-2), there is
|
|
|
|
|
a complaint that Unicode strings take up too much memory - especially
|
|
|
|
|
compared to Python 2.x, where the same code would often use ASCII
|
|
|
|
|
strings (i.e. ASCII-encoded byte strings). With the proposed approach,
|
|
|
|
|
ASCII-only Unicode strings will again use only one byte per character;
|
|
|
|
|
while still allowing efficient indexing of strings containing non-BMP
|
|
|
|
|
characters (as strings containing them will use 4 bytes per
|
|
|
|
|
character).
|
|
|
|
|
|
|
|
|
|
One problem with the approach is support for existing applications
|
|
|
|
|
(e.g. extension modules). For compatibility, redundant representations
|
|
|
|
|
may be computed. Applications are encouraged to phase out reliance on
|
|
|
|
|
a specific internal representation if possible. As interaction with
|
|
|
|
|
other libraries will often require some sort of internal
|
2011-08-26 18:50:22 -04:00
|
|
|
|
representation, the specification chooses UTF-8 as the recommended way
|
2011-01-24 15:00:09 -05:00
|
|
|
|
of exposing strings to C code.
|
|
|
|
|
|
|
|
|
|
For many strings (e.g. ASCII), multiple representations may actually
|
|
|
|
|
share memory (e.g. the shortest form may be shared with the UTF-8 form
|
|
|
|
|
if all characters are ASCII). With such sharing, the overhead of
|
2011-09-25 16:58:13 -04:00
|
|
|
|
compatibility representations is reduced. If representations do share
|
|
|
|
|
data, it is also possible to omit structure fields, reducing the base
|
|
|
|
|
size of string objects.
|
2011-01-24 15:00:09 -05:00
|
|
|
|
|
|
|
|
|
Specification
|
|
|
|
|
=============
|
|
|
|
|
|
2011-09-25 16:58:13 -04:00
|
|
|
|
Unicode structures are now defined as a hierarchy of structures,
|
|
|
|
|
namely::
|
2011-01-24 15:00:09 -05:00
|
|
|
|
|
|
|
|
|
typedef struct {
|
|
|
|
|
PyObject_HEAD
|
|
|
|
|
Py_ssize_t length;
|
2011-09-25 16:58:13 -04:00
|
|
|
|
Py_hash_t hash;
|
|
|
|
|
struct {
|
|
|
|
|
unsigned int interned:2;
|
|
|
|
|
unsigned int kind:2;
|
|
|
|
|
unsigned int compact:1;
|
|
|
|
|
unsigned int ascii:1;
|
|
|
|
|
unsigned int ready:1;
|
|
|
|
|
} state;
|
|
|
|
|
wchar_t *wstr;
|
|
|
|
|
} PyASCIIObject;
|
|
|
|
|
|
|
|
|
|
typedef struct {
|
|
|
|
|
PyASCIIObject _base;
|
|
|
|
|
Py_ssize_t utf8_length;
|
|
|
|
|
char *utf8;
|
|
|
|
|
Py_ssize_t wstr_length;
|
|
|
|
|
} PyCompactUnicodeObject;
|
|
|
|
|
|
|
|
|
|
typedef struct {
|
|
|
|
|
PyCompactUnicodeObject _base;
|
2011-08-28 14:51:49 -04:00
|
|
|
|
union {
|
|
|
|
|
void *any;
|
|
|
|
|
Py_UCS1 *latin1;
|
|
|
|
|
Py_UCS2 *ucs2;
|
|
|
|
|
Py_UCS4 *ucs4;
|
|
|
|
|
} data;
|
2011-01-24 15:00:09 -05:00
|
|
|
|
} PyUnicodeObject;
|
|
|
|
|
|
2011-09-25 16:58:13 -04:00
|
|
|
|
Objects for which both size and maximum character are known at
|
|
|
|
|
creation time are called "compact" unicode objects; character data
|
|
|
|
|
immediately follow the base structure. If the maximum character is
|
|
|
|
|
less than 128, they use the PyASCIIObject structure, and the UTF-8
|
|
|
|
|
data, the UTF-8 length and the wstr length are the same as the length
|
|
|
|
|
and the ASCII data. For non-ASCII strings, the PyCompactObject
|
|
|
|
|
structure is used. Resizing compact objects is not supported.
|
2011-01-24 15:00:09 -05:00
|
|
|
|
|
2011-09-25 16:58:13 -04:00
|
|
|
|
Objects for which the maximum character is not given at creation time
|
|
|
|
|
are called "legacy" objects, created through
|
|
|
|
|
PyUnicode_FromStringAndSize(NULL, length). They use the
|
|
|
|
|
PyUnicodeObject structure. Initially, their data is only in the wstr
|
|
|
|
|
pointer; when PyUnicode_READY is called, the data pointer (union) is
|
|
|
|
|
allocated. Resizing is possible as long PyUnicode_READY has not been
|
|
|
|
|
called.
|
2011-01-27 16:37:25 -05:00
|
|
|
|
|
2011-09-25 16:58:13 -04:00
|
|
|
|
The fields have the following interpretations:
|
2011-01-27 16:37:25 -05:00
|
|
|
|
|
2011-09-25 16:58:13 -04:00
|
|
|
|
- length: number of code points in the string (result of sq_length)
|
|
|
|
|
- interned: interned-state (SSTATE_*) as in 3.2
|
|
|
|
|
- kind: form of string
|
2011-08-28 14:51:49 -04:00
|
|
|
|
+ 00 => str is not initialized (data are in wstr)
|
2011-01-27 16:37:25 -05:00
|
|
|
|
+ 01 => 1 byte (Latin-1)
|
|
|
|
|
+ 10 => 2 byte (UCS-2)
|
|
|
|
|
+ 11 => 4 byte (UCS-4);
|
2011-09-25 16:58:13 -04:00
|
|
|
|
- compact: the object uses one of the compact representations
|
|
|
|
|
(implies ready)
|
|
|
|
|
- ascii: the object uses the PyASCIIObject representation
|
|
|
|
|
(implies compact and ready)
|
2011-09-28 21:56:12 -04:00
|
|
|
|
- ready: the canonical representation is ready to be accessed through
|
2011-09-25 16:58:13 -04:00
|
|
|
|
PyUnicode_DATA and PyUnicode_GET_LENGTH. This is set either if the
|
|
|
|
|
object is compact, or the data pointer and length have been
|
|
|
|
|
initialized.
|
2011-01-24 15:00:09 -05:00
|
|
|
|
- wstr_length, wstr: representation in platform's wchar_t
|
|
|
|
|
(null-terminated). If wchar_t is 16-bit, this form may use surrogate
|
|
|
|
|
pairs (in which cast wstr_length differs form length).
|
2011-09-25 16:58:13 -04:00
|
|
|
|
wstr_length differs from length only if there are surrogate pairs
|
|
|
|
|
in the representation.
|
|
|
|
|
- utf8_length, utf8: UTF-8 representation (null-terminated).
|
|
|
|
|
- data: shortest-form representation of the unicode string.
|
|
|
|
|
The string is null-terminated (in its respective representation).
|
2011-01-24 15:00:09 -05:00
|
|
|
|
|
2011-08-28 14:51:49 -04:00
|
|
|
|
All three representations are optional, although the data form is
|
2011-01-24 15:00:09 -05:00
|
|
|
|
considered the canonical representation which can be absent only
|
2011-01-27 16:16:50 -05:00
|
|
|
|
while the string is being created. If the representation is absent,
|
|
|
|
|
the pointer is NULL, and the corresponding length field may contain
|
|
|
|
|
arbitrary data.
|
2011-01-24 15:00:09 -05:00
|
|
|
|
|
|
|
|
|
The Py_UNICODE type is still supported but deprecated. It is always
|
|
|
|
|
defined as a typedef for wchar_t, so the wstr representation can double
|
|
|
|
|
as Py_UNICODE representation.
|
|
|
|
|
|
2011-08-28 14:51:49 -04:00
|
|
|
|
The data and utf8 pointers point to the same memory if the string uses
|
|
|
|
|
only ASCII characters (using only Latin-1 is not sufficient). The data
|
2011-01-24 15:00:09 -05:00
|
|
|
|
and wstr pointers point to the same memory if the string happens to
|
|
|
|
|
fit exactly to the wchar_t type of the platform (i.e. uses some
|
|
|
|
|
BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some
|
|
|
|
|
non-BMP characters if sizeof(wchar_t) is 4).
|
|
|
|
|
|
|
|
|
|
String Creation
|
|
|
|
|
---------------
|
|
|
|
|
|
|
|
|
|
The recommended way to create a Unicode object is to use the function
|
|
|
|
|
PyUnicode_New::
|
|
|
|
|
|
|
|
|
|
PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar);
|
|
|
|
|
|
|
|
|
|
Both parameters must denote the eventual size/range of the strings.
|
|
|
|
|
In particular, codecs using this API must compute both the number of
|
|
|
|
|
characters and the maximum character in advance. An string is
|
|
|
|
|
allocated according to the specified size and character range and is
|
2011-09-28 21:56:12 -04:00
|
|
|
|
null-terminated; the actual characters in it may be uninitialized.
|
2011-01-24 15:00:09 -05:00
|
|
|
|
|
|
|
|
|
PyUnicode_FromString and PyUnicode_FromStringAndSize remain supported
|
|
|
|
|
for processing UTF-8 input; the input is decoded, and the UTF-8
|
|
|
|
|
representation is not yet set for the string.
|
|
|
|
|
|
|
|
|
|
PyUnicode_FromUnicode remains supported but is deprecated. If the
|
2011-08-28 14:51:49 -04:00
|
|
|
|
Py_UNICODE pointer is non-null, the data representation is set. If the
|
2011-01-24 15:00:09 -05:00
|
|
|
|
pointer is NULL, a properly-sized wstr representation is allocated,
|
2011-09-27 18:08:51 -04:00
|
|
|
|
which can be modified until PyUnicode_READY() is called (explicitly
|
2011-01-24 15:00:09 -05:00
|
|
|
|
or implicitly). Resizing a Unicode string remains possible until it
|
|
|
|
|
is finalized.
|
|
|
|
|
|
2011-09-25 16:58:13 -04:00
|
|
|
|
PyUnicode_READY() converts a string containing only a wstr
|
2011-08-28 14:51:49 -04:00
|
|
|
|
representation into the canonical representation. Unless wstr and data
|
2011-01-24 15:00:09 -05:00
|
|
|
|
can share the memory, the wstr representation is discarded after the
|
2011-09-25 16:58:13 -04:00
|
|
|
|
conversion. The macro returns 0 on success and -1 on failure, which
|
|
|
|
|
happens in particular if the memory allocation fails.
|
2011-01-24 15:00:09 -05:00
|
|
|
|
|
|
|
|
|
String Access
|
|
|
|
|
-------------
|
|
|
|
|
|
|
|
|
|
The canonical representation can be accessed using two macros
|
|
|
|
|
PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the
|
2011-08-27 01:22:44 -04:00
|
|
|
|
values PyUnicode_WCHAR_KIND (0), PyUnicode_1BYTE_KIND (1),
|
|
|
|
|
PyUnicode_2BYTE_KIND (2), or PyUnicode_4BYTE_KIND (3). PyUnicode_DATA
|
2011-08-27 04:05:42 -04:00
|
|
|
|
gives the void pointer to the data. Access to individual characters
|
|
|
|
|
should use PyUnicode_{READ|WRITE}[_CHAR]:
|
2011-09-28 21:56:12 -04:00
|
|
|
|
|
2011-08-27 01:22:44 -04:00
|
|
|
|
- PyUnciode_READ(kind, data, index)
|
|
|
|
|
- PyUnicode_WRITE(kind, data, index, value)
|
|
|
|
|
- PyUnicode_READ_CHAR(unicode, index)
|
2011-01-24 15:00:09 -05:00
|
|
|
|
|
2011-08-27 04:05:42 -04:00
|
|
|
|
All these macros assume that the string is in canonical form;
|
2011-09-27 18:08:51 -04:00
|
|
|
|
callers need to ensure this by calling PyUnicode_READY.
|
2011-08-27 04:05:42 -04:00
|
|
|
|
|
2011-01-24 15:00:09 -05:00
|
|
|
|
A new function PyUnicode_AsUTF8 is provided to access the UTF-8
|
|
|
|
|
representation. It is thus identical to the existing
|
|
|
|
|
_PyUnicode_AsString, which is removed. The function will compute the
|
|
|
|
|
utf8 representation when first called. Since this representation will
|
|
|
|
|
consume memory until the string object is released, applications
|
|
|
|
|
should use the existing PyUnicode_AsUTF8String where possible
|
2011-08-26 18:50:22 -04:00
|
|
|
|
(which generates a new string object every time). APIs that implicitly
|
2011-01-24 15:00:09 -05:00
|
|
|
|
converts a string to a char* (such as the ParseTuple functions) will
|
2011-01-27 16:37:25 -05:00
|
|
|
|
use PyUnicode_AsUTF8 to compute a conversion.
|
2011-01-24 15:00:09 -05:00
|
|
|
|
|
2011-09-26 06:25:49 -04:00
|
|
|
|
New API
|
|
|
|
|
-------
|
|
|
|
|
|
|
|
|
|
This section summarizes the API additions.
|
|
|
|
|
|
|
|
|
|
Macros to access the internal representation of a Unicode object
|
|
|
|
|
(read-only):
|
|
|
|
|
|
|
|
|
|
- PyUnicode_IS_COMPACT_ASCII(o), PyUnicode_IS_COMPACT(o),
|
|
|
|
|
PyUnicode_IS_READY(o)
|
|
|
|
|
- PyUnicode_GET_LENGTH(o)
|
|
|
|
|
- PyUnicode_KIND(o), PyUnicode_CHARACTER_SIZE(o),
|
|
|
|
|
PyUnicode_MAX_CHAR_VALUE(o)
|
|
|
|
|
- PyUnicode_DATA(o), PyUnicode_1BYTE_DATA(o), PyUnicode_2BYTE_DATA(o),
|
|
|
|
|
PyUnicode_4BYTE_DATA(o)
|
|
|
|
|
|
|
|
|
|
Character access macros:
|
|
|
|
|
|
|
|
|
|
- PyUnicode_READ(kind, data, index), PyUnicode_READ_CHAR(o, index)
|
|
|
|
|
- PyUnicode_WRITE(kind, data, index, value)
|
|
|
|
|
|
|
|
|
|
Other macros:
|
|
|
|
|
|
|
|
|
|
- PyUnicode_READY(o)
|
2011-09-27 18:08:51 -04:00
|
|
|
|
- PyUnicode_CONVERT_BYTES(from_type, to_type, begin, end, to)
|
2011-09-26 06:25:49 -04:00
|
|
|
|
|
|
|
|
|
String creation functions:
|
|
|
|
|
|
|
|
|
|
- PyUnicode_New(size, maxchar)
|
|
|
|
|
- PyUnicode_FromKindAndData(kind, data, size)
|
|
|
|
|
- PyUnicode_Substring(o, start, end)
|
|
|
|
|
|
|
|
|
|
Character access utility functions:
|
|
|
|
|
|
2011-09-28 21:56:12 -04:00
|
|
|
|
- PyUnicode_GetLength(o), PyUnicode_ReadChar(o, index),
|
2011-09-27 17:46:28 -04:00
|
|
|
|
PyUnicode_WriteChar(o, index, character)
|
2011-09-26 06:25:49 -04:00
|
|
|
|
- PyUnicode_CopyCharacters(to, to_start, from, from_start, how_many)
|
|
|
|
|
- PyUnicode_FindChar(str, ch, start, end, direction)
|
|
|
|
|
|
|
|
|
|
Representation conversion:
|
|
|
|
|
|
|
|
|
|
- PyUnicode_AsUCS4(o, buffer, buflen)
|
|
|
|
|
- PyUnicode_AsUCS4Copy(o)
|
|
|
|
|
- PyUnicode_AsUnicodeAndSize(o, size_out)
|
|
|
|
|
- PyUnicode_AsUTF8(o)
|
|
|
|
|
- PyUnicode_AsUTF8AndSize(o, size_out)
|
|
|
|
|
|
|
|
|
|
UCS4 utility functions:
|
|
|
|
|
|
|
|
|
|
- Py_UCS4_{strlen, strcpy, strcat, strncpy, strcmp, strncpy, strcmp,
|
|
|
|
|
strncmp, strchr, strrchr}
|
|
|
|
|
|
2011-01-24 15:00:09 -05:00
|
|
|
|
Stable ABI
|
|
|
|
|
----------
|
|
|
|
|
|
2011-09-27 17:46:28 -04:00
|
|
|
|
The following functions are added to the stable ABI (PEP 384), as they
|
|
|
|
|
are independent of the actual representation of Unicode objects:
|
2011-09-27 18:08:51 -04:00
|
|
|
|
PyUnicode_New, PyUnicode_Substring, PyUnicode_GetLength,
|
|
|
|
|
PyUnicode_ReadChar, PyUnicode_WriteChar, PyUnicode_Find,
|
|
|
|
|
PyUnicode_FindChar.
|
2011-01-24 15:00:09 -05:00
|
|
|
|
|
2011-01-27 16:16:50 -05:00
|
|
|
|
GDB Debugging Hooks
|
|
|
|
|
-------------------
|
|
|
|
|
Tools/gdb/libpython.py contains debugging hooks that embed knowledge
|
|
|
|
|
about the internals of CPython's data types, include PyUnicodeObject
|
2011-09-27 17:46:28 -04:00
|
|
|
|
instances. It has been updated to track the change.
|
2011-01-27 16:16:50 -05:00
|
|
|
|
|
2011-09-25 16:58:13 -04:00
|
|
|
|
Deprecations, Removals, and Incompatibilities
|
|
|
|
|
---------------------------------------------
|
|
|
|
|
|
|
|
|
|
While the Py_UNICODE representation and APIs are deprecated with this
|
|
|
|
|
PEP, no removal of the respective APIs is scheduled. The APIs should
|
|
|
|
|
remain available at least five years after the PEP is accepted; before
|
|
|
|
|
they are removed, existing extension modules should be studied to find
|
|
|
|
|
out whether a sufficient majority of the open-source code on PyPI has
|
|
|
|
|
been ported to the new API. A reasonable motivation for using the
|
|
|
|
|
deprecated API even in new code is for code that shall work both on
|
|
|
|
|
Python 2 and Python 3.
|
|
|
|
|
|
2011-09-27 18:08:51 -04:00
|
|
|
|
The following macros and functions are deprecated:
|
|
|
|
|
|
|
|
|
|
- PyUnicode_FromUnicode
|
|
|
|
|
- PyUnicode_GET_SIZE, PyUnicode_GetSize, PyUnicode_GET_DATA_SIZE,
|
|
|
|
|
- PyUnicode_AS_UNICODE, PyUnicode_AsUnicode, PyUnicode_AsUnicodeAndSize
|
|
|
|
|
- PyUnicode_COPY, PyUnicode_FILL, PyUnicode_MATCH
|
|
|
|
|
- PyUnicode_Encode, PyUnicode_EncodeUTF7, PyUnicode_EncodeUTF8,
|
|
|
|
|
PyUnicode_EncodeUTF16, PyUnicode_EncodeUTF32,
|
|
|
|
|
PyUnicode_EncodeUnicodeEscape, PyUnicode_EncodeRawUnicodeEscape,
|
|
|
|
|
PyUnicode_EncodeLatin1, PyUnicode_EncodeASCII,
|
|
|
|
|
PyUnicode_EncodeCharmap, PyUnicode_TranslateCharmap,
|
2011-09-28 21:56:12 -04:00
|
|
|
|
PyUnicode_EncodeMBCS, PyUnicode_EncodeDecimal,
|
2011-09-27 18:08:51 -04:00
|
|
|
|
PyUnicode_TransformDecimalToASCII
|
|
|
|
|
- Py_UNICODE_{strlen, strcat, strcpy, strcmp, strchr, strrchr}
|
|
|
|
|
- PyUnicode_AsUnicodeCopy
|
2011-09-28 18:01:23 -04:00
|
|
|
|
- PyUnicode_GetMax
|
2011-09-27 18:08:51 -04:00
|
|
|
|
|
2011-09-25 16:58:13 -04:00
|
|
|
|
_PyUnicode_AsDefaultEncodedString is removed. It previously returned a
|
|
|
|
|
borrowed reference to an UTF-8-encoded bytes object. Since the unicode
|
|
|
|
|
object cannot anymore cache such a reference, implementing it without
|
|
|
|
|
leaking memory is not possible. No deprecation phase is provided,
|
|
|
|
|
since it was an API for internal use only.
|
|
|
|
|
|
|
|
|
|
Extension modules using the legacy API may inadvertently call
|
|
|
|
|
PyUnicode_READY, by calling some API that requires that the object is
|
|
|
|
|
ready, and then continue accessing the (now invalid) Py_UNICODE
|
|
|
|
|
pointer. Such code will break with this PEP. The code was already
|
|
|
|
|
flawed in 3.2, as there is was no explicit guarantee that the
|
|
|
|
|
PyUnicode_AS_UNICODE result would stay valid after an API call (due to
|
2011-09-28 21:56:12 -04:00
|
|
|
|
the possibility of string resizing). Modules that face this issue
|
2011-09-25 16:58:13 -04:00
|
|
|
|
need to re-fetch the Py_UNICODE pointer after API calls; doing
|
|
|
|
|
so will continue to work correctly in earlier Python versions.
|
|
|
|
|
|
2011-01-27 16:42:35 -05:00
|
|
|
|
Discussion
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
Several concerns have been raised about the approach presented here:
|
|
|
|
|
|
|
|
|
|
It makes the implementation more complex. That's true, but considered
|
2011-08-26 18:50:22 -04:00
|
|
|
|
worth it given the benefits.
|
2011-01-27 16:16:50 -05:00
|
|
|
|
|
2011-08-27 01:23:59 -04:00
|
|
|
|
The Py_UNICODE representation is not instantaneously available,
|
2011-01-27 16:47:00 -05:00
|
|
|
|
slowing down applications that request it. While this is also true,
|
|
|
|
|
applications that care about this problem can be rewritten to use the
|
2011-08-28 14:51:49 -04:00
|
|
|
|
data representation.
|
2011-01-27 16:47:00 -05:00
|
|
|
|
|
2011-08-28 14:12:38 -04:00
|
|
|
|
Performance
|
|
|
|
|
-----------
|
|
|
|
|
|
|
|
|
|
Performance of this patch must be considered for both memory
|
|
|
|
|
consumption and runtime efficiency. For memory consumption, the
|
|
|
|
|
expectation is that applications that have many large strings will see
|
|
|
|
|
a reduction in memory usage. For small strings, the effects depend on
|
|
|
|
|
the pointer size of the system, and the size of the Py_UNICODE/wchar_t
|
2011-08-28 14:51:49 -04:00
|
|
|
|
type. The following table demonstrates this for various small ASCII
|
2011-09-25 16:58:13 -04:00
|
|
|
|
and Latin-1 string sizes and platforms.
|
|
|
|
|
|
|
|
|
|
+-------+---------------------------------+---------------------------------+
|
|
|
|
|
|string | Python 3.2 | This PEP |
|
|
|
|
|
|size +----------------+----------------+----------------+----------------+
|
|
|
|
|
| | 16-bit wchar_t | 32-bit wchar_t | ASCII | Latin-1 |
|
|
|
|
|
| +---------+------+--------+-------+--------+-------+--------+-------+
|
|
|
|
|
| | 32-bit |64-bit| 32-bit |64-bit | 32-bit |64-bit | 32-bit |64-bit |
|
|
|
|
|
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|
|
|
|
|
|1 | 32 | 64 | 40 | 64 | 32 | 56 | 40 | 80 |
|
|
|
|
|
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|
|
|
|
|
|2 | 40 | 64 | 40 | 72 | 32 | 56 | 40 | 80 |
|
|
|
|
|
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|
|
|
|
|
|3 | 40 | 64 | 48 | 72 | 32 | 56 | 40 | 80 |
|
|
|
|
|
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|
|
|
|
|
|4 | 40 | 72 | 48 | 80 | 32 | 56 | 48 | 80 |
|
|
|
|
|
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|
|
|
|
|
|5 | 40 | 72 | 56 | 80 | 32 | 56 | 48 | 80 |
|
|
|
|
|
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|
|
|
|
|
|6 | 48 | 72 | 56 | 88 | 32 | 56 | 48 | 80 |
|
|
|
|
|
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|
|
|
|
|
|7 | 48 | 72 | 64 | 88 | 32 | 56 | 48 | 80 |
|
|
|
|
|
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|
|
|
|
|
|8 | 48 | 80 | 64 | 96 | 40 | 64 | 48 | 88 |
|
|
|
|
|
+-------+---------+------+--------+-------+--------+-------+--------+-------+
|
2011-08-28 14:12:38 -04:00
|
|
|
|
|
2011-08-28 15:44:21 -04:00
|
|
|
|
The runtime effect is significantly affected by the API being
|
|
|
|
|
used. After porting the relevant pieces of code to the new API,
|
|
|
|
|
the iobench, stringbench, and json benchmarks see typically
|
|
|
|
|
slowdowns of 1% to 30%; for specific benchmarks, speedups may
|
|
|
|
|
happen as may happen significantly larger slowdowns.
|
2011-08-28 14:12:38 -04:00
|
|
|
|
|
2011-09-15 11:29:41 -04:00
|
|
|
|
Porting Guidelines
|
|
|
|
|
==================
|
|
|
|
|
|
|
|
|
|
Only a small fraction of C code is affected by this PEP, namely code
|
|
|
|
|
that needs to look "inside" unicode strings. That code doesn't
|
|
|
|
|
necessarily need to be ported to this API, as the existing API will
|
|
|
|
|
continue to work correctly. In particular, modules that need to
|
|
|
|
|
support both Python 2 and Python 3 might get too complicated when
|
|
|
|
|
simultaneously supporting this new API and the old Unicode API.
|
|
|
|
|
|
|
|
|
|
In order to port modules to the new API, try to eliminate
|
|
|
|
|
the use of these API elements:
|
|
|
|
|
|
|
|
|
|
- the Py_UNICODE type,
|
|
|
|
|
- PyUnicode_AS_UNICODE and PyUnicode_AsUnicode,
|
2011-09-27 18:08:51 -04:00
|
|
|
|
- PyUnicode_GET_SIZE and PyUnicode_GetSize, and
|
2011-09-15 11:29:41 -04:00
|
|
|
|
- PyUnicode_FromUnicode.
|
|
|
|
|
|
|
|
|
|
When iterating over an existing string, or looking at specific
|
|
|
|
|
characters, use indexing operations rather than pointer arithmetic;
|
|
|
|
|
indexing works well for PyUnicode_READ(_CHAR) and PyUnicode_WRITE. Use
|
|
|
|
|
void* as the buffer type for characters to let the compiler detect
|
|
|
|
|
invalid dereferencing operations. If you do want to use pointer
|
2011-09-28 21:56:12 -04:00
|
|
|
|
arithmetics (e.g. when converting existing code), use (unsigned)
|
2011-09-15 11:29:41 -04:00
|
|
|
|
char* as the buffer type, and keep the element size (1, 2, or 4) in a
|
|
|
|
|
variable. Notice that (1<<(kind-1)) will produce the element size
|
|
|
|
|
given a buffer kind.
|
|
|
|
|
|
|
|
|
|
When creating new strings, it was common in Python to start of with a
|
|
|
|
|
heuristical buffer size, and then grow or shrink if the heuristics
|
|
|
|
|
failed. With this PEP, this is now less practical, as you need not
|
|
|
|
|
only a heuristics for the length of the string, but also for the
|
|
|
|
|
maximum character.
|
|
|
|
|
|
|
|
|
|
In order to avoid heuristics, you need to make two passes over the
|
|
|
|
|
input: once to determine the output length, and the maximum character;
|
|
|
|
|
then allocate the target string with PyUnicode_New and iterate over
|
|
|
|
|
the input a second time to produce the final output. While this may
|
|
|
|
|
sound expensive, it could actually be cheaper than having to copy the
|
|
|
|
|
result again as in the following approach.
|
|
|
|
|
|
|
|
|
|
If you take the heuristical route, avoid allocating a string meant to
|
|
|
|
|
be resized, as resizing strings won't work for their canonical
|
|
|
|
|
representation. Instead, allocate a separate buffer to collect the
|
|
|
|
|
characters, and then construct a unicode object from that using
|
|
|
|
|
PyUnicode_FromKindAndData. One option is to use Py_UCS4 as the buffer
|
|
|
|
|
element, assuming for the worst case in character ordinals. This will
|
|
|
|
|
allow for pointer arithmetics, but may require a lot of memory.
|
|
|
|
|
Alternatively, start with a 1-byte buffer, and increase the element
|
|
|
|
|
size as you encounter larger characters. In any case,
|
|
|
|
|
PyUnicode_FromKindAndData will scan over the buffer to verify the
|
|
|
|
|
maximum character.
|
|
|
|
|
|
|
|
|
|
For common tasks, direct access to the string representation may not
|
|
|
|
|
be necessary: PyUnicode_Find, PyUnicode_FindChar, PyUnicode_Ord, and
|
|
|
|
|
PyUnicode_CopyCharacters help in analyzing and creating string
|
2011-09-28 21:56:12 -04:00
|
|
|
|
objects, operating on indexes instead of data pointers.
|
2011-09-15 11:29:41 -04:00
|
|
|
|
|
2011-09-26 06:25:49 -04:00
|
|
|
|
References
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
.. [1] PEP 393 branch
|
|
|
|
|
https://bitbucket.org/t0rsten/pep-393
|
|
|
|
|
|
2011-01-24 15:00:09 -05:00
|
|
|
|
Copyright
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
..
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
coding: utf-8
|
|
|
|
|
End:
|