Changes from Nick Coghlan:

Clarify PyUnicode_AsUTF8 usage.
Rename PyUnicode_Finalize.
Store representation form in state.
This commit is contained in:
Martin v. Löwis 2011-01-27 21:37:25 +00:00
parent 55c04efb71
commit d363de45dd
1 changed files with 18 additions and 10 deletions

View File

@ -69,13 +69,21 @@ The Unicode object structure is changed to this definition::
These fields have the following interpretations:
- length: number of code points in the string (result of sq_length)
- str: shortest-form representation of the unicode string; the lower
two bits of the pointer indicate the specific form:
01 => 1 byte (Latin-1); 10 => 2 byte (UCS-2); 11 => 4 byte (UCS-4);
00 => null pointer
- str: shortest-form representation of the unicode string
The string is null-terminated (in its respective representation).
- hash, state: same as in Python 3.2
- hash: same as in Python 3.2
- state:
* lowest 2 bits (mask 0x03) - interned-state (SSTATE_*) as in 3.2
* next 2 bits (mask 0x0C) - form of str:
+ 00 => reserved
+ 01 => 1 byte (Latin-1)
+ 10 => 2 byte (UCS-2)
+ 11 => 4 byte (UCS-4);
* next bit (mask 0x10): 1 if str memory follows PyUnicodeObject
- utf8_length, utf8: UTF-8 representation (null-terminated)
- wstr_length, wstr: representation in platform's wchar_t
(null-terminated). If wchar_t is 16-bit, this form may use surrogate
@ -123,11 +131,11 @@ representation is not yet set for the string.
PyUnicode_FromUnicode remains supported but is deprecated. If the
Py_UNICODE pointer is non-null, the str representation is set. If the
pointer is NULL, a properly-sized wstr representation is allocated,
which can be modified until PyUnicode_Finalize() is called (explicitly
which can be modified until PyUnicode_Ready() is called (explicitly
or implicitly). Resizing a Unicode string remains possible until it
is finalized.
PyUnicode_Finalize() converts a string containing only a wstr
PyUnicode_Ready() converts a string containing only a wstr
representation into the canonical representation. Unless wstr and str
can share the memory, the wstr representation is discarded after the
conversion.
@ -139,7 +147,7 @@ The canonical representation can be accessed using two macros
PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the
value PyUnicode_1BYTE (1), PyUnicode_2BYTE (2), or PyUnicode_4BYTE
(3). PyUnicode_Data gives the void pointer to the data, masking out
the pointer kind. All these functions call PyUnicode_Finalize
the pointer kind. All these functions call PyUnicode_Ready
in case the canonical representation hasn't been computed yet.
A new function PyUnicode_AsUTF8 is provided to access the UTF-8
@ -150,7 +158,7 @@ consume memory until the string object is released, applications
should use the existing PyUnicode_AsUTF8String where possible
(which generates a new string object every time). API that implicitly
converts a string to a char* (such as the ParseTuple functions) will
use this function to compute a conversion.
use PyUnicode_AsUTF8 to compute a conversion.
PyUnicode_AsUnicode is deprecated; it computes the wstr representation
on first use.