Changes from Nick Coghlan:
Clarify PyUnicode_AsUTF8 usage. Rename PyUnicode_Finalize. Store representation form in state.
This commit is contained in:
parent
55c04efb71
commit
d363de45dd
28
pep-0393.txt
28
pep-0393.txt
|
@ -69,13 +69,21 @@ The Unicode object structure is changed to this definition::
|
|||
These fields have the following interpretations:
|
||||
|
||||
- length: number of code points in the string (result of sq_length)
|
||||
- str: shortest-form representation of the unicode string; the lower
|
||||
two bits of the pointer indicate the specific form:
|
||||
01 => 1 byte (Latin-1); 10 => 2 byte (UCS-2); 11 => 4 byte (UCS-4);
|
||||
00 => null pointer
|
||||
|
||||
- str: shortest-form representation of the unicode string
|
||||
The string is null-terminated (in its respective representation).
|
||||
- hash, state: same as in Python 3.2
|
||||
- hash: same as in Python 3.2
|
||||
- state:
|
||||
|
||||
* lowest 2 bits (mask 0x03) - interned-state (SSTATE_*) as in 3.2
|
||||
* next 2 bits (mask 0x0C) - form of str:
|
||||
|
||||
+ 00 => reserved
|
||||
+ 01 => 1 byte (Latin-1)
|
||||
+ 10 => 2 byte (UCS-2)
|
||||
+ 11 => 4 byte (UCS-4);
|
||||
|
||||
* next bit (mask 0x10): 1 if str memory follows PyUnicodeObject
|
||||
|
||||
- utf8_length, utf8: UTF-8 representation (null-terminated)
|
||||
- wstr_length, wstr: representation in platform's wchar_t
|
||||
(null-terminated). If wchar_t is 16-bit, this form may use surrogate
|
||||
|
@ -123,11 +131,11 @@ representation is not yet set for the string.
|
|||
PyUnicode_FromUnicode remains supported but is deprecated. If the
|
||||
Py_UNICODE pointer is non-null, the str representation is set. If the
|
||||
pointer is NULL, a properly-sized wstr representation is allocated,
|
||||
which can be modified until PyUnicode_Finalize() is called (explicitly
|
||||
which can be modified until PyUnicode_Ready() is called (explicitly
|
||||
or implicitly). Resizing a Unicode string remains possible until it
|
||||
is finalized.
|
||||
|
||||
PyUnicode_Finalize() converts a string containing only a wstr
|
||||
PyUnicode_Ready() converts a string containing only a wstr
|
||||
representation into the canonical representation. Unless wstr and str
|
||||
can share the memory, the wstr representation is discarded after the
|
||||
conversion.
|
||||
|
@ -139,7 +147,7 @@ The canonical representation can be accessed using two macros
|
|||
PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the
|
||||
value PyUnicode_1BYTE (1), PyUnicode_2BYTE (2), or PyUnicode_4BYTE
|
||||
(3). PyUnicode_Data gives the void pointer to the data, masking out
|
||||
the pointer kind. All these functions call PyUnicode_Finalize
|
||||
the pointer kind. All these functions call PyUnicode_Ready
|
||||
in case the canonical representation hasn't been computed yet.
|
||||
|
||||
A new function PyUnicode_AsUTF8 is provided to access the UTF-8
|
||||
|
@ -150,7 +158,7 @@ consume memory until the string object is released, applications
|
|||
should use the existing PyUnicode_AsUTF8String where possible
|
||||
(which generates a new string object every time). API that implicitly
|
||||
converts a string to a char* (such as the ParseTuple functions) will
|
||||
use this function to compute a conversion.
|
||||
use PyUnicode_AsUTF8 to compute a conversion.
|
||||
|
||||
PyUnicode_AsUnicode is deprecated; it computes the wstr representation
|
||||
on first use.
|
||||
|
|
Loading…
Reference in New Issue