Add porting guidelines.
This commit is contained in:
parent
a17354feff
commit
602c44f1d7
58
pep-0393.txt
58
pep-0393.txt
|
@ -272,6 +272,64 @@ the iobench, stringbench, and json benchmarks see typically
|
|||
slowdowns of 1% to 30%; for specific benchmarks, speedups may
|
||||
happen as may happen significantly larger slowdowns.
|
||||
|
||||
Porting Guidelines
|
||||
==================
|
||||
|
||||
Only a small fraction of C code is affected by this PEP, namely code
|
||||
that needs to look "inside" unicode strings. That code doesn't
|
||||
necessarily need to be ported to this API, as the existing API will
|
||||
continue to work correctly. In particular, modules that need to
|
||||
support both Python 2 and Python 3 might get too complicated when
|
||||
simultaneously supporting this new API and the old Unicode API.
|
||||
|
||||
In order to port modules to the new API, try to eliminate
|
||||
the use of these API elements:
|
||||
|
||||
- the Py_UNICODE type,
|
||||
- PyUnicode_AS_UNICODE and PyUnicode_AsUnicode,
|
||||
- PyUnicode_GET_LENGTH and PyUnicode_GetSize, and
|
||||
- PyUnicode_FromUnicode.
|
||||
|
||||
When iterating over an existing string, or looking at specific
|
||||
characters, use indexing operations rather than pointer arithmetic;
|
||||
indexing works well for PyUnicode_READ(_CHAR) and PyUnicode_WRITE. Use
|
||||
void* as the buffer type for characters to let the compiler detect
|
||||
invalid dereferencing operations. If you do want to use pointer
|
||||
arithmentics (e.g. when converting existing code), use (unsigned)
|
||||
char* as the buffer type, and keep the element size (1, 2, or 4) in a
|
||||
variable. Notice that (1<<(kind-1)) will produce the element size
|
||||
given a buffer kind.
|
||||
|
||||
When creating new strings, it was common in Python to start of with a
|
||||
heuristical buffer size, and then grow or shrink if the heuristics
|
||||
failed. With this PEP, this is now less practical, as you need not
|
||||
only a heuristics for the length of the string, but also for the
|
||||
maximum character.
|
||||
|
||||
In order to avoid heuristics, you need to make two passes over the
|
||||
input: once to determine the output length, and the maximum character;
|
||||
then allocate the target string with PyUnicode_New and iterate over
|
||||
the input a second time to produce the final output. While this may
|
||||
sound expensive, it could actually be cheaper than having to copy the
|
||||
result again as in the following approach.
|
||||
|
||||
If you take the heuristical route, avoid allocating a string meant to
|
||||
be resized, as resizing strings won't work for their canonical
|
||||
representation. Instead, allocate a separate buffer to collect the
|
||||
characters, and then construct a unicode object from that using
|
||||
PyUnicode_FromKindAndData. One option is to use Py_UCS4 as the buffer
|
||||
element, assuming for the worst case in character ordinals. This will
|
||||
allow for pointer arithmetics, but may require a lot of memory.
|
||||
Alternatively, start with a 1-byte buffer, and increase the element
|
||||
size as you encounter larger characters. In any case,
|
||||
PyUnicode_FromKindAndData will scan over the buffer to verify the
|
||||
maximum character.
|
||||
|
||||
For common tasks, direct access to the string representation may not
|
||||
be necessary: PyUnicode_Find, PyUnicode_FindChar, PyUnicode_Ord, and
|
||||
PyUnicode_CopyCharacters help in analyzing and creating string
|
||||
objects, operating on indices instead of data pointers.
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
|
|
Loading…
Reference in New Issue