Add porting guidelines.

This commit is contained in:
Martin v. Löwis 2011-09-15 17:29:41 +02:00
parent a17354feff
commit 602c44f1d7
1 changed files with 58 additions and 0 deletions

View File

@ -272,6 +272,64 @@ the iobench, stringbench, and json benchmarks see typically
slowdowns of 1% to 30%; for specific benchmarks, speedups may
happen as may happen significantly larger slowdowns.
Porting Guidelines
==================
Only a small fraction of C code is affected by this PEP, namely code
that needs to look "inside" unicode strings. That code doesn't
necessarily need to be ported to this API, as the existing API will
continue to work correctly. In particular, modules that need to
support both Python 2 and Python 3 might get too complicated when
simultaneously supporting this new API and the old Unicode API.
In order to port modules to the new API, try to eliminate
the use of these API elements:
- the Py_UNICODE type,
- PyUnicode_AS_UNICODE and PyUnicode_AsUnicode,
- PyUnicode_GET_LENGTH and PyUnicode_GetSize, and
- PyUnicode_FromUnicode.
When iterating over an existing string, or looking at specific
characters, use indexing operations rather than pointer arithmetic;
indexing works well for PyUnicode_READ(_CHAR) and PyUnicode_WRITE. Use
void* as the buffer type for characters to let the compiler detect
invalid dereferencing operations. If you do want to use pointer
arithmentics (e.g. when converting existing code), use (unsigned)
char* as the buffer type, and keep the element size (1, 2, or 4) in a
variable. Notice that (1<<(kind-1)) will produce the element size
given a buffer kind.
When creating new strings, it was common in Python to start of with a
heuristical buffer size, and then grow or shrink if the heuristics
failed. With this PEP, this is now less practical, as you need not
only a heuristics for the length of the string, but also for the
maximum character.
In order to avoid heuristics, you need to make two passes over the
input: once to determine the output length, and the maximum character;
then allocate the target string with PyUnicode_New and iterate over
the input a second time to produce the final output. While this may
sound expensive, it could actually be cheaper than having to copy the
result again as in the following approach.
If you take the heuristical route, avoid allocating a string meant to
be resized, as resizing strings won't work for their canonical
representation. Instead, allocate a separate buffer to collect the
characters, and then construct a unicode object from that using
PyUnicode_FromKindAndData. One option is to use Py_UCS4 as the buffer
element, assuming for the worst case in character ordinals. This will
allow for pointer arithmetics, but may require a lot of memory.
Alternatively, start with a 1-byte buffer, and increase the element
size as you encounter larger characters. In any case,
PyUnicode_FromKindAndData will scan over the buffer to verify the
maximum character.
For common tasks, direct access to the string representation may not
be necessary: PyUnicode_Find, PyUnicode_FindChar, PyUnicode_Ord, and
PyUnicode_CopyCharacters help in analyzing and creating string
objects, operating on indices instead of data pointers.
Copyright
=========