Add porting guidelines.
This commit is contained in:
parent
a17354feff
commit
602c44f1d7
58
pep-0393.txt
58
pep-0393.txt
|
@ -272,6 +272,64 @@ the iobench, stringbench, and json benchmarks see typically
|
||||||
slowdowns of 1% to 30%; for specific benchmarks, speedups may
|
slowdowns of 1% to 30%; for specific benchmarks, speedups may
|
||||||
happen as may happen significantly larger slowdowns.
|
happen as may happen significantly larger slowdowns.
|
||||||
|
|
||||||
|
Porting Guidelines
|
||||||
|
==================
|
||||||
|
|
||||||
|
Only a small fraction of C code is affected by this PEP, namely code
|
||||||
|
that needs to look "inside" unicode strings. That code doesn't
|
||||||
|
necessarily need to be ported to this API, as the existing API will
|
||||||
|
continue to work correctly. In particular, modules that need to
|
||||||
|
support both Python 2 and Python 3 might get too complicated when
|
||||||
|
simultaneously supporting this new API and the old Unicode API.
|
||||||
|
|
||||||
|
In order to port modules to the new API, try to eliminate
|
||||||
|
the use of these API elements:
|
||||||
|
|
||||||
|
- the Py_UNICODE type,
|
||||||
|
- PyUnicode_AS_UNICODE and PyUnicode_AsUnicode,
|
||||||
|
- PyUnicode_GET_LENGTH and PyUnicode_GetSize, and
|
||||||
|
- PyUnicode_FromUnicode.
|
||||||
|
|
||||||
|
When iterating over an existing string, or looking at specific
|
||||||
|
characters, use indexing operations rather than pointer arithmetic;
|
||||||
|
indexing works well for PyUnicode_READ(_CHAR) and PyUnicode_WRITE. Use
|
||||||
|
void* as the buffer type for characters to let the compiler detect
|
||||||
|
invalid dereferencing operations. If you do want to use pointer
|
||||||
|
arithmentics (e.g. when converting existing code), use (unsigned)
|
||||||
|
char* as the buffer type, and keep the element size (1, 2, or 4) in a
|
||||||
|
variable. Notice that (1<<(kind-1)) will produce the element size
|
||||||
|
given a buffer kind.
|
||||||
|
|
||||||
|
When creating new strings, it was common in Python to start of with a
|
||||||
|
heuristical buffer size, and then grow or shrink if the heuristics
|
||||||
|
failed. With this PEP, this is now less practical, as you need not
|
||||||
|
only a heuristics for the length of the string, but also for the
|
||||||
|
maximum character.
|
||||||
|
|
||||||
|
In order to avoid heuristics, you need to make two passes over the
|
||||||
|
input: once to determine the output length, and the maximum character;
|
||||||
|
then allocate the target string with PyUnicode_New and iterate over
|
||||||
|
the input a second time to produce the final output. While this may
|
||||||
|
sound expensive, it could actually be cheaper than having to copy the
|
||||||
|
result again as in the following approach.
|
||||||
|
|
||||||
|
If you take the heuristical route, avoid allocating a string meant to
|
||||||
|
be resized, as resizing strings won't work for their canonical
|
||||||
|
representation. Instead, allocate a separate buffer to collect the
|
||||||
|
characters, and then construct a unicode object from that using
|
||||||
|
PyUnicode_FromKindAndData. One option is to use Py_UCS4 as the buffer
|
||||||
|
element, assuming for the worst case in character ordinals. This will
|
||||||
|
allow for pointer arithmetics, but may require a lot of memory.
|
||||||
|
Alternatively, start with a 1-byte buffer, and increase the element
|
||||||
|
size as you encounter larger characters. In any case,
|
||||||
|
PyUnicode_FromKindAndData will scan over the buffer to verify the
|
||||||
|
maximum character.
|
||||||
|
|
||||||
|
For common tasks, direct access to the string representation may not
|
||||||
|
be necessary: PyUnicode_Find, PyUnicode_FindChar, PyUnicode_Ord, and
|
||||||
|
PyUnicode_CopyCharacters help in analyzing and creating string
|
||||||
|
objects, operating on indices instead of data pointers.
|
||||||
|
|
||||||
Copyright
|
Copyright
|
||||||
=========
|
=========
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue