Add porting guidelines.

2011-09-15 17:29:41 +02:00 · 2011-09-15 17:29:41 +02:00 · 602c44f1d7
parent a17354feff
commit 602c44f1d7
1 changed files with 58 additions and 0 deletions
--- a/pep-0393.txt
+++ b/pep-0393.txt
@ -272,6 +272,64 @@ the iobench, stringbench, and json benchmarks see typically
 slowdowns of 1% to 30%; for specific benchmarks, speedups may
 happen as may happen significantly larger slowdowns.

+Porting Guidelines
+==================
+
+Only a small fraction of C code is affected by this PEP, namely code
+that needs to look "inside" unicode strings.  That code doesn't
+necessarily need to be ported to this API, as the existing API will
+continue to work correctly. In particular, modules that need to
+support both Python 2 and Python 3 might get too complicated when
+simultaneously supporting this new API and the old Unicode API.
+
+In order to port modules to the new API, try to eliminate
+the use of these API elements:
+
+- the Py_UNICODE type,
+- PyUnicode_AS_UNICODE and PyUnicode_AsUnicode,
+- PyUnicode_GET_LENGTH and PyUnicode_GetSize, and
+- PyUnicode_FromUnicode.
+
+When iterating over an existing string, or looking at specific
+characters, use indexing operations rather than pointer arithmetic;
+indexing works well for PyUnicode_READ(_CHAR) and PyUnicode_WRITE. Use
+void* as the buffer type for characters to let the compiler detect
+invalid dereferencing operations. If you do want to use pointer
+arithmentics (e.g. when converting existing code), use (unsigned)
+char* as the buffer type, and keep the element size (1, 2, or 4) in a
+variable. Notice that (1<<(kind-1)) will produce the element size
+given a buffer kind.
+
+When creating new strings, it was common in Python to start of with a
+heuristical buffer size, and then grow or shrink if the heuristics
+failed. With this PEP, this is now less practical, as you need not
+only a heuristics for the length of the string, but also for the
+maximum character.
+
+In order to avoid heuristics, you need to make two passes over the
+input: once to determine the output length, and the maximum character;
+then allocate the target string with PyUnicode_New and iterate over
+the input a second time to produce the final output. While this may
+sound expensive, it could actually be cheaper than having to copy the
+result again as in the following approach.
+
+If you take the heuristical route, avoid allocating a string meant to
+be resized, as resizing strings won't work for their canonical
+representation.  Instead, allocate a separate buffer to collect the
+characters, and then construct a unicode object from that using
+PyUnicode_FromKindAndData. One option is to use Py_UCS4 as the buffer
+element, assuming for the worst case in character ordinals. This will
+allow for pointer arithmetics, but may require a lot of memory.
+Alternatively, start with a 1-byte buffer, and increase the element
+size as you encounter larger characters. In any case,
+PyUnicode_FromKindAndData will scan over the buffer to verify the
+maximum character.
+
+For common tasks, direct access to the string representation may not
+be necessary: PyUnicode_Find, PyUnicode_FindChar, PyUnicode_Ord, and
+PyUnicode_CopyCharacters help in analyzing and creating string
+objects, operating on indices instead of data pointers.
+
 Copyright
 =========