From 602c44f1d78d9a6d9de2bd36b85fa5b99f1acdc7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Martin=20v=2E=20L=C3=B6wis?= Date: Thu, 15 Sep 2011 17:29:41 +0200 Subject: [PATCH] Add porting guidelines. --- pep-0393.txt | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 58 insertions(+) diff --git a/pep-0393.txt b/pep-0393.txt index d9fafab88..740f8e368 100644 --- a/pep-0393.txt +++ b/pep-0393.txt @@ -272,6 +272,64 @@ the iobench, stringbench, and json benchmarks see typically slowdowns of 1% to 30%; for specific benchmarks, speedups may happen as may happen significantly larger slowdowns. +Porting Guidelines +================== + +Only a small fraction of C code is affected by this PEP, namely code +that needs to look "inside" unicode strings. That code doesn't +necessarily need to be ported to this API, as the existing API will +continue to work correctly. In particular, modules that need to +support both Python 2 and Python 3 might get too complicated when +simultaneously supporting this new API and the old Unicode API. + +In order to port modules to the new API, try to eliminate +the use of these API elements: + +- the Py_UNICODE type, +- PyUnicode_AS_UNICODE and PyUnicode_AsUnicode, +- PyUnicode_GET_LENGTH and PyUnicode_GetSize, and +- PyUnicode_FromUnicode. + +When iterating over an existing string, or looking at specific +characters, use indexing operations rather than pointer arithmetic; +indexing works well for PyUnicode_READ(_CHAR) and PyUnicode_WRITE. Use +void* as the buffer type for characters to let the compiler detect +invalid dereferencing operations. If you do want to use pointer +arithmentics (e.g. when converting existing code), use (unsigned) +char* as the buffer type, and keep the element size (1, 2, or 4) in a +variable. Notice that (1<<(kind-1)) will produce the element size +given a buffer kind. + +When creating new strings, it was common in Python to start of with a +heuristical buffer size, and then grow or shrink if the heuristics +failed. With this PEP, this is now less practical, as you need not +only a heuristics for the length of the string, but also for the +maximum character. + +In order to avoid heuristics, you need to make two passes over the +input: once to determine the output length, and the maximum character; +then allocate the target string with PyUnicode_New and iterate over +the input a second time to produce the final output. While this may +sound expensive, it could actually be cheaper than having to copy the +result again as in the following approach. + +If you take the heuristical route, avoid allocating a string meant to +be resized, as resizing strings won't work for their canonical +representation. Instead, allocate a separate buffer to collect the +characters, and then construct a unicode object from that using +PyUnicode_FromKindAndData. One option is to use Py_UCS4 as the buffer +element, assuming for the worst case in character ordinals. This will +allow for pointer arithmetics, but may require a lot of memory. +Alternatively, start with a 1-byte buffer, and increase the element +size as you encounter larger characters. In any case, +PyUnicode_FromKindAndData will scan over the buffer to verify the +maximum character. + +For common tasks, direct access to the string representation may not +be necessary: PyUnicode_Find, PyUnicode_FindChar, PyUnicode_Ord, and +PyUnicode_CopyCharacters help in analyzing and creating string +objects, operating on indices instead of data pointers. + Copyright =========