Add porting guidelines.

2011-09-15 17:29:41 +02:00 · 2011-09-15 17:29:41 +02:00 · 602c44f1d7
parent a17354feff
commit 602c44f1d7
1 changed files with 58 additions and 0 deletions
--- a/pep-0393.txt
+++ b/pep-0393.txt
@ -272,6 +272,64 @@ the iobench, stringbench, and json benchmarks see typically
 slowdowns of 1% to 30%; for specific benchmarks, speedups may
 happen as may happen significantly larger slowdowns.
 Porting Guidelines
 ==================
 Only a small fraction of C code is affected by this PEP, namely code
 that needs to look "inside" unicode strings.  That code doesn't
 necessarily need to be ported to this API, as the existing API will
 continue to work correctly. In particular, modules that need to
 support both Python 2 and Python 3 might get too complicated when
 simultaneously supporting this new API and the old Unicode API.
 In order to port modules to the new API, try to eliminate
 the use of these API elements:
 - the Py_UNICODE type,
 - PyUnicode_AS_UNICODE and PyUnicode_AsUnicode,
 - PyUnicode_GET_LENGTH and PyUnicode_GetSize, and
 - PyUnicode_FromUnicode.
 When iterating over an existing string, or looking at specific
 characters, use indexing operations rather than pointer arithmetic;
 indexing works well for PyUnicode_READ(_CHAR) and PyUnicode_WRITE. Use
 void* as the buffer type for characters to let the compiler detect
 invalid dereferencing operations. If you do want to use pointer
 arithmentics (e.g. when converting existing code), use (unsigned)
 char* as the buffer type, and keep the element size (1, 2, or 4) in a
 variable. Notice that (1<<(kind-1)) will produce the element size
 given a buffer kind.
 When creating new strings, it was common in Python to start of with a
 heuristical buffer size, and then grow or shrink if the heuristics
 failed. With this PEP, this is now less practical, as you need not
 only a heuristics for the length of the string, but also for the
 maximum character.
 In order to avoid heuristics, you need to make two passes over the
 input: once to determine the output length, and the maximum character;
 then allocate the target string with PyUnicode_New and iterate over
 the input a second time to produce the final output. While this may
 sound expensive, it could actually be cheaper than having to copy the
 result again as in the following approach.
 If you take the heuristical route, avoid allocating a string meant to
 be resized, as resizing strings won't work for their canonical
 representation.  Instead, allocate a separate buffer to collect the
 characters, and then construct a unicode object from that using
 PyUnicode_FromKindAndData. One option is to use Py_UCS4 as the buffer
 element, assuming for the worst case in character ordinals. This will
 allow for pointer arithmetics, but may require a lot of memory.
 Alternatively, start with a 1-byte buffer, and increase the element
 size as you encounter larger characters. In any case,
 PyUnicode_FromKindAndData will scan over the buffer to verify the
 maximum character.
 For common tasks, direct access to the string representation may not
 be necessary: PyUnicode_Find, PyUnicode_FindChar, PyUnicode_Ord, and
 PyUnicode_CopyCharacters help in analyzing and creating string
 objects, operating on indices instead of data pointers.
 Copyright
 =========