PEP: 756 Title: Add PyUnicode_Export() and PyUnicode_Import() C functions Author: Victor Stinner PEP-Delegate: C API Working Group Discussions-To: https://discuss.python.org/t/63891 Status: Draft Type: Standards Track Created: 13-Sep-2024 Python-Version: 3.14 Post-History: `14-Sep-2024 `__ .. highlight:: c Abstract ======== Add functions to the limited C API version 3.14: * ``PyUnicode_Export()``: export a Python str object as a ``Py_buffer`` view. * ``PyUnicode_Import()``: import a Python str object. In general, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory copy is needed. See the :ref:`specification ` for cases when a copy is needed. Rationale ========= PEP 393 ------- :pep:`393` "Flexible String Representation" changed string internals in Python 3.3 to use three formats: * ``PyUnicode_1BYTE_KIND``: Unicode range [U+0000; U+00ff], UCS-1, 1 byte/character. * ``PyUnicode_2BYTE_KIND``: Unicode range [U+0000; U+ffff], UCS-2, 2 bytes/character. * ``PyUnicode_4BYTE_KIND``: Unicode range [U+0000; U+10ffff], UCS-4, 4 bytes/character. A Python ``str`` object must always use the most compact format. For example, a string which only contains ASCII characters must use the UCS-1 format. The ``PyUnicode_KIND()`` function can be used to know the format used by a string. One of the following functions can be used to access data: * ``PyUnicode_1BYTE_DATA()`` for ``PyUnicode_1BYTE_KIND``. * ``PyUnicode_2BYTE_DATA()`` for ``PyUnicode_2BYTE_KIND``. * ``PyUnicode_4BYTE_DATA()`` for ``PyUnicode_4BYTE_KIND``. To get the best performance, a C extension should have 3 code paths for each of these 3 string native formats. Limited C API ------------- :pep:`393` functions such as ``PyUnicode_KIND()`` and ``PyUnicode_1BYTE_DATA()`` are excluded from the limited C API. It's not possible to write code specialized for UCS formats. A C extension using the limited C API can only use less efficient code paths and string formats. For example, the MarkupSafe project has a C extension specialized for UCS formats for best performance, and so cannot use the limited C API. Specification ============= API --- Add the following API to the limited C API version 3.14:: int32_t PyUnicode_Export( PyObject *unicode, int32_t requested_formats, Py_buffer *view); PyObject* PyUnicode_Import( const void *data, Py_ssize_t nbytes, int32_t format); #define PyUnicode_FORMAT_UCS1 0x01 // Py_UCS1* #define PyUnicode_FORMAT_UCS2 0x02 // Py_UCS2* #define PyUnicode_FORMAT_UCS4 0x04 // Py_UCS4* #define PyUnicode_FORMAT_UTF8 0x08 // char* #define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string) The ``int32_t`` type is used instead of ``int`` to have a well defined type size and not depend on the platform or the compiler. See `Avoid C-specific Types `_ for the longer rationale. PyUnicode_Export() ------------------ API:: int32_t PyUnicode_Export( PyObject *unicode, int32_t requested_formats, Py_buffer *view) Export the contents of the *unicode* string in one of the *requested_formats*. * On success, fill *view*, and return a format (greater than ``0``). * On error, set an exception, and return ``-1``. *view* is left unchanged. After a successful call to ``PyUnicode_Export()``, the *view* buffer must be released by ``PyBuffer_Release()``. The contents of the buffer are valid until they are released. The buffer is read-only and must not be modified. The ``view->len`` member must be used to get the string length. The buffer should end with a trailing NUL character, but it's not recommended to rely on that because of embedded NUL characters. *unicode* and *view* must not be NULL. Available formats: =================================== ======== =========================== Constant Identifier Value Description =================================== ======== =========================== ``PyUnicode_FORMAT_UCS1`` ``0x01`` UCS-1 string (``Py_UCS1*``) ``PyUnicode_FORMAT_UCS2`` ``0x02`` UCS-2 string (``Py_UCS2*``) ``PyUnicode_FORMAT_UCS4`` ``0x04`` UCS-4 string (``Py_UCS4*``) ``PyUnicode_FORMAT_UTF8`` ``0x08`` UTF-8 string (``char*``) ``PyUnicode_FORMAT_ASCII`` ``0x10`` ASCII string (``Py_UCS1*``) =================================== ======== =========================== UCS-2 and UCS-4 use the native byte order. *requested_formats* can be a single format or a bitwise combination of the formats in the table above. On success, the returned format will be set to a single one of the requested flags. Note that future versions of Python may introduce additional formats. .. _export-complexity: Export complexity ----------------- In general, an export has a complexity of *O*\ (1): no memory copy is needed. There are cases when a copy is needed, *O*\ (*n*) complexity: * If only UCS-2 is requested and the native format is UCS-1. * If only UCS-4 is requested and the native format is UCS-1 or UCS-2. * If only UTF-8 is requested: the string is encoded to UTF-8 at the first call, and then the encoded UTF-8 string is cached. To get the best performance on CPython and PyPy, it's recommended to support these 4 formats:: (PyUnicode_FORMAT_UCS1 \ | PyUnicode_FORMAT_UCS2 \ | PyUnicode_FORMAT_UCS4 \ | PyUnicode_FORMAT_UTF8) PyPy uses UTF-8 natively and so the ``PyUnicode_FORMAT_UTF8`` format is recommended. It requires a memory copy, since PyPy ``str`` objects can be moved in memory (PyPy uses a moving garbage collector). Py_buffer format and item size ------------------------------ ``Py_buffer`` uses the following format and item size depending on the export format: ========================== ================== ============ Export format Buffer format Item size ========================== ================== ============ ``PyUnicode_FORMAT_UCS1`` ``"B"`` 1 byte ``PyUnicode_FORMAT_UCS2`` ``"=H"`` 2 bytes ``PyUnicode_FORMAT_UCS4`` ``"=I"`` 4 bytes ``PyUnicode_FORMAT_UTF8`` ``"B"`` 1 byte ``PyUnicode_FORMAT_ASCII`` ``"B"`` 1 byte ========================== ================== ============ PyUnicode_Import() ------------------ API:: PyObject* PyUnicode_Import( const void *data, Py_ssize_t nbytes, int32_t format) Create a Unicode string object from a buffer in a supported format. * Return a reference to a new string object on success. * Set an exception and return ``NULL`` on error. *data* must not be NULL. *nbytes* must be positive or zero. See ``PyUnicode_Export()`` for the available formats. UTF-8 format ------------ CPython 3.14 doesn't use the UTF-8 format internally. The format is provided for compatibility with PyPy which uses UTF-8 natively for strings. However, in CPython, the encoded UTF-8 string is cached which makes it convenient to be exported. On CPython, the UTF-8 format has the lowest priority: ASCII and UCS formats are preferred. ASCII format ------------ When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the ``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin-1 strings. The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for ``PyUnicode_Import()`` to validate that the string only contains ASCII characters. Surrogate characters and NUL characters --------------------------------------- Surrogate characters are allowed: they can be imported and exported. For example, the UTF-8 format uses the ``surrogatepass`` error handler. Embedded NUL characters are allowed: they can be imported and exported. Implementation ============== https://github.com/python/cpython/pull/123738 Backwards Compatibility ======================= There is no impact on the backward compatibility, only new C API functions are added. Usage of PEP 393 C APIs ======================= A code search on PyPI top 7,500 projects (in March 2024) shows that there are many projects importing and exporting UCS formats with the regular C API. PyUnicode_FromKindAndData() --------------------------- 25 projects call ``PyUnicode_FromKindAndData()``: * **Cython** (3.0.9) * Levenshtein (0.25.0) * PyICU (2.12) * PyICU-binary (2.7.4) * PyQt5 (5.15.10) * PyQt6 (6.6.1) * aiocsv (1.3.1) * asyncpg (0.29.0) * biopython (1.83) * catboost (1.2.3) * cffi (1.16.0) * mojimoji (0.0.13) * mwparserfromhell (0.6.6) * numba (0.59.0) * **numpy** (1.26.4) * orjson (3.9.15) * pemja (0.4.1) * pyahocorasick (2.0.0) * pyjson5 (1.6.6) * rapidfuzz (3.6.2) * regex (2023.12.25) * srsly (2.4.8) * tokenizers (0.15.2) * ujson (5.9.0) * unicodedata2 (15.1.0) PyUnicode_4BYTE_DATA() ---------------------- 21 projects call ``PyUnicode_2BYTE_DATA()`` and/or ``PyUnicode_4BYTE_DATA()``: * **Cython** (3.0.9) * **MarkupSafe** (2.1.5) * Nuitka (2.1.2) * PyICU (2.12) * PyICU-binary (2.7.4) * PyQt5_sip (12.13.0) * PyQt6_sip (13.6.0) * biopython (1.83) * catboost (1.2.3) * cement (3.0.10) * cffi (1.16.0) * duckdb (0.10.0) * **mypy** (1.9.0) * **numpy** (1.26.4) * orjson (3.9.15) * pemja (0.4.1) * pyahocorasick (2.0.0) * pyjson5 (1.6.6) * pyobjc-core (10.2) * sip (6.8.3) * wxPython (4.2.1) Rejected Ideas ============== Reject embedded NUL characters and require trailing NUL character ----------------------------------------------------------------- In C, it's convenient to have a trailing NUL character. For example, the ``for (; *str != 0; str++)`` loop can be used to iterate on characters and ``strlen()`` can be used to get a string length. The problem is that a Python ``str`` object can embed NUL characters. Example: ``"ab\0c"``. If a string contains an embedded NUL character, code relying on the NUL character to find the string end truncates the string. It can lead to bugs, or even security vulnerabilities. See a previous discussion in the issue `Change PyUnicode_AsUTF8() to return NULL on embedded null characters `_. Rejecting embedded NUL characters require to scan the string which has an *O*\ (*n*) complexity. Reject surrogate characters --------------------------- Surrogate characters are characters in the Unicode range [U+D800; U+DFFF]. They are disallowed by UTF codecs such as UTF-8. A Python ``str`` object can contain arbitrary lone surrogate characters. Example: ``"\uDC80"``. Rejecting surrogate characters prevents exporting a string which contains such a character. It can be surprising and annoying since the ``PyUnicode_Export()`` caller doesn't control the string contents. Allowing surrogate characters allows to export any string and so avoid this issue. For example, the UTF-8 codec can be used with the ``surrogatepass`` error handler to encode and decode surrogate characters. Discussions =========== * https://discuss.python.org/t/63891 * https://github.com/capi-workgroup/decisions/issues/33 * https://github.com/python/cpython/issues/119609 Copyright ========= This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.