From e82e1e3f1a030c69ae9aaf98e35be996d7e0a2df Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Sat, 14 Sep 2024 11:03:39 +0200 Subject: [PATCH] PEP 756: PyUnicode_Export() (#3960) --- .github/CODEOWNERS | 1 + peps/pep-0756.rst | 377 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 378 insertions(+) create mode 100644 peps/pep-0756.rst diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index f3cd9391a..97c17b2b0 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -636,6 +636,7 @@ peps/pep-0753.rst @warsaw # ... # peps/pep-0754.rst # ... +peps/pep-0756.rst @vstinner peps/pep-0789.rst @njsmith # ... peps/pep-0801.rst @warsaw diff --git a/peps/pep-0756.rst b/peps/pep-0756.rst new file mode 100644 index 000000000..68942a04a --- /dev/null +++ b/peps/pep-0756.rst @@ -0,0 +1,377 @@ +PEP: 756 +Title: Add PyUnicode_Export() and PyUnicode_Import() C functions +Author: Victor Stinner +PEP-Delegate: C API Working Group +Status: Draft +Type: Standards Track +Created: 13-Sep-2024 +Python-Version: 3.14 + +.. highlight:: c + + +Abstract +======== + +Add functions to the limited C API version 3.14: + +* ``PyUnicode_Export()``: export a Python str object as a ``Py_buffer`` + view. +* ``PyUnicode_Import()``: import a Python str object. + +In general, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory +copy is needed. See the :ref:`specification ` for +cases when a copy is needed. + + +Rationale +========= + +PEP 393 +------- + +:pep:`393` "Flexible String Representation" changed string internals in +Python 3.3 to use three formats: + +* ``PyUnicode_1BYTE_KIND``: Unicode range [U+0000; U+00ff], + UCS-1, 1 byte/character. +* ``PyUnicode_2BYTE_KIND``: Unicode range [U+0000; U+ffff], + UCS-2, 2 bytes/character. +* ``PyUnicode_4BYTE_KIND``: Unicode range [U+0000; U+10ffff], + UCS-4, 4 bytes/character. + +A Python ``str`` object must always use the most compact format. For +example, a string which only contains ASCII characters must use the +UCS-1 format. + +The ``PyUnicode_KIND()`` function can be used to know the format used by +a string. + +One of the following functions can be used to access data: + +* ``PyUnicode_1BYTE_DATA()`` for ``PyUnicode_1BYTE_KIND``. +* ``PyUnicode_2BYTE_DATA()`` for ``PyUnicode_2BYTE_KIND``. +* ``PyUnicode_4BYTE_DATA()`` for ``PyUnicode_4BYTE_KIND``. + +To get the best performance, a C extension should have 3 code paths for +each of these 3 string native formats. + +Limited C API +------------- + +:pep:`393` functions such as ``PyUnicode_KIND()`` and +``PyUnicode_1BYTE_DATA()`` are excluded from the limited C API. It's not +possible to write code specialized for UCS formats. A C extension using +the limited C API can only use less efficient code paths and string +formats. + +For example, the MarkupSafe project has a C extension specialized for +UCS formats for best performance, and so cannot use the limited C +API. + + +Specification +============= + +API +--- + +Add the following API to the limited C API version 3.14:: + + int32_t PyUnicode_Export( + PyObject *unicode, + int32_t requested_formats, + int32_t flags, + Py_buffer *view); + PyObject* PyUnicode_Import( + const void *data, + Py_ssize_t nbytes, + int32_t format); + + #define PyUnicode_FORMAT_UCS1 0x01 // Py_UCS1* + #define PyUnicode_FORMAT_UCS2 0x02 // Py_UCS2* + #define PyUnicode_FORMAT_UCS4 0x04 // Py_UCS4* + #define PyUnicode_FORMAT_UTF8 0x08 // char* + #define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string) + +The ``int32_t`` type is used instead of ``int`` to have a well defined +type size and not depend on the platform or the compiler. +See `Avoid C-specific Types +`_ for the +longer rationale. + +PyUnicode_Export() +------------------ + +API: ``int32_t PyUnicode_Export(PyObject *unicode, int32_t requested_formats, Py_buffer *view)``. + +Export the contents of the *unicode* string in one of the *requested_formats*. + +* On success, fill *view*, and return a format (greater than ``0``). +* On error, set an exception, and return ``-1``. + *view* is left unchanged. + +After a successful call to ``PyUnicode_Export()``, +the *view* buffer must be released by ``PyBuffer_Release()``. +The contents of the buffer are valid until they are released. + +The buffer is read-only and must not be modified. + +*unicode* and *view* must not be NULL. + +Available formats: + +=================================== ======== =========================== +Constant Identifier Value Description +=================================== ======== =========================== +``PyUnicode_FORMAT_UCS1`` ``0x01`` UCS-1 string (``Py_UCS1*``) +``PyUnicode_FORMAT_UCS2`` ``0x02`` UCS-2 string (``Py_UCS2*``) +``PyUnicode_FORMAT_UCS4`` ``0x04`` UCS-4 string (``Py_UCS4*``) +``PyUnicode_FORMAT_UTF8`` ``0x08`` UTF-8 string (``char*``) +``PyUnicode_FORMAT_ASCII`` ``0x10`` ASCII string (``Py_UCS1*``) +=================================== ======== =========================== + +UCS-2 and UCS-4 use the native byte order. + +*requested_formats* can be a single format or a bitwise combination of the +formats in the table above. +On success, the returned format will be set to a single one of the requested +flags. + +Note that future versions of Python may introduce additional formats. + +.. _export-complexity: + +Export complexity +----------------- + +In general, an export has a complexity of *O*\ (1): no memory copy is +needed. There are cases when a copy is needed, *O*\ (*n*) complexity: + +* If only UCS-2 is requested and the native format is UCS-1. +* If only UCS-4 is requested and the native format is UCS-1 or UCS-2. +* If only UTF-8 is requested: the string is encoded to UTF-8 at the + first call, and then the encoded UTF-8 string is cached. + +To have an *O*\ (1) complexity on CPython and PyPy, it's recommended to +support these 4 formats:: + + (PyUnicode_FORMAT_UCS1 \ + | PyUnicode_FORMAT_UCS2 \ + | PyUnicode_FORMAT_UCS4 \ + | PyUnicode_FORMAT_UTF8) + + +Py_buffer format and item size +------------------------------ + +``Py_buffer`` uses the following format and item size depending on the +export format: + +========================== ================== ============ +Export format Buffer format Item size +========================== ================== ============ +``PyUnicode_FORMAT_UCS1`` ``"B"`` 1 byte +``PyUnicode_FORMAT_UCS2`` ``"H"`` 2 bytes +``PyUnicode_FORMAT_UCS4`` ``"I"`` or ``"L"`` 4 bytes +``PyUnicode_FORMAT_UTF8`` ``"B"`` 1 byte +``PyUnicode_FORMAT_ASCII`` ``"B"`` 1 byte +========================== ================== ============ + + +PyUnicode_Import() +------------------ + +API: ``PyObject* PyUnicode_Import(const void *data, Py_ssize_t nbytes, int32_t format)``. + +Create a Unicode string object from a buffer in a supported format. + +* Return a reference to a new string object on success. +* Set an exception and return ``NULL`` on error. + +*data* must not be NULL. *nbytes* must be positive or zero. + +See ``PyUnicode_Export()`` for the available formats. + + +UTF-8 format +------------ + +CPython 3.14 doesn't use the UTF-8 format internally. The format is +provided for compatibility with PyPy which uses UTF-8 natively for +strings. However, in CPython, the encoded UTF-8 string is cached which +makes it convenient to be exported. + +On CPython, the UTF-8 format has the lowest priority: ASCII and UCS +formats are preferred. + +ASCII format +------------ + +When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the +``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin-1 +strings. + +The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for +``PyUnicode_Import()`` to validate that the string only contains ASCII +characters. + + +Surrogate characters and NUL characters +--------------------------------------- + +Surrogate characters are allowed: they can be imported and exported. For +example, the UTF-8 format uses the ``surrogatepass`` error handler. + +Embedded NUL characters are allowed: they can be imported and exported. + +An exported string does not end with a trailing NUL character: the +``PyUnicode_Export()`` caller must use ``Py_buffer.len`` to get the +string length. + + +Implementation +============== + +https://github.com/python/cpython/pull/123738 + + +Backwards Compatibility +======================= + +There is no impact on the backward compatibility, only new C API +functions are added. + + +Open Questions +============== + +* Should we guarantee that the exported buffer always ends with a NUL + character? Is it possible to implement it in *O*\ (1) complexity + in all Python implementations? +* Is it ok to allow surrogate characters? +* Should we add a flag to disallow embedded NUL characters? It would + have an *O*\ (*n*) complexity. +* Should we add a flag to disallow surrogate characters? It would + have an *O*\ (*n*) complexity. + + +Usage of PEP 393 C APIs +======================= + +A code search on PyPI top 7,500 projects (in March 2024) shows that +there are many projects importing and exporting UCS formats with the +regular C API. + +PyUnicode_FromKindAndData() +--------------------------- + +25 projects call ``PyUnicode_FromKindAndData()``: + +* **Cython** (3.0.9) +* Levenshtein (0.25.0) +* PyICU (2.12) +* PyICU-binary (2.7.4) +* PyQt5 (5.15.10) +* PyQt6 (6.6.1) +* aiocsv (1.3.1) +* asyncpg (0.29.0) +* biopython (1.83) +* catboost (1.2.3) +* cffi (1.16.0) +* mojimoji (0.0.13) +* mwparserfromhell (0.6.6) +* numba (0.59.0) +* **numpy** (1.26.4) +* orjson (3.9.15) +* pemja (0.4.1) +* pyahocorasick (2.0.0) +* pyjson5 (1.6.6) +* rapidfuzz (3.6.2) +* regex (2023.12.25) +* srsly (2.4.8) +* tokenizers (0.15.2) +* ujson (5.9.0) +* unicodedata2 (15.1.0) + + +PyUnicode_4BYTE_DATA() +---------------------- + +21 projects call ``PyUnicode_2BYTE_DATA()`` and/or +``PyUnicode_4BYTE_DATA()``: + +* **Cython** (3.0.9) +* **MarkupSafe** (2.1.5) +* Nuitka (2.1.2) +* PyICU (2.12) +* PyICU-binary (2.7.4) +* PyQt5_sip (12.13.0) +* PyQt6_sip (13.6.0) +* biopython (1.83) +* catboost (1.2.3) +* cement (3.0.10) +* cffi (1.16.0) +* duckdb (0.10.0) +* **mypy** (1.9.0) +* **numpy** (1.26.4) +* orjson (3.9.15) +* pemja (0.4.1) +* pyahocorasick (2.0.0) +* pyjson5 (1.6.6) +* pyobjc-core (10.2) +* sip (6.8.3) +* wxPython (4.2.1) + + +Rejected Ideas +============== + +Reject embedded NUL characters and require trailing NUL character +----------------------------------------------------------------- + +In C, it's convenient to have a trailing NUL character. For example, +the ``for (; *str != 0; str++)`` loop can be used to iterate on +characters and ``strlen()`` can be used to get a string length. + +The problem is that a Python ``str`` object can embed NUL characters. +Example: ``"ab\0c"``. If a string contains an embedded NUL character, +code relying on the NUL character to find the string end truncates the +string. It can lead to bugs, or even security vulnerabilities. +See a previous discussion in the issue `Change PyUnicode_AsUTF8() +to return NULL on embedded null characters +`_. + +Rejecting embedded NUL characters require to scan the string which has +an *O*\ (*n*) complexity. + +Reject surrogate characters +--------------------------- + +Surrogate characters are characters in the Unicode range [U+D800; +U+DFFF]. They are disallowed by UTF codecs such as UTF-8. A Python +``str`` object can contain arbitrary lone surrogate characters. Example: +``"\uDC80"``. + +Rejecting surrogate characters prevents exporting a string which contains +such a character. It can be surprising and annoying since the +``PyUnicode_Export()`` caller doesn't control the string contents. + +Allowing surrogate characters allows to export any string and so avoid +this issue. For example, the UTF-8 codec can be used with the +``surrogatepass`` error handler to encode and decode surrogate +characters. + + +Discussions +=========== + +* https://github.com/capi-workgroup/decisions/issues/33 +* https://github.com/python/cpython/issues/119609 + +Copyright +========= + +This document is placed in the public domain or under the +CC0-1.0-Universal license, whichever is more permissive. +