PEP 756: PyUnicode_Export() (#3960)

2024-09-14 11:03:39 +02:00 · 2024-09-14 11:03:39 +02:00 · e82e1e3f1a
parent e660248bea
commit e82e1e3f1a
2 changed files with 378 additions and 0 deletions
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@ -636,6 +636,7 @@ peps/pep-0753.rst  @warsaw
 # ...
 # peps/pep-0754.rst
 # ...
+peps/pep-0756.rst  @vstinner
 peps/pep-0789.rst  @njsmith
 # ...
 peps/pep-0801.rst  @warsaw
--- a/peps/pep-0756.rst
+++ b/peps/pep-0756.rst
@ -0,0 +1,377 @@
+PEP: 756
+Title: Add PyUnicode_Export() and PyUnicode_Import() C functions
+Author: Victor Stinner <vstinner@python.org>
+PEP-Delegate: C API Working Group
+Status: Draft
+Type: Standards Track
+Created: 13-Sep-2024
+Python-Version: 3.14
+
+.. highlight:: c
+
+
+Abstract
+========
+
+Add functions to the limited C API version 3.14:
+
+* ``PyUnicode_Export()``: export a Python str object as a ``Py_buffer``
+  view.
+* ``PyUnicode_Import()``: import a Python str object.
+
+In general, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
+copy is needed. See the :ref:`specification <export-complexity>` for
+cases when a copy is needed.
+
+
+Rationale
+=========
+
+PEP 393
+-------
+
+:pep:`393` "Flexible String Representation" changed string internals in
+Python 3.3 to use three formats:
+
+* ``PyUnicode_1BYTE_KIND``: Unicode range [U+0000; U+00ff],
+  UCS-1, 1 byte/character.
+* ``PyUnicode_2BYTE_KIND``: Unicode range [U+0000; U+ffff],
+  UCS-2, 2 bytes/character.
+* ``PyUnicode_4BYTE_KIND``: Unicode range [U+0000; U+10ffff],
+  UCS-4, 4 bytes/character.
+
+A Python ``str`` object must always use the most compact format. For
+example, a string which only contains ASCII characters must use the
+UCS-1 format.
+
+The ``PyUnicode_KIND()`` function can be used to know the format used by
+a string.
+
+One of the following functions can be used to access data:
+
+* ``PyUnicode_1BYTE_DATA()`` for ``PyUnicode_1BYTE_KIND``.
+* ``PyUnicode_2BYTE_DATA()`` for ``PyUnicode_2BYTE_KIND``.
+* ``PyUnicode_4BYTE_DATA()`` for ``PyUnicode_4BYTE_KIND``.
+
+To get the best performance, a C extension should have 3 code paths for
+each of these 3 string native formats.
+
+Limited C API
+-------------
+
+:pep:`393` functions such as ``PyUnicode_KIND()`` and
+``PyUnicode_1BYTE_DATA()`` are excluded from the limited C API. It's not
+possible to write code specialized for UCS formats. A C extension using
+the limited C API can only use less efficient code paths and string
+formats.
+
+For example, the MarkupSafe project has a C extension specialized for
+UCS formats for best performance, and so cannot use the limited C
+API.
+
+
+Specification
+=============
+
+API
+---
+
+Add the following API to the limited C API version 3.14::
+
+    int32_t PyUnicode_Export(
+        PyObject *unicode,
+        int32_t requested_formats,
+        int32_t flags,
+        Py_buffer *view);
+    PyObject* PyUnicode_Import(
+        const void *data,
+        Py_ssize_t nbytes,
+        int32_t format);
+
+    #define PyUnicode_FORMAT_UCS1  0x01   // Py_UCS1*
+    #define PyUnicode_FORMAT_UCS2  0x02   // Py_UCS2*
+    #define PyUnicode_FORMAT_UCS4  0x04   // Py_UCS4*
+    #define PyUnicode_FORMAT_UTF8  0x08   // char*
+    #define PyUnicode_FORMAT_ASCII 0x10   // char* (ASCII string)
+
+The ``int32_t`` type is used instead of ``int`` to have a well defined
+type size and not depend on the platform or the compiler.
+See `Avoid C-specific Types
+<https://github.com/capi-workgroup/api-evolution/issues/10>`_ for the
+longer rationale.
+
+PyUnicode_Export()
+------------------
+
+API: ``int32_t PyUnicode_Export(PyObject *unicode, int32_t requested_formats, Py_buffer *view)``.
+
+Export the contents of the *unicode* string in one of the *requested_formats*.
+
+* On success, fill *view*, and return a format (greater than ``0``).
+* On error, set an exception, and return ``-1``.
+  *view* is left unchanged.
+
+After a successful call to ``PyUnicode_Export()``,
+the *view* buffer must be released by ``PyBuffer_Release()``.
+The contents of the buffer are valid until they are released.
+
+The buffer is read-only and must not be modified.
+
+*unicode* and *view* must not be NULL.
+
+Available formats:
+
+===================================  ========  ===========================
+Constant Identifier                  Value     Description
+===================================  ========  ===========================
+``PyUnicode_FORMAT_UCS1``            ``0x01``  UCS-1 string (``Py_UCS1*``)
+``PyUnicode_FORMAT_UCS2``            ``0x02``  UCS-2 string (``Py_UCS2*``)
+``PyUnicode_FORMAT_UCS4``            ``0x04``  UCS-4 string (``Py_UCS4*``)
+``PyUnicode_FORMAT_UTF8``            ``0x08``  UTF-8 string (``char*``)
+``PyUnicode_FORMAT_ASCII``           ``0x10``  ASCII string (``Py_UCS1*``)
+===================================  ========  ===========================
+
+UCS-2 and UCS-4 use the native byte order.
+
+*requested_formats* can be a single format or a bitwise combination of the
+formats in the table above.
+On success, the returned format will be set to a single one of the requested
+flags.
+
+Note that future versions of Python may introduce additional formats.
+
+.. _export-complexity:
+
+Export complexity
+-----------------
+
+In general, an export has a complexity of *O*\ (1): no memory copy is
+needed. There are cases when a copy is needed, *O*\ (*n*) complexity:
+
+* If only UCS-2 is requested and the native format is UCS-1.
+* If only UCS-4 is requested and the native format is UCS-1 or UCS-2.
+* If only UTF-8 is requested: the string is encoded to UTF-8 at the
+  first call, and then the encoded UTF-8 string is cached.
+
+To have an *O*\ (1) complexity on CPython and PyPy, it's recommended to
+support these 4 formats::
+
+    (PyUnicode_FORMAT_UCS1 \
+     | PyUnicode_FORMAT_UCS2 \
+     | PyUnicode_FORMAT_UCS4 \
+     | PyUnicode_FORMAT_UTF8)
+
+
+Py_buffer format and item size
+------------------------------
+
+``Py_buffer`` uses the following format and item size depending on the
+export format:
+
+==========================  ==================  ============
+Export format               Buffer format       Item size
+==========================  ==================  ============
+``PyUnicode_FORMAT_UCS1``   ``"B"``             1 byte
+``PyUnicode_FORMAT_UCS2``   ``"H"``             2 bytes
+``PyUnicode_FORMAT_UCS4``   ``"I"`` or ``"L"``  4 bytes
+``PyUnicode_FORMAT_UTF8``   ``"B"``             1 byte
+``PyUnicode_FORMAT_ASCII``  ``"B"``             1 byte
+==========================  ==================  ============
+
+
+PyUnicode_Import()
+------------------
+
+API: ``PyObject* PyUnicode_Import(const void *data, Py_ssize_t nbytes, int32_t format)``.
+
+Create a Unicode string object from a buffer in a supported format.
+
+* Return a reference to a new string object on success.
+* Set an exception and return ``NULL`` on error.
+
+*data* must not be NULL. *nbytes* must be positive or zero.
+
+See ``PyUnicode_Export()`` for the available formats.
+
+
+UTF-8 format
+------------
+
+CPython 3.14 doesn't use the UTF-8 format internally. The format is
+provided for compatibility with PyPy which uses UTF-8 natively for
+strings. However, in CPython, the encoded UTF-8 string is cached which
+makes it convenient to be exported.
+
+On CPython, the UTF-8 format has the lowest priority: ASCII and UCS
+formats are preferred.
+
+ASCII format
+------------
+
+When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the
+``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin-1
+strings.
+
+The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for
+``PyUnicode_Import()`` to validate that the string only contains ASCII
+characters.
+
+
+Surrogate characters and NUL characters
+---------------------------------------
+
+Surrogate characters are allowed: they can be imported and exported. For
+example, the UTF-8 format uses the ``surrogatepass`` error handler.
+
+Embedded NUL characters are allowed: they can be imported and exported.
+
+An exported string does not end with a trailing NUL character: the
+``PyUnicode_Export()`` caller must use ``Py_buffer.len`` to get the
+string length.
+
+
+Implementation
+==============
+
+https://github.com/python/cpython/pull/123738
+
+
+Backwards Compatibility
+=======================
+
+There is no impact on the backward compatibility, only new C API
+functions are added.
+
+
+Open Questions
+==============
+
+* Should we guarantee that the exported buffer always ends with a NUL
+  character? Is it possible to implement it in *O*\ (1) complexity
+  in all Python implementations?
+* Is it ok to allow surrogate characters?
+* Should we add a flag to disallow embedded NUL characters? It would
+  have an *O*\ (*n*) complexity.
+* Should we add a flag to disallow surrogate characters? It would
+  have an *O*\ (*n*) complexity.
+
+
+Usage of PEP 393 C APIs
+=======================
+
+A code search on PyPI top 7,500 projects (in March 2024) shows that
+there are many projects importing and exporting UCS formats with the
+regular C API.
+
+PyUnicode_FromKindAndData()
+---------------------------
+
+25 projects call ``PyUnicode_FromKindAndData()``:
+
+* **Cython** (3.0.9)
+* Levenshtein (0.25.0)
+* PyICU (2.12)
+* PyICU-binary (2.7.4)
+* PyQt5 (5.15.10)
+* PyQt6 (6.6.1)
+* aiocsv (1.3.1)
+* asyncpg (0.29.0)
+* biopython (1.83)
+* catboost (1.2.3)
+* cffi (1.16.0)
+* mojimoji (0.0.13)
+* mwparserfromhell (0.6.6)
+* numba (0.59.0)
+* **numpy** (1.26.4)
+* orjson (3.9.15)
+* pemja (0.4.1)
+* pyahocorasick (2.0.0)
+* pyjson5 (1.6.6)
+* rapidfuzz (3.6.2)
+* regex (2023.12.25)
+* srsly (2.4.8)
+* tokenizers (0.15.2)
+* ujson (5.9.0)
+* unicodedata2 (15.1.0)
+
+
+PyUnicode_4BYTE_DATA()
+----------------------
+
+21 projects call ``PyUnicode_2BYTE_DATA()`` and/or
+``PyUnicode_4BYTE_DATA()``:
+
+* **Cython** (3.0.9)
+* **MarkupSafe** (2.1.5)
+* Nuitka (2.1.2)
+* PyICU (2.12)
+* PyICU-binary (2.7.4)
+* PyQt5_sip (12.13.0)
+* PyQt6_sip (13.6.0)
+* biopython (1.83)
+* catboost (1.2.3)
+* cement (3.0.10)
+* cffi (1.16.0)
+* duckdb (0.10.0)
+* **mypy** (1.9.0)
+* **numpy** (1.26.4)
+* orjson (3.9.15)
+* pemja (0.4.1)
+* pyahocorasick (2.0.0)
+* pyjson5 (1.6.6)
+* pyobjc-core (10.2)
+* sip (6.8.3)
+* wxPython (4.2.1)
+
+
+Rejected Ideas
+==============
+
+Reject embedded NUL characters and require trailing NUL character
+-----------------------------------------------------------------
+
+In C, it's convenient to have a trailing NUL character. For example,
+the ``for (; *str != 0; str++)`` loop can be used to iterate on
+characters and ``strlen()`` can be used to get a string length.
+
+The problem is that a Python ``str`` object can embed NUL characters.
+Example: ``"ab\0c"``. If a string contains an embedded NUL character,
+code relying on the NUL character to find the string end truncates the
+string. It can lead to bugs, or even security vulnerabilities.
+See a previous discussion in the issue `Change PyUnicode_AsUTF8()
+to return NULL on embedded null characters
+<https://github.com/python/cpython/issues/111089>`_.
+
+Rejecting embedded NUL characters require to scan the string which has
+an *O*\ (*n*) complexity.
+
+Reject surrogate characters
+---------------------------
+
+Surrogate characters are characters in the Unicode range [U+D800;
+U+DFFF].  They are disallowed by UTF codecs such as UTF-8. A Python
+``str`` object can contain arbitrary lone surrogate characters. Example:
+``"\uDC80"``.
+
+Rejecting surrogate characters prevents exporting a string which contains
+such a character. It can be surprising and annoying since the
+``PyUnicode_Export()`` caller doesn't control the string contents.
+
+Allowing surrogate characters allows to export any string and so avoid
+this issue. For example, the UTF-8 codec can be used with the
+``surrogatepass`` error handler to encode and decode surrogate
+characters.
+
+
+Discussions
+===========
+
+* https://github.com/capi-workgroup/decisions/issues/33
+* https://github.com/python/cpython/issues/119609
+
+Copyright
+=========
+
+This document is placed in the public domain or under the
+CC0-1.0-Universal license, whichever is more permissive.
+