377 lines
11 KiB
ReStructuredText
377 lines
11 KiB
ReStructuredText
|
PEP: 756
|
||
|
Title: Add PyUnicode_Export() and PyUnicode_Import() C functions
|
||
|
Author: Victor Stinner <vstinner@python.org>
|
||
|
PEP-Delegate: C API Working Group
|
||
|
Status: Draft
|
||
|
Type: Standards Track
|
||
|
Created: 13-Sep-2024
|
||
|
Python-Version: 3.14
|
||
|
|
||
|
.. highlight:: c
|
||
|
|
||
|
|
||
|
Abstract
|
||
|
========
|
||
|
|
||
|
Add functions to the limited C API version 3.14:
|
||
|
|
||
|
* ``PyUnicode_Export()``: export a Python str object as a ``Py_buffer``
|
||
|
view.
|
||
|
* ``PyUnicode_Import()``: import a Python str object.
|
||
|
|
||
|
In general, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
|
||
|
copy is needed. See the :ref:`specification <export-complexity>` for
|
||
|
cases when a copy is needed.
|
||
|
|
||
|
|
||
|
Rationale
|
||
|
=========
|
||
|
|
||
|
PEP 393
|
||
|
-------
|
||
|
|
||
|
:pep:`393` "Flexible String Representation" changed string internals in
|
||
|
Python 3.3 to use three formats:
|
||
|
|
||
|
* ``PyUnicode_1BYTE_KIND``: Unicode range [U+0000; U+00ff],
|
||
|
UCS-1, 1 byte/character.
|
||
|
* ``PyUnicode_2BYTE_KIND``: Unicode range [U+0000; U+ffff],
|
||
|
UCS-2, 2 bytes/character.
|
||
|
* ``PyUnicode_4BYTE_KIND``: Unicode range [U+0000; U+10ffff],
|
||
|
UCS-4, 4 bytes/character.
|
||
|
|
||
|
A Python ``str`` object must always use the most compact format. For
|
||
|
example, a string which only contains ASCII characters must use the
|
||
|
UCS-1 format.
|
||
|
|
||
|
The ``PyUnicode_KIND()`` function can be used to know the format used by
|
||
|
a string.
|
||
|
|
||
|
One of the following functions can be used to access data:
|
||
|
|
||
|
* ``PyUnicode_1BYTE_DATA()`` for ``PyUnicode_1BYTE_KIND``.
|
||
|
* ``PyUnicode_2BYTE_DATA()`` for ``PyUnicode_2BYTE_KIND``.
|
||
|
* ``PyUnicode_4BYTE_DATA()`` for ``PyUnicode_4BYTE_KIND``.
|
||
|
|
||
|
To get the best performance, a C extension should have 3 code paths for
|
||
|
each of these 3 string native formats.
|
||
|
|
||
|
Limited C API
|
||
|
-------------
|
||
|
|
||
|
:pep:`393` functions such as ``PyUnicode_KIND()`` and
|
||
|
``PyUnicode_1BYTE_DATA()`` are excluded from the limited C API. It's not
|
||
|
possible to write code specialized for UCS formats. A C extension using
|
||
|
the limited C API can only use less efficient code paths and string
|
||
|
formats.
|
||
|
|
||
|
For example, the MarkupSafe project has a C extension specialized for
|
||
|
UCS formats for best performance, and so cannot use the limited C
|
||
|
API.
|
||
|
|
||
|
|
||
|
Specification
|
||
|
=============
|
||
|
|
||
|
API
|
||
|
---
|
||
|
|
||
|
Add the following API to the limited C API version 3.14::
|
||
|
|
||
|
int32_t PyUnicode_Export(
|
||
|
PyObject *unicode,
|
||
|
int32_t requested_formats,
|
||
|
Py_buffer *view);
|
||
|
PyObject* PyUnicode_Import(
|
||
|
const void *data,
|
||
|
Py_ssize_t nbytes,
|
||
|
int32_t format);
|
||
|
|
||
|
#define PyUnicode_FORMAT_UCS1 0x01 // Py_UCS1*
|
||
|
#define PyUnicode_FORMAT_UCS2 0x02 // Py_UCS2*
|
||
|
#define PyUnicode_FORMAT_UCS4 0x04 // Py_UCS4*
|
||
|
#define PyUnicode_FORMAT_UTF8 0x08 // char*
|
||
|
#define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string)
|
||
|
|
||
|
The ``int32_t`` type is used instead of ``int`` to have a well defined
|
||
|
type size and not depend on the platform or the compiler.
|
||
|
See `Avoid C-specific Types
|
||
|
<https://github.com/capi-workgroup/api-evolution/issues/10>`_ for the
|
||
|
longer rationale.
|
||
|
|
||
|
PyUnicode_Export()
|
||
|
------------------
|
||
|
|
||
|
API: ``int32_t PyUnicode_Export(PyObject *unicode, int32_t requested_formats, Py_buffer *view)``.
|
||
|
|
||
|
Export the contents of the *unicode* string in one of the *requested_formats*.
|
||
|
|
||
|
* On success, fill *view*, and return a format (greater than ``0``).
|
||
|
* On error, set an exception, and return ``-1``.
|
||
|
*view* is left unchanged.
|
||
|
|
||
|
After a successful call to ``PyUnicode_Export()``,
|
||
|
the *view* buffer must be released by ``PyBuffer_Release()``.
|
||
|
The contents of the buffer are valid until they are released.
|
||
|
|
||
|
The buffer is read-only and must not be modified.
|
||
|
|
||
|
*unicode* and *view* must not be NULL.
|
||
|
|
||
|
Available formats:
|
||
|
|
||
|
=================================== ======== ===========================
|
||
|
Constant Identifier Value Description
|
||
|
=================================== ======== ===========================
|
||
|
``PyUnicode_FORMAT_UCS1`` ``0x01`` UCS-1 string (``Py_UCS1*``)
|
||
|
``PyUnicode_FORMAT_UCS2`` ``0x02`` UCS-2 string (``Py_UCS2*``)
|
||
|
``PyUnicode_FORMAT_UCS4`` ``0x04`` UCS-4 string (``Py_UCS4*``)
|
||
|
``PyUnicode_FORMAT_UTF8`` ``0x08`` UTF-8 string (``char*``)
|
||
|
``PyUnicode_FORMAT_ASCII`` ``0x10`` ASCII string (``Py_UCS1*``)
|
||
|
=================================== ======== ===========================
|
||
|
|
||
|
UCS-2 and UCS-4 use the native byte order.
|
||
|
|
||
|
*requested_formats* can be a single format or a bitwise combination of the
|
||
|
formats in the table above.
|
||
|
On success, the returned format will be set to a single one of the requested
|
||
|
flags.
|
||
|
|
||
|
Note that future versions of Python may introduce additional formats.
|
||
|
|
||
|
.. _export-complexity:
|
||
|
|
||
|
Export complexity
|
||
|
-----------------
|
||
|
|
||
|
In general, an export has a complexity of *O*\ (1): no memory copy is
|
||
|
needed. There are cases when a copy is needed, *O*\ (*n*) complexity:
|
||
|
|
||
|
* If only UCS-2 is requested and the native format is UCS-1.
|
||
|
* If only UCS-4 is requested and the native format is UCS-1 or UCS-2.
|
||
|
* If only UTF-8 is requested: the string is encoded to UTF-8 at the
|
||
|
first call, and then the encoded UTF-8 string is cached.
|
||
|
|
||
|
To have an *O*\ (1) complexity on CPython and PyPy, it's recommended to
|
||
|
support these 4 formats::
|
||
|
|
||
|
(PyUnicode_FORMAT_UCS1 \
|
||
|
| PyUnicode_FORMAT_UCS2 \
|
||
|
| PyUnicode_FORMAT_UCS4 \
|
||
|
| PyUnicode_FORMAT_UTF8)
|
||
|
|
||
|
|
||
|
Py_buffer format and item size
|
||
|
------------------------------
|
||
|
|
||
|
``Py_buffer`` uses the following format and item size depending on the
|
||
|
export format:
|
||
|
|
||
|
========================== ================== ============
|
||
|
Export format Buffer format Item size
|
||
|
========================== ================== ============
|
||
|
``PyUnicode_FORMAT_UCS1`` ``"B"`` 1 byte
|
||
|
``PyUnicode_FORMAT_UCS2`` ``"H"`` 2 bytes
|
||
|
``PyUnicode_FORMAT_UCS4`` ``"I"`` or ``"L"`` 4 bytes
|
||
|
``PyUnicode_FORMAT_UTF8`` ``"B"`` 1 byte
|
||
|
``PyUnicode_FORMAT_ASCII`` ``"B"`` 1 byte
|
||
|
========================== ================== ============
|
||
|
|
||
|
|
||
|
PyUnicode_Import()
|
||
|
------------------
|
||
|
|
||
|
API: ``PyObject* PyUnicode_Import(const void *data, Py_ssize_t nbytes, int32_t format)``.
|
||
|
|
||
|
Create a Unicode string object from a buffer in a supported format.
|
||
|
|
||
|
* Return a reference to a new string object on success.
|
||
|
* Set an exception and return ``NULL`` on error.
|
||
|
|
||
|
*data* must not be NULL. *nbytes* must be positive or zero.
|
||
|
|
||
|
See ``PyUnicode_Export()`` for the available formats.
|
||
|
|
||
|
|
||
|
UTF-8 format
|
||
|
------------
|
||
|
|
||
|
CPython 3.14 doesn't use the UTF-8 format internally. The format is
|
||
|
provided for compatibility with PyPy which uses UTF-8 natively for
|
||
|
strings. However, in CPython, the encoded UTF-8 string is cached which
|
||
|
makes it convenient to be exported.
|
||
|
|
||
|
On CPython, the UTF-8 format has the lowest priority: ASCII and UCS
|
||
|
formats are preferred.
|
||
|
|
||
|
ASCII format
|
||
|
------------
|
||
|
|
||
|
When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the
|
||
|
``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin-1
|
||
|
strings.
|
||
|
|
||
|
The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for
|
||
|
``PyUnicode_Import()`` to validate that the string only contains ASCII
|
||
|
characters.
|
||
|
|
||
|
|
||
|
Surrogate characters and NUL characters
|
||
|
---------------------------------------
|
||
|
|
||
|
Surrogate characters are allowed: they can be imported and exported. For
|
||
|
example, the UTF-8 format uses the ``surrogatepass`` error handler.
|
||
|
|
||
|
Embedded NUL characters are allowed: they can be imported and exported.
|
||
|
|
||
|
An exported string does not end with a trailing NUL character: the
|
||
|
``PyUnicode_Export()`` caller must use ``Py_buffer.len`` to get the
|
||
|
string length.
|
||
|
|
||
|
|
||
|
Implementation
|
||
|
==============
|
||
|
|
||
|
https://github.com/python/cpython/pull/123738
|
||
|
|
||
|
|
||
|
Backwards Compatibility
|
||
|
=======================
|
||
|
|
||
|
There is no impact on the backward compatibility, only new C API
|
||
|
functions are added.
|
||
|
|
||
|
|
||
|
Open Questions
|
||
|
==============
|
||
|
|
||
|
* Should we guarantee that the exported buffer always ends with a NUL
|
||
|
character? Is it possible to implement it in *O*\ (1) complexity
|
||
|
in all Python implementations?
|
||
|
* Is it ok to allow surrogate characters?
|
||
|
* Should we add a flag to disallow embedded NUL characters? It would
|
||
|
have an *O*\ (*n*) complexity.
|
||
|
* Should we add a flag to disallow surrogate characters? It would
|
||
|
have an *O*\ (*n*) complexity.
|
||
|
|
||
|
|
||
|
Usage of PEP 393 C APIs
|
||
|
=======================
|
||
|
|
||
|
A code search on PyPI top 7,500 projects (in March 2024) shows that
|
||
|
there are many projects importing and exporting UCS formats with the
|
||
|
regular C API.
|
||
|
|
||
|
PyUnicode_FromKindAndData()
|
||
|
---------------------------
|
||
|
|
||
|
25 projects call ``PyUnicode_FromKindAndData()``:
|
||
|
|
||
|
* **Cython** (3.0.9)
|
||
|
* Levenshtein (0.25.0)
|
||
|
* PyICU (2.12)
|
||
|
* PyICU-binary (2.7.4)
|
||
|
* PyQt5 (5.15.10)
|
||
|
* PyQt6 (6.6.1)
|
||
|
* aiocsv (1.3.1)
|
||
|
* asyncpg (0.29.0)
|
||
|
* biopython (1.83)
|
||
|
* catboost (1.2.3)
|
||
|
* cffi (1.16.0)
|
||
|
* mojimoji (0.0.13)
|
||
|
* mwparserfromhell (0.6.6)
|
||
|
* numba (0.59.0)
|
||
|
* **numpy** (1.26.4)
|
||
|
* orjson (3.9.15)
|
||
|
* pemja (0.4.1)
|
||
|
* pyahocorasick (2.0.0)
|
||
|
* pyjson5 (1.6.6)
|
||
|
* rapidfuzz (3.6.2)
|
||
|
* regex (2023.12.25)
|
||
|
* srsly (2.4.8)
|
||
|
* tokenizers (0.15.2)
|
||
|
* ujson (5.9.0)
|
||
|
* unicodedata2 (15.1.0)
|
||
|
|
||
|
|
||
|
PyUnicode_4BYTE_DATA()
|
||
|
----------------------
|
||
|
|
||
|
21 projects call ``PyUnicode_2BYTE_DATA()`` and/or
|
||
|
``PyUnicode_4BYTE_DATA()``:
|
||
|
|
||
|
* **Cython** (3.0.9)
|
||
|
* **MarkupSafe** (2.1.5)
|
||
|
* Nuitka (2.1.2)
|
||
|
* PyICU (2.12)
|
||
|
* PyICU-binary (2.7.4)
|
||
|
* PyQt5_sip (12.13.0)
|
||
|
* PyQt6_sip (13.6.0)
|
||
|
* biopython (1.83)
|
||
|
* catboost (1.2.3)
|
||
|
* cement (3.0.10)
|
||
|
* cffi (1.16.0)
|
||
|
* duckdb (0.10.0)
|
||
|
* **mypy** (1.9.0)
|
||
|
* **numpy** (1.26.4)
|
||
|
* orjson (3.9.15)
|
||
|
* pemja (0.4.1)
|
||
|
* pyahocorasick (2.0.0)
|
||
|
* pyjson5 (1.6.6)
|
||
|
* pyobjc-core (10.2)
|
||
|
* sip (6.8.3)
|
||
|
* wxPython (4.2.1)
|
||
|
|
||
|
|
||
|
Rejected Ideas
|
||
|
==============
|
||
|
|
||
|
Reject embedded NUL characters and require trailing NUL character
|
||
|
-----------------------------------------------------------------
|
||
|
|
||
|
In C, it's convenient to have a trailing NUL character. For example,
|
||
|
the ``for (; *str != 0; str++)`` loop can be used to iterate on
|
||
|
characters and ``strlen()`` can be used to get a string length.
|
||
|
|
||
|
The problem is that a Python ``str`` object can embed NUL characters.
|
||
|
Example: ``"ab\0c"``. If a string contains an embedded NUL character,
|
||
|
code relying on the NUL character to find the string end truncates the
|
||
|
string. It can lead to bugs, or even security vulnerabilities.
|
||
|
See a previous discussion in the issue `Change PyUnicode_AsUTF8()
|
||
|
to return NULL on embedded null characters
|
||
|
<https://github.com/python/cpython/issues/111089>`_.
|
||
|
|
||
|
Rejecting embedded NUL characters require to scan the string which has
|
||
|
an *O*\ (*n*) complexity.
|
||
|
|
||
|
Reject surrogate characters
|
||
|
---------------------------
|
||
|
|
||
|
Surrogate characters are characters in the Unicode range [U+D800;
|
||
|
U+DFFF]. They are disallowed by UTF codecs such as UTF-8. A Python
|
||
|
``str`` object can contain arbitrary lone surrogate characters. Example:
|
||
|
``"\uDC80"``.
|
||
|
|
||
|
Rejecting surrogate characters prevents exporting a string which contains
|
||
|
such a character. It can be surprising and annoying since the
|
||
|
``PyUnicode_Export()`` caller doesn't control the string contents.
|
||
|
|
||
|
Allowing surrogate characters allows to export any string and so avoid
|
||
|
this issue. For example, the UTF-8 codec can be used with the
|
||
|
``surrogatepass`` error handler to encode and decode surrogate
|
||
|
characters.
|
||
|
|
||
|
|
||
|
Discussions
|
||
|
===========
|
||
|
|
||
|
* https://github.com/capi-workgroup/decisions/issues/33
|
||
|
* https://github.com/python/cpython/issues/119609
|
||
|
|
||
|
Copyright
|
||
|
=========
|
||
|
|
||
|
This document is placed in the public domain or under the
|
||
|
CC0-1.0-Universal license, whichever is more permissive.
|
||
|
|