python-peps/peps/pep-0756.rst

PEP: 756
Title: Add PyUnicode_Export() and PyUnicode_Import() C functions
Author: Victor Stinner <vstinner@python.org>
PEP-Delegate: C API Working Group
Discussions-To: https://discuss.python.org/t/63891
Status: Draft
Type: Standards Track
Created: 13-Sep-2024
Python-Version: 3.14
Post-History: `14-Sep-2024 <https://discuss.python.org/t/63891>`__

.. highlight:: c


Abstract
========

Add functions to the limited C API version 3.14:

* ``PyUnicode_Export()``: export a Python str object as a ``Py_buffer``
  view.
* ``PyUnicode_Import()``: import a Python str object.

By default, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
is copied. See the :ref:`specification <export-complexity>` for cases
when a copy is needed.


Rationale
=========

PEP 393
-------

:pep:`393` "Flexible String Representation" changed string internals in
Python 3.3 to use three formats:

* ``PyUnicode_1BYTE_KIND``: Unicode range [U+0000; U+00ff],
  UCS-1, 1 byte/character.
* ``PyUnicode_2BYTE_KIND``: Unicode range [U+0000; U+ffff],
  UCS-2, 2 bytes/character.
* ``PyUnicode_4BYTE_KIND``: Unicode range [U+0000; U+10ffff],
  UCS-4, 4 bytes/character.

A Python ``str`` object must always use the most compact format. For
example, a string which only contains ASCII characters must use the
UCS-1 format.

The ``PyUnicode_KIND()`` function can be used to know the format used by
a string.

One of the following functions can be used to access data:

* ``PyUnicode_1BYTE_DATA()`` for ``PyUnicode_1BYTE_KIND``.
* ``PyUnicode_2BYTE_DATA()`` for ``PyUnicode_2BYTE_KIND``.
* ``PyUnicode_4BYTE_DATA()`` for ``PyUnicode_4BYTE_KIND``.

To get the best performance, a C extension should have 3 code paths for
each of these 3 string native formats.

Limited C API
-------------

:pep:`393` functions such as ``PyUnicode_KIND()`` and
``PyUnicode_1BYTE_DATA()`` are excluded from the limited C API. It's not
possible to write code specialized for UCS formats. A C extension using
the limited C API can only use less efficient code paths and string
formats.

For example, the MarkupSafe project has a C extension specialized for
UCS formats for best performance, and so cannot use the limited C
API.


Specification
=============

API
---

Add the following API to the limited C API version 3.14::

    int32_t PyUnicode_Export(
        PyObject *unicode,
        int32_t requested_formats,
        Py_buffer *view);
    PyObject* PyUnicode_Import(
        const void *data,
        Py_ssize_t nbytes,
        int32_t format);

    #define PyUnicode_FORMAT_UCS1  0x01   // Py_UCS1*
    #define PyUnicode_FORMAT_UCS2  0x02   // Py_UCS2*
    #define PyUnicode_FORMAT_UCS4  0x04   // Py_UCS4*
    #define PyUnicode_FORMAT_UTF8  0x08   // char*
    #define PyUnicode_FORMAT_ASCII 0x10   // char* (ASCII string)

    #define PyUnicode_EXPORT_ALLOW_COPY 0x10000

The ``int32_t`` type is used instead of ``int`` to have a well defined
type size and not depend on the platform or the compiler.
See `Avoid C-specific Types
<https://github.com/capi-workgroup/api-evolution/issues/10>`_ for the
longer rationale.

PyUnicode_Export()
------------------

API::

    int32_t PyUnicode_Export(
        PyObject *unicode,
        int32_t requested_formats,
        Py_buffer *view)

Export the contents of the *unicode* string in one of the *requested_formats*.

* On success, fill *view*, and return a format (greater than ``0``).
* On error, set an exception, and return ``-1``.
  *view* is left unchanged.

After a successful call to ``PyUnicode_Export()``,
the *view* buffer must be released by ``PyBuffer_Release()``.
The contents of the buffer are valid until they are released.

The buffer is read-only and must not be modified.

The ``view->len`` member must be used to get the string length. The
buffer should end with a trailing NUL character, but it's not
recommended to rely on that because of embedded NUL characters.

*unicode* and *view* must not be NULL.

Available formats:

===================================  ========  ===========================
Constant Identifier                  Value     Description
===================================  ========  ===========================
``PyUnicode_FORMAT_UCS1``            ``0x01``  UCS-1 string (``Py_UCS1*``)
``PyUnicode_FORMAT_UCS2``            ``0x02``  UCS-2 string (``Py_UCS2*``)
``PyUnicode_FORMAT_UCS4``            ``0x04``  UCS-4 string (``Py_UCS4*``)
``PyUnicode_FORMAT_UTF8``            ``0x08``  UTF-8 string (``char*``)
``PyUnicode_FORMAT_ASCII``           ``0x10``  ASCII string (``Py_UCS1*``)
===================================  ========  ===========================

UCS-2 and UCS-4 use the native byte order.

*requested_formats* can be a single format or a bitwise combination of the
formats in the table above.
On success, the returned format will be set to a single one of the requested
flags.

Note that future versions of Python may introduce additional formats.

By default, no memory is copied and no conversion is done.

If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set in
*requested_formats*, the function can copy memory to provide the
requested format and convert from a format to another.

The ``PyUnicode_EXPORT_ALLOW_COPY`` flag is needed to export to
``PyUnicode_FORMAT_UTF8`` a string containing surrogate characters.

Available flags:

===============================  ===========  ===================================
Flag                             Value        Description
===============================  ===========  ===================================
``PyUnicode_EXPORT_ALLOW_COPY``  ``0x10000``  Allow memory copies and conversions
===============================  ===========  ===================================


.. _export-complexity:

Export complexity
-----------------

By default, an export has a complexity of *O*\ (1): no memory is copied
and no conversion is done. There is an exception: if only UTF-8 is
requested and the UTF-8 cache is not filled, the string is encoded to
UTF-8 to fill the cache.

If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set, there are cases when a
copy is needed, *O*\ (*n*) complexity:

* If only UCS-2 is requested and the native format is UCS-1.
* If only UCS-4 is requested and the native format is UCS-1 or UCS-2.
* If only UTF-8 is requested and the string contains surrogate
  characters.

To get the best performance on CPython and PyPy, it's recommended to
support these 4 formats::

    (PyUnicode_FORMAT_UCS1 \
     | PyUnicode_FORMAT_UCS2 \
     | PyUnicode_FORMAT_UCS4 \
     | PyUnicode_FORMAT_UTF8)

PyPy uses UTF-8 natively and so the ``PyUnicode_FORMAT_UTF8`` format is
recommended. It requires a memory copy, since PyPy ``str`` objects can
be moved in memory (PyPy uses a moving garbage collector).


Py_buffer format and item size
------------------------------

``Py_buffer`` uses the following format and item size depending on the
export format:

==========================  ==================  ============
Export format               Buffer format       Item size
==========================  ==================  ============
``PyUnicode_FORMAT_UCS1``   ``"B"``             1 byte
``PyUnicode_FORMAT_UCS2``   ``"=H"``            2 bytes
``PyUnicode_FORMAT_UCS4``   ``"=I"``            4 bytes
``PyUnicode_FORMAT_UTF8``   ``"B"``             1 byte
``PyUnicode_FORMAT_ASCII``  ``"B"``             1 byte
==========================  ==================  ============


PyUnicode_Import()
------------------

API::

    PyObject* PyUnicode_Import(
        const void *data,
        Py_ssize_t nbytes,
        int32_t format)

Create a Unicode string object from a buffer in a supported format.

* Return a reference to a new string object on success.
* Set an exception and return ``NULL`` on error.

*data* must not be NULL. *nbytes* must be positive or zero.

See ``PyUnicode_Export()`` for the available formats.


UTF-8 format
------------

CPython 3.14 doesn't use the UTF-8 format internally. The format is
provided for compatibility with PyPy which uses UTF-8 natively for
strings. However, in CPython, the encoded UTF-8 string is cached which
makes it convenient to be exported.

On CPython, the UTF-8 format has the lowest priority: ASCII and UCS
formats are preferred.

ASCII format
------------

When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the
``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin-1
strings.

The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for
``PyUnicode_Import()`` to validate that the string only contains ASCII
characters.


Surrogate characters and embedded NUL characters
------------------------------------------------

Surrogate characters are allowed: they can be imported and exported. For
example, the UTF-8 format uses the ``surrogatepass`` error handler.

Embedded NUL characters are allowed: they can be imported and exported.


Implementation
==============

https://github.com/python/cpython/pull/123738


Backwards Compatibility
=======================

There is no impact on the backward compatibility, only new C API
functions are added.


Usage of PEP 393 C APIs
=======================

A code search on PyPI top 7,500 projects (in March 2024) shows that
there are many projects importing and exporting UCS formats with the
regular C API.

PyUnicode_FromKindAndData()
---------------------------

25 projects call ``PyUnicode_FromKindAndData()``:

* **Cython** (3.0.9)
* Levenshtein (0.25.0)
* PyICU (2.12)
* PyICU-binary (2.7.4)
* PyQt5 (5.15.10)
* PyQt6 (6.6.1)
* aiocsv (1.3.1)
* asyncpg (0.29.0)
* biopython (1.83)
* catboost (1.2.3)
* cffi (1.16.0)
* mojimoji (0.0.13)
* mwparserfromhell (0.6.6)
* numba (0.59.0)
* **numpy** (1.26.4)
* orjson (3.9.15)
* pemja (0.4.1)
* pyahocorasick (2.0.0)
* pyjson5 (1.6.6)
* rapidfuzz (3.6.2)
* regex (2023.12.25)
* srsly (2.4.8)
* tokenizers (0.15.2)
* ujson (5.9.0)
* unicodedata2 (15.1.0)


PyUnicode_4BYTE_DATA()
----------------------

21 projects call ``PyUnicode_2BYTE_DATA()`` and/or
``PyUnicode_4BYTE_DATA()``:

* **Cython** (3.0.9)
* **MarkupSafe** (2.1.5)
* Nuitka (2.1.2)
* PyICU (2.12)
* PyICU-binary (2.7.4)
* PyQt5_sip (12.13.0)
* PyQt6_sip (13.6.0)
* biopython (1.83)
* catboost (1.2.3)
* cement (3.0.10)
* cffi (1.16.0)
* duckdb (0.10.0)
* **mypy** (1.9.0)
* **numpy** (1.26.4)
* orjson (3.9.15)
* pemja (0.4.1)
* pyahocorasick (2.0.0)
* pyjson5 (1.6.6)
* pyobjc-core (10.2)
* sip (6.8.3)
* wxPython (4.2.1)


Rejected Ideas
==============

Reject embedded NUL characters and require trailing NUL character
-----------------------------------------------------------------

In C, it's convenient to have a trailing NUL character. For example,
the ``for (; *str != 0; str++)`` loop can be used to iterate on
characters and ``strlen()`` can be used to get a string length.

The problem is that a Python ``str`` object can embed NUL characters.
Example: ``"ab\0c"``. If a string contains an embedded NUL character,
code relying on the NUL character to find the string end truncates the
string. It can lead to bugs, or even security vulnerabilities.
See a previous discussion in the issue `Change PyUnicode_AsUTF8()
to return NULL on embedded null characters
<https://github.com/python/cpython/issues/111089>`_.

Rejecting embedded NUL characters require to scan the string which has
an *O*\ (*n*) complexity.


Reject surrogate characters
---------------------------

Surrogate characters are characters in the Unicode range [U+D800;
U+DFFF].  They are disallowed by UTF codecs such as UTF-8. A Python
``str`` object can contain arbitrary lone surrogate characters. Example:
``"\uDC80"``.

Rejecting surrogate characters prevents exporting a string which contains
such a character. It can be surprising and annoying since the
``PyUnicode_Export()`` caller doesn't control the string contents.

Allowing surrogate characters allows to export any string and so avoid
this issue. For example, the UTF-8 codec can be used with the
``surrogatepass`` error handler to encode and decode surrogate
characters.


Discussions
===========

* https://discuss.python.org/t/63891
* https://github.com/capi-workgroup/decisions/issues/33
* https://github.com/python/cpython/issues/119609

Copyright
=========

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.
PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00			`PEP: 756`
			`Title: Add PyUnicode_Export() and PyUnicode_Import() C functions`
			`Author: Victor Stinner <vstinner@python.org>`
			`PEP-Delegate: C API Working Group`
PEP 756: Add Discussions-To and Post-History headers (#3981) 2024-09-19 17:57:53 -04:00			`Discussions-To: https://discuss.python.org/t/63891`
PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00			`Status: Draft`
			`Type: Standards Track`
			`Created: 13-Sep-2024`
			`Python-Version: 3.14`
PEP 756: Add Discussions-To and Post-History headers (#3981) 2024-09-19 17:57:53 -04:00			Post-History: `14-Sep-2024 <https://discuss.python.org/t/63891>`__
PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00
			`.. highlight:: c`


			`Abstract`
			`========`

			`Add functions to the limited C API version 3.14:`

			* ``PyUnicode_Export()``: export a Python str object as a ``Py_buffer``
			`view.`
			* ``PyUnicode_Import()``: import a Python str object.

PEP 756: Add PyUnicode_EXPORT_ALLOW_COPY flag (#3988) 2024-09-24 17:03:10 -04:00			By default, ``PyUnicode_Export()`` has an O\ (1) complexity: no memory
			is copied. See the :ref:`specification <export-complexity>` for cases
			`when a copy is needed.`
PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00

			`Rationale`
			`=========`

			`PEP 393`
			`-------`

			:pep:`393` "Flexible String Representation" changed string internals in
			`Python 3.3 to use three formats:`

			* ``PyUnicode_1BYTE_KIND``: Unicode range [U+0000; U+00ff],
			`UCS-1, 1 byte/character.`
			* ``PyUnicode_2BYTE_KIND``: Unicode range [U+0000; U+ffff],
			`UCS-2, 2 bytes/character.`
			* ``PyUnicode_4BYTE_KIND``: Unicode range [U+0000; U+10ffff],
			`UCS-4, 4 bytes/character.`

			A Python ``str`` object must always use the most compact format. For
			`example, a string which only contains ASCII characters must use the`
			`UCS-1 format.`

			The ``PyUnicode_KIND()`` function can be used to know the format used by
			`a string.`

			`One of the following functions can be used to access data:`

			* ``PyUnicode_1BYTE_DATA()`` for ``PyUnicode_1BYTE_KIND``.
			* ``PyUnicode_2BYTE_DATA()`` for ``PyUnicode_2BYTE_KIND``.
			* ``PyUnicode_4BYTE_DATA()`` for ``PyUnicode_4BYTE_KIND``.

			`To get the best performance, a C extension should have 3 code paths for`
			`each of these 3 string native formats.`

			`Limited C API`
			`-------------`

			:pep:`393` functions such as ``PyUnicode_KIND()`` and
			``PyUnicode_1BYTE_DATA()`` are excluded from the limited C API. It's not
			`possible to write code specialized for UCS formats. A C extension using`
			`the limited C API can only use less efficient code paths and string`
			`formats.`

			`For example, the MarkupSafe project has a C extension specialized for`
			`UCS formats for best performance, and so cannot use the limited C`
			`API.`


			`Specification`
			`=============`

			`API`
			`---`

			`Add the following API to the limited C API version 3.14::`

			`int32_t PyUnicode_Export(`
			`PyObject *unicode,`
			`int32_t requested_formats,`
			`Py_buffer *view);`
			`PyObject* PyUnicode_Import(`
			`const void *data,`
			`Py_ssize_t nbytes,`
			`int32_t format);`

			`#define PyUnicode_FORMAT_UCS1 0x01 // Py_UCS1*`
			`#define PyUnicode_FORMAT_UCS2 0x02 // Py_UCS2*`
			`#define PyUnicode_FORMAT_UCS4 0x04 // Py_UCS4*`
			`#define PyUnicode_FORMAT_UTF8 0x08 // char*`
			`#define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string)`

PEP 756: Add PyUnicode_EXPORT_ALLOW_COPY flag (#3988) 2024-09-24 17:03:10 -04:00			`#define PyUnicode_EXPORT_ALLOW_COPY 0x10000`

PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00			The ``int32_t`` type is used instead of ``int`` to have a well defined
			`type size and not depend on the platform or the compiler.`
			See `Avoid C-specific Types
			<https://github.com/capi-workgroup/api-evolution/issues/10>`_ for the
			`longer rationale.`

			`PyUnicode_Export()`
			`------------------`

PEP 756: Remove Open Questions (#3968) 2024-09-17 09:34:14 -04:00			`API::`

			`int32_t PyUnicode_Export(`
			`PyObject *unicode,`
			`int32_t requested_formats,`
			`Py_buffer *view)`
PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00
			`Export the contents of the unicode string in one of the requested_formats.`

			* On success, fill view, and return a format (greater than ``0``).
			* On error, set an exception, and return ``-1``.
			`view is left unchanged.`

			After a successful call to ``PyUnicode_Export()``,
			the view buffer must be released by ``PyBuffer_Release()``.
			`The contents of the buffer are valid until they are released.`

			`The buffer is read-only and must not be modified.`

PEP 756: Remove Open Questions (#3968) 2024-09-17 09:34:14 -04:00			The ``view->len`` member must be used to get the string length. The
			`buffer should end with a trailing NUL character, but it's not`
			`recommended to rely on that because of embedded NUL characters.`

PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00			`unicode and view must not be NULL.`

			`Available formats:`

			`=================================== ======== ===========================`
			`Constant Identifier Value Description`
			`=================================== ======== ===========================`
			``PyUnicode_FORMAT_UCS1`` ``0x01`` UCS-1 string (``Py_UCS1*``)
			``PyUnicode_FORMAT_UCS2`` ``0x02`` UCS-2 string (``Py_UCS2*``)
			``PyUnicode_FORMAT_UCS4`` ``0x04`` UCS-4 string (``Py_UCS4*``)
			``PyUnicode_FORMAT_UTF8`` ``0x08`` UTF-8 string (``char*``)
			``PyUnicode_FORMAT_ASCII`` ``0x10`` ASCII string (``Py_UCS1*``)
			`=================================== ======== ===========================`

			`UCS-2 and UCS-4 use the native byte order.`

			`requested_formats can be a single format or a bitwise combination of the`
			`formats in the table above.`
			`On success, the returned format will be set to a single one of the requested`
			`flags.`

			`Note that future versions of Python may introduce additional formats.`

PEP 756: Add PyUnicode_EXPORT_ALLOW_COPY flag (#3988) 2024-09-24 17:03:10 -04:00			`By default, no memory is copied and no conversion is done.`

			If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set in
			`requested_formats, the function can copy memory to provide the`
			`requested format and convert from a format to another.`

			The ``PyUnicode_EXPORT_ALLOW_COPY`` flag is needed to export to
			``PyUnicode_FORMAT_UTF8`` a string containing surrogate characters.

			`Available flags:`

			`=============================== =========== ===================================`
			`Flag Value Description`
			`=============================== =========== ===================================`
			``PyUnicode_EXPORT_ALLOW_COPY`` ``0x10000`` Allow memory copies and conversions
			`=============================== =========== ===================================`


PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00			`.. _export-complexity:`

			`Export complexity`
			`-----------------`

PEP 756: Add PyUnicode_EXPORT_ALLOW_COPY flag (#3988) 2024-09-24 17:03:10 -04:00			`By default, an export has a complexity of O\ (1): no memory is copied`
			`and no conversion is done. There is an exception: if only UTF-8 is`
			`requested and the UTF-8 cache is not filled, the string is encoded to`
			`UTF-8 to fill the cache.`

			If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set, there are cases when a
			`copy is needed, O\ (n) complexity:`
PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00
			`* If only UCS-2 is requested and the native format is UCS-1.`
			`* If only UCS-4 is requested and the native format is UCS-1 or UCS-2.`
PEP 756: Add PyUnicode_EXPORT_ALLOW_COPY flag (#3988) 2024-09-24 17:03:10 -04:00			`* If only UTF-8 is requested and the string contains surrogate`
			`characters.`
PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00
PEP 756: Remove Open Questions (#3968) 2024-09-17 09:34:14 -04:00			`To get the best performance on CPython and PyPy, it's recommended to`
PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00			`support these 4 formats::`

			`(PyUnicode_FORMAT_UCS1 \`
			`\| PyUnicode_FORMAT_UCS2 \`
			`\| PyUnicode_FORMAT_UCS4 \`
			`\| PyUnicode_FORMAT_UTF8)`

PEP 756: Remove Open Questions (#3968) 2024-09-17 09:34:14 -04:00			PyPy uses UTF-8 natively and so the ``PyUnicode_FORMAT_UTF8`` format is
			recommended. It requires a memory copy, since PyPy ``str`` objects can
			`be moved in memory (PyPy uses a moving garbage collector).`

PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00
			`Py_buffer format and item size`
			`------------------------------`

			``Py_buffer`` uses the following format and item size depending on the
			`export format:`

			`========================== ================== ============`
			`Export format Buffer format Item size`
			`========================== ================== ============`
			``PyUnicode_FORMAT_UCS1`` ``"B"`` 1 byte
PEP 756: Fix buffer format (#3967) Use the native encoding. "=I" size is 4 bytes on all platforms. 2024-09-16 08:36:10 -04:00			``PyUnicode_FORMAT_UCS2`` ``"=H"`` 2 bytes
			``PyUnicode_FORMAT_UCS4`` ``"=I"`` 4 bytes
PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00			``PyUnicode_FORMAT_UTF8`` ``"B"`` 1 byte
			``PyUnicode_FORMAT_ASCII`` ``"B"`` 1 byte
			`========================== ================== ============`


			`PyUnicode_Import()`
			`------------------`

PEP 756: Remove Open Questions (#3968) 2024-09-17 09:34:14 -04:00			`API::`

			`PyObject* PyUnicode_Import(`
			`const void *data,`
			`Py_ssize_t nbytes,`
			`int32_t format)`
PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00
			`Create a Unicode string object from a buffer in a supported format.`

			`* Return a reference to a new string object on success.`
			* Set an exception and return ``NULL`` on error.

			`data must not be NULL. nbytes must be positive or zero.`

			See ``PyUnicode_Export()`` for the available formats.


			`UTF-8 format`
			`------------`

			`CPython 3.14 doesn't use the UTF-8 format internally. The format is`
			`provided for compatibility with PyPy which uses UTF-8 natively for`
			`strings. However, in CPython, the encoded UTF-8 string is cached which`
			`makes it convenient to be exported.`

			`On CPython, the UTF-8 format has the lowest priority: ASCII and UCS`
			`formats are preferred.`

			`ASCII format`
			`------------`

			When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the
			``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin-1
			`strings.`

			The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for
			``PyUnicode_Import()`` to validate that the string only contains ASCII
			`characters.`


PEP 756: Add PyUnicode_EXPORT_ALLOW_COPY flag (#3988) 2024-09-24 17:03:10 -04:00			`Surrogate characters and embedded NUL characters`
			`------------------------------------------------`
PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00
			`Surrogate characters are allowed: they can be imported and exported. For`
			example, the UTF-8 format uses the ``surrogatepass`` error handler.

			`Embedded NUL characters are allowed: they can be imported and exported.`


			`Implementation`
			`==============`

			`https://github.com/python/cpython/pull/123738`


			`Backwards Compatibility`
			`=======================`

			`There is no impact on the backward compatibility, only new C API`
			`functions are added.`


			`Usage of PEP 393 C APIs`
			`=======================`

			`A code search on PyPI top 7,500 projects (in March 2024) shows that`
			`there are many projects importing and exporting UCS formats with the`
			`regular C API.`

			`PyUnicode_FromKindAndData()`
			`---------------------------`

			25 projects call ``PyUnicode_FromKindAndData()``:

			`* Cython (3.0.9)`
			`* Levenshtein (0.25.0)`
			`* PyICU (2.12)`
			`* PyICU-binary (2.7.4)`
			`* PyQt5 (5.15.10)`
			`* PyQt6 (6.6.1)`
			`* aiocsv (1.3.1)`
			`* asyncpg (0.29.0)`
			`* biopython (1.83)`
			`* catboost (1.2.3)`
			`* cffi (1.16.0)`
			`* mojimoji (0.0.13)`
			`* mwparserfromhell (0.6.6)`
			`* numba (0.59.0)`
			`* numpy (1.26.4)`
			`* orjson (3.9.15)`
			`* pemja (0.4.1)`
			`* pyahocorasick (2.0.0)`
			`* pyjson5 (1.6.6)`
			`* rapidfuzz (3.6.2)`
			`* regex (2023.12.25)`
			`* srsly (2.4.8)`
			`* tokenizers (0.15.2)`
			`* ujson (5.9.0)`
			`* unicodedata2 (15.1.0)`


			`PyUnicode_4BYTE_DATA()`
			`----------------------`

			21 projects call ``PyUnicode_2BYTE_DATA()`` and/or
			``PyUnicode_4BYTE_DATA()``:

			`* Cython (3.0.9)`
			`* MarkupSafe (2.1.5)`
			`* Nuitka (2.1.2)`
			`* PyICU (2.12)`
			`* PyICU-binary (2.7.4)`
			`* PyQt5_sip (12.13.0)`
			`* PyQt6_sip (13.6.0)`
			`* biopython (1.83)`
			`* catboost (1.2.3)`
			`* cement (3.0.10)`
			`* cffi (1.16.0)`
			`* duckdb (0.10.0)`
			`* mypy (1.9.0)`
			`* numpy (1.26.4)`
			`* orjson (3.9.15)`
			`* pemja (0.4.1)`
			`* pyahocorasick (2.0.0)`
			`* pyjson5 (1.6.6)`
			`* pyobjc-core (10.2)`
			`* sip (6.8.3)`
			`* wxPython (4.2.1)`


			`Rejected Ideas`
			`==============`

			`Reject embedded NUL characters and require trailing NUL character`
			`-----------------------------------------------------------------`

			`In C, it's convenient to have a trailing NUL character. For example,`
			the ``for (; *str != 0; str++)`` loop can be used to iterate on
			characters and ``strlen()`` can be used to get a string length.

			The problem is that a Python ``str`` object can embed NUL characters.
			Example: ``"ab\0c"``. If a string contains an embedded NUL character,
			`code relying on the NUL character to find the string end truncates the`
			`string. It can lead to bugs, or even security vulnerabilities.`
			See a previous discussion in the issue `Change PyUnicode_AsUTF8()
			`to return NULL on embedded null characters`
			<https://github.com/python/cpython/issues/111089>`_.

			`Rejecting embedded NUL characters require to scan the string which has`
			`an O\ (n) complexity.`

PEP 756: Add PyUnicode_EXPORT_ALLOW_COPY flag (#3988) 2024-09-24 17:03:10 -04:00
PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00			`Reject surrogate characters`
			`---------------------------`

			`Surrogate characters are characters in the Unicode range [U+D800;`
			`U+DFFF]. They are disallowed by UTF codecs such as UTF-8. A Python`
			``str`` object can contain arbitrary lone surrogate characters. Example:
			``"\uDC80"``.

			`Rejecting surrogate characters prevents exporting a string which contains`
			`such a character. It can be surprising and annoying since the`
			``PyUnicode_Export()`` caller doesn't control the string contents.

			`Allowing surrogate characters allows to export any string and so avoid`
			`this issue. For example, the UTF-8 codec can be used with the`
			``surrogatepass`` error handler to encode and decode surrogate
			`characters.`


			`Discussions`
			`===========`

PEP 756: Add Discussions-To and Post-History headers (#3981) 2024-09-19 17:57:53 -04:00			`* https://discuss.python.org/t/63891`
PEP 756: PyUnicode_Export() (#3960) 2024-09-14 05:03:39 -04:00			`* https://github.com/capi-workgroup/decisions/issues/33`
			`* https://github.com/python/cpython/issues/119609`

			`Copyright`
			`=========`

			`This document is placed in the public domain or under the`
			`CC0-1.0-Universal license, whichever is more permissive.`