PEP 756: Add PyUnicode_EXPORT_ALLOW_COPY flag (#3988)

This commit is contained in:
Victor Stinner 2024-09-24 23:03:10 +02:00 committed by GitHub
parent 680c8b1c13
commit f085d19db9
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
1 changed files with 35 additions and 9 deletions

View File

@ -21,9 +21,9 @@ Add functions to the limited C API version 3.14:
view. view.
* ``PyUnicode_Import()``: import a Python str object. * ``PyUnicode_Import()``: import a Python str object.
In general, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory By default, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
copy is needed. See the :ref:`specification <export-complexity>` for is copied. See the :ref:`specification <export-complexity>` for cases
cases when a copy is needed. when a copy is needed.
Rationale Rationale
@ -95,6 +95,8 @@ Add the following API to the limited C API version 3.14::
#define PyUnicode_FORMAT_UTF8 0x08 // char* #define PyUnicode_FORMAT_UTF8 0x08 // char*
#define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string) #define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string)
#define PyUnicode_EXPORT_ALLOW_COPY 0x10000
The ``int32_t`` type is used instead of ``int`` to have a well defined The ``int32_t`` type is used instead of ``int`` to have a well defined
type size and not depend on the platform or the compiler. type size and not depend on the platform or the compiler.
See `Avoid C-specific Types See `Avoid C-specific Types
@ -150,18 +152,41 @@ flags.
Note that future versions of Python may introduce additional formats. Note that future versions of Python may introduce additional formats.
By default, no memory is copied and no conversion is done.
If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set in
*requested_formats*, the function can copy memory to provide the
requested format and convert from a format to another.
The ``PyUnicode_EXPORT_ALLOW_COPY`` flag is needed to export to
``PyUnicode_FORMAT_UTF8`` a string containing surrogate characters.
Available flags:
=============================== =========== ===================================
Flag Value Description
=============================== =========== ===================================
``PyUnicode_EXPORT_ALLOW_COPY`` ``0x10000`` Allow memory copies and conversions
=============================== =========== ===================================
.. _export-complexity: .. _export-complexity:
Export complexity Export complexity
----------------- -----------------
In general, an export has a complexity of *O*\ (1): no memory copy is By default, an export has a complexity of *O*\ (1): no memory is copied
needed. There are cases when a copy is needed, *O*\ (*n*) complexity: and no conversion is done. There is an exception: if only UTF-8 is
requested and the UTF-8 cache is not filled, the string is encoded to
UTF-8 to fill the cache.
If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set, there are cases when a
copy is needed, *O*\ (*n*) complexity:
* If only UCS-2 is requested and the native format is UCS-1. * If only UCS-2 is requested and the native format is UCS-1.
* If only UCS-4 is requested and the native format is UCS-1 or UCS-2. * If only UCS-4 is requested and the native format is UCS-1 or UCS-2.
* If only UTF-8 is requested: the string is encoded to UTF-8 at the * If only UTF-8 is requested and the string contains surrogate
first call, and then the encoded UTF-8 string is cached. characters.
To get the best performance on CPython and PyPy, it's recommended to To get the best performance on CPython and PyPy, it's recommended to
support these 4 formats:: support these 4 formats::
@ -236,8 +261,8 @@ The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for
characters. characters.
Surrogate characters and NUL characters Surrogate characters and embedded NUL characters
--------------------------------------- ------------------------------------------------
Surrogate characters are allowed: they can be imported and exported. For Surrogate characters are allowed: they can be imported and exported. For
example, the UTF-8 format uses the ``surrogatepass`` error handler. example, the UTF-8 format uses the ``surrogatepass`` error handler.
@ -347,6 +372,7 @@ to return NULL on embedded null characters
Rejecting embedded NUL characters require to scan the string which has Rejecting embedded NUL characters require to scan the string which has
an *O*\ (*n*) complexity. an *O*\ (*n*) complexity.
Reject surrogate characters Reject surrogate characters
--------------------------- ---------------------------