PEP 756: Give up on copying memory (#3999)

This commit is contained in:
Victor Stinner 2024-09-26 20:37:52 +02:00 committed by GitHub
parent 2a3dfe0c88
commit aced24fc35
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
1 changed files with 47 additions and 47 deletions

View File

@ -21,9 +21,8 @@ Add functions to the limited C API version 3.14:
view.
* ``PyUnicode_Import()``: import a Python str object.
By default, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
is copied. See the :ref:`specification <export-complexity>` for cases
when a copy is needed.
On CPython, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
is copied and no conversion is done.
Rationale
@ -67,9 +66,10 @@ possible to write code specialized for UCS formats. A C extension using
the limited C API can only use less efficient code paths and string
formats.
For example, the MarkupSafe project has a C extension specialized for
UCS formats for best performance, and so cannot use the limited C
API.
For example, the `MarkupSafe project
<https://markupsafe.palletsprojects.com/>`_ has a C extension
specialized for UCS formats for best performance, and so cannot use the
limited C API.
Specification
@ -95,8 +95,6 @@ Add the following API to the limited C API version 3.14::
#define PyUnicode_FORMAT_UTF8 0x08 // char*
#define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string)
#define PyUnicode_EXPORT_ALLOW_COPY 0x10000
The ``int32_t`` type is used instead of ``int`` to have a well defined
type size and not depend on the platform or the compiler.
See `Avoid C-specific Types
@ -148,26 +146,12 @@ UCS-2 and UCS-4 use the native byte order.
*requested_formats* can be a single format or a bitwise combination of the
formats in the table above.
On success, the returned format will be set to a single one of the requested
flags.
formats.
Note that future versions of Python may introduce additional formats.
By default, no memory is copied and no conversion is done.
No memory is copied and no conversion is done.
If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set in
*requested_formats*, the function can copy memory to provide the
requested format and convert from a format to another.
The ``PyUnicode_EXPORT_ALLOW_COPY`` flag is needed to export to
``PyUnicode_FORMAT_UTF8`` a string containing surrogate characters.
Available flags:
=============================== =========== ===================================
Flag Value Description
=============================== =========== ===================================
``PyUnicode_EXPORT_ALLOW_COPY`` ``0x10000`` Allow memory copies and conversions
=============================== =========== ===================================
.. _export-complexity:
@ -175,18 +159,8 @@ Flag Value Description
Export complexity
-----------------
By default, an export has a complexity of *O*\ (1): no memory is copied
and no conversion is done. There is an exception: if only UTF-8 is
requested and the UTF-8 cache is not filled, the string is encoded to
UTF-8 to fill the cache.
If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set, there are cases when a
copy is needed, *O*\ (*n*) complexity:
* If only UCS-2 is requested and the native format is UCS-1.
* If only UCS-4 is requested and the native format is UCS-1 or UCS-2.
* If only UTF-8 is requested and the string contains surrogate
characters.
On CPython, an export has a complexity of *O*\ (1): no memory is copied
and no conversion is done.
To get the best performance on CPython and PyPy, it's recommended to
support these 4 formats::
@ -241,31 +215,29 @@ See ``PyUnicode_Export()`` for the available formats.
UTF-8 format
------------
CPython 3.14 doesn't use the UTF-8 format internally. The format is
provided for compatibility with PyPy which uses UTF-8 natively for
strings. However, in CPython, the encoded UTF-8 string is cached which
makes it convenient to be exported.
CPython 3.14 doesn't use the UTF-8 format internally and doesn't support
exporting a string as UTF-8. The ``PyUnicode_AsUTF8AndSize()`` function
can be used instead.
The ``PyUnicode_FORMAT_UTF8`` format is provided for compatibility with
alternate implementations which may use UTF-8 natively for strings.
On CPython, the UTF-8 format has the lowest priority: ASCII and UCS
formats are preferred.
ASCII format
------------
When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the
``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin-1
strings.
``PyUnicode_FORMAT_UCS1`` export format is used for ASCII strings.
The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for
``PyUnicode_Import()`` to validate that the string only contains ASCII
``PyUnicode_Import()`` to validate that a string only contains ASCII
characters.
Surrogate characters and embedded NUL characters
------------------------------------------------
Surrogate characters are allowed: they can be imported and exported. For
example, the UTF-8 format uses the ``surrogatepass`` error handler.
Surrogate characters are allowed: they can be imported and exported.
Embedded NUL characters are allowed: they can be imported and exported.
@ -391,6 +363,34 @@ this issue. For example, the UTF-8 codec can be used with the
characters.
Conversions on demand
---------------------
It would be convenient to convert formats on demand. For example,
convert UCS-1 and UCS-2 to UCS-4 if an export to only UCS-4 is
requested.
The problem is that most users expect an export to require no memory
copy and no conversion: an *O*\ (1) complexity. It is better to have an
API where all operations have an *O*\ (1) complexity.
Export to UTF-8
---------------
CPython 3.14 has a cache to encode a string to UTF-8. It is tempting to
allow exporting to UTF-8.
The problem is that the UTF-8 cache doesn't support surrogate
characters. An export is expected to provide the whole string content,
including embedded NUL characters and surrogate characters. To export
surrogate characters, a different code path using the ``surrogatepass``
error handler is needed and each export operation has to allocate a
temporary buffer: *O*\ (n) complexity.
An export is expected to have an *O*\ (1) complexity, so the idea to
export UTF-8 in CPython was abadonned.
Discussions
===========