PEP 756: Give up on copying memory (#3999)
This commit is contained in:
parent
2a3dfe0c88
commit
aced24fc35
|
@ -21,9 +21,8 @@ Add functions to the limited C API version 3.14:
|
|||
view.
|
||||
* ``PyUnicode_Import()``: import a Python str object.
|
||||
|
||||
By default, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
|
||||
is copied. See the :ref:`specification <export-complexity>` for cases
|
||||
when a copy is needed.
|
||||
On CPython, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
|
||||
is copied and no conversion is done.
|
||||
|
||||
|
||||
Rationale
|
||||
|
@ -67,9 +66,10 @@ possible to write code specialized for UCS formats. A C extension using
|
|||
the limited C API can only use less efficient code paths and string
|
||||
formats.
|
||||
|
||||
For example, the MarkupSafe project has a C extension specialized for
|
||||
UCS formats for best performance, and so cannot use the limited C
|
||||
API.
|
||||
For example, the `MarkupSafe project
|
||||
<https://markupsafe.palletsprojects.com/>`_ has a C extension
|
||||
specialized for UCS formats for best performance, and so cannot use the
|
||||
limited C API.
|
||||
|
||||
|
||||
Specification
|
||||
|
@ -95,8 +95,6 @@ Add the following API to the limited C API version 3.14::
|
|||
#define PyUnicode_FORMAT_UTF8 0x08 // char*
|
||||
#define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string)
|
||||
|
||||
#define PyUnicode_EXPORT_ALLOW_COPY 0x10000
|
||||
|
||||
The ``int32_t`` type is used instead of ``int`` to have a well defined
|
||||
type size and not depend on the platform or the compiler.
|
||||
See `Avoid C-specific Types
|
||||
|
@ -148,26 +146,12 @@ UCS-2 and UCS-4 use the native byte order.
|
|||
*requested_formats* can be a single format or a bitwise combination of the
|
||||
formats in the table above.
|
||||
On success, the returned format will be set to a single one of the requested
|
||||
flags.
|
||||
formats.
|
||||
|
||||
Note that future versions of Python may introduce additional formats.
|
||||
|
||||
By default, no memory is copied and no conversion is done.
|
||||
No memory is copied and no conversion is done.
|
||||
|
||||
If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set in
|
||||
*requested_formats*, the function can copy memory to provide the
|
||||
requested format and convert from a format to another.
|
||||
|
||||
The ``PyUnicode_EXPORT_ALLOW_COPY`` flag is needed to export to
|
||||
``PyUnicode_FORMAT_UTF8`` a string containing surrogate characters.
|
||||
|
||||
Available flags:
|
||||
|
||||
=============================== =========== ===================================
|
||||
Flag Value Description
|
||||
=============================== =========== ===================================
|
||||
``PyUnicode_EXPORT_ALLOW_COPY`` ``0x10000`` Allow memory copies and conversions
|
||||
=============================== =========== ===================================
|
||||
|
||||
|
||||
.. _export-complexity:
|
||||
|
@ -175,18 +159,8 @@ Flag Value Description
|
|||
Export complexity
|
||||
-----------------
|
||||
|
||||
By default, an export has a complexity of *O*\ (1): no memory is copied
|
||||
and no conversion is done. There is an exception: if only UTF-8 is
|
||||
requested and the UTF-8 cache is not filled, the string is encoded to
|
||||
UTF-8 to fill the cache.
|
||||
|
||||
If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set, there are cases when a
|
||||
copy is needed, *O*\ (*n*) complexity:
|
||||
|
||||
* If only UCS-2 is requested and the native format is UCS-1.
|
||||
* If only UCS-4 is requested and the native format is UCS-1 or UCS-2.
|
||||
* If only UTF-8 is requested and the string contains surrogate
|
||||
characters.
|
||||
On CPython, an export has a complexity of *O*\ (1): no memory is copied
|
||||
and no conversion is done.
|
||||
|
||||
To get the best performance on CPython and PyPy, it's recommended to
|
||||
support these 4 formats::
|
||||
|
@ -241,31 +215,29 @@ See ``PyUnicode_Export()`` for the available formats.
|
|||
UTF-8 format
|
||||
------------
|
||||
|
||||
CPython 3.14 doesn't use the UTF-8 format internally. The format is
|
||||
provided for compatibility with PyPy which uses UTF-8 natively for
|
||||
strings. However, in CPython, the encoded UTF-8 string is cached which
|
||||
makes it convenient to be exported.
|
||||
CPython 3.14 doesn't use the UTF-8 format internally and doesn't support
|
||||
exporting a string as UTF-8. The ``PyUnicode_AsUTF8AndSize()`` function
|
||||
can be used instead.
|
||||
|
||||
The ``PyUnicode_FORMAT_UTF8`` format is provided for compatibility with
|
||||
alternate implementations which may use UTF-8 natively for strings.
|
||||
|
||||
On CPython, the UTF-8 format has the lowest priority: ASCII and UCS
|
||||
formats are preferred.
|
||||
|
||||
ASCII format
|
||||
------------
|
||||
|
||||
When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the
|
||||
``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin-1
|
||||
strings.
|
||||
``PyUnicode_FORMAT_UCS1`` export format is used for ASCII strings.
|
||||
|
||||
The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for
|
||||
``PyUnicode_Import()`` to validate that the string only contains ASCII
|
||||
``PyUnicode_Import()`` to validate that a string only contains ASCII
|
||||
characters.
|
||||
|
||||
|
||||
Surrogate characters and embedded NUL characters
|
||||
------------------------------------------------
|
||||
|
||||
Surrogate characters are allowed: they can be imported and exported. For
|
||||
example, the UTF-8 format uses the ``surrogatepass`` error handler.
|
||||
Surrogate characters are allowed: they can be imported and exported.
|
||||
|
||||
Embedded NUL characters are allowed: they can be imported and exported.
|
||||
|
||||
|
@ -391,6 +363,34 @@ this issue. For example, the UTF-8 codec can be used with the
|
|||
characters.
|
||||
|
||||
|
||||
Conversions on demand
|
||||
---------------------
|
||||
|
||||
It would be convenient to convert formats on demand. For example,
|
||||
convert UCS-1 and UCS-2 to UCS-4 if an export to only UCS-4 is
|
||||
requested.
|
||||
|
||||
The problem is that most users expect an export to require no memory
|
||||
copy and no conversion: an *O*\ (1) complexity. It is better to have an
|
||||
API where all operations have an *O*\ (1) complexity.
|
||||
|
||||
Export to UTF-8
|
||||
---------------
|
||||
|
||||
CPython 3.14 has a cache to encode a string to UTF-8. It is tempting to
|
||||
allow exporting to UTF-8.
|
||||
|
||||
The problem is that the UTF-8 cache doesn't support surrogate
|
||||
characters. An export is expected to provide the whole string content,
|
||||
including embedded NUL characters and surrogate characters. To export
|
||||
surrogate characters, a different code path using the ``surrogatepass``
|
||||
error handler is needed and each export operation has to allocate a
|
||||
temporary buffer: *O*\ (n) complexity.
|
||||
|
||||
An export is expected to have an *O*\ (1) complexity, so the idea to
|
||||
export UTF-8 in CPython was abadonned.
|
||||
|
||||
|
||||
Discussions
|
||||
===========
|
||||
|
||||
|
|
Loading…
Reference in New Issue