PEP 756: Add PyUnicode_EXPORT_ALLOW_COPY flag (#3988)

2024-09-24 23:03:10 +02:00 · 2024-09-24 23:03:10 +02:00 · f085d19db9
parent 680c8b1c13
commit f085d19db9
1 changed files with 35 additions and 9 deletions
--- a/peps/pep-0756.rst
+++ b/peps/pep-0756.rst
@ -21,9 +21,9 @@ Add functions to the limited C API version 3.14:
  view.
 * ``PyUnicode_Import()``: import a Python str object.

-In general, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
-copy is needed. See the :ref:`specification <export-complexity>` for
-cases when a copy is needed.
+By default, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
+is copied. See the :ref:`specification <export-complexity>` for cases
+when a copy is needed.


 Rationale
@ -95,6 +95,8 @@ Add the following API to the limited C API version 3.14::
    #define PyUnicode_FORMAT_UTF8  0x08   // char*
    #define PyUnicode_FORMAT_ASCII 0x10   // char* (ASCII string)

+    #define PyUnicode_EXPORT_ALLOW_COPY 0x10000
+
 The ``int32_t`` type is used instead of ``int`` to have a well defined
 type size and not depend on the platform or the compiler.
 See `Avoid C-specific Types
@ -150,18 +152,41 @@ flags.

 Note that future versions of Python may introduce additional formats.

+By default, no memory is copied and no conversion is done.
+
+If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set in
+*requested_formats*, the function can copy memory to provide the
+requested format and convert from a format to another.
+
+The ``PyUnicode_EXPORT_ALLOW_COPY`` flag is needed to export to
+``PyUnicode_FORMAT_UTF8`` a string containing surrogate characters.
+
+Available flags:
+
+===============================  ===========  ===================================
+Flag                             Value        Description
+===============================  ===========  ===================================
+``PyUnicode_EXPORT_ALLOW_COPY``  ``0x10000``  Allow memory copies and conversions
+===============================  ===========  ===================================
+
+
 .. _export-complexity:

 Export complexity
 -----------------

-In general, an export has a complexity of *O*\ (1): no memory copy is
-needed. There are cases when a copy is needed, *O*\ (*n*) complexity:
+By default, an export has a complexity of *O*\ (1): no memory is copied
+and no conversion is done. There is an exception: if only UTF-8 is
+requested and the UTF-8 cache is not filled, the string is encoded to
+UTF-8 to fill the cache.
+
+If the ``PyUnicode_EXPORT_ALLOW_COPY`` flag is set, there are cases when a
+copy is needed, *O*\ (*n*) complexity:

 * If only UCS-2 is requested and the native format is UCS-1.
 * If only UCS-4 is requested and the native format is UCS-1 or UCS-2.
-* If only UTF-8 is requested: the string is encoded to UTF-8 at the
-  first call, and then the encoded UTF-8 string is cached.
+* If only UTF-8 is requested and the string contains surrogate
+  characters.

 To get the best performance on CPython and PyPy, it's recommended to
 support these 4 formats::
@ -236,8 +261,8 @@ The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for
 characters.


-Surrogate characters and NUL characters
---------------------------------------
+Surrogate characters and embedded NUL characters
+------------------------------------------------

 Surrogate characters are allowed: they can be imported and exported. For
 example, the UTF-8 format uses the ``surrogatepass`` error handler.
@ -347,6 +372,7 @@ to return NULL on embedded null characters
 Rejecting embedded NUL characters require to scan the string which has
 an *O*\ (*n*) complexity.

+
 Reject surrogate characters
 ---------------------------