PEP 624: Update alternative ideas (#1793)

Add note about we can avoid creating a temporary Unicode object in deprecated APIs for some codecs.
2021-02-04 17:39:05 +09:00 · 2021-02-04 17:39:05 +09:00 · 814daa8aea
parent d3f48ed58f
commit 814daa8aea
1 changed files with 69 additions and 48 deletions
--- a/pep-0624.rst
+++ b/pep-0624.rst
@ -33,7 +33,7 @@ This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in Python 3.1
   `PEP 623  <https://www.python.org/dev/peps/pep-0623/>`_ propose to remove
   Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP
   is not relating to Unicode object. These PEPs are split because they have
-   different motivation and need different discussion.
+   different motivations and need different discussions.


 Motivation
@ -51,7 +51,7 @@ Rationale
 Deprecated since Python 3.3
 ---------------------------

-``Py_UNICODE`` and APIs using it have been deprecated since Python 3.3.
+``Py_UNICODE`` and APIs using it has been deprecated since Python 3.3.


 Inefficient
@ -65,7 +65,7 @@ object.
 Not used widely
 ---------------

-When searching from top 4000 PyPI packages [1]_, only pyodbc use
+When searching from the top 4000 PyPI packages [1]_, only pyodbc use
 these APIs.

 * ``PyUnicode_EncodeUTF8()``
@ -139,23 +139,22 @@ Remove these APIs in Python 3.11. They have been deprecated already.
 * ``PyUnicode_TransformDecimalToASCII()``


-Alternative ideas
+Alternative Ideas
 =================

-Instead of just removing deprecated APIs, we may be able to use their
-names with different signature.
+Replace ``Py_UNICODE*`` with ``PyObjct*``
+-----------------------------------------

+As described in the "Alternative APIs" section, some APIs don't have
+public alternative APIs accepting ``PyObject *unicode`` input.
+And some public alternative APIs have restrictions like missing
+``errors`` and ``byteorder`` parameters.

-Make some private APIs public
------------------------------
+Instead of removing deprecated APIs, we can reuse their names for
+alternative public APIs.

-``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs.
-
-Some APIs have alternative public APIs. But they are missing
-``const char *errors`` or ``int byteorder`` parameters.
-
-We can rename some private APIs and make them public to cover missing
-APIs and parameters.
+Since we have private alternative APIs already, it is just renaming
+from private name to public and deprecated names.

 ============================= ================================
 Rename to                     Rename from
@ -170,11 +169,12 @@ APIs and parameters.

 Pros:

-* We have more consistent API set.
+* We have a more consistent API set.

 Cons:

-* We have more public APIs to maintain.
+* Backward incompatible.
+* We have more public APIs to maintain for rare use cases.
 * Existing public APIs are enough for most use cases, and
  ``PyUnicode_AsEncodedString()`` can be used in other cases.

@ -182,51 +182,71 @@ Cons:
 Replace ``Py_UNICODE*`` with ``Py_UCS4*``
 -----------------------------------------

-We can replace ``Py_UNICODE`` (typedef of ``wchar_t``) with
-``Py_UCS4``. Since builtin codecs support UCS-4, we don't need to
-convert ``Py_UCS4*`` string to Unicode object.
+We can replace ``Py_UNICODE`` with ``Py_UCS4`` and undeprecate
+these APIs.
+
+UTF-8, UTF-16, UTF-32 encoders support ``Py_UCS4`` internally.
+So ``PyUnicode_EncodeUTF8()``, ``PyUnicode_EncodeUTF16()``, and
+``PyUnicode_EncodeUTF32()`` can avoid to create a temporary Unicode
+object.


 Pros:

-* We have more consistent API set.
-* User can encode UCS-4 string in C without creating Unicode object.
+* We can avoid creating temporary Unicode object when encoding from
+  ``Py_UCS4*`` into bytes object with UTF-8, UTF-16, UTF-32 codecs.

 Cons:

-* We have more public APIs to maintain.
-* Applications which uses UTF-8 or UTF-16 can not use these APIs
-  anyway.
-* Other Python implementations may not have builtin codec for UCS-4.
-* If we change the Unicode internal representation to UTF-8, we need
-  to keep UCS-4 support only for these APIs.
+* Backward incompatible.
+* We have more public APIs to maintain for rare use cases.
+* Other Python implementations that want to support Python/C API need
+  to support these APIs too.
+* If we change the Unicode internal representation to UTF-8 in the
+  future, we need to keep UCS-4 support only for these APIs.


 Replace ``Py_UNICODE*`` with ``wchar_t*``
 -----------------------------------------

-We can replace ``Py_UNICODE`` to ``wchar_t``.
+We can replace ``Py_UNICODE`` with ``wchar_t``. Since ``Py_UNICODE``
+is typedef of ``wchar_t`` already, this is status quo.
+
+On platforms where ``sizeof(wchar_t) == 4``, we can avoid to create a
+temporary Unicode object when encoding from ``wchar_t*`` to bytes
+objects using UTF-8, UTF-16, and UTF-32 codec, like the "Replace
+``Py_UNICODE*`` with ``Py_UCS4*``" idea.
+

 Pros:

-* We have more consistent API set.
 * Backward compatible.
+* We can avoid creating temporary Unicode object when encode from
+  ``Py_UCS4*`` into bytes object with UTF-8, UTF-16, UTF-32 codecs
+  on platform where ``sizeof(wchar_t) == 4``.

 Cons:

-* We have more public APIs to maintain.
-* They are inefficient on platforms ``wchar_t*`` is UTF-16. It is
-  because built-in codecs supports only UCS-1, UCS-2, and UCS-4
-  input.
+* Although Windows is the most major platform that uses ``wchar_t``
+  heavily, these APIs need to create a temporary Unicode object
+  always because ``sizeof(wchar_t) == 2`` on Windows.
+* We have more public APIs to maintain for rare use cases.
+* Other Python implementations that want to support Python/C API need
+  to support these APIs too.
+* If we change the Unicode internal representation to UTF-8 in the
+  future, we need to keep UCS-4 support only for these APIs.


-Rejected ideas
+Rejected Ideas
 ==============

-Using runtime warning
---------------------
+Emit runtime warning
+--------------------

-These APIs doesn't release GIL for now. Emitting a warning from
+In addition to existing compiler warning, emitting runtime
+``DeprecationWarning`` is suggested.
+
+But these APIs doesn't release GIL for now. Emitting a warning from
 such APIs is not safe. See this example.

 .. code-block::
@ -244,7 +264,6 @@ filters and other threads may change the ``list`` and ``u`` can be
 a dangling reference after ``PyUnicode_EncodeUTF8()`` returned.


-
 Discussions
 ===========

@ -256,22 +275,24 @@ Discussions
 Objections
 ----------

-* Removing these APIs removes ability to use codec without temporary Unicode.
+* Removing these APIs removes ability to use codec without temporary
+  Unicode.

-  * Codecs can not encode Unicode buffer directly without temporary Unicode
-    object since Python 3.3. All these APIs creates temporary Unicode object
-    for now. So removing them doesn't reduce any abilities.
+  * Codecs can not encode Unicode buffer directly without temporary
+    Unicode object since Python 3.3. All these APIs creates temporary
+    Unicode object for now. So removing them doesn't reduce any
+    abilities.

 * Why not remove decoder APIs too?

  * They are part of stable ABI.

-  * ``PyUnicode_DecodeASCII()`` and ``PyUnicode_DecodeUTF8()`` are used
-    very widely. Deprecating them is not worth enough.
+  * ``PyUnicode_DecodeASCII()`` and ``PyUnicode_DecodeUTF8()`` are
+    used very widely. Deprecating them is not worth enough.

-  * Decoder APIs can decode from byte buffer directly, without creating
-    temporary bytes object. On the other hand, encoder APIs can not avoid
-    temporary Unicode object.
+  * Decoder APIs can decode from byte buffer directly, without
+    creating temporary bytes object. On the other hand, encoder APIs
+    can not avoid temporary Unicode object.


 References