PEP 624: Update alternative ideas (#1793)

Add note about we can avoid creating a temporary Unicode object
in deprecated APIs for some codecs.
This commit is contained in:
Inada Naoki 2021-02-04 17:39:05 +09:00 committed by GitHub
parent d3f48ed58f
commit 814daa8aea
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 69 additions and 48 deletions

View File

@ -33,7 +33,7 @@ This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in Python 3.1
`PEP 623 <https://www.python.org/dev/peps/pep-0623/>`_ propose to remove `PEP 623 <https://www.python.org/dev/peps/pep-0623/>`_ propose to remove
Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP
is not relating to Unicode object. These PEPs are split because they have is not relating to Unicode object. These PEPs are split because they have
different motivation and need different discussion. different motivations and need different discussions.
Motivation Motivation
@ -51,7 +51,7 @@ Rationale
Deprecated since Python 3.3 Deprecated since Python 3.3
--------------------------- ---------------------------
``Py_UNICODE`` and APIs using it have been deprecated since Python 3.3. ``Py_UNICODE`` and APIs using it has been deprecated since Python 3.3.
Inefficient Inefficient
@ -65,7 +65,7 @@ object.
Not used widely Not used widely
--------------- ---------------
When searching from top 4000 PyPI packages [1]_, only pyodbc use When searching from the top 4000 PyPI packages [1]_, only pyodbc use
these APIs. these APIs.
* ``PyUnicode_EncodeUTF8()`` * ``PyUnicode_EncodeUTF8()``
@ -139,23 +139,22 @@ Remove these APIs in Python 3.11. They have been deprecated already.
* ``PyUnicode_TransformDecimalToASCII()`` * ``PyUnicode_TransformDecimalToASCII()``
Alternative ideas Alternative Ideas
================= =================
Instead of just removing deprecated APIs, we may be able to use their Replace ``Py_UNICODE*`` with ``PyObjct*``
names with different signature. -----------------------------------------
As described in the "Alternative APIs" section, some APIs don't have
public alternative APIs accepting ``PyObject *unicode`` input.
And some public alternative APIs have restrictions like missing
``errors`` and ``byteorder`` parameters.
Make some private APIs public Instead of removing deprecated APIs, we can reuse their names for
------------------------------ alternative public APIs.
``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs. Since we have private alternative APIs already, it is just renaming
from private name to public and deprecated names.
Some APIs have alternative public APIs. But they are missing
``const char *errors`` or ``int byteorder`` parameters.
We can rename some private APIs and make them public to cover missing
APIs and parameters.
============================= ================================ ============================= ================================
Rename to Rename from Rename to Rename from
@ -170,11 +169,12 @@ APIs and parameters.
Pros: Pros:
* We have more consistent API set. * We have a more consistent API set.
Cons: Cons:
* We have more public APIs to maintain. * Backward incompatible.
* We have more public APIs to maintain for rare use cases.
* Existing public APIs are enough for most use cases, and * Existing public APIs are enough for most use cases, and
``PyUnicode_AsEncodedString()`` can be used in other cases. ``PyUnicode_AsEncodedString()`` can be used in other cases.
@ -182,51 +182,71 @@ Cons:
Replace ``Py_UNICODE*`` with ``Py_UCS4*`` Replace ``Py_UNICODE*`` with ``Py_UCS4*``
----------------------------------------- -----------------------------------------
We can replace ``Py_UNICODE`` (typedef of ``wchar_t``) with We can replace ``Py_UNICODE`` with ``Py_UCS4`` and undeprecate
``Py_UCS4``. Since builtin codecs support UCS-4, we don't need to these APIs.
convert ``Py_UCS4*`` string to Unicode object.
UTF-8, UTF-16, UTF-32 encoders support ``Py_UCS4`` internally.
So ``PyUnicode_EncodeUTF8()``, ``PyUnicode_EncodeUTF16()``, and
``PyUnicode_EncodeUTF32()`` can avoid to create a temporary Unicode
object.
Pros: Pros:
* We have more consistent API set. * We can avoid creating temporary Unicode object when encoding from
* User can encode UCS-4 string in C without creating Unicode object. ``Py_UCS4*`` into bytes object with UTF-8, UTF-16, UTF-32 codecs.
Cons: Cons:
* We have more public APIs to maintain. * Backward incompatible.
* Applications which uses UTF-8 or UTF-16 can not use these APIs * We have more public APIs to maintain for rare use cases.
anyway. * Other Python implementations that want to support Python/C API need
* Other Python implementations may not have builtin codec for UCS-4. to support these APIs too.
* If we change the Unicode internal representation to UTF-8, we need * If we change the Unicode internal representation to UTF-8 in the
to keep UCS-4 support only for these APIs. future, we need to keep UCS-4 support only for these APIs.
Replace ``Py_UNICODE*`` with ``wchar_t*`` Replace ``Py_UNICODE*`` with ``wchar_t*``
----------------------------------------- -----------------------------------------
We can replace ``Py_UNICODE`` to ``wchar_t``. We can replace ``Py_UNICODE`` with ``wchar_t``. Since ``Py_UNICODE``
is typedef of ``wchar_t`` already, this is status quo.
On platforms where ``sizeof(wchar_t) == 4``, we can avoid to create a
temporary Unicode object when encoding from ``wchar_t*`` to bytes
objects using UTF-8, UTF-16, and UTF-32 codec, like the "Replace
``Py_UNICODE*`` with ``Py_UCS4*``" idea.
Pros: Pros:
* We have more consistent API set.
* Backward compatible. * Backward compatible.
* We can avoid creating temporary Unicode object when encode from
``Py_UCS4*`` into bytes object with UTF-8, UTF-16, UTF-32 codecs
on platform where ``sizeof(wchar_t) == 4``.
Cons: Cons:
* We have more public APIs to maintain. * Although Windows is the most major platform that uses ``wchar_t``
* They are inefficient on platforms ``wchar_t*`` is UTF-16. It is heavily, these APIs need to create a temporary Unicode object
because built-in codecs supports only UCS-1, UCS-2, and UCS-4 always because ``sizeof(wchar_t) == 2`` on Windows.
input. * We have more public APIs to maintain for rare use cases.
* Other Python implementations that want to support Python/C API need
to support these APIs too.
* If we change the Unicode internal representation to UTF-8 in the
future, we need to keep UCS-4 support only for these APIs.
Rejected ideas Rejected Ideas
============== ==============
Using runtime warning Emit runtime warning
--------------------- --------------------
These APIs doesn't release GIL for now. Emitting a warning from In addition to existing compiler warning, emitting runtime
``DeprecationWarning`` is suggested.
But these APIs doesn't release GIL for now. Emitting a warning from
such APIs is not safe. See this example. such APIs is not safe. See this example.
.. code-block:: .. code-block::
@ -244,7 +264,6 @@ filters and other threads may change the ``list`` and ``u`` can be
a dangling reference after ``PyUnicode_EncodeUTF8()`` returned. a dangling reference after ``PyUnicode_EncodeUTF8()`` returned.
Discussions Discussions
=========== ===========
@ -256,22 +275,24 @@ Discussions
Objections Objections
---------- ----------
* Removing these APIs removes ability to use codec without temporary Unicode. * Removing these APIs removes ability to use codec without temporary
Unicode.
* Codecs can not encode Unicode buffer directly without temporary Unicode * Codecs can not encode Unicode buffer directly without temporary
object since Python 3.3. All these APIs creates temporary Unicode object Unicode object since Python 3.3. All these APIs creates temporary
for now. So removing them doesn't reduce any abilities. Unicode object for now. So removing them doesn't reduce any
abilities.
* Why not remove decoder APIs too? * Why not remove decoder APIs too?
* They are part of stable ABI. * They are part of stable ABI.
* ``PyUnicode_DecodeASCII()`` and ``PyUnicode_DecodeUTF8()`` are used * ``PyUnicode_DecodeASCII()`` and ``PyUnicode_DecodeUTF8()`` are
very widely. Deprecating them is not worth enough. used very widely. Deprecating them is not worth enough.
* Decoder APIs can decode from byte buffer directly, without creating * Decoder APIs can decode from byte buffer directly, without
temporary bytes object. On the other hand, encoder APIs can not avoid creating temporary bytes object. On the other hand, encoder APIs
temporary Unicode object. can not avoid temporary Unicode object.
References References