PEP 624: Update alternative ideas (#1793)

Add note about we can avoid creating a temporary Unicode object
in deprecated APIs for some codecs.
This commit is contained in:
Inada Naoki 2021-02-04 17:39:05 +09:00 committed by GitHub
parent d3f48ed58f
commit 814daa8aea
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 69 additions and 48 deletions

View File

@ -33,7 +33,7 @@ This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in Python 3.1
`PEP 623 <https://www.python.org/dev/peps/pep-0623/>`_ propose to remove
Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP
is not relating to Unicode object. These PEPs are split because they have
different motivation and need different discussion.
different motivations and need different discussions.
Motivation
@ -51,7 +51,7 @@ Rationale
Deprecated since Python 3.3
---------------------------
``Py_UNICODE`` and APIs using it have been deprecated since Python 3.3.
``Py_UNICODE`` and APIs using it has been deprecated since Python 3.3.
Inefficient
@ -65,7 +65,7 @@ object.
Not used widely
---------------
When searching from top 4000 PyPI packages [1]_, only pyodbc use
When searching from the top 4000 PyPI packages [1]_, only pyodbc use
these APIs.
* ``PyUnicode_EncodeUTF8()``
@ -139,23 +139,22 @@ Remove these APIs in Python 3.11. They have been deprecated already.
* ``PyUnicode_TransformDecimalToASCII()``
Alternative ideas
Alternative Ideas
=================
Instead of just removing deprecated APIs, we may be able to use their
names with different signature.
Replace ``Py_UNICODE*`` with ``PyObjct*``
-----------------------------------------
As described in the "Alternative APIs" section, some APIs don't have
public alternative APIs accepting ``PyObject *unicode`` input.
And some public alternative APIs have restrictions like missing
``errors`` and ``byteorder`` parameters.
Make some private APIs public
------------------------------
Instead of removing deprecated APIs, we can reuse their names for
alternative public APIs.
``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs.
Some APIs have alternative public APIs. But they are missing
``const char *errors`` or ``int byteorder`` parameters.
We can rename some private APIs and make them public to cover missing
APIs and parameters.
Since we have private alternative APIs already, it is just renaming
from private name to public and deprecated names.
============================= ================================
Rename to Rename from
@ -170,11 +169,12 @@ APIs and parameters.
Pros:
* We have more consistent API set.
* We have a more consistent API set.
Cons:
* We have more public APIs to maintain.
* Backward incompatible.
* We have more public APIs to maintain for rare use cases.
* Existing public APIs are enough for most use cases, and
``PyUnicode_AsEncodedString()`` can be used in other cases.
@ -182,51 +182,71 @@ Cons:
Replace ``Py_UNICODE*`` with ``Py_UCS4*``
-----------------------------------------
We can replace ``Py_UNICODE`` (typedef of ``wchar_t``) with
``Py_UCS4``. Since builtin codecs support UCS-4, we don't need to
convert ``Py_UCS4*`` string to Unicode object.
We can replace ``Py_UNICODE`` with ``Py_UCS4`` and undeprecate
these APIs.
UTF-8, UTF-16, UTF-32 encoders support ``Py_UCS4`` internally.
So ``PyUnicode_EncodeUTF8()``, ``PyUnicode_EncodeUTF16()``, and
``PyUnicode_EncodeUTF32()`` can avoid to create a temporary Unicode
object.
Pros:
* We have more consistent API set.
* User can encode UCS-4 string in C without creating Unicode object.
* We can avoid creating temporary Unicode object when encoding from
``Py_UCS4*`` into bytes object with UTF-8, UTF-16, UTF-32 codecs.
Cons:
* We have more public APIs to maintain.
* Applications which uses UTF-8 or UTF-16 can not use these APIs
anyway.
* Other Python implementations may not have builtin codec for UCS-4.
* If we change the Unicode internal representation to UTF-8, we need
to keep UCS-4 support only for these APIs.
* Backward incompatible.
* We have more public APIs to maintain for rare use cases.
* Other Python implementations that want to support Python/C API need
to support these APIs too.
* If we change the Unicode internal representation to UTF-8 in the
future, we need to keep UCS-4 support only for these APIs.
Replace ``Py_UNICODE*`` with ``wchar_t*``
-----------------------------------------
We can replace ``Py_UNICODE`` to ``wchar_t``.
We can replace ``Py_UNICODE`` with ``wchar_t``. Since ``Py_UNICODE``
is typedef of ``wchar_t`` already, this is status quo.
On platforms where ``sizeof(wchar_t) == 4``, we can avoid to create a
temporary Unicode object when encoding from ``wchar_t*`` to bytes
objects using UTF-8, UTF-16, and UTF-32 codec, like the "Replace
``Py_UNICODE*`` with ``Py_UCS4*``" idea.
Pros:
* We have more consistent API set.
* Backward compatible.
* We can avoid creating temporary Unicode object when encode from
``Py_UCS4*`` into bytes object with UTF-8, UTF-16, UTF-32 codecs
on platform where ``sizeof(wchar_t) == 4``.
Cons:
* We have more public APIs to maintain.
* They are inefficient on platforms ``wchar_t*`` is UTF-16. It is
because built-in codecs supports only UCS-1, UCS-2, and UCS-4
input.
* Although Windows is the most major platform that uses ``wchar_t``
heavily, these APIs need to create a temporary Unicode object
always because ``sizeof(wchar_t) == 2`` on Windows.
* We have more public APIs to maintain for rare use cases.
* Other Python implementations that want to support Python/C API need
to support these APIs too.
* If we change the Unicode internal representation to UTF-8 in the
future, we need to keep UCS-4 support only for these APIs.
Rejected ideas
Rejected Ideas
==============
Using runtime warning
---------------------
Emit runtime warning
--------------------
These APIs doesn't release GIL for now. Emitting a warning from
In addition to existing compiler warning, emitting runtime
``DeprecationWarning`` is suggested.
But these APIs doesn't release GIL for now. Emitting a warning from
such APIs is not safe. See this example.
.. code-block::
@ -244,7 +264,6 @@ filters and other threads may change the ``list`` and ``u`` can be
a dangling reference after ``PyUnicode_EncodeUTF8()`` returned.
Discussions
===========
@ -256,22 +275,24 @@ Discussions
Objections
----------
* Removing these APIs removes ability to use codec without temporary Unicode.
* Removing these APIs removes ability to use codec without temporary
Unicode.
* Codecs can not encode Unicode buffer directly without temporary Unicode
object since Python 3.3. All these APIs creates temporary Unicode object
for now. So removing them doesn't reduce any abilities.
* Codecs can not encode Unicode buffer directly without temporary
Unicode object since Python 3.3. All these APIs creates temporary
Unicode object for now. So removing them doesn't reduce any
abilities.
* Why not remove decoder APIs too?
* They are part of stable ABI.
* ``PyUnicode_DecodeASCII()`` and ``PyUnicode_DecodeUTF8()`` are used
very widely. Deprecating them is not worth enough.
* ``PyUnicode_DecodeASCII()`` and ``PyUnicode_DecodeUTF8()`` are
used very widely. Deprecating them is not worth enough.
* Decoder APIs can decode from byte buffer directly, without creating
temporary bytes object. On the other hand, encoder APIs can not avoid
temporary Unicode object.
* Decoder APIs can decode from byte buffer directly, without
creating temporary bytes object. On the other hand, encoder APIs
can not avoid temporary Unicode object.
References