PEP 624: Remove Py_UNICODE encoder APIs (#1497)
Co-authored-by: Victor Stinner <vstinner@python.org>
This commit is contained in:
parent
f1de4f169d
commit
d51eaeec78
|
@ -0,0 +1,299 @@
|
|||
PEP: 624
|
||||
Title: Remove Py_UNICODE encoder APIs
|
||||
Author: Inada Naoki <songofacandy@gmail.com>
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 06-Jul-2020
|
||||
Python-Version: 3.11
|
||||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in Python 3.11:
|
||||
|
||||
* ``PyUnicode_Encode()``
|
||||
* ``PyUnicode_EncodeASCII()``
|
||||
* ``PyUnicode_EncodeLatin1()``
|
||||
* ``PyUnicode_EncodeUTF7()``
|
||||
* ``PyUnicode_EncodeUTF8()``
|
||||
* ``PyUnicode_EncodeUTF16()``
|
||||
* ``PyUnicode_EncodeUTF32()``
|
||||
* ``PyUnicode_EncodeUnicodeEscape()``
|
||||
* ``PyUnicode_EncodeRawUnicodeEscape()``
|
||||
* ``PyUnicode_EncodeCharmap()``
|
||||
* ``PyUnicode_TranslateCharmap()``
|
||||
* ``PyUnicode_EncodeDecimal()``
|
||||
* ``PyUnicode_TransformDecimalToASCII()``
|
||||
|
||||
.. note::
|
||||
|
||||
`PEP 623 <https://www.python.org/dev/peps/pep-0623/>`_ propose to remove
|
||||
Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP
|
||||
is not relating to Unicode object. These PEPs are split because they have
|
||||
different motivation and need different discussion.
|
||||
|
||||
|
||||
Motivation
|
||||
==========
|
||||
|
||||
In general, reducing the number of APIs that have been deprecated for
|
||||
a long time and have few users is a good idea for not only it
|
||||
improves the maintainability of CPython, but it also helps API users
|
||||
and other Python implementations.
|
||||
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
||||
Deprecated since Python 3.3
|
||||
---------------------------
|
||||
|
||||
``Py_UNICODE`` and APIs using it are deprecated since Python 3.3.
|
||||
|
||||
|
||||
Inefficient
|
||||
-----------
|
||||
|
||||
All of these APIs are implemented using ``PyUnicode_FromWideChar``.
|
||||
So these APIs are inefficient when user want to encode Unicode
|
||||
object.
|
||||
|
||||
|
||||
Not used widely
|
||||
---------------
|
||||
|
||||
When searching from top 4000 PyPI packages [1]_, only pyodbc use
|
||||
these APIs.
|
||||
|
||||
* ``PyUnicode_EncodeUTF8()``
|
||||
* ``PyUnicode_EncodeUTF16()``
|
||||
|
||||
pyodbc uses these APIs to encode Unicode object into bytes object.
|
||||
So it is easy to fix it. [2]_
|
||||
|
||||
|
||||
Alternative APIs
|
||||
================
|
||||
|
||||
There are alternative APIs to accept ``PyObject *unicode`` instead of
|
||||
``Py_UNICODE *``. Users can migrate to them.
|
||||
|
||||
|
||||
========================================= ==========================================
|
||||
Deprecated API Alternative APIs
|
||||
========================================= ==========================================
|
||||
``PyUnicode_Encode()`` ``PyUnicode_AsEncodedString()``
|
||||
``PyUnicode_EncodeASCII()`` ``PyUnicode_AsASCIIString()`` \(1)
|
||||
``PyUnicode_EncodeLatin1()`` ``PyUnicode_AsLatin1String()`` \(1)
|
||||
``PyUnicode_EncodeUTF7()`` \(2)
|
||||
``PyUnicode_EncodeUTF8()`` ``PyUnicode_AsUTF8String()`` \(1)
|
||||
``PyUnicode_EncodeUTF16()`` ``PyUnicode_AsUTF16String()`` \(3)
|
||||
``PyUnicode_EncodeUTF32()`` ``PyUnicode_AsUTF32String()`` \(3)
|
||||
``PyUnicode_EncodeUnicodeEscape()`` ``PyUnicode_AsUnicodeEscapeString()``
|
||||
``PyUnicode_EncodeRawUnicodeEscape()`` ``PyUnicode_AsRawUnicodeEscapeString()``
|
||||
``PyUnicode_EncodeCharmap()`` ``PyUnicode_AsCharmapString()`` \(1)
|
||||
``PyUnicode_TranslateCharmap()`` ``PyUnicode_Translate()``
|
||||
``PyUnicode_EncodeDecimal()`` \(4)
|
||||
``PyUnicode_TransformDecimalToASCII()`` \(4)
|
||||
========================================= ==========================================
|
||||
|
||||
Notes:
|
||||
|
||||
(1)
|
||||
``const char *errors`` parameter is missing.
|
||||
|
||||
(2)
|
||||
There is no public alternative API. But user can use generic
|
||||
``PyUnicode_AsEncodedString()`` instead.
|
||||
|
||||
(3)
|
||||
``const char *errors, int byteorder`` parameters are missing.
|
||||
|
||||
(4)
|
||||
There is no direct replacement. But ``Py_UNICODE_TODECIMAL``
|
||||
can be used instead. CPython uses
|
||||
``_PyUnicode_TransformDecimalAndSpaceToASCII`` for converting
|
||||
from Unicode to numbers instead.
|
||||
|
||||
|
||||
Plan
|
||||
====
|
||||
|
||||
Python 3.9
|
||||
----------
|
||||
|
||||
Add ``Py_DEPRECATED(3.3)`` to following APIs. This change is committed
|
||||
already [3]_. All other APIs have been marked ``Py_DEPRECATED(3.3)``
|
||||
already.
|
||||
|
||||
* ``PyUnicode_EncodeDecimal()``
|
||||
* ``PyUnicode_TransformDecimalToASCII()``.
|
||||
|
||||
Document all APIs as "will be removed in version 3.11".
|
||||
|
||||
|
||||
Python 3.11
|
||||
-----------
|
||||
|
||||
These APIs are removed.
|
||||
|
||||
* ``PyUnicode_Encode()``
|
||||
* ``PyUnicode_EncodeASCII()``
|
||||
* ``PyUnicode_EncodeLatin1()``
|
||||
* ``PyUnicode_EncodeUTF7()``
|
||||
* ``PyUnicode_EncodeUTF8()``
|
||||
* ``PyUnicode_EncodeUTF16()``
|
||||
* ``PyUnicode_EncodeUTF32()``
|
||||
* ``PyUnicode_EncodeUnicodeEscape()``
|
||||
* ``PyUnicode_EncodeRawUnicodeEscape()``
|
||||
* ``PyUnicode_EncodeCharmap()``
|
||||
* ``PyUnicode_TranslateCharmap()``
|
||||
* ``PyUnicode_EncodeDecimal()``
|
||||
* ``PyUnicode_TransformDecimalToASCII()``
|
||||
|
||||
|
||||
Alternative ideas
|
||||
=================
|
||||
|
||||
Instead of just removing deprecated APIs, we may be able to use thier
|
||||
names with different signature.
|
||||
|
||||
|
||||
Make some private APIs public
|
||||
------------------------------
|
||||
|
||||
``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs.
|
||||
|
||||
Some APIs have alternative public APIs. But they are missing
|
||||
``const char *errors`` or ``int byteorder`` parameters.
|
||||
|
||||
We can rename some private APIs and make them public to cover missing
|
||||
APIs and parameters.
|
||||
|
||||
============================= ================================
|
||||
Rename to Rename from
|
||||
============================= ================================
|
||||
``PyUnicode_EncodeASCII()`` ``_PyUnicode_AsASCIIString()``
|
||||
``PyUnicode_EncodeLatin1()`` ``_PyUnicode_AsLatin1String()``
|
||||
``PyUnicode_EncodeUTF7()`` ``_PyUnicode_EncodeUTF7()``
|
||||
``PyUnicode_EncodeUTF8()`` ``_PyUnicode_AsUTF8String()``
|
||||
``PyUnicode_EncodeUTF16()`` ``_PyUnicode_EncodeUTF16()``
|
||||
``PyUnicode_EncodeUTF32()`` ``_PyUnicode_EncodeUTF32()``
|
||||
============================= ================================
|
||||
|
||||
Pros:
|
||||
|
||||
* We have more consistent API set.
|
||||
|
||||
Cons:
|
||||
|
||||
* We have more public APIs to maintain.
|
||||
* Existing public APIs are enough for most use cases, and
|
||||
``PyUnicode_AsEncodedString()`` can be used in other cases.
|
||||
|
||||
|
||||
Replace ``Py_UNICODE*`` with ``Py_UCS4*``
|
||||
-----------------------------------------
|
||||
|
||||
We can replace ``Py_UNICODE`` (typedef of ``wchar_t``) with
|
||||
``Py_UCS4``. Since builtin codecs support UCS-4, we don't need to
|
||||
convert ``Py_UCS4*`` string to Unicode object.
|
||||
|
||||
|
||||
Pros:
|
||||
|
||||
* We have more consistent API set.
|
||||
* User can encode UCS-4 string in C without creating Unicode object.
|
||||
|
||||
Cons:
|
||||
|
||||
* We have more public APIs to maintain.
|
||||
* Applications which uses UTF-8 or UTF-32 can not use these APIs
|
||||
anyway.
|
||||
* Other Python implementations may not have builtin codec for UCS-4.
|
||||
* If we change the Unicode internal representation to UTF-8, we need
|
||||
to keep UCS-4 support only for these APIs.
|
||||
|
||||
|
||||
Replace ``Py_UNICODE*`` with ``wchar_t*``
|
||||
-----------------------------------------
|
||||
|
||||
We can replace ``Py_UNICODE`` to ``wchar_t``.
|
||||
|
||||
Pros:
|
||||
|
||||
* We have more consistent API set.
|
||||
* Backward compatible.
|
||||
|
||||
Cons:
|
||||
|
||||
* We have more public APIs to maintain.
|
||||
* They are inefficient on platforms ``wchar_t*`` is UTF-16. It is
|
||||
because built-in codecs supports only UCS-1, UCS-2, and UCS-4
|
||||
input.
|
||||
|
||||
|
||||
Rejected ideas
|
||||
==============
|
||||
|
||||
Using runtime warning
|
||||
---------------------
|
||||
|
||||
These APIs doesn't release GIL for now. Emitting a warning from
|
||||
such APIs is not safe. See this example.
|
||||
|
||||
.. code-block::
|
||||
|
||||
PyObject *u = PyList_GET_ITEM(list, i); // u is borrowed reference.
|
||||
PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u),
|
||||
PyUnicode_GET_SIZE(u), NULL);
|
||||
// Assumes u is still living reference.
|
||||
PyObject *t = PyTuple_Pack(2, u, b);
|
||||
Py_DECREF(b);
|
||||
return t;
|
||||
|
||||
If we emit Python warning from ``PyUnicode_EncodeUTF8()``, warning
|
||||
filters and other threads may change the ``list`` and ``u`` can be
|
||||
a dangling reference after ``PyUnicode_EncodeUTF8()`` returned.
|
||||
|
||||
Additionally, since we are not changing behavior but removing C APIs,
|
||||
runtime ``DeprecationWarning`` might not helpful for Python
|
||||
developers. We should warn to extension developers instead.
|
||||
|
||||
|
||||
Discussions
|
||||
===========
|
||||
|
||||
* `Plan to remove Py_UNICODE APis except PEP 623
|
||||
<https://mail.python.org/archives/list/python-dev@python.org/thread/S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE/#S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE>`_
|
||||
* `bpo-41123: Remove Py_UNICODE APIs except PEP 623: <https://bugs.python.org/issue41123>`_
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] Source package list chosen from top 4000 PyPI packages.
|
||||
(https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.txt)
|
||||
|
||||
.. [2] pyodbc -- Don't use PyUnicode_Encode API #792
|
||||
(https://github.com/mkleehammer/pyodbc/pull/792)
|
||||
|
||||
.. [3] Uncomment Py_DEPRECATED for Py_UNICODE APIs (GH-21318)
|
||||
(https://github.com/python/cpython/commit/9c3840870814493fed62e140cfa43c2883e12181)
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
coding: utf-8
|
||||
End:
|
Loading…
Reference in New Issue