From d51eaeec7865fe53f57710a4ee9a518613a5f9e8 Mon Sep 17 00:00:00 2001 From: Inada Naoki Date: Wed, 8 Jul 2020 00:08:47 +0900 Subject: [PATCH] PEP 624: Remove Py_UNICODE encoder APIs (#1497) Co-authored-by: Victor Stinner --- pep-0624.rst | 299 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 299 insertions(+) create mode 100644 pep-0624.rst diff --git a/pep-0624.rst b/pep-0624.rst new file mode 100644 index 000000000..5b9455e2a --- /dev/null +++ b/pep-0624.rst @@ -0,0 +1,299 @@ +PEP: 624 +Title: Remove Py_UNICODE encoder APIs +Author: Inada Naoki +Status: Draft +Type: Standards Track +Content-Type: text/x-rst +Created: 06-Jul-2020 +Python-Version: 3.11 + + +Abstract +======== + +This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in Python 3.11: + +* ``PyUnicode_Encode()`` +* ``PyUnicode_EncodeASCII()`` +* ``PyUnicode_EncodeLatin1()`` +* ``PyUnicode_EncodeUTF7()`` +* ``PyUnicode_EncodeUTF8()`` +* ``PyUnicode_EncodeUTF16()`` +* ``PyUnicode_EncodeUTF32()`` +* ``PyUnicode_EncodeUnicodeEscape()`` +* ``PyUnicode_EncodeRawUnicodeEscape()`` +* ``PyUnicode_EncodeCharmap()`` +* ``PyUnicode_TranslateCharmap()`` +* ``PyUnicode_EncodeDecimal()`` +* ``PyUnicode_TransformDecimalToASCII()`` + +.. note:: + + `PEP 623 `_ propose to remove + Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP + is not relating to Unicode object. These PEPs are split because they have + different motivation and need different discussion. + + +Motivation +========== + +In general, reducing the number of APIs that have been deprecated for +a long time and have few users is a good idea for not only it +improves the maintainability of CPython, but it also helps API users +and other Python implementations. + + +Rationale +========= + +Deprecated since Python 3.3 +--------------------------- + +``Py_UNICODE`` and APIs using it are deprecated since Python 3.3. + + +Inefficient +----------- + +All of these APIs are implemented using ``PyUnicode_FromWideChar``. +So these APIs are inefficient when user want to encode Unicode +object. + + +Not used widely +--------------- + +When searching from top 4000 PyPI packages [1]_, only pyodbc use +these APIs. + +* ``PyUnicode_EncodeUTF8()`` +* ``PyUnicode_EncodeUTF16()`` + +pyodbc uses these APIs to encode Unicode object into bytes object. +So it is easy to fix it. [2]_ + + +Alternative APIs +================ + +There are alternative APIs to accept ``PyObject *unicode`` instead of +``Py_UNICODE *``. Users can migrate to them. + + +========================================= ========================================== +Deprecated API Alternative APIs +========================================= ========================================== +``PyUnicode_Encode()`` ``PyUnicode_AsEncodedString()`` +``PyUnicode_EncodeASCII()`` ``PyUnicode_AsASCIIString()`` \(1) +``PyUnicode_EncodeLatin1()`` ``PyUnicode_AsLatin1String()`` \(1) +``PyUnicode_EncodeUTF7()`` \(2) +``PyUnicode_EncodeUTF8()`` ``PyUnicode_AsUTF8String()`` \(1) +``PyUnicode_EncodeUTF16()`` ``PyUnicode_AsUTF16String()`` \(3) +``PyUnicode_EncodeUTF32()`` ``PyUnicode_AsUTF32String()`` \(3) +``PyUnicode_EncodeUnicodeEscape()`` ``PyUnicode_AsUnicodeEscapeString()`` +``PyUnicode_EncodeRawUnicodeEscape()`` ``PyUnicode_AsRawUnicodeEscapeString()`` +``PyUnicode_EncodeCharmap()`` ``PyUnicode_AsCharmapString()`` \(1) +``PyUnicode_TranslateCharmap()`` ``PyUnicode_Translate()`` +``PyUnicode_EncodeDecimal()`` \(4) +``PyUnicode_TransformDecimalToASCII()`` \(4) +========================================= ========================================== + +Notes: + +(1) + ``const char *errors`` parameter is missing. + +(2) + There is no public alternative API. But user can use generic + ``PyUnicode_AsEncodedString()`` instead. + +(3) + ``const char *errors, int byteorder`` parameters are missing. + +(4) + There is no direct replacement. But ``Py_UNICODE_TODECIMAL`` + can be used instead. CPython uses + ``_PyUnicode_TransformDecimalAndSpaceToASCII`` for converting + from Unicode to numbers instead. + + +Plan +==== + +Python 3.9 +---------- + +Add ``Py_DEPRECATED(3.3)`` to following APIs. This change is committed +already [3]_. All other APIs have been marked ``Py_DEPRECATED(3.3)`` +already. + +* ``PyUnicode_EncodeDecimal()`` +* ``PyUnicode_TransformDecimalToASCII()``. + +Document all APIs as "will be removed in version 3.11". + + +Python 3.11 +----------- + +These APIs are removed. + +* ``PyUnicode_Encode()`` +* ``PyUnicode_EncodeASCII()`` +* ``PyUnicode_EncodeLatin1()`` +* ``PyUnicode_EncodeUTF7()`` +* ``PyUnicode_EncodeUTF8()`` +* ``PyUnicode_EncodeUTF16()`` +* ``PyUnicode_EncodeUTF32()`` +* ``PyUnicode_EncodeUnicodeEscape()`` +* ``PyUnicode_EncodeRawUnicodeEscape()`` +* ``PyUnicode_EncodeCharmap()`` +* ``PyUnicode_TranslateCharmap()`` +* ``PyUnicode_EncodeDecimal()`` +* ``PyUnicode_TransformDecimalToASCII()`` + + +Alternative ideas +================= + +Instead of just removing deprecated APIs, we may be able to use thier +names with different signature. + + +Make some private APIs public +------------------------------ + +``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs. + +Some APIs have alternative public APIs. But they are missing +``const char *errors`` or ``int byteorder`` parameters. + +We can rename some private APIs and make them public to cover missing +APIs and parameters. + +============================= ================================ + Rename to Rename from +============================= ================================ +``PyUnicode_EncodeASCII()`` ``_PyUnicode_AsASCIIString()`` +``PyUnicode_EncodeLatin1()`` ``_PyUnicode_AsLatin1String()`` +``PyUnicode_EncodeUTF7()`` ``_PyUnicode_EncodeUTF7()`` +``PyUnicode_EncodeUTF8()`` ``_PyUnicode_AsUTF8String()`` +``PyUnicode_EncodeUTF16()`` ``_PyUnicode_EncodeUTF16()`` +``PyUnicode_EncodeUTF32()`` ``_PyUnicode_EncodeUTF32()`` +============================= ================================ + +Pros: + +* We have more consistent API set. + +Cons: + +* We have more public APIs to maintain. +* Existing public APIs are enough for most use cases, and + ``PyUnicode_AsEncodedString()`` can be used in other cases. + + +Replace ``Py_UNICODE*`` with ``Py_UCS4*`` +----------------------------------------- + +We can replace ``Py_UNICODE`` (typedef of ``wchar_t``) with +``Py_UCS4``. Since builtin codecs support UCS-4, we don't need to +convert ``Py_UCS4*`` string to Unicode object. + + +Pros: + +* We have more consistent API set. +* User can encode UCS-4 string in C without creating Unicode object. + +Cons: + +* We have more public APIs to maintain. +* Applications which uses UTF-8 or UTF-32 can not use these APIs + anyway. +* Other Python implementations may not have builtin codec for UCS-4. +* If we change the Unicode internal representation to UTF-8, we need + to keep UCS-4 support only for these APIs. + + +Replace ``Py_UNICODE*`` with ``wchar_t*`` +----------------------------------------- + +We can replace ``Py_UNICODE`` to ``wchar_t``. + +Pros: + +* We have more consistent API set. +* Backward compatible. + +Cons: + +* We have more public APIs to maintain. +* They are inefficient on platforms ``wchar_t*`` is UTF-16. It is + because built-in codecs supports only UCS-1, UCS-2, and UCS-4 + input. + + +Rejected ideas +============== + +Using runtime warning +--------------------- + +These APIs doesn't release GIL for now. Emitting a warning from +such APIs is not safe. See this example. + +.. code-block:: + + PyObject *u = PyList_GET_ITEM(list, i); // u is borrowed reference. + PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u), + PyUnicode_GET_SIZE(u), NULL); + // Assumes u is still living reference. + PyObject *t = PyTuple_Pack(2, u, b); + Py_DECREF(b); + return t; + +If we emit Python warning from ``PyUnicode_EncodeUTF8()``, warning +filters and other threads may change the ``list`` and ``u`` can be +a dangling reference after ``PyUnicode_EncodeUTF8()`` returned. + +Additionally, since we are not changing behavior but removing C APIs, +runtime ``DeprecationWarning`` might not helpful for Python +developers. We should warn to extension developers instead. + + +Discussions +=========== + +* `Plan to remove Py_UNICODE APis except PEP 623 + `_ +* `bpo-41123: Remove Py_UNICODE APIs except PEP 623: `_ + + +References +========== + +.. [1] Source package list chosen from top 4000 PyPI packages. + (https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.txt) + +.. [2] pyodbc -- Don't use PyUnicode_Encode API #792 + (https://github.com/mkleehammer/pyodbc/pull/792) + +.. [3] Uncomment Py_DEPRECATED for Py_UNICODE APIs (GH-21318) + (https://github.com/python/cpython/commit/9c3840870814493fed62e140cfa43c2883e12181) + + +Copyright +========= + +This document has been placed in the public domain. + +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + coding: utf-8 + End: