PEP: 624 Title: Remove Py_UNICODE encoder APIs Author: Inada Naoki Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 06-Jul-2020 Python-Version: 3.11 Post-History: 08-Jul-2020 Abstract ======== This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in Python 3.11: * ``PyUnicode_Encode()`` * ``PyUnicode_EncodeASCII()`` * ``PyUnicode_EncodeLatin1()`` * ``PyUnicode_EncodeUTF7()`` * ``PyUnicode_EncodeUTF8()`` * ``PyUnicode_EncodeUTF16()`` * ``PyUnicode_EncodeUTF32()`` * ``PyUnicode_EncodeUnicodeEscape()`` * ``PyUnicode_EncodeRawUnicodeEscape()`` * ``PyUnicode_EncodeCharmap()`` * ``PyUnicode_TranslateCharmap()`` * ``PyUnicode_EncodeDecimal()`` * ``PyUnicode_TransformDecimalToASCII()`` .. note:: `PEP 623 `_ propose to remove Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP is not relating to Unicode object. These PEPs are split because they have different motivation and need different discussion. Motivation ========== In general, reducing the number of APIs that have been deprecated for a long time and have few users is a good idea for not only it improves the maintainability of CPython, but it also helps API users and other Python implementations. Rationale ========= Deprecated since Python 3.3 --------------------------- ``Py_UNICODE`` and APIs using it have been deprecated since Python 3.3. Inefficient ----------- All of these APIs are implemented using ``PyUnicode_FromWideChar``. So these APIs are inefficient when user want to encode Unicode object. Not used widely --------------- When searching from top 4000 PyPI packages [1]_, only pyodbc use these APIs. * ``PyUnicode_EncodeUTF8()`` * ``PyUnicode_EncodeUTF16()`` pyodbc uses these APIs to encode Unicode object into bytes object. So it is easy to fix it. [2]_ Alternative APIs ================ There are alternative APIs to accept ``PyObject *unicode`` instead of ``Py_UNICODE *``. Users can migrate to them. ========================================= ========================================== Deprecated API Alternative APIs ========================================= ========================================== ``PyUnicode_Encode()`` ``PyUnicode_AsEncodedString()`` ``PyUnicode_EncodeASCII()`` ``PyUnicode_AsASCIIString()`` \(1) ``PyUnicode_EncodeLatin1()`` ``PyUnicode_AsLatin1String()`` \(1) ``PyUnicode_EncodeUTF7()`` \(2) ``PyUnicode_EncodeUTF8()`` ``PyUnicode_AsUTF8String()`` \(1) ``PyUnicode_EncodeUTF16()`` ``PyUnicode_AsUTF16String()`` \(3) ``PyUnicode_EncodeUTF32()`` ``PyUnicode_AsUTF32String()`` \(3) ``PyUnicode_EncodeUnicodeEscape()`` ``PyUnicode_AsUnicodeEscapeString()`` ``PyUnicode_EncodeRawUnicodeEscape()`` ``PyUnicode_AsRawUnicodeEscapeString()`` ``PyUnicode_EncodeCharmap()`` ``PyUnicode_AsCharmapString()`` \(1) ``PyUnicode_TranslateCharmap()`` ``PyUnicode_Translate()`` ``PyUnicode_EncodeDecimal()`` \(4) ``PyUnicode_TransformDecimalToASCII()`` \(4) ========================================= ========================================== Notes: (1) ``const char *errors`` parameter is missing. (2) There is no public alternative API. But user can use generic ``PyUnicode_AsEncodedString()`` instead. (3) ``const char *errors, int byteorder`` parameters are missing. (4) There is no direct replacement. But ``Py_UNICODE_TODECIMAL`` can be used instead. CPython uses ``_PyUnicode_TransformDecimalAndSpaceToASCII`` for converting from Unicode to numbers instead. Plan ==== Remove these APIs in Python 3.11. They have been deprecated already. * ``PyUnicode_Encode()`` * ``PyUnicode_EncodeASCII()`` * ``PyUnicode_EncodeLatin1()`` * ``PyUnicode_EncodeUTF7()`` * ``PyUnicode_EncodeUTF8()`` * ``PyUnicode_EncodeUTF16()`` * ``PyUnicode_EncodeUTF32()`` * ``PyUnicode_EncodeUnicodeEscape()`` * ``PyUnicode_EncodeRawUnicodeEscape()`` * ``PyUnicode_EncodeCharmap()`` * ``PyUnicode_TranslateCharmap()`` * ``PyUnicode_EncodeDecimal()`` * ``PyUnicode_TransformDecimalToASCII()`` Alternative ideas ================= Instead of just removing deprecated APIs, we may be able to use thier names with different signature. Make some private APIs public ------------------------------ ``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs. Some APIs have alternative public APIs. But they are missing ``const char *errors`` or ``int byteorder`` parameters. We can rename some private APIs and make them public to cover missing APIs and parameters. ============================= ================================ Rename to Rename from ============================= ================================ ``PyUnicode_EncodeASCII()`` ``_PyUnicode_AsASCIIString()`` ``PyUnicode_EncodeLatin1()`` ``_PyUnicode_AsLatin1String()`` ``PyUnicode_EncodeUTF7()`` ``_PyUnicode_EncodeUTF7()`` ``PyUnicode_EncodeUTF8()`` ``_PyUnicode_AsUTF8String()`` ``PyUnicode_EncodeUTF16()`` ``_PyUnicode_EncodeUTF16()`` ``PyUnicode_EncodeUTF32()`` ``_PyUnicode_EncodeUTF32()`` ============================= ================================ Pros: * We have more consistent API set. Cons: * We have more public APIs to maintain. * Existing public APIs are enough for most use cases, and ``PyUnicode_AsEncodedString()`` can be used in other cases. Replace ``Py_UNICODE*`` with ``Py_UCS4*`` ----------------------------------------- We can replace ``Py_UNICODE`` (typedef of ``wchar_t``) with ``Py_UCS4``. Since builtin codecs support UCS-4, we don't need to convert ``Py_UCS4*`` string to Unicode object. Pros: * We have more consistent API set. * User can encode UCS-4 string in C without creating Unicode object. Cons: * We have more public APIs to maintain. * Applications which uses UTF-8 or UTF-16 can not use these APIs anyway. * Other Python implementations may not have builtin codec for UCS-4. * If we change the Unicode internal representation to UTF-8, we need to keep UCS-4 support only for these APIs. Replace ``Py_UNICODE*`` with ``wchar_t*`` ----------------------------------------- We can replace ``Py_UNICODE`` to ``wchar_t``. Pros: * We have more consistent API set. * Backward compatible. Cons: * We have more public APIs to maintain. * They are inefficient on platforms ``wchar_t*`` is UTF-16. It is because built-in codecs supports only UCS-1, UCS-2, and UCS-4 input. Rejected ideas ============== Using runtime warning --------------------- These APIs doesn't release GIL for now. Emitting a warning from such APIs is not safe. See this example. .. code-block:: PyObject *u = PyList_GET_ITEM(list, i); // u is borrowed reference. PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u), PyUnicode_GET_SIZE(u), NULL); // Assumes u is still living reference. PyObject *t = PyTuple_Pack(2, u, b); Py_DECREF(b); return t; If we emit Python warning from ``PyUnicode_EncodeUTF8()``, warning filters and other threads may change the ``list`` and ``u`` can be a dangling reference after ``PyUnicode_EncodeUTF8()`` returned. Discussions =========== * `Plan to remove Py_UNICODE APis except PEP 623 `_ * `bpo-41123: Remove Py_UNICODE APIs except PEP 623: `_ Objections ---------- * Removing these APIs removes ability to use codec without temporary Unicode. * Codecs can not encode Unicode buffer directly without temporary Unicode object since Python 3.3. All these APIs creates temporary Unicode object for now. So removing them doesn't reduce any abilities. * Why not remove decoder APIs too? * They are part of stable ABI. * ``PyUnicode_DecodeASCII()`` and ``PyUnicode_DecodeUTF8()`` are used very widely. Deprecating them is not worth enough. * Decoder APIs can decode from byte buffer directly, without creating temporary bytes object. On the other hand, encoder APIs can not avoid temporary Unicode object. References ========== .. [1] Source package list chosen from top 4000 PyPI packages. (https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.txt) .. [2] pyodbc -- Don't use PyUnicode_Encode API #792 (https://github.com/mkleehammer/pyodbc/pull/792) .. [3] Uncomment Py_DEPRECATED for Py_UNICODE APIs (GH-21318) (https://github.com/python/cpython/commit/9c3840870814493fed62e140cfa43c2883e12181) Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: