python-peps/pep-0624.rst

PEP: 624
Title: Remove Py_UNICODE encoder APIs
Author: Inada Naoki <songofacandy@gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 06-Jul-2020
Python-Version: 3.11
Post-History: 08-Jul-2020


Abstract
========

This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in Python 3.11:

* ``PyUnicode_Encode()``
* ``PyUnicode_EncodeASCII()``
* ``PyUnicode_EncodeLatin1()``
* ``PyUnicode_EncodeUTF7()``
* ``PyUnicode_EncodeUTF8()``
* ``PyUnicode_EncodeUTF16()``
* ``PyUnicode_EncodeUTF32()``
* ``PyUnicode_EncodeUnicodeEscape()``
* ``PyUnicode_EncodeRawUnicodeEscape()``
* ``PyUnicode_EncodeCharmap()``
* ``PyUnicode_TranslateCharmap()``
* ``PyUnicode_EncodeDecimal()``
* ``PyUnicode_TransformDecimalToASCII()``

.. note::

   `PEP 623  <https://www.python.org/dev/peps/pep-0623/>`_ propose to remove
   Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP
   is not relating to Unicode object. These PEPs are split because they have
   different motivation and need different discussion.


Motivation
==========

In general, reducing the number of APIs that have been deprecated for
a long time and have few users is a good idea for not only it
improves the maintainability of CPython, but it also helps API users
and other Python implementations.


Rationale
=========

Deprecated since Python 3.3
---------------------------

``Py_UNICODE`` and APIs using it have been deprecated since Python 3.3.


Inefficient
-----------

All of these APIs are implemented using ``PyUnicode_FromWideChar``.
So these APIs are inefficient when user want to encode Unicode
object.


Not used widely
---------------

When searching from top 4000 PyPI packages [1]_, only pyodbc use
these APIs.

* ``PyUnicode_EncodeUTF8()``
* ``PyUnicode_EncodeUTF16()``

pyodbc uses these APIs to encode Unicode object into bytes object.
So it is easy to fix it. [2]_


Alternative APIs
================

There are alternative APIs to accept ``PyObject *unicode`` instead of
``Py_UNICODE *``. Users can migrate to them.


========================================= ==========================================
Deprecated API                            Alternative APIs
========================================= ==========================================
``PyUnicode_Encode()``                    ``PyUnicode_AsEncodedString()``
``PyUnicode_EncodeASCII()``               ``PyUnicode_AsASCIIString()`` \(1)
``PyUnicode_EncodeLatin1()``              ``PyUnicode_AsLatin1String()`` \(1)
``PyUnicode_EncodeUTF7()``                \(2)
``PyUnicode_EncodeUTF8()``                ``PyUnicode_AsUTF8String()`` \(1)
``PyUnicode_EncodeUTF16()``               ``PyUnicode_AsUTF16String()`` \(3)
``PyUnicode_EncodeUTF32()``               ``PyUnicode_AsUTF32String()`` \(3)
``PyUnicode_EncodeUnicodeEscape()``       ``PyUnicode_AsUnicodeEscapeString()``
``PyUnicode_EncodeRawUnicodeEscape()``    ``PyUnicode_AsRawUnicodeEscapeString()``
``PyUnicode_EncodeCharmap()``             ``PyUnicode_AsCharmapString()`` \(1)
``PyUnicode_TranslateCharmap()``          ``PyUnicode_Translate()``
``PyUnicode_EncodeDecimal()``              \(4)
``PyUnicode_TransformDecimalToASCII()``    \(4)
========================================= ==========================================

Notes:

(1)
   ``const char *errors`` parameter is missing.

(2)
   There is no public alternative API. But user can use generic
   ``PyUnicode_AsEncodedString()`` instead.

(3)
   ``const char *errors, int byteorder`` parameters are missing.

(4)
   There is no direct replacement. But ``Py_UNICODE_TODECIMAL``
   can be used instead. CPython uses
   ``_PyUnicode_TransformDecimalAndSpaceToASCII`` for converting
   from Unicode to numbers instead.


Plan
====

Python 3.9
----------

Add ``Py_DEPRECATED(3.3)`` to following APIs. This change is committed
already [3]_. All other APIs have been marked ``Py_DEPRECATED(3.3)``
already.

* ``PyUnicode_EncodeDecimal()``
* ``PyUnicode_TransformDecimalToASCII()``.

Document all APIs as "will be removed in version 3.11".


Python 3.11
-----------

These APIs are removed.

* ``PyUnicode_Encode()``
* ``PyUnicode_EncodeASCII()``
* ``PyUnicode_EncodeLatin1()``
* ``PyUnicode_EncodeUTF7()``
* ``PyUnicode_EncodeUTF8()``
* ``PyUnicode_EncodeUTF16()``
* ``PyUnicode_EncodeUTF32()``
* ``PyUnicode_EncodeUnicodeEscape()``
* ``PyUnicode_EncodeRawUnicodeEscape()``
* ``PyUnicode_EncodeCharmap()``
* ``PyUnicode_TranslateCharmap()``
* ``PyUnicode_EncodeDecimal()``
* ``PyUnicode_TransformDecimalToASCII()``


Alternative ideas
=================

Instead of just removing deprecated APIs, we may be able to use thier
names with different signature.


Make some private APIs public
------------------------------

``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs.

Some APIs have alternative public APIs. But they are missing
``const char *errors`` or ``int byteorder`` parameters.

We can rename some private APIs and make them public to cover missing
APIs and parameters.

============================= ================================
 Rename to                     Rename from
============================= ================================
``PyUnicode_EncodeASCII()``    ``_PyUnicode_AsASCIIString()``
``PyUnicode_EncodeLatin1()``   ``_PyUnicode_AsLatin1String()``
``PyUnicode_EncodeUTF7()``     ``_PyUnicode_EncodeUTF7()``
``PyUnicode_EncodeUTF8()``     ``_PyUnicode_AsUTF8String()``
``PyUnicode_EncodeUTF16()``    ``_PyUnicode_EncodeUTF16()``
``PyUnicode_EncodeUTF32()``    ``_PyUnicode_EncodeUTF32()``
============================= ================================

Pros:

* We have more consistent API set.

Cons:

* We have more public APIs to maintain.
* Existing public APIs are enough for most use cases, and
  ``PyUnicode_AsEncodedString()`` can be used in other cases.


Replace ``Py_UNICODE*`` with ``Py_UCS4*``
-----------------------------------------

We can replace ``Py_UNICODE`` (typedef of ``wchar_t``) with
``Py_UCS4``. Since builtin codecs support UCS-4, we don't need to
convert ``Py_UCS4*`` string to Unicode object.


Pros:

* We have more consistent API set.
* User can encode UCS-4 string in C without creating Unicode object.

Cons:

* We have more public APIs to maintain.
* Applications which uses UTF-8 or UTF-16 can not use these APIs
  anyway.
* Other Python implementations may not have builtin codec for UCS-4.
* If we change the Unicode internal representation to UTF-8, we need
  to keep UCS-4 support only for these APIs.


Replace ``Py_UNICODE*`` with ``wchar_t*``
-----------------------------------------

We can replace ``Py_UNICODE`` to ``wchar_t``.

Pros:

* We have more consistent API set.
* Backward compatible.

Cons:

* We have more public APIs to maintain.
* They are inefficient on platforms ``wchar_t*`` is UTF-16. It is
  because built-in codecs supports only UCS-1, UCS-2, and UCS-4
  input.


Rejected ideas
==============

Using runtime warning
---------------------

These APIs doesn't release GIL for now. Emitting a warning from
such APIs is not safe. See this example.

.. code-block::

   PyObject *u = PyList_GET_ITEM(list, i);  // u is borrowed reference.
   PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u),
           PyUnicode_GET_SIZE(u), NULL);
   // Assumes u is still living reference.
   PyObject *t = PyTuple_Pack(2, u, b);
   Py_DECREF(b);
   return t;

If we emit Python warning from ``PyUnicode_EncodeUTF8()``, warning
filters and other threads may change the ``list`` and ``u`` can be
a dangling reference after ``PyUnicode_EncodeUTF8()`` returned.

Additionally, since we are not changing behavior but removing C APIs,
runtime ``DeprecationWarning`` might not helpful for Python
developers. We should warn to extension developers instead.


Deprecate ``PyUnicode_Decode*`` APIs too
----------------------------------------

Not only remove ``PyUnicode_Encode*()`` APIs, but also deprecate
following APIs too for symmetry and reducing number of APIs.

* ``PyUnicode_DecodeASCII()``
* ``PyUnicode_DecodeLatin1()``
* ``PyUnicode_DecodeUTF7()``
* ``PyUnicode_DecodeUTF8()``
* ``PyUnicode_DecodeUTF16()``
* ``PyUnicode_DecodeUTF32()``
* ``PyUnicode_DecodeUnicodeEscape()``
* ``PyUnicode_DecodeRawUnicodeEscape()``
* ``PyUnicode_DecodeCharmap()``
* ``PyUnicode_DecodeMBCS()``

This idea is excluded from this PEP because of several reasons:

* We can not remove them anytime soon because they are part of stable
  ABI.

* ``PyUnicode_DecodeASCII()`` and ``PyUnicode_DecodeUTF8()`` are used
  very widely. Deprecating them is not worth enough.

* Decoding from ``const char*`` is independent from Unicode
  representation.

  * ``PyUnicode_Decode*()`` APIs are useful for applications and
    extensions using UTF-8 or Python Unicode objects to store Unicode
    string. But ``PyUnicode_Encode*()`` APIs are not useful for them.

  * Python implementations using UTF-8 for Unicode internal
    representation (e.g. PyPy and micropython) may not have encoder
    with ``wchar_t*`` or UCS-4 input. But decoding from ``char*``
    is very natural for them too.


Discussions
===========

* `Plan to remove Py_UNICODE APis except PEP 623
  <https://mail.python.org/archives/list/python-dev@python.org/thread/S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE/#S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE>`_
* `bpo-41123: Remove Py_UNICODE APIs except PEP 623: <https://bugs.python.org/issue41123>`_


References
==========

.. [1] Source package list chosen from top 4000 PyPI packages.
   (https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.txt)

.. [2] pyodbc -- Don't use PyUnicode_Encode API #792
   (https://github.com/mkleehammer/pyodbc/pull/792)

.. [3] Uncomment Py_DEPRECATED for Py_UNICODE APIs (GH-21318)
   (https://github.com/python/cpython/commit/9c3840870814493fed62e140cfa43c2883e12181)


Copyright
=========

This document has been placed in the public domain.

..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End:
PEP 624: Remove Py_UNICODE encoder APIs (#1497) Co-authored-by: Victor Stinner <vstinner@python.org> 2020-07-07 11:08:47 -04:00			`PEP: 624`
			`Title: Remove Py_UNICODE encoder APIs`
			`Author: Inada Naoki <songofacandy@gmail.com>`
			`Status: Draft`
			`Type: Standards Track`
			`Content-Type: text/x-rst`
			`Created: 06-Jul-2020`
			`Python-Version: 3.11`
PEP 624: Add rejected idea. (#1508) New section describes why this PEP doesn't deprecate PyUnicode_Decode*** APIs. 2020-07-15 01:31:39 -04:00			`Post-History: 08-Jul-2020`
PEP 624: Remove Py_UNICODE encoder APIs (#1497) Co-authored-by: Victor Stinner <vstinner@python.org> 2020-07-07 11:08:47 -04:00

			`Abstract`
			`========`

			This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in Python 3.11:

			* ``PyUnicode_Encode()``
			* ``PyUnicode_EncodeASCII()``
			* ``PyUnicode_EncodeLatin1()``
			* ``PyUnicode_EncodeUTF7()``
			* ``PyUnicode_EncodeUTF8()``
			* ``PyUnicode_EncodeUTF16()``
			* ``PyUnicode_EncodeUTF32()``
			* ``PyUnicode_EncodeUnicodeEscape()``
			* ``PyUnicode_EncodeRawUnicodeEscape()``
			* ``PyUnicode_EncodeCharmap()``
			* ``PyUnicode_TranslateCharmap()``
			* ``PyUnicode_EncodeDecimal()``
			* ``PyUnicode_TransformDecimalToASCII()``

			`.. note::`

			`PEP 623 <https://www.python.org/dev/peps/pep-0623/>`_ propose to remove
			Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP
			`is not relating to Unicode object. These PEPs are split because they have`
			`different motivation and need different discussion.`


			`Motivation`
			`==========`

			`In general, reducing the number of APIs that have been deprecated for`
			`a long time and have few users is a good idea for not only it`
			`improves the maintainability of CPython, but it also helps API users`
			`and other Python implementations.`


			`Rationale`
			`=========`

			`Deprecated since Python 3.3`
			`---------------------------`

PEP 624: nitpicks (#1547) 2020-08-03 23:00:32 -04:00			``Py_UNICODE`` and APIs using it have been deprecated since Python 3.3.
PEP 624: Remove Py_UNICODE encoder APIs (#1497) Co-authored-by: Victor Stinner <vstinner@python.org> 2020-07-07 11:08:47 -04:00

			`Inefficient`
			`-----------`

			All of these APIs are implemented using ``PyUnicode_FromWideChar``.
			`So these APIs are inefficient when user want to encode Unicode`
			`object.`


			`Not used widely`
			`---------------`

			`When searching from top 4000 PyPI packages [1]_, only pyodbc use`
			`these APIs.`

			* ``PyUnicode_EncodeUTF8()``
			* ``PyUnicode_EncodeUTF16()``

			`pyodbc uses these APIs to encode Unicode object into bytes object.`
			`So it is easy to fix it. [2]_`


			`Alternative APIs`
			`================`

			There are alternative APIs to accept ``PyObject *unicode`` instead of
			``Py_UNICODE *``. Users can migrate to them.


			`========================================= ==========================================`
			`Deprecated API Alternative APIs`
			`========================================= ==========================================`
			``PyUnicode_Encode()`` ``PyUnicode_AsEncodedString()``
			``PyUnicode_EncodeASCII()`` ``PyUnicode_AsASCIIString()`` \(1)
			``PyUnicode_EncodeLatin1()`` ``PyUnicode_AsLatin1String()`` \(1)
			``PyUnicode_EncodeUTF7()`` \(2)
			``PyUnicode_EncodeUTF8()`` ``PyUnicode_AsUTF8String()`` \(1)
			``PyUnicode_EncodeUTF16()`` ``PyUnicode_AsUTF16String()`` \(3)
			``PyUnicode_EncodeUTF32()`` ``PyUnicode_AsUTF32String()`` \(3)
			``PyUnicode_EncodeUnicodeEscape()`` ``PyUnicode_AsUnicodeEscapeString()``
			``PyUnicode_EncodeRawUnicodeEscape()`` ``PyUnicode_AsRawUnicodeEscapeString()``
			``PyUnicode_EncodeCharmap()`` ``PyUnicode_AsCharmapString()`` \(1)
			``PyUnicode_TranslateCharmap()`` ``PyUnicode_Translate()``
			``PyUnicode_EncodeDecimal()`` \(4)
			``PyUnicode_TransformDecimalToASCII()`` \(4)
			`========================================= ==========================================`

			`Notes:`

			`(1)`
			``const char *errors`` parameter is missing.

			`(2)`
			`There is no public alternative API. But user can use generic`
			``PyUnicode_AsEncodedString()`` instead.

			`(3)`
			``const char *errors, int byteorder`` parameters are missing.

			`(4)`
			There is no direct replacement. But ``Py_UNICODE_TODECIMAL``
			`can be used instead. CPython uses`
			``_PyUnicode_TransformDecimalAndSpaceToASCII`` for converting
			`from Unicode to numbers instead.`


			`Plan`
			`====`

			`Python 3.9`
			`----------`

			Add ``Py_DEPRECATED(3.3)`` to following APIs. This change is committed
			already [3]_. All other APIs have been marked ``Py_DEPRECATED(3.3)``
			`already.`

			* ``PyUnicode_EncodeDecimal()``
			* ``PyUnicode_TransformDecimalToASCII()``.

			`Document all APIs as "will be removed in version 3.11".`


			`Python 3.11`
			`-----------`

			`These APIs are removed.`

			* ``PyUnicode_Encode()``
			* ``PyUnicode_EncodeASCII()``
			* ``PyUnicode_EncodeLatin1()``
			* ``PyUnicode_EncodeUTF7()``
			* ``PyUnicode_EncodeUTF8()``
			* ``PyUnicode_EncodeUTF16()``
			* ``PyUnicode_EncodeUTF32()``
			* ``PyUnicode_EncodeUnicodeEscape()``
			* ``PyUnicode_EncodeRawUnicodeEscape()``
			* ``PyUnicode_EncodeCharmap()``
			* ``PyUnicode_TranslateCharmap()``
			* ``PyUnicode_EncodeDecimal()``
			* ``PyUnicode_TransformDecimalToASCII()``


			`Alternative ideas`
			`=================`

			`Instead of just removing deprecated APIs, we may be able to use thier`
			`names with different signature.`


			`Make some private APIs public`
			`------------------------------`

			``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs.

			`Some APIs have alternative public APIs. But they are missing`
			``const char *errors`` or ``int byteorder`` parameters.

			`We can rename some private APIs and make them public to cover missing`
			`APIs and parameters.`

			`============================= ================================`
			`Rename to Rename from`
			`============================= ================================`
			``PyUnicode_EncodeASCII()`` ``_PyUnicode_AsASCIIString()``
			``PyUnicode_EncodeLatin1()`` ``_PyUnicode_AsLatin1String()``
			``PyUnicode_EncodeUTF7()`` ``_PyUnicode_EncodeUTF7()``
			``PyUnicode_EncodeUTF8()`` ``_PyUnicode_AsUTF8String()``
			``PyUnicode_EncodeUTF16()`` ``_PyUnicode_EncodeUTF16()``
			``PyUnicode_EncodeUTF32()`` ``_PyUnicode_EncodeUTF32()``
			`============================= ================================`

			`Pros:`

			`* We have more consistent API set.`

			`Cons:`

			`* We have more public APIs to maintain.`
			`* Existing public APIs are enough for most use cases, and`
			``PyUnicode_AsEncodedString()`` can be used in other cases.


			Replace ``Py_UNICODE`` with ``Py_UCS4``
			`-----------------------------------------`

			We can replace ``Py_UNICODE`` (typedef of ``wchar_t``) with
			``Py_UCS4``. Since builtin codecs support UCS-4, we don't need to
			convert ``Py_UCS4*`` string to Unicode object.


			`Pros:`

			`* We have more consistent API set.`
			`* User can encode UCS-4 string in C without creating Unicode object.`

			`Cons:`

			`* We have more public APIs to maintain.`
PEP 624: nitpicks (#1547) 2020-08-03 23:00:32 -04:00			`* Applications which uses UTF-8 or UTF-16 can not use these APIs`
PEP 624: Remove Py_UNICODE encoder APIs (#1497) Co-authored-by: Victor Stinner <vstinner@python.org> 2020-07-07 11:08:47 -04:00			`anyway.`
			`* Other Python implementations may not have builtin codec for UCS-4.`
			`* If we change the Unicode internal representation to UTF-8, we need`
			`to keep UCS-4 support only for these APIs.`


			Replace ``Py_UNICODE`` with ``wchar_t``
			`-----------------------------------------`

			We can replace ``Py_UNICODE`` to ``wchar_t``.

			`Pros:`

			`* We have more consistent API set.`
			`* Backward compatible.`

			`Cons:`

			`* We have more public APIs to maintain.`
			* They are inefficient on platforms ``wchar_t*`` is UTF-16. It is
			`because built-in codecs supports only UCS-1, UCS-2, and UCS-4`
			`input.`


			`Rejected ideas`
			`==============`

			`Using runtime warning`
			`---------------------`

			`These APIs doesn't release GIL for now. Emitting a warning from`
			`such APIs is not safe. See this example.`

			`.. code-block::`

			`PyObject *u = PyList_GET_ITEM(list, i); // u is borrowed reference.`
			`PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u),`
			`PyUnicode_GET_SIZE(u), NULL);`
			`// Assumes u is still living reference.`
			`PyObject *t = PyTuple_Pack(2, u, b);`
			`Py_DECREF(b);`
			`return t;`

			If we emit Python warning from ``PyUnicode_EncodeUTF8()``, warning
			filters and other threads may change the ``list`` and ``u`` can be
			a dangling reference after ``PyUnicode_EncodeUTF8()`` returned.

			`Additionally, since we are not changing behavior but removing C APIs,`
			runtime ``DeprecationWarning`` might not helpful for Python
			`developers. We should warn to extension developers instead.`


PEP 624: Add rejected idea. (#1508) New section describes why this PEP doesn't deprecate PyUnicode_Decode*** APIs. 2020-07-15 01:31:39 -04:00			Deprecate ``PyUnicode_Decode*`` APIs too
			`----------------------------------------`

			Not only remove ``PyUnicode_Encode*()`` APIs, but also deprecate
			`following APIs too for symmetry and reducing number of APIs.`

			* ``PyUnicode_DecodeASCII()``
			* ``PyUnicode_DecodeLatin1()``
			* ``PyUnicode_DecodeUTF7()``
			* ``PyUnicode_DecodeUTF8()``
			* ``PyUnicode_DecodeUTF16()``
			* ``PyUnicode_DecodeUTF32()``
			* ``PyUnicode_DecodeUnicodeEscape()``
			* ``PyUnicode_DecodeRawUnicodeEscape()``
			* ``PyUnicode_DecodeCharmap()``
			* ``PyUnicode_DecodeMBCS()``

			`This idea is excluded from this PEP because of several reasons:`

			`* We can not remove them anytime soon because they are part of stable`
			`ABI.`

			* ``PyUnicode_DecodeASCII()`` and ``PyUnicode_DecodeUTF8()`` are used
			`very widely. Deprecating them is not worth enough.`

			* Decoding from ``const char*`` is independent from Unicode
			`representation.`

			* ``PyUnicode_Decode*()`` APIs are useful for applications and
			`extensions using UTF-8 or Python Unicode objects to store Unicode`
			string. But ``PyUnicode_Encode*()`` APIs are not useful for them.

			`* Python implementations using UTF-8 for Unicode internal`
			`representation (e.g. PyPy and micropython) may not have encoder`
			with ``wchar_t`` or UCS-4 input. But decoding from ``char``
			`is very natural for them too.`


PEP 624: Remove Py_UNICODE encoder APIs (#1497) Co-authored-by: Victor Stinner <vstinner@python.org> 2020-07-07 11:08:47 -04:00			`Discussions`
			`===========`

			* `Plan to remove Py_UNICODE APis except PEP 623
			<https://mail.python.org/archives/list/python-dev@python.org/thread/S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE/#S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE>`_
			* `bpo-41123: Remove Py_UNICODE APIs except PEP 623: <https://bugs.python.org/issue41123>`_


			`References`
			`==========`

			`.. [1] Source package list chosen from top 4000 PyPI packages.`
			`(https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.txt)`

			`.. [2] pyodbc -- Don't use PyUnicode_Encode API #792`
			`(https://github.com/mkleehammer/pyodbc/pull/792)`

			`.. [3] Uncomment Py_DEPRECATED for Py_UNICODE APIs (GH-21318)`
			`(https://github.com/python/cpython/commit/9c3840870814493fed62e140cfa43c2883e12181)`


			`Copyright`
			`=========`

			`This document has been placed in the public domain.`

			`..`
			`Local Variables:`
			`mode: indented-text`
			`indent-tabs-mode: nil`
			`sentence-end-double-space: t`
			`fill-column: 70`
			`coding: utf-8`
			`End:`