2020-07-07 11:08:47 -04:00
|
|
|
PEP: 624
|
|
|
|
Title: Remove Py_UNICODE encoder APIs
|
|
|
|
Author: Inada Naoki <songofacandy@gmail.com>
|
|
|
|
Status: Draft
|
|
|
|
Type: Standards Track
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
Created: 06-Jul-2020
|
|
|
|
Python-Version: 3.11
|
2020-07-15 01:31:39 -04:00
|
|
|
Post-History: 08-Jul-2020
|
2020-07-07 11:08:47 -04:00
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
========
|
|
|
|
|
|
|
|
This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in Python 3.11:
|
|
|
|
|
|
|
|
* ``PyUnicode_Encode()``
|
|
|
|
* ``PyUnicode_EncodeASCII()``
|
|
|
|
* ``PyUnicode_EncodeLatin1()``
|
|
|
|
* ``PyUnicode_EncodeUTF7()``
|
|
|
|
* ``PyUnicode_EncodeUTF8()``
|
|
|
|
* ``PyUnicode_EncodeUTF16()``
|
|
|
|
* ``PyUnicode_EncodeUTF32()``
|
|
|
|
* ``PyUnicode_EncodeUnicodeEscape()``
|
|
|
|
* ``PyUnicode_EncodeRawUnicodeEscape()``
|
|
|
|
* ``PyUnicode_EncodeCharmap()``
|
|
|
|
* ``PyUnicode_TranslateCharmap()``
|
|
|
|
* ``PyUnicode_EncodeDecimal()``
|
|
|
|
* ``PyUnicode_TransformDecimalToASCII()``
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
|
|
|
`PEP 623 <https://www.python.org/dev/peps/pep-0623/>`_ propose to remove
|
|
|
|
Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP
|
|
|
|
is not relating to Unicode object. These PEPs are split because they have
|
2021-02-04 03:39:05 -05:00
|
|
|
different motivations and need different discussions.
|
2020-07-07 11:08:47 -04:00
|
|
|
|
|
|
|
|
|
|
|
Motivation
|
|
|
|
==========
|
|
|
|
|
|
|
|
In general, reducing the number of APIs that have been deprecated for
|
|
|
|
a long time and have few users is a good idea for not only it
|
|
|
|
improves the maintainability of CPython, but it also helps API users
|
|
|
|
and other Python implementations.
|
|
|
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
=========
|
|
|
|
|
|
|
|
Deprecated since Python 3.3
|
|
|
|
---------------------------
|
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
``Py_UNICODE`` and APIs using it has been deprecated since Python 3.3.
|
2020-07-07 11:08:47 -04:00
|
|
|
|
|
|
|
|
|
|
|
Inefficient
|
|
|
|
-----------
|
|
|
|
|
|
|
|
All of these APIs are implemented using ``PyUnicode_FromWideChar``.
|
|
|
|
So these APIs are inefficient when user want to encode Unicode
|
|
|
|
object.
|
|
|
|
|
|
|
|
|
|
|
|
Not used widely
|
|
|
|
---------------
|
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
When searching from the top 4000 PyPI packages [1]_, only pyodbc use
|
2020-07-07 11:08:47 -04:00
|
|
|
these APIs.
|
|
|
|
|
|
|
|
* ``PyUnicode_EncodeUTF8()``
|
|
|
|
* ``PyUnicode_EncodeUTF16()``
|
|
|
|
|
|
|
|
pyodbc uses these APIs to encode Unicode object into bytes object.
|
|
|
|
So it is easy to fix it. [2]_
|
|
|
|
|
|
|
|
|
|
|
|
Alternative APIs
|
|
|
|
================
|
|
|
|
|
|
|
|
There are alternative APIs to accept ``PyObject *unicode`` instead of
|
|
|
|
``Py_UNICODE *``. Users can migrate to them.
|
|
|
|
|
|
|
|
|
|
|
|
========================================= ==========================================
|
|
|
|
Deprecated API Alternative APIs
|
|
|
|
========================================= ==========================================
|
|
|
|
``PyUnicode_Encode()`` ``PyUnicode_AsEncodedString()``
|
|
|
|
``PyUnicode_EncodeASCII()`` ``PyUnicode_AsASCIIString()`` \(1)
|
|
|
|
``PyUnicode_EncodeLatin1()`` ``PyUnicode_AsLatin1String()`` \(1)
|
|
|
|
``PyUnicode_EncodeUTF7()`` \(2)
|
|
|
|
``PyUnicode_EncodeUTF8()`` ``PyUnicode_AsUTF8String()`` \(1)
|
|
|
|
``PyUnicode_EncodeUTF16()`` ``PyUnicode_AsUTF16String()`` \(3)
|
|
|
|
``PyUnicode_EncodeUTF32()`` ``PyUnicode_AsUTF32String()`` \(3)
|
|
|
|
``PyUnicode_EncodeUnicodeEscape()`` ``PyUnicode_AsUnicodeEscapeString()``
|
|
|
|
``PyUnicode_EncodeRawUnicodeEscape()`` ``PyUnicode_AsRawUnicodeEscapeString()``
|
|
|
|
``PyUnicode_EncodeCharmap()`` ``PyUnicode_AsCharmapString()`` \(1)
|
|
|
|
``PyUnicode_TranslateCharmap()`` ``PyUnicode_Translate()``
|
|
|
|
``PyUnicode_EncodeDecimal()`` \(4)
|
|
|
|
``PyUnicode_TransformDecimalToASCII()`` \(4)
|
|
|
|
========================================= ==========================================
|
|
|
|
|
|
|
|
Notes:
|
|
|
|
|
|
|
|
(1)
|
|
|
|
``const char *errors`` parameter is missing.
|
|
|
|
|
|
|
|
(2)
|
|
|
|
There is no public alternative API. But user can use generic
|
|
|
|
``PyUnicode_AsEncodedString()`` instead.
|
|
|
|
|
|
|
|
(3)
|
|
|
|
``const char *errors, int byteorder`` parameters are missing.
|
|
|
|
|
|
|
|
(4)
|
|
|
|
There is no direct replacement. But ``Py_UNICODE_TODECIMAL``
|
|
|
|
can be used instead. CPython uses
|
|
|
|
``_PyUnicode_TransformDecimalAndSpaceToASCII`` for converting
|
|
|
|
from Unicode to numbers instead.
|
|
|
|
|
|
|
|
|
|
|
|
Plan
|
|
|
|
====
|
|
|
|
|
2020-10-20 04:42:54 -04:00
|
|
|
Remove these APIs in Python 3.11. They have been deprecated already.
|
2020-07-07 11:08:47 -04:00
|
|
|
|
|
|
|
* ``PyUnicode_Encode()``
|
|
|
|
* ``PyUnicode_EncodeASCII()``
|
|
|
|
* ``PyUnicode_EncodeLatin1()``
|
|
|
|
* ``PyUnicode_EncodeUTF7()``
|
|
|
|
* ``PyUnicode_EncodeUTF8()``
|
|
|
|
* ``PyUnicode_EncodeUTF16()``
|
|
|
|
* ``PyUnicode_EncodeUTF32()``
|
|
|
|
* ``PyUnicode_EncodeUnicodeEscape()``
|
|
|
|
* ``PyUnicode_EncodeRawUnicodeEscape()``
|
|
|
|
* ``PyUnicode_EncodeCharmap()``
|
|
|
|
* ``PyUnicode_TranslateCharmap()``
|
|
|
|
* ``PyUnicode_EncodeDecimal()``
|
|
|
|
* ``PyUnicode_TransformDecimalToASCII()``
|
|
|
|
|
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
Alternative Ideas
|
2020-07-07 11:08:47 -04:00
|
|
|
=================
|
|
|
|
|
2021-04-09 02:06:30 -04:00
|
|
|
Replace ``Py_UNICODE*`` with ``PyObject*``
|
|
|
|
------------------------------------------
|
2020-07-07 11:08:47 -04:00
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
As described in the "Alternative APIs" section, some APIs don't have
|
|
|
|
public alternative APIs accepting ``PyObject *unicode`` input.
|
|
|
|
And some public alternative APIs have restrictions like missing
|
|
|
|
``errors`` and ``byteorder`` parameters.
|
2020-07-07 11:08:47 -04:00
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
Instead of removing deprecated APIs, we can reuse their names for
|
|
|
|
alternative public APIs.
|
2020-07-07 11:08:47 -04:00
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
Since we have private alternative APIs already, it is just renaming
|
|
|
|
from private name to public and deprecated names.
|
2020-07-07 11:08:47 -04:00
|
|
|
|
|
|
|
============================= ================================
|
|
|
|
Rename to Rename from
|
|
|
|
============================= ================================
|
|
|
|
``PyUnicode_EncodeASCII()`` ``_PyUnicode_AsASCIIString()``
|
|
|
|
``PyUnicode_EncodeLatin1()`` ``_PyUnicode_AsLatin1String()``
|
|
|
|
``PyUnicode_EncodeUTF7()`` ``_PyUnicode_EncodeUTF7()``
|
|
|
|
``PyUnicode_EncodeUTF8()`` ``_PyUnicode_AsUTF8String()``
|
|
|
|
``PyUnicode_EncodeUTF16()`` ``_PyUnicode_EncodeUTF16()``
|
|
|
|
``PyUnicode_EncodeUTF32()`` ``_PyUnicode_EncodeUTF32()``
|
|
|
|
============================= ================================
|
|
|
|
|
|
|
|
Pros:
|
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
* We have a more consistent API set.
|
2020-07-07 11:08:47 -04:00
|
|
|
|
|
|
|
Cons:
|
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
* Backward incompatible.
|
|
|
|
* We have more public APIs to maintain for rare use cases.
|
2020-07-07 11:08:47 -04:00
|
|
|
* Existing public APIs are enough for most use cases, and
|
|
|
|
``PyUnicode_AsEncodedString()`` can be used in other cases.
|
|
|
|
|
|
|
|
|
|
|
|
Replace ``Py_UNICODE*`` with ``Py_UCS4*``
|
|
|
|
-----------------------------------------
|
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
We can replace ``Py_UNICODE`` with ``Py_UCS4`` and undeprecate
|
|
|
|
these APIs.
|
|
|
|
|
|
|
|
UTF-8, UTF-16, UTF-32 encoders support ``Py_UCS4`` internally.
|
|
|
|
So ``PyUnicode_EncodeUTF8()``, ``PyUnicode_EncodeUTF16()``, and
|
|
|
|
``PyUnicode_EncodeUTF32()`` can avoid to create a temporary Unicode
|
|
|
|
object.
|
2020-07-07 11:08:47 -04:00
|
|
|
|
|
|
|
|
|
|
|
Pros:
|
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
* We can avoid creating temporary Unicode object when encoding from
|
|
|
|
``Py_UCS4*`` into bytes object with UTF-8, UTF-16, UTF-32 codecs.
|
2020-07-07 11:08:47 -04:00
|
|
|
|
|
|
|
Cons:
|
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
* Backward incompatible.
|
|
|
|
* We have more public APIs to maintain for rare use cases.
|
|
|
|
* Other Python implementations that want to support Python/C API need
|
|
|
|
to support these APIs too.
|
|
|
|
* If we change the Unicode internal representation to UTF-8 in the
|
|
|
|
future, we need to keep UCS-4 support only for these APIs.
|
2020-07-07 11:08:47 -04:00
|
|
|
|
|
|
|
|
|
|
|
Replace ``Py_UNICODE*`` with ``wchar_t*``
|
|
|
|
-----------------------------------------
|
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
We can replace ``Py_UNICODE`` with ``wchar_t``. Since ``Py_UNICODE``
|
|
|
|
is typedef of ``wchar_t`` already, this is status quo.
|
|
|
|
|
|
|
|
On platforms where ``sizeof(wchar_t) == 4``, we can avoid to create a
|
|
|
|
temporary Unicode object when encoding from ``wchar_t*`` to bytes
|
|
|
|
objects using UTF-8, UTF-16, and UTF-32 codec, like the "Replace
|
|
|
|
``Py_UNICODE*`` with ``Py_UCS4*``" idea.
|
|
|
|
|
2020-07-07 11:08:47 -04:00
|
|
|
|
|
|
|
Pros:
|
|
|
|
|
|
|
|
* Backward compatible.
|
2021-02-04 03:39:05 -05:00
|
|
|
* We can avoid creating temporary Unicode object when encode from
|
|
|
|
``Py_UCS4*`` into bytes object with UTF-8, UTF-16, UTF-32 codecs
|
|
|
|
on platform where ``sizeof(wchar_t) == 4``.
|
2020-07-07 11:08:47 -04:00
|
|
|
|
|
|
|
Cons:
|
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
* Although Windows is the most major platform that uses ``wchar_t``
|
|
|
|
heavily, these APIs need to create a temporary Unicode object
|
|
|
|
always because ``sizeof(wchar_t) == 2`` on Windows.
|
|
|
|
* We have more public APIs to maintain for rare use cases.
|
|
|
|
* Other Python implementations that want to support Python/C API need
|
|
|
|
to support these APIs too.
|
|
|
|
* If we change the Unicode internal representation to UTF-8 in the
|
|
|
|
future, we need to keep UCS-4 support only for these APIs.
|
2020-07-07 11:08:47 -04:00
|
|
|
|
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
Rejected Ideas
|
2020-07-07 11:08:47 -04:00
|
|
|
==============
|
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
Emit runtime warning
|
|
|
|
--------------------
|
2020-07-07 11:08:47 -04:00
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
In addition to existing compiler warning, emitting runtime
|
|
|
|
``DeprecationWarning`` is suggested.
|
|
|
|
|
|
|
|
But these APIs doesn't release GIL for now. Emitting a warning from
|
2020-07-07 11:08:47 -04:00
|
|
|
such APIs is not safe. See this example.
|
|
|
|
|
|
|
|
.. code-block::
|
|
|
|
|
|
|
|
PyObject *u = PyList_GET_ITEM(list, i); // u is borrowed reference.
|
|
|
|
PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u),
|
|
|
|
PyUnicode_GET_SIZE(u), NULL);
|
|
|
|
// Assumes u is still living reference.
|
|
|
|
PyObject *t = PyTuple_Pack(2, u, b);
|
|
|
|
Py_DECREF(b);
|
|
|
|
return t;
|
|
|
|
|
|
|
|
If we emit Python warning from ``PyUnicode_EncodeUTF8()``, warning
|
|
|
|
filters and other threads may change the ``list`` and ``u`` can be
|
|
|
|
a dangling reference after ``PyUnicode_EncodeUTF8()`` returned.
|
|
|
|
|
2020-07-15 01:31:39 -04:00
|
|
|
|
2020-10-20 04:42:54 -04:00
|
|
|
Discussions
|
|
|
|
===========
|
2020-07-15 01:31:39 -04:00
|
|
|
|
2021-02-15 03:36:09 -05:00
|
|
|
* `[python-dev] Plan to remove Py_UNICODE APis except PEP 623
|
2020-10-20 04:42:54 -04:00
|
|
|
<https://mail.python.org/archives/list/python-dev@python.org/thread/S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE/#S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE>`_
|
2021-02-15 03:36:09 -05:00
|
|
|
* `bpo-41123: Remove Py_UNICODE APIs except PEP 623
|
|
|
|
<https://bugs.python.org/issue41123>`_
|
|
|
|
* `[python-dev] PEP 624: Remove Py_UNICODE encoder APIs
|
|
|
|
<https://mail.python.org/archives/list/python-dev@python.org/thread/THXVM7FZVT56B7CPEDIYKJG6VMAYIEK5/#QUGBVLQNBFVNX25AEIL77WSFOHQES6LJ>`_
|
2020-07-15 01:31:39 -04:00
|
|
|
|
|
|
|
|
2020-10-20 04:42:54 -04:00
|
|
|
Objections
|
|
|
|
----------
|
2020-07-15 01:31:39 -04:00
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
* Removing these APIs removes ability to use codec without temporary
|
|
|
|
Unicode.
|
2020-07-15 01:31:39 -04:00
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
* Codecs can not encode Unicode buffer directly without temporary
|
|
|
|
Unicode object since Python 3.3. All these APIs creates temporary
|
|
|
|
Unicode object for now. So removing them doesn't reduce any
|
|
|
|
abilities.
|
2020-07-15 01:31:39 -04:00
|
|
|
|
2020-10-20 04:42:54 -04:00
|
|
|
* Why not remove decoder APIs too?
|
2020-07-15 01:31:39 -04:00
|
|
|
|
2020-10-20 04:42:54 -04:00
|
|
|
* They are part of stable ABI.
|
2020-07-15 01:31:39 -04:00
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
* ``PyUnicode_DecodeASCII()`` and ``PyUnicode_DecodeUTF8()`` are
|
|
|
|
used very widely. Deprecating them is not worth enough.
|
2020-07-07 11:08:47 -04:00
|
|
|
|
2021-02-04 03:39:05 -05:00
|
|
|
* Decoder APIs can decode from byte buffer directly, without
|
|
|
|
creating temporary bytes object. On the other hand, encoder APIs
|
|
|
|
can not avoid temporary Unicode object.
|
2020-07-07 11:08:47 -04:00
|
|
|
|
|
|
|
|
|
|
|
References
|
|
|
|
==========
|
|
|
|
|
|
|
|
.. [1] Source package list chosen from top 4000 PyPI packages.
|
|
|
|
(https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.txt)
|
|
|
|
|
|
|
|
.. [2] pyodbc -- Don't use PyUnicode_Encode API #792
|
|
|
|
(https://github.com/mkleehammer/pyodbc/pull/792)
|
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
=========
|
|
|
|
|
2021-02-15 03:36:09 -05:00
|
|
|
This document is placed in the public domain or under the
|
|
|
|
CC0-1.0-Universal license, whichever is more permissive.
|
2020-07-07 11:08:47 -04:00
|
|
|
|
|
|
|
..
|
|
|
|
Local Variables:
|
|
|
|
mode: indented-text
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
sentence-end-double-space: t
|
|
|
|
fill-column: 70
|
|
|
|
coding: utf-8
|
|
|
|
End:
|