2020-06-25 07:16:25 -04:00
|
|
|
PEP: 623
|
|
|
|
Title: Remove wstr from Unicode
|
|
|
|
Author: Inada Naoki <songofacandy@gmail.com>
|
2020-06-25 19:02:01 -04:00
|
|
|
BDFL-Delegate: Victor Stinner <vstinner@python.org>
|
2020-06-25 07:16:25 -04:00
|
|
|
Status: Draft
|
|
|
|
Type: Standards Track
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
Created: 25-Jun-2020
|
|
|
|
Python-Version: 3.10
|
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
========
|
|
|
|
|
|
|
|
PEP 393 deprecated some unicode APIs, and introduced ``wchar_t *wstr``,
|
|
|
|
and ``Py_ssize_t wstr_length`` in the Unicode structure to support
|
|
|
|
these deprecated APIs. [1]_
|
|
|
|
|
|
|
|
This PEP is planning removal of ``wstr``, and ``wstr_length`` with
|
|
|
|
deprecated APIs using these members by Python 3.12.
|
|
|
|
|
|
|
|
Deprecated APIs which doesn't use the members are out of scope because
|
|
|
|
they can be removed independently.
|
|
|
|
|
|
|
|
|
|
|
|
Motivation
|
|
|
|
==========
|
|
|
|
|
|
|
|
Memory usage
|
|
|
|
------------
|
|
|
|
|
|
|
|
``str`` is one of the most used types in Python. Even most simple ASCII
|
2020-07-04 17:12:10 -04:00
|
|
|
strings have a ``wstr`` member. It consumes 8 bytes per string on 64-bit
|
|
|
|
systems.
|
2020-06-25 07:16:25 -04:00
|
|
|
|
|
|
|
|
|
|
|
Runtime overhead
|
|
|
|
----------------
|
|
|
|
|
2020-07-04 17:12:10 -04:00
|
|
|
To support legacy Unicode object, many Unicode APIs must call
|
|
|
|
``PyUnicode_READY()``.
|
2020-06-25 07:16:25 -04:00
|
|
|
|
2020-07-04 17:12:10 -04:00
|
|
|
We can remove this overhead too by dropping support of legacy Unicode
|
|
|
|
object.
|
2020-06-25 07:16:25 -04:00
|
|
|
|
|
|
|
|
|
|
|
Simplicity
|
|
|
|
----------
|
|
|
|
|
2020-07-04 17:12:10 -04:00
|
|
|
Supporting legacy Unicode object makes the Unicode implementation more
|
|
|
|
complex.
|
2020-06-25 07:16:25 -04:00
|
|
|
Until we drop legacy Unicode object, it is very hard to try other
|
|
|
|
Unicode implementation like UTF-8 based implementation in PyPy.
|
|
|
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
=========
|
|
|
|
|
|
|
|
Python 4.0 is not scheduled yet
|
|
|
|
-------------------------------
|
|
|
|
|
|
|
|
PEP 393 introduced efficient internal representation of Unicode and
|
|
|
|
removed border between "narrow" and "wide" build of Python.
|
|
|
|
|
|
|
|
PEP 393 was implemented in Python 3.3 which is released in 2012. Old
|
|
|
|
APIs were deprecated since then, and the removal was scheduled in
|
|
|
|
Python 4.0.
|
|
|
|
|
|
|
|
Python 4.0 was expected as next version of Python 3.9 when PEP 393
|
|
|
|
was accepted. But the next version of Python 3.9 is Python 3.10,
|
|
|
|
not 4.0. This is why this PEP schedule the removal plan again.
|
|
|
|
|
|
|
|
|
|
|
|
Python 2 reached EOL
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
Since Python 2 didn't have PEP 393 Unicode implementation, legacy
|
|
|
|
APIs might help C extensiom modules supporting both of Python 2 and 3.
|
|
|
|
|
|
|
|
But Python 2 reached the EOL in 2020. We can remove legacy APIs kept
|
|
|
|
for compatibility with Python 2.
|
|
|
|
|
|
|
|
|
|
|
|
Plan
|
|
|
|
====
|
|
|
|
|
2020-07-04 17:12:10 -04:00
|
|
|
Python 3.9
|
|
|
|
----------
|
2020-06-25 07:16:25 -04:00
|
|
|
|
|
|
|
These macros and functions are marked as deprecated, using
|
|
|
|
``Py_DEPRECATED`` macro.
|
|
|
|
|
|
|
|
* ``Py_UNICODE_WSTR_LENGTH()``
|
|
|
|
* ``PyUnicode_GET_SIZE()``
|
|
|
|
* ``PyUnicode_GetSize()``
|
|
|
|
* ``PyUnicode_GET_DATA_SIZE()``
|
|
|
|
* ``PyUnicode_AS_UNICODE()``
|
|
|
|
* ``PyUnicode_AS_DATA()``
|
|
|
|
* ``PyUnicode_AsUnicode()``
|
|
|
|
* ``_PyUnicode_AsUnicode()``
|
|
|
|
* ``PyUnicode_AsUnicodeAndSize()``
|
|
|
|
* ``PyUnicode_FromUnicode()``
|
|
|
|
|
|
|
|
|
|
|
|
Python 3.10
|
|
|
|
-----------
|
|
|
|
|
2020-07-04 17:12:10 -04:00
|
|
|
* Following macros, enum members are marked as deprecated.
|
|
|
|
``Py_DEPRECATED(3.10)`` macro are used as possible. But they
|
|
|
|
are deprecated only in comment and document if the macro can
|
2020-06-25 07:16:25 -04:00
|
|
|
not be used easily.
|
|
|
|
|
2020-07-04 17:12:10 -04:00
|
|
|
* ``PyUnicode_WCHAR_KIND``
|
|
|
|
* ``PyUnicode_READY()``
|
|
|
|
* ``PyUnicode_IS_READY()``
|
|
|
|
* ``PyUnicode_IS_COMPACT()``
|
2020-06-25 07:16:25 -04:00
|
|
|
|
|
|
|
* ``PyUnicode_FromUnicode(NULL, size)`` and
|
2020-07-04 17:12:10 -04:00
|
|
|
``PyUnicode_FromStringAndSize(NULL, size)`` emit
|
2020-06-25 07:16:25 -04:00
|
|
|
``DeprecationWarning`` when ``size > 0``.
|
|
|
|
|
2020-07-04 17:12:10 -04:00
|
|
|
* ``PyArg_ParseTuple()`` and ``PyArg_ParseTupleAndKeywords()`` emit
|
2020-06-25 07:16:25 -04:00
|
|
|
``DeprecationWarning`` when ``u``, ``u#``, ``Z``, and ``Z#`` formats are used.
|
|
|
|
|
|
|
|
|
|
|
|
Python 3.12
|
|
|
|
-----------
|
|
|
|
|
2020-07-04 17:12:10 -04:00
|
|
|
* Following members are removed from the Unicode structures:
|
|
|
|
|
|
|
|
* ``wstr``
|
|
|
|
* ``wstr_length``
|
|
|
|
* ``state.compact``
|
|
|
|
* ``state.ready``
|
|
|
|
|
|
|
|
* The ``PyUnicodeObject`` structure is removed.
|
|
|
|
|
|
|
|
* Following macros and functions, and enum members are removed:
|
|
|
|
|
|
|
|
* ``Py_UNICODE_WSTR_LENGTH()``
|
|
|
|
* ``PyUnicode_GET_SIZE()``
|
|
|
|
* ``PyUnicode_GetSize()``
|
|
|
|
* ``PyUnicode_GET_DATA_SIZE()``
|
|
|
|
* ``PyUnicode_AS_UNICODE()``
|
|
|
|
* ``PyUnicode_AS_DATA()``
|
|
|
|
* ``PyUnicode_AsUnicode()``
|
|
|
|
* ``_PyUnicode_AsUnicode()``
|
|
|
|
* ``PyUnicode_AsUnicodeAndSize()``
|
|
|
|
* ``PyUnicode_FromUnicode()``
|
|
|
|
* ``PyUnicode_WCHAR_KIND``
|
|
|
|
* ``PyUnicode_READY()``
|
|
|
|
* ``PyUnicode_IS_READY()``
|
|
|
|
* ``PyUnicode_IS_COMPACT()``
|
|
|
|
|
|
|
|
* ``PyUnicode_FromStringAndSize(NULL, size))`` raises
|
2020-06-25 07:16:25 -04:00
|
|
|
``RuntimeError`` when ``size > 0``.
|
|
|
|
|
2020-07-04 17:12:10 -04:00
|
|
|
* ``PyArg_ParseTuple()`` and ``PyArg_ParseTupleAndKeywords()`` raise
|
2020-06-25 07:16:25 -04:00
|
|
|
``SystemError`` when ``u``, ``u#``, ``Z``, and ``Z#`` formats are used,
|
|
|
|
as other unsupported format character.
|
|
|
|
|
|
|
|
|
2020-07-04 17:12:10 -04:00
|
|
|
Discussion
|
|
|
|
==========
|
|
|
|
|
|
|
|
* `Draft PEP: Remove wstr from Unicode
|
|
|
|
<https://mail.python.org/archives/list/python-dev@python.org/thread/BO2TQHSXWL2RJMINWQQRBF5LANDDJNHH/#BO2TQHSXWL2RJMINWQQRBF5LANDDJNHH>`_
|
|
|
|
* `When can we remove wchar_t* cache from string?
|
|
|
|
<https://mail.python.org/archives/list/python-dev@python.org/thread/7JVC3IKS2V73K36ISEJAAWMRFN2T4KKR/#7JVC3IKS2V73K36ISEJAAWMRFN2T4KKR>`_
|
|
|
|
* `PEP 623: Remove wstr from Unicode object #1462
|
|
|
|
<https://github.com/python/peps/pull/1462>`_
|
|
|
|
|
|
|
|
|
2020-06-25 07:16:25 -04:00
|
|
|
References
|
|
|
|
==========
|
2020-07-04 17:12:10 -04:00
|
|
|
|
|
|
|
* `bpo-38604: Schedule Py_UNICODE API removal
|
|
|
|
<https://bugs.python.org/issue38604>`_
|
|
|
|
* `bpo-36346: Prepare for removing the legacy Unicode C API
|
|
|
|
<https://bugs.python.org/issue36346>`_
|
|
|
|
* `bpo-30863: Rewrite PyUnicode_AsWideChar() and
|
|
|
|
PyUnicode_AsWideCharString() <https://bugs.python.org/issue30863>`_:
|
|
|
|
They no longer cache the ``wchar_t*`` representation of string
|
|
|
|
objects.
|
2020-06-25 07:16:25 -04:00
|
|
|
|
|
|
|
.. [1] PEP 393 -- Flexible String Representation
|
|
|
|
(https://www.python.org/dev/peps/pep-0393/)
|
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
=========
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|