PEP: 623 Title: Remove wstr from Unicode Author: Inada Naoki BDFL-Delegate: Victor Stinner Status: Accepted Type: Standards Track Content-Type: text/x-rst Created: 25-Jun-2020 Python-Version: 3.10 Abstract ======== PEP 393 deprecated some unicode APIs, and introduced ``wchar_t *wstr``, and ``Py_ssize_t wstr_length`` in the Unicode structure to support these deprecated APIs. [1]_ This PEP is planning removal of ``wstr``, and ``wstr_length`` with deprecated APIs using these members by Python 3.12. Deprecated APIs which doesn't use the members are out of scope because they can be removed independently. Motivation ========== Memory usage ------------ ``str`` is one of the most used types in Python. Even most simple ASCII strings have a ``wstr`` member. It consumes 8 bytes per string on 64-bit systems. Runtime overhead ---------------- To support legacy Unicode object, many Unicode APIs must call ``PyUnicode_READY()``. We can remove this overhead too by dropping support of legacy Unicode object. Simplicity ---------- Supporting legacy Unicode object makes the Unicode implementation more complex. Until we drop legacy Unicode object, it is very hard to try other Unicode implementation like UTF-8 based implementation in PyPy. Rationale ========= Python 4.0 is not scheduled yet ------------------------------- PEP 393 introduced efficient internal representation of Unicode and removed border between "narrow" and "wide" build of Python. PEP 393 was implemented in Python 3.3 which is released in 2012. Old APIs were deprecated since then, and the removal was scheduled in Python 4.0. Python 4.0 was expected as next version of Python 3.9 when PEP 393 was accepted. But the next version of Python 3.9 is Python 3.10, not 4.0. This is why this PEP schedule the removal plan again. Python 2 reached EOL -------------------- Since Python 2 didn't have PEP 393 Unicode implementation, legacy APIs might help C extensiom modules supporting both of Python 2 and 3. But Python 2 reached the EOL in 2020. We can remove legacy APIs kept for compatibility with Python 2. Plan ==== Python 3.9 ---------- These macros and functions are marked as deprecated, using ``Py_DEPRECATED`` macro. * ``Py_UNICODE_WSTR_LENGTH()`` * ``PyUnicode_GET_SIZE()`` * ``PyUnicode_GetSize()`` * ``PyUnicode_GET_DATA_SIZE()`` * ``PyUnicode_AS_UNICODE()`` * ``PyUnicode_AS_DATA()`` * ``PyUnicode_AsUnicode()`` * ``_PyUnicode_AsUnicode()`` * ``PyUnicode_AsUnicodeAndSize()`` * ``PyUnicode_FromUnicode()`` Python 3.10 ----------- * Following macros, enum members are marked as deprecated. ``Py_DEPRECATED(3.10)`` macro are used as possible. But they are deprecated only in comment and document if the macro can not be used easily. * ``PyUnicode_WCHAR_KIND`` * ``PyUnicode_READY()`` * ``PyUnicode_IS_READY()`` * ``PyUnicode_IS_COMPACT()`` * ``PyUnicode_FromUnicode(NULL, size)`` and ``PyUnicode_FromStringAndSize(NULL, size)`` emit ``DeprecationWarning`` when ``size > 0``. * ``PyArg_ParseTuple()`` and ``PyArg_ParseTupleAndKeywords()`` emit ``DeprecationWarning`` when ``u``, ``u#``, ``Z``, and ``Z#`` formats are used. Python 3.12 ----------- * Following members are removed from the Unicode structures: * ``wstr`` * ``wstr_length`` * ``state.compact`` * ``state.ready`` * The ``PyUnicodeObject`` structure is removed. * Following macros and functions, and enum members are removed: * ``Py_UNICODE_WSTR_LENGTH()`` * ``PyUnicode_GET_SIZE()`` * ``PyUnicode_GetSize()`` * ``PyUnicode_GET_DATA_SIZE()`` * ``PyUnicode_AS_UNICODE()`` * ``PyUnicode_AS_DATA()`` * ``PyUnicode_AsUnicode()`` * ``_PyUnicode_AsUnicode()`` * ``PyUnicode_AsUnicodeAndSize()`` * ``PyUnicode_FromUnicode()`` * ``PyUnicode_WCHAR_KIND`` * ``PyUnicode_READY()`` * ``PyUnicode_IS_READY()`` * ``PyUnicode_IS_COMPACT()`` * ``PyUnicode_FromStringAndSize(NULL, size))`` raises ``RuntimeError`` when ``size > 0``. * ``PyArg_ParseTuple()`` and ``PyArg_ParseTupleAndKeywords()`` raise ``SystemError`` when ``u``, ``u#``, ``Z``, and ``Z#`` formats are used, as other unsupported format character. Discussion ========== * `Draft PEP: Remove wstr from Unicode `_ * `When can we remove wchar_t* cache from string? `_ * `PEP 623: Remove wstr from Unicode object #1462 `_ References ========== * `bpo-38604: Schedule Py_UNICODE API removal `_ * `bpo-36346: Prepare for removing the legacy Unicode C API `_ * `bpo-30863: Rewrite PyUnicode_AsWideChar() and PyUnicode_AsWideCharString() `_: They no longer cache the ``wchar_t*`` representation of string objects. .. [1] PEP 393 -- Flexible String Representation (https://www.python.org/dev/peps/pep-0393/) Copyright ========= This document has been placed in the public domain.