python-peps/pep-0623.rst

PEP: 623
Title: Remove wstr from Unicode
Author: Inada Naoki <songofacandy@gmail.com>
BDFL-Delegate: Victor Stinner <vstinner@python.org>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 25-Jun-2020
Python-Version: 3.10


Abstract
========

PEP 393 deprecated some unicode APIs, and introduced ``wchar_t *wstr``,
and ``Py_ssize_t wstr_length`` in the Unicode structure to support
these deprecated APIs. [1]_

This PEP is planning removal of ``wstr``, and ``wstr_length`` with
deprecated APIs using these members by Python 3.12.

Deprecated APIs which doesn't use the members are out of scope because
they can be removed independently.


Motivation
==========

Memory usage
------------

``str`` is one of the most used types in Python. Even most simple ASCII
strings have a ``wstr`` member. It consumes 8 bytes on 64bit systems.


Runtime overhead
----------------

To support legacy Unicode object created by
``PyUnicode_FromUnicode(NULL, length)``, many Unicode APIs has
``PyUnicode_READY()`` check.

When we drop support of legacy unicode object, We can reduce this
overhead too.


Simplicity
----------

Support of legacy Unicode object makes Unicode implementation complex.
Until we drop legacy Unicode object, it is very hard to try other
Unicode implementation like UTF-8 based implementation in PyPy.


Rationale
=========

Python 4.0 is not scheduled yet
-------------------------------

PEP 393 introduced efficient internal representation of Unicode and
removed border between "narrow" and "wide" build of Python.

PEP 393 was implemented in Python 3.3 which is released in 2012. Old
APIs were deprecated since then, and the removal was scheduled in
Python 4.0.

Python 4.0 was expected as next version of Python 3.9 when PEP 393
was accepted. But the next version of Python 3.9 is Python 3.10,
not 4.0. This is why this PEP schedule the removal plan again.


Python 2 reached EOL
--------------------

Since Python 2 didn't have PEP 393 Unicode implementation, legacy
APIs might help C extensiom modules supporting both of Python 2 and 3.

But Python 2 reached the EOL in 2020. We can remove legacy APIs kept
for compatibility with Python 2.


Plan
====

Python 3.9 (current)
--------------------

These macros and functions are marked as deprecated, using
``Py_DEPRECATED`` macro.

* ``Py_UNICODE_WSTR_LENGTH()``
* ``PyUnicode_GET_SIZE()``
* ``PyUnicode_GetSize()``
* ``PyUnicode_GET_DATA_SIZE()``
* ``PyUnicode_AS_UNICODE()``
* ``PyUnicode_AS_DATA()``
* ``PyUnicode_AsUnicode()``
* ``_PyUnicode_AsUnicode()``
* ``PyUnicode_AsUnicodeAndSize()``
* ``PyUnicode_FromUnicode()``


Python 3.10
-----------

* Following macros, enum members will be marked as deprecated.
  ``Py_DEPRECATED(3.10)`` macro will be used as possible. But they
  will be deprecated only in comment and document if the macro can
  not be used easily.

   * ``PyUnicode_WCHAR_KIND``
   * ``PyUnicode_READY()``
   * ``PyUnicode_IS_READY()``
   * ``PyUnicode_IS_COMPACT()``

* ``PyUnicode_FromUnicode(NULL, size)`` and
  ``PyUnicode_FromStringAndSize(NULL, size)`` will emit
  ``DeprecationWarning`` when ``size > 0``.

* ``PyArg_ParseTuple()`` and ``PyArg_ParseTupleAndKeywords()`` will emit
  ``DeprecationWarning`` when ``u``, ``u#``, ``Z``, and ``Z#`` formats are used.


Python 3.12
-----------

* Following members will be removed from the Unicode strucutres:

   * ``wstr``
   * ``wstr_length``
   * ``state.compact``
   * ``state.ready``

* The ``PyUnicodeObject`` struct will be removed.

* Following macros and functions, and enum members will be removed:

   * ``Py_UNICODE_WSTR_LENGTH()``
   * ``PyUnicode_GET_SIZE()``
   * ``PyUnicode_GetSize()``
   * ``PyUnicode_GET_DATA_SIZE()``
   * ``PyUnicode_AS_UNICODE()``
   * ``PyUnicode_AS_DATA()``
   * ``PyUnicode_AsUnicode()``
   * ``_PyUnicode_AsUnicode()``
   * ``PyUnicode_AsUnicodeAndSize()``
   * ``PyUnicode_FromUnicode()``
   * ``PyUnicode_WCHAR_KIND``
   * ``PyUnicode_READY()``
   * ``PyUnicode_IS_READY()``
   * ``PyUnicode_IS_COMPACT()``

* ``PyUnicode_FromStringAndSize(NULL, size))`` will raise
  ``RuntimeError`` when ``size > 0``.

* ``PyArg_ParseTuple()`` and ``PyArg_ParseTupleAndKeywords()`` will raise
  ``SystemError`` when ``u``, ``u#``, ``Z``, and ``Z#`` formats are used,
  as other unsupported format character.


References
==========
A collection of URLs used as references through the PEP.

.. [1] PEP 393 -- Flexible String Representation
       (https://www.python.org/dev/peps/pep-0393/)


Copyright
=========

This document has been placed in the public domain.
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00			`PEP: 623`
			`Title: Remove wstr from Unicode`
			`Author: Inada Naoki <songofacandy@gmail.com>`
PEP 623: Set the BDFL-Delegate (#1468) 2020-06-25 19:02:01 -04:00			`BDFL-Delegate: Victor Stinner <vstinner@python.org>`
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00			`Status: Draft`
			`Type: Standards Track`
			`Content-Type: text/x-rst`
			`Created: 25-Jun-2020`
			`Python-Version: 3.10`


			`Abstract`
			`========`

			PEP 393 deprecated some unicode APIs, and introduced ``wchar_t *wstr``,
			and ``Py_ssize_t wstr_length`` in the Unicode structure to support
			`these deprecated APIs. [1]_`

			This PEP is planning removal of ``wstr``, and ``wstr_length`` with
			`deprecated APIs using these members by Python 3.12.`

			`Deprecated APIs which doesn't use the members are out of scope because`
			`they can be removed independently.`


			`Motivation`
			`==========`

			`Memory usage`
			`------------`

			``str`` is one of the most used types in Python. Even most simple ASCII
			strings have a ``wstr`` member. It consumes 8 bytes on 64bit systems.


			`Runtime overhead`
			`----------------`

			`To support legacy Unicode object created by`
			``PyUnicode_FromUnicode(NULL, length)``, many Unicode APIs has
			``PyUnicode_READY()`` check.

			`When we drop support of legacy unicode object, We can reduce this`
			`overhead too.`


			`Simplicity`
			`----------`

			`Support of legacy Unicode object makes Unicode implementation complex.`
			`Until we drop legacy Unicode object, it is very hard to try other`
			`Unicode implementation like UTF-8 based implementation in PyPy.`


			`Rationale`
			`=========`

			`Python 4.0 is not scheduled yet`
			`-------------------------------`

			`PEP 393 introduced efficient internal representation of Unicode and`
			`removed border between "narrow" and "wide" build of Python.`

			`PEP 393 was implemented in Python 3.3 which is released in 2012. Old`
			`APIs were deprecated since then, and the removal was scheduled in`
			`Python 4.0.`

			`Python 4.0 was expected as next version of Python 3.9 when PEP 393`
			`was accepted. But the next version of Python 3.9 is Python 3.10,`
			`not 4.0. This is why this PEP schedule the removal plan again.`


			`Python 2 reached EOL`
			`--------------------`

			`Since Python 2 didn't have PEP 393 Unicode implementation, legacy`
			`APIs might help C extensiom modules supporting both of Python 2 and 3.`

			`But Python 2 reached the EOL in 2020. We can remove legacy APIs kept`
			`for compatibility with Python 2.`


			`Plan`
			`====`

			`Python 3.9 (current)`
			`--------------------`

			`These macros and functions are marked as deprecated, using`
			``Py_DEPRECATED`` macro.

			* ``Py_UNICODE_WSTR_LENGTH()``
			* ``PyUnicode_GET_SIZE()``
			* ``PyUnicode_GetSize()``
			* ``PyUnicode_GET_DATA_SIZE()``
			* ``PyUnicode_AS_UNICODE()``
			* ``PyUnicode_AS_DATA()``
			* ``PyUnicode_AsUnicode()``
			* ``_PyUnicode_AsUnicode()``
			* ``PyUnicode_AsUnicodeAndSize()``
			* ``PyUnicode_FromUnicode()``


			`Python 3.10`
			`-----------`

			`* Following macros, enum members will be marked as deprecated.`
			``Py_DEPRECATED(3.10)`` macro will be used as possible. But they
			`will be deprecated only in comment and document if the macro can`
			`not be used easily.`

			* ``PyUnicode_WCHAR_KIND``
			* ``PyUnicode_READY()``
			* ``PyUnicode_IS_READY()``
			* ``PyUnicode_IS_COMPACT()``

			* ``PyUnicode_FromUnicode(NULL, size)`` and
			``PyUnicode_FromStringAndSize(NULL, size)`` will emit
			``DeprecationWarning`` when ``size > 0``.

			* ``PyArg_ParseTuple()`` and ``PyArg_ParseTupleAndKeywords()`` will emit
			``DeprecationWarning`` when ``u``, ``u#``, ``Z``, and ``Z#`` formats are used.


			`Python 3.12`
			`-----------`

			`* Following members will be removed from the Unicode strucutres:`

			* ``wstr``
			* ``wstr_length``
			* ``state.compact``
			* ``state.ready``

			* The ``PyUnicodeObject`` struct will be removed.

			`* Following macros and functions, and enum members will be removed:`

			* ``Py_UNICODE_WSTR_LENGTH()``
			* ``PyUnicode_GET_SIZE()``
			* ``PyUnicode_GetSize()``
			* ``PyUnicode_GET_DATA_SIZE()``
			* ``PyUnicode_AS_UNICODE()``
			* ``PyUnicode_AS_DATA()``
			* ``PyUnicode_AsUnicode()``
			* ``_PyUnicode_AsUnicode()``
			* ``PyUnicode_AsUnicodeAndSize()``
			* ``PyUnicode_FromUnicode()``
			* ``PyUnicode_WCHAR_KIND``
			* ``PyUnicode_READY()``
			* ``PyUnicode_IS_READY()``
			* ``PyUnicode_IS_COMPACT()``

			* ``PyUnicode_FromStringAndSize(NULL, size))`` will raise
			``RuntimeError`` when ``size > 0``.

			* ``PyArg_ParseTuple()`` and ``PyArg_ParseTupleAndKeywords()`` will raise
			``SystemError`` when ``u``, ``u#``, ``Z``, and ``Z#`` formats are used,
			`as other unsupported format character.`


			`References`
			`==========`
			`A collection of URLs used as references through the PEP.`

			`.. [1] PEP 393 -- Flexible String Representation`
			`(https://www.python.org/dev/peps/pep-0393/)`


			`Copyright`
			`=========`

			`This document has been placed in the public domain.`