python-peps/peps/pep-0623.rst

PEP: 623
Title: Remove wstr from Unicode
Author: Inada Naoki <songofacandy@gmail.com>
BDFL-Delegate: Victor Stinner <vstinner@python.org>
Discussions-To: https://mail.python.org/archives/list/python-dev@python.org/thread/BO2TQHSXWL2RJMINWQQRBF5LANDDJNHH/
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Created: 25-Jun-2020
Python-Version: 3.10
Resolution: https://mail.python.org/archives/list/python-dev@python.org/thread/VQKDIZLZ6HF2MLTNCUFURK2IFTXVQEYA/


Abstract
========

:pep:`393` deprecated some unicode APIs, and introduced ``wchar_t *wstr``,
and ``Py_ssize_t wstr_length`` in the Unicode structure to support
these deprecated APIs.

This PEP is planning removal of ``wstr``, and ``wstr_length`` with
deprecated APIs using these members by Python 3.12.

Deprecated APIs which doesn't use the members are out of scope because
they can be removed independently.


Motivation
==========

Memory usage
------------

``str`` is one of the most used types in Python. Even most simple ASCII
strings have a ``wstr`` member. It consumes 8 bytes per string on 64-bit
systems.


Runtime overhead
----------------

To support legacy Unicode object, many Unicode APIs must call
``PyUnicode_READY()``.

We can remove this overhead too by dropping support of legacy Unicode
object.


Simplicity
----------

Supporting legacy Unicode object makes the Unicode implementation more
complex.
Until we drop legacy Unicode object, it is very hard to try other
Unicode implementation like UTF-8 based implementation in PyPy.


Rationale
=========

Python 4.0 is not scheduled yet
-------------------------------

:pep:`393` introduced efficient internal representation of Unicode and
removed border between "narrow" and "wide" build of Python.

:pep:`393` was implemented in Python 3.3 which is released in 2012. Old
APIs were deprecated since then, and the removal was scheduled in
Python 4.0.

Python 4.0 was expected as next version of Python 3.9 when :pep:`393`
was accepted. But the next version of Python 3.9 is Python 3.10,
not 4.0. This is why this PEP schedule the removal plan again.


Python 2 reached EOL
--------------------

Since Python 2 didn't have :pep:`393` Unicode implementation, legacy
APIs might help C extension modules supporting both of Python 2 and 3.

But Python 2 reached the EOL in 2020. We can remove legacy APIs kept
for compatibility with Python 2.


Plan
====

Python 3.9
----------

These macros and functions are marked as deprecated, using
``Py_DEPRECATED`` macro.

* ``Py_UNICODE_WSTR_LENGTH()``
* ``PyUnicode_GET_SIZE()``
* ``PyUnicode_GetSize()``
* ``PyUnicode_GET_DATA_SIZE()``
* ``PyUnicode_AS_UNICODE()``
* ``PyUnicode_AS_DATA()``
* ``PyUnicode_AsUnicode()``
* ``_PyUnicode_AsUnicode()``
* ``PyUnicode_AsUnicodeAndSize()``
* ``PyUnicode_FromUnicode()``


Python 3.10
-----------

* Following macros, enum members are marked as deprecated.
  ``Py_DEPRECATED(3.10)`` macro are used as possible. But they
  are deprecated only in comment and document if the macro can
  not be used easily.

  * ``PyUnicode_WCHAR_KIND``
  * ``PyUnicode_READY()``
  * ``PyUnicode_IS_READY()``
  * ``PyUnicode_IS_COMPACT()``

* ``PyUnicode_FromUnicode(NULL, size)`` and
  ``PyUnicode_FromStringAndSize(NULL, size)`` emit
  ``DeprecationWarning`` when ``size > 0``.

* ``PyArg_ParseTuple()`` and ``PyArg_ParseTupleAndKeywords()`` emit
  ``DeprecationWarning`` when ``u``, ``u#``, ``Z``, and ``Z#`` formats are used.


Python 3.12
-----------

* Following members are removed from the Unicode structures:

  * ``wstr``
  * ``wstr_length``
  * ``state.compact``
  * ``state.ready``

* The ``PyUnicodeObject`` structure is removed.

* Following macros and functions, and enum members are removed:

  * ``Py_UNICODE_WSTR_LENGTH()``
  * ``PyUnicode_GET_SIZE()``
  * ``PyUnicode_GetSize()``
  * ``PyUnicode_GET_DATA_SIZE()``
  * ``PyUnicode_AS_UNICODE()``
  * ``PyUnicode_AS_DATA()``
  * ``PyUnicode_AsUnicode()``
  * ``_PyUnicode_AsUnicode()``
  * ``PyUnicode_AsUnicodeAndSize()``
  * ``PyUnicode_FromUnicode()``
  * ``PyUnicode_WCHAR_KIND``
  * ``PyUnicode_READY()``
  * ``PyUnicode_IS_READY()``
  * ``PyUnicode_IS_COMPACT()``

* ``PyUnicode_FromStringAndSize(NULL, size))`` raises
  ``RuntimeError`` when ``size > 0``.

* ``PyArg_ParseTuple()`` and ``PyArg_ParseTupleAndKeywords()`` raise
  ``SystemError`` when ``u``, ``u#``, ``Z``, and ``Z#`` formats are used,
  as other unsupported format character.


Discussion
==========

* `Draft PEP: Remove wstr from Unicode
  <https://mail.python.org/archives/list/python-dev@python.org/thread/BO2TQHSXWL2RJMINWQQRBF5LANDDJNHH/#BO2TQHSXWL2RJMINWQQRBF5LANDDJNHH>`_
* `When can we remove wchar_t* cache from string?
  <https://mail.python.org/archives/list/python-dev@python.org/thread/7JVC3IKS2V73K36ISEJAAWMRFN2T4KKR/#7JVC3IKS2V73K36ISEJAAWMRFN2T4KKR>`_
* `PEP 623: Remove wstr from Unicode object #1462
  <https://github.com/python/peps/pull/1462>`_


References
==========

* `bpo-38604: Schedule Py_UNICODE API removal
  <https://bugs.python.org/issue38604>`_
* `bpo-36346: Prepare for removing the legacy Unicode C API
  <https://bugs.python.org/issue36346>`_
* `bpo-30863: Rewrite PyUnicode_AsWideChar() and
  PyUnicode_AsWideCharString() <https://bugs.python.org/issue30863>`_:
  They no longer cache the ``wchar_t*`` representation of string
  objects.


Copyright
=========

This document has been placed in the public domain.
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00			`PEP: 623`
			`Title: Remove wstr from Unicode`
			`Author: Inada Naoki <songofacandy@gmail.com>`
PEP 623: Set the BDFL-Delegate (#1468) 2020-06-25 19:02:01 -04:00			`BDFL-Delegate: Victor Stinner <vstinner@python.org>`
PEP 623: Change status to Final (#2601) * PEP 623: Mark it final Implemented in https://github.com/python/cpython/pull/92537 * Add Discussions-To and Resolution header * Fix Discussions-To link 2022-05-12 22:37:54 -04:00			`Discussions-To: https://mail.python.org/archives/list/python-dev@python.org/thread/BO2TQHSXWL2RJMINWQQRBF5LANDDJNHH/`
			`Status: Final`
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00			`Type: Standards Track`
			`Content-Type: text/x-rst`
			`Created: 25-Jun-2020`
			`Python-Version: 3.10`
PEP 623: Remove fragment from Resolution (#3319) 2023-08-30 23:40:14 -04:00			`Resolution: https://mail.python.org/archives/list/python-dev@python.org/thread/VQKDIZLZ6HF2MLTNCUFURK2IFTXVQEYA/`
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00

			`Abstract`
			`========`

Several PEPs: Use explicit `:pep:` and `:rfc:` roles (#2209) 2022-01-21 06:03:51 -05:00			:pep:`393` deprecated some unicode APIs, and introduced ``wchar_t *wstr``,
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00			and ``Py_ssize_t wstr_length`` in the Unicode structure to support
Several PEPs: Use explicit `:pep:` and `:rfc:` roles (#2209) 2022-01-21 06:03:51 -05:00			`these deprecated APIs.`
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00
			This PEP is planning removal of ``wstr``, and ``wstr_length`` with
			`deprecated APIs using these members by Python 3.12.`

			`Deprecated APIs which doesn't use the members are out of scope because`
			`they can be removed independently.`


			`Motivation`
			`==========`

			`Memory usage`
			`------------`

			``str`` is one of the most used types in Python. Even most simple ASCII
Rephrase the PEP 623 (#1492) * Rephrase the PEP 623 * Add Discussion section and bpo links * Update pep-0623.rst Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> 2020-07-04 17:12:10 -04:00			strings have a ``wstr`` member. It consumes 8 bytes per string on 64-bit
			`systems.`
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00

			`Runtime overhead`
			`----------------`

Rephrase the PEP 623 (#1492) * Rephrase the PEP 623 * Add Discussion section and bpo links * Update pep-0623.rst Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> 2020-07-04 17:12:10 -04:00			`To support legacy Unicode object, many Unicode APIs must call`
			``PyUnicode_READY()``.
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00
Rephrase the PEP 623 (#1492) * Rephrase the PEP 623 * Add Discussion section and bpo links * Update pep-0623.rst Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> 2020-07-04 17:12:10 -04:00			`We can remove this overhead too by dropping support of legacy Unicode`
			`object.`
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00

			`Simplicity`
			`----------`

Rephrase the PEP 623 (#1492) * Rephrase the PEP 623 * Add Discussion section and bpo links * Update pep-0623.rst Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> 2020-07-04 17:12:10 -04:00			`Supporting legacy Unicode object makes the Unicode implementation more`
			`complex.`
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00			`Until we drop legacy Unicode object, it is very hard to try other`
			`Unicode implementation like UTF-8 based implementation in PyPy.`


			`Rationale`
			`=========`

			`Python 4.0 is not scheduled yet`
			`-------------------------------`

Several PEPs: Use explicit `:pep:` and `:rfc:` roles (#2209) 2022-01-21 06:03:51 -05:00			:pep:`393` introduced efficient internal representation of Unicode and
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00			`removed border between "narrow" and "wide" build of Python.`

Several PEPs: Use explicit `:pep:` and `:rfc:` roles (#2209) 2022-01-21 06:03:51 -05:00			:pep:`393` was implemented in Python 3.3 which is released in 2012. Old
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00			`APIs were deprecated since then, and the removal was scheduled in`
			`Python 4.0.`

Several PEPs: Use explicit `:pep:` and `:rfc:` roles (#2209) 2022-01-21 06:03:51 -05:00			Python 4.0 was expected as next version of Python 3.9 when :pep:`393`
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00			`was accepted. But the next version of Python 3.9 is Python 3.10,`
			`not 4.0. This is why this PEP schedule the removal plan again.`


			`Python 2 reached EOL`
			`--------------------`

Several PEPs: Use explicit `:pep:` and `:rfc:` roles (#2209) 2022-01-21 06:03:51 -05:00			Since Python 2 didn't have :pep:`393` Unicode implementation, legacy
fix typo: extensiom -> extension (#1638) 2020-10-05 16:23:15 -04:00			`APIs might help C extension modules supporting both of Python 2 and 3.`
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00
			`But Python 2 reached the EOL in 2020. We can remove legacy APIs kept`
			`for compatibility with Python 2.`


			`Plan`
			`====`

Rephrase the PEP 623 (#1492) * Rephrase the PEP 623 * Add Discussion section and bpo links * Update pep-0623.rst Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> 2020-07-04 17:12:10 -04:00			`Python 3.9`
			`----------`
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00
			`These macros and functions are marked as deprecated, using`
			``Py_DEPRECATED`` macro.

			* ``Py_UNICODE_WSTR_LENGTH()``
			* ``PyUnicode_GET_SIZE()``
			* ``PyUnicode_GetSize()``
			* ``PyUnicode_GET_DATA_SIZE()``
			* ``PyUnicode_AS_UNICODE()``
			* ``PyUnicode_AS_DATA()``
			* ``PyUnicode_AsUnicode()``
			* ``_PyUnicode_AsUnicode()``
			* ``PyUnicode_AsUnicodeAndSize()``
			* ``PyUnicode_FromUnicode()``


			`Python 3.10`
			`-----------`

Rephrase the PEP 623 (#1492) * Rephrase the PEP 623 * Add Discussion section and bpo links * Update pep-0623.rst Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> 2020-07-04 17:12:10 -04:00			`* Following macros, enum members are marked as deprecated.`
			``Py_DEPRECATED(3.10)`` macro are used as possible. But they
			`are deprecated only in comment and document if the macro can`
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00			`not be used easily.`

Rephrase the PEP 623 (#1492) * Rephrase the PEP 623 * Add Discussion section and bpo links * Update pep-0623.rst Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> 2020-07-04 17:12:10 -04:00			* ``PyUnicode_WCHAR_KIND``
			* ``PyUnicode_READY()``
			* ``PyUnicode_IS_READY()``
			* ``PyUnicode_IS_COMPACT()``
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00
			* ``PyUnicode_FromUnicode(NULL, size)`` and
Rephrase the PEP 623 (#1492) * Rephrase the PEP 623 * Add Discussion section and bpo links * Update pep-0623.rst Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> 2020-07-04 17:12:10 -04:00			``PyUnicode_FromStringAndSize(NULL, size)`` emit
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00			``DeprecationWarning`` when ``size > 0``.

Rephrase the PEP 623 (#1492) * Rephrase the PEP 623 * Add Discussion section and bpo links * Update pep-0623.rst Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> 2020-07-04 17:12:10 -04:00			* ``PyArg_ParseTuple()`` and ``PyArg_ParseTupleAndKeywords()`` emit
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00			``DeprecationWarning`` when ``u``, ``u#``, ``Z``, and ``Z#`` formats are used.


			`Python 3.12`
			`-----------`

Rephrase the PEP 623 (#1492) * Rephrase the PEP 623 * Add Discussion section and bpo links * Update pep-0623.rst Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> 2020-07-04 17:12:10 -04:00			`* Following members are removed from the Unicode structures:`

			* ``wstr``
			* ``wstr_length``
			* ``state.compact``
			* ``state.ready``

			* The ``PyUnicodeObject`` structure is removed.

			`* Following macros and functions, and enum members are removed:`

			* ``Py_UNICODE_WSTR_LENGTH()``
			* ``PyUnicode_GET_SIZE()``
			* ``PyUnicode_GetSize()``
			* ``PyUnicode_GET_DATA_SIZE()``
			* ``PyUnicode_AS_UNICODE()``
			* ``PyUnicode_AS_DATA()``
			* ``PyUnicode_AsUnicode()``
			* ``_PyUnicode_AsUnicode()``
			* ``PyUnicode_AsUnicodeAndSize()``
			* ``PyUnicode_FromUnicode()``
			* ``PyUnicode_WCHAR_KIND``
			* ``PyUnicode_READY()``
			* ``PyUnicode_IS_READY()``
			* ``PyUnicode_IS_COMPACT()``

			* ``PyUnicode_FromStringAndSize(NULL, size))`` raises
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00			``RuntimeError`` when ``size > 0``.

Rephrase the PEP 623 (#1492) * Rephrase the PEP 623 * Add Discussion section and bpo links * Update pep-0623.rst Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> 2020-07-04 17:12:10 -04:00			* ``PyArg_ParseTuple()`` and ``PyArg_ParseTupleAndKeywords()`` raise
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00			``SystemError`` when ``u``, ``u#``, ``Z``, and ``Z#`` formats are used,
			`as other unsupported format character.`


Rephrase the PEP 623 (#1492) * Rephrase the PEP 623 * Add Discussion section and bpo links * Update pep-0623.rst Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> 2020-07-04 17:12:10 -04:00			`Discussion`
			`==========`

			* `Draft PEP: Remove wstr from Unicode
			<https://mail.python.org/archives/list/python-dev@python.org/thread/BO2TQHSXWL2RJMINWQQRBF5LANDDJNHH/#BO2TQHSXWL2RJMINWQQRBF5LANDDJNHH>`_
			* `When can we remove wchar_t* cache from string?
			<https://mail.python.org/archives/list/python-dev@python.org/thread/7JVC3IKS2V73K36ISEJAAWMRFN2T4KKR/#7JVC3IKS2V73K36ISEJAAWMRFN2T4KKR>`_
			* `PEP 623: Remove wstr from Unicode object #1462
			<https://github.com/python/peps/pull/1462>`_


PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00			`References`
			`==========`
Rephrase the PEP 623 (#1492) * Rephrase the PEP 623 * Add Discussion section and bpo links * Update pep-0623.rst Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> Co-authored-by: Pradyun Gedam <pradyunsg@gmail.com> 2020-07-04 17:12:10 -04:00
			* `bpo-38604: Schedule Py_UNICODE API removal
			<https://bugs.python.org/issue38604>`_
			* `bpo-36346: Prepare for removing the legacy Unicode C API
			<https://bugs.python.org/issue36346>`_
			* `bpo-30863: Rewrite PyUnicode_AsWideChar() and
			PyUnicode_AsWideCharString() <https://bugs.python.org/issue30863>`_:
			They no longer cache the ``wchar_t*`` representation of string
			`objects.`
PEP 623: Remove wstr from Unicode object (#1462) 2020-06-25 07:16:25 -04:00

			`Copyright`
			`=========`

			`This document has been placed in the public domain.`