python-peps/pep-0261/index.html

370 lines
21 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="color-scheme" content="light dark">
<title>PEP 261 Support for “wide” Unicode characters | peps.python.org</title>
<link rel="shortcut icon" href="../_static/py.png">
<link rel="canonical" href="https://peps.python.org/pep-0261/">
<link rel="stylesheet" href="../_static/style.css" type="text/css">
<link rel="stylesheet" href="../_static/mq.css" type="text/css">
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" media="(prefers-color-scheme: light)" id="pyg-light">
<link rel="stylesheet" href="../_static/pygments_dark.css" type="text/css" media="(prefers-color-scheme: dark)" id="pyg-dark">
<link rel="alternate" type="application/rss+xml" title="Latest PEPs" href="https://peps.python.org/peps.rss">
<meta property="og:title" content='PEP 261 Support for “wide” Unicode characters | peps.python.org'>
<meta property="og:description" content="Python 2.1 unicode characters can have ordinals only up to 2**16 - 1. This range corresponds to a range in Unicode known as the Basic Multilingual Plane. There are now characters in Unicode that live on other “planes”. The largest addressable character ...">
<meta property="og:type" content="website">
<meta property="og:url" content="https://peps.python.org/pep-0261/">
<meta property="og:site_name" content="Python Enhancement Proposals (PEPs)">
<meta property="og:image" content="https://peps.python.org/_static/og-image.png">
<meta property="og:image:alt" content="Python PEPs">
<meta property="og:image:width" content="200">
<meta property="og:image:height" content="200">
<meta name="description" content="Python 2.1 unicode characters can have ordinals only up to 2**16 - 1. This range corresponds to a range in Unicode known as the Basic Multilingual Plane. There are now characters in Unicode that live on other “planes”. The largest addressable character ...">
<meta name="theme-color" content="#3776ab">
</head>
<body>
<svg xmlns="http://www.w3.org/2000/svg" style="display: none;">
<symbol id="svg-sun-half" viewBox="0 0 24 24" pointer-events="all">
<title>Following system colour scheme</title>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none"
stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
<circle cx="12" cy="12" r="9"></circle>
<path d="M12 3v18m0-12l4.65-4.65M12 14.3l7.37-7.37M12 19.6l8.85-8.85"></path>
</svg>
</symbol>
<symbol id="svg-moon" viewBox="0 0 24 24" pointer-events="all">
<title>Selected dark colour scheme</title>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none"
stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
<path stroke="none" d="M0 0h24v24H0z" fill="none"></path>
<path d="M12 3c.132 0 .263 0 .393 0a7.5 7.5 0 0 0 7.92 12.446a9 9 0 1 1 -8.313 -12.454z"></path>
</svg>
</symbol>
<symbol id="svg-sun" viewBox="0 0 24 24" pointer-events="all">
<title>Selected light colour scheme</title>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" fill="none"
stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
<circle cx="12" cy="12" r="5"></circle>
<line x1="12" y1="1" x2="12" y2="3"></line>
<line x1="12" y1="21" x2="12" y2="23"></line>
<line x1="4.22" y1="4.22" x2="5.64" y2="5.64"></line>
<line x1="18.36" y1="18.36" x2="19.78" y2="19.78"></line>
<line x1="1" y1="12" x2="3" y2="12"></line>
<line x1="21" y1="12" x2="23" y2="12"></line>
<line x1="4.22" y1="19.78" x2="5.64" y2="18.36"></line>
<line x1="18.36" y1="5.64" x2="19.78" y2="4.22"></line>
</svg>
</symbol>
</svg>
<script>
document.documentElement.dataset.colour_scheme = localStorage.getItem("colour_scheme") || "auto"
</script>
<section id="pep-page-section">
<header>
<h1>Python Enhancement Proposals</h1>
<ul class="breadcrumbs">
<li><a href="https://www.python.org/" title="The Python Programming Language">Python</a> &raquo; </li>
<li><a href="../pep-0000/">PEP Index</a> &raquo; </li>
<li>PEP 261</li>
</ul>
<button id="colour-scheme-cycler" onClick="setColourScheme(nextColourScheme())">
<svg aria-hidden="true" class="colour-scheme-icon-when-auto"><use href="#svg-sun-half"></use></svg>
<svg aria-hidden="true" class="colour-scheme-icon-when-dark"><use href="#svg-moon"></use></svg>
<svg aria-hidden="true" class="colour-scheme-icon-when-light"><use href="#svg-sun"></use></svg>
<span class="visually-hidden">Toggle light / dark / auto colour theme</span>
</button>
</header>
<article>
<section id="pep-content">
<h1 class="page-title">PEP 261 Support for “wide” Unicode characters</h1>
<dl class="rfc2822 field-list simple">
<dt class="field-odd">Author<span class="colon">:</span></dt>
<dd class="field-odd">Paul Prescod &lt;paul&#32;&#97;t&#32;prescod.net&gt;</dd>
<dt class="field-even">Status<span class="colon">:</span></dt>
<dd class="field-even"><abbr title="Accepted and implementation complete, or no longer active">Final</abbr></dd>
<dt class="field-odd">Type<span class="colon">:</span></dt>
<dd class="field-odd"><abbr title="Normative PEP with a new feature for Python, implementation change for CPython or interoperability standard for the ecosystem">Standards Track</abbr></dd>
<dt class="field-even">Created<span class="colon">:</span></dt>
<dd class="field-even">27-Jun-2001</dd>
<dt class="field-odd">Python-Version<span class="colon">:</span></dt>
<dd class="field-odd">2.2</dd>
<dt class="field-even">Post-History<span class="colon">:</span></dt>
<dd class="field-even">27-Jun-2001</dd>
</dl>
<hr class="docutils" />
<section id="contents">
<details><summary>Table of Contents</summary><ul class="simple">
<li><a class="reference internal" href="#abstract">Abstract</a></li>
<li><a class="reference internal" href="#glossary">Glossary</a></li>
<li><a class="reference internal" href="#proposed-solution">Proposed Solution</a></li>
<li><a class="reference internal" href="#implementation">Implementation</a></li>
<li><a class="reference internal" href="#notes">Notes</a></li>
<li><a class="reference internal" href="#rejected-suggestions">Rejected Suggestions</a></li>
<li><a class="reference internal" href="#references">References</a></li>
<li><a class="reference internal" href="#copyright">Copyright</a></li>
</ul>
</details></section>
<section id="abstract">
<h2><a class="toc-backref" href="#abstract" role="doc-backlink">Abstract</a></h2>
<p>Python 2.1 unicode characters can have ordinals only up to <code class="docutils literal notranslate"><span class="pre">2**16</span> <span class="pre">-</span> <span class="pre">1</span></code>.
This range corresponds to a range in Unicode known as the Basic
Multilingual Plane. There are now characters in Unicode that live
on other “planes”. The largest addressable character in Unicode
has the ordinal <code class="docutils literal notranslate"><span class="pre">17</span> <span class="pre">*</span> <span class="pre">2**16</span> <span class="pre">-</span> <span class="pre">1</span></code> (<code class="docutils literal notranslate"><span class="pre">0x10ffff</span></code>). For readability, we
will call this TOPCHAR and call characters in this range “wide
characters”.</p>
</section>
<section id="glossary">
<h2><a class="toc-backref" href="#glossary" role="doc-backlink">Glossary</a></h2>
<dl class="simple">
<dt>Character</dt><dd>Used by itself, means the addressable units of a Python
Unicode string.</dd>
<dt>Code point</dt><dd>A code point is an integer between 0 and TOPCHAR.
If you imagine Unicode as a mapping from integers to
characters, each integer is a code point. But the
integers between 0 and TOPCHAR that do not map to
characters are also code points. Some will someday
be used for characters. Some are guaranteed never
to be used for characters.</dd>
<dt>Codec</dt><dd>A set of functions for translating between physical
encodings (e.g. on disk or coming in from a network)
into logical Python objects.</dd>
<dt>Encoding</dt><dd>Mechanism for representing abstract characters in terms of
physical bits and bytes. Encodings allow us to store
Unicode characters on disk and transmit them over networks
in a manner that is compatible with other Unicode software.</dd>
<dt>Surrogate pair</dt><dd>Two physical characters that represent a single logical
character. Part of a convention for representing 32-bit
code points in terms of two 16-bit code points.</dd>
<dt>Unicode string</dt><dd>A Python type representing a sequence of code points with
“string semantics” (e.g. case conversions, regular
expression compatibility, etc.) Constructed with the
<code class="docutils literal notranslate"><span class="pre">unicode()</span></code> function.</dd>
</dl>
</section>
<section id="proposed-solution">
<h2><a class="toc-backref" href="#proposed-solution" role="doc-backlink">Proposed Solution</a></h2>
<p>One solution would be to merely increase the maximum ordinal
to a larger value. Unfortunately the only straightforward
implementation of this idea is to use 4 bytes per character.
This has the effect of doubling the size of most Unicode
strings. In order to avoid imposing this cost on every
user, Python 2.2 will allow the 4-byte implementation as a
build-time option. Users can choose whether they care about
wide characters or prefer to preserve memory.</p>
<p>The 4-byte option is called “wide <code class="docutils literal notranslate"><span class="pre">Py_UNICODE</span></code>”. The 2-byte option
is called “narrow <code class="docutils literal notranslate"><span class="pre">Py_UNICODE</span></code>”.</p>
<p>Most things will behave identically in the wide and narrow worlds.</p>
<ul>
<li><code class="docutils literal notranslate"><span class="pre">unichr(i)</span></code> for 0 &lt;= i &lt; <code class="docutils literal notranslate"><span class="pre">2**16</span></code> (<code class="docutils literal notranslate"><span class="pre">0x10000</span></code>) always returns a
length-one string.</li>
<li><code class="docutils literal notranslate"><span class="pre">unichr(i)</span></code> for <code class="docutils literal notranslate"><span class="pre">2**16</span></code> &lt;= i &lt;= TOPCHAR will return a
length-one string on wide Python builds. On narrow builds it will
raise <code class="docutils literal notranslate"><span class="pre">ValueError</span></code>.<p><strong>ISSUE</strong></p>
<blockquote>
<div>Python currently allows <code class="docutils literal notranslate"><span class="pre">\U</span></code> literals that cannot be
represented as a single Python character. It generates two
Python characters known as a “surrogate pair”. Should this
be disallowed on future narrow Python builds?</div></blockquote>
<p><strong>Pro:</strong></p>
<blockquote>
<div>Python already the construction of a surrogate pair
for a large unicode literal character escape sequence.
This is basically designed as a simple way to construct
“wide characters” even in a narrow Python build. It is also
somewhat logical considering that the Unicode-literal syntax
is basically a short-form way of invoking the unicode-escape
codec.</div></blockquote>
<p><strong>Con:</strong></p>
<blockquote>
<div>Surrogates could be easily created this way but the user
still needs to be careful about slicing, indexing, printing
etc. Therefore, some have suggested that Unicode
literals should not support surrogates.</div></blockquote>
<p><strong>ISSUE</strong></p>
<blockquote>
<div>Should Python allow the construction of characters that do
not correspond to Unicode code points? Unassigned Unicode
code points should obviously be legal (because they could
be assigned at any time). But code points above TOPCHAR are
guaranteed never to be used by Unicode. Should we allow access
to them anyhow?</div></blockquote>
<p><strong>Pro:</strong></p>
<blockquote>
<div>If a Python user thinks they know what theyre doing why
should we try to prevent them from violating the Unicode
spec? After all, we dont stop 8-bit strings from
containing non-ASCII characters.</div></blockquote>
<p><strong>Con:</strong></p>
<blockquote>
<div>Codecs and other Unicode-consuming code will have to be
careful of these characters which are disallowed by the
Unicode specification.</div></blockquote>
</li>
<li><code class="docutils literal notranslate"><span class="pre">ord()</span></code> is always the inverse of <code class="docutils literal notranslate"><span class="pre">unichr()</span></code></li>
<li>There is an integer value in the sys module that describes the
largest ordinal for a character in a Unicode string on the current
interpreter. <code class="docutils literal notranslate"><span class="pre">sys.maxunicode</span></code> is <code class="docutils literal notranslate"><span class="pre">2**16-1</span></code> (<code class="docutils literal notranslate"><span class="pre">0xffff</span></code>) on narrow builds
of Python and TOPCHAR on wide builds.<p><strong>ISSUE:</strong></p>
<blockquote>
<div>Should there be distinct constants for accessing
TOPCHAR and the real upper bound for the domain of
<code class="docutils literal notranslate"><span class="pre">unichr</span></code> (if they differ)? There has also been a
suggestion of <code class="docutils literal notranslate"><span class="pre">sys.unicodewidth</span></code> which can take the
values <code class="docutils literal notranslate"><span class="pre">'wide'</span></code> and <code class="docutils literal notranslate"><span class="pre">'narrow'</span></code>.</div></blockquote>
</li>
<li>every Python Unicode character represents exactly one Unicode code
point (i.e. Python Unicode Character = Abstract Unicode character).</li>
<li>codecs will be upgraded to support “wide characters”
(represented directly in UCS-4, and as variable-length sequences
in UTF-8 and UTF-16). This is the main part of the implementation
left to be done.</li>
<li>There is a convention in the Unicode world for encoding a 32-bit
code point in terms of two 16-bit code points. These are known
as “surrogate pairs”. Pythons codecs will adopt this convention
and encode 32-bit code points as surrogate pairs on narrow Python
builds.<p><strong>ISSUE</strong></p>
<blockquote>
<div>Should there be a way to tell codecs not to generate
surrogates and instead treat wide characters as
errors?</div></blockquote>
<p><strong>Pro:</strong></p>
<blockquote>
<div>I might want to write code that works only with
fixed-width characters and does not have to worry about
surrogates.</div></blockquote>
<p><strong>Con:</strong></p>
<blockquote>
<div>No clear proposal of how to communicate this to codecs.</div></blockquote>
</li>
<li>there are no restrictions on constructing strings that use
code points “reserved for surrogates” improperly. These are
called “isolated surrogates”. The codecs should disallow reading
these from files, but you could construct them using string
literals or <code class="docutils literal notranslate"><span class="pre">unichr()</span></code>.</li>
</ul>
</section>
<section id="implementation">
<h2><a class="toc-backref" href="#implementation" role="doc-backlink">Implementation</a></h2>
<p>There is a new define:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1">#define Py_UNICODE_SIZE 2</span>
</pre></div>
</div>
<p>To test whether UCS2 or UCS4 is in use, the derived macro
<code class="docutils literal notranslate"><span class="pre">Py_UNICODE_WIDE</span></code> should be used, which is defined when UCS-4 is in
use.</p>
<p>There is a new configure option:</p>
<table class="docutils align-default">
<tbody>
<tr class="row-odd"><td>enable-unicode=ucs2</td>
<td>configures a narrow <code class="docutils literal notranslate"><span class="pre">Py_UNICODE</span></code>, and uses
wchar_t if it fits</td>
</tr>
<tr class="row-even"><td>enable-unicode=ucs4</td>
<td>configures a wide <code class="docutils literal notranslate"><span class="pre">Py_UNICODE</span></code>, and uses
wchar_t if it fits</td>
</tr>
<tr class="row-odd"><td>enable-unicode</td>
<td>same as “=ucs2”</td>
</tr>
<tr class="row-even"><td>disable-unicode</td>
<td>entirely remove the Unicode functionality.</td>
</tr>
</tbody>
</table>
<p>It is also proposed that one day <code class="docutils literal notranslate"><span class="pre">--enable-unicode</span></code> will just
default to the width of your platforms <code class="docutils literal notranslate"><span class="pre">wchar_t</span></code>.</p>
<p>Windows builds will be narrow for a while based on the fact that
there have been few requests for wide characters, those requests
are mostly from hard-core programmers with the ability to buy
their own Python and Windows itself is strongly biased towards
16-bit characters.</p>
</section>
<section id="notes">
<h2><a class="toc-backref" href="#notes" role="doc-backlink">Notes</a></h2>
<p>This PEP does NOT imply that people using Unicode need to use a
4-byte encoding for their files on disk or sent over the network.
It only allows them to do so. For example, ASCII is still a
legitimate (7-bit) Unicode-encoding.</p>
<p>It has been proposed that there should be a module that handles
surrogates in narrow Python builds for programmers. If someone
wants to implement that, it will be another PEP. It might also be
combined with features that allow other kinds of character-,
word- and line- based indexing.</p>
</section>
<section id="rejected-suggestions">
<h2><a class="toc-backref" href="#rejected-suggestions" role="doc-backlink">Rejected Suggestions</a></h2>
<p>More or less the status-quo</p>
<blockquote>
<div>We could officially say that Python characters are 16-bit and
require programmers to implement wide characters in their
application logic by combining surrogate pairs. This is a heavy
burden because emulating 32-bit characters is likely to be
very inefficient if it is coded entirely in Python. Plus these
abstracted pseudo-strings would not be legal as input to the
regular expression engine.</div></blockquote>
<p>“Space-efficient Unicode” type</p>
<blockquote>
<div>Another class of solution is to use some efficient storage
internally but present an abstraction of wide characters to
the programmer. Any of these would require a much more complex
implementation than the accepted solution. For instance consider
the impact on the regular expression engine. In theory, we could
move to this implementation in the future without breaking Python
code. A future Python could “emulate” wide Python semantics on
narrow Python. Guido is not willing to undertake the
implementation right now.</div></blockquote>
<p>Two types</p>
<blockquote>
<div>We could introduce a 32-bit Unicode type alongside the 16-bit
type. There is a lot of code that expects there to be only a
single Unicode type.</div></blockquote>
<p>This PEP represents the least-effort solution. Over the next
several years, 32-bit Unicode characters will become more common
and that may either convince us that we need a more sophisticated
solution or (on the other hand) convince us that simply
mandating wide Unicode characters is an appropriate solution.
Right now the two options on the table are do nothing or do
this.</p>
</section>
<section id="references">
<h2><a class="toc-backref" href="#references" role="doc-backlink">References</a></h2>
<p>Unicode Glossary: <a class="reference external" href="http://www.unicode.org/glossary/">http://www.unicode.org/glossary/</a></p>
</section>
<section id="copyright">
<h2><a class="toc-backref" href="#copyright" role="doc-backlink">Copyright</a></h2>
<p>This document has been placed in the public domain.</p>
</section>
</section>
<hr class="docutils" />
<p>Source: <a class="reference external" href="https://github.com/python/peps/blob/main/peps/pep-0261.rst">https://github.com/python/peps/blob/main/peps/pep-0261.rst</a></p>
<p>Last modified: <a class="reference external" href="https://github.com/python/peps/commits/main/peps/pep-0261.rst">2023-09-09 17:39:29 GMT</a></p>
</article>
<nav id="pep-sidebar">
<h2>Contents</h2>
<ul>
<li><a class="reference internal" href="#abstract">Abstract</a></li>
<li><a class="reference internal" href="#glossary">Glossary</a></li>
<li><a class="reference internal" href="#proposed-solution">Proposed Solution</a></li>
<li><a class="reference internal" href="#implementation">Implementation</a></li>
<li><a class="reference internal" href="#notes">Notes</a></li>
<li><a class="reference internal" href="#rejected-suggestions">Rejected Suggestions</a></li>
<li><a class="reference internal" href="#references">References</a></li>
<li><a class="reference internal" href="#copyright">Copyright</a></li>
</ul>
<br>
<a id="source" href="https://github.com/python/peps/blob/main/peps/pep-0261.rst">Page Source (GitHub)</a>
</nav>
</section>
<script src="../_static/colour_scheme.js"></script>
<script src="../_static/wrap_tables.js"></script>
<script src="../_static/sticky_banner.js"></script>
</body>
</html>