445 lines
19 KiB
ReStructuredText
445 lines
19 KiB
ReStructuredText
PEP: 414
|
||
Title: Explicit Unicode Literal for Python 3.3
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Armin Ronacher <armin.ronacher@active-4.com>,
|
||
Alyssa Coghlan <ncoghlan@gmail.com>
|
||
Status: Final
|
||
Type: Standards Track
|
||
Content-Type: text/x-rst
|
||
Created: 15-Feb-2012
|
||
Python-Version: 3.3
|
||
Post-History: 28-Feb-2012, 04-Mar-2012
|
||
Resolution: https://mail.python.org/pipermail/python-dev/2012-February/116995.html
|
||
|
||
|
||
Abstract
|
||
========
|
||
|
||
This document proposes the reintegration of an explicit unicode literal
|
||
from Python 2.x to the Python 3.x language specification, in order to
|
||
reduce the volume of changes needed when porting Unicode-aware
|
||
Python 2 applications to Python 3.
|
||
|
||
|
||
BDFL Pronouncement
|
||
==================
|
||
|
||
This PEP has been formally accepted for Python 3.3:
|
||
|
||
I'm accepting the PEP. It's about as harmless as they come. Make it so.
|
||
|
||
|
||
Proposal
|
||
========
|
||
|
||
This PEP proposes that Python 3.3 restore support for Python 2's Unicode
|
||
literal syntax, substantially increasing the number of lines of existing
|
||
Python 2 code in Unicode aware applications that will run without modification
|
||
on Python 3.
|
||
|
||
Specifically, the Python 3 definition for string literal prefixes will be
|
||
expanded to allow::
|
||
|
||
"u" | "U"
|
||
|
||
in addition to the currently supported::
|
||
|
||
"r" | "R"
|
||
|
||
The following will all denote ordinary Python 3 strings::
|
||
|
||
'text'
|
||
"text"
|
||
'''text'''
|
||
"""text"""
|
||
u'text'
|
||
u"text"
|
||
u'''text'''
|
||
u"""text"""
|
||
U'text'
|
||
U"text"
|
||
U'''text'''
|
||
U"""text"""
|
||
|
||
No changes are proposed to Python 3's actual Unicode handling, only to the
|
||
acceptable forms for string literals.
|
||
|
||
|
||
Exclusion of "Raw" Unicode Literals
|
||
===================================
|
||
|
||
Python 2 supports a concept of "raw" Unicode literals that don't meet the
|
||
conventional definition of a raw string: ``\uXXXX`` and ``\UXXXXXXXX`` escape
|
||
sequences are still processed by the compiler and converted to the
|
||
appropriate Unicode code points when creating the associated Unicode objects.
|
||
|
||
Python 3 has no corresponding concept - the compiler performs *no*
|
||
preprocessing of the contents of raw string literals. This matches the
|
||
behaviour of 8-bit raw string literals in Python 2.
|
||
|
||
Since such strings are rarely used and would be interpreted differently in
|
||
Python 3 if permitted, it was decided that leaving them out entirely was
|
||
a better choice. Code which uses them will thus still fail immediately on
|
||
Python 3 (with a Syntax Error), rather than potentially producing different
|
||
output.
|
||
|
||
To get equivalent behaviour that will run on both Python 2 and Python 3,
|
||
either an ordinary Unicode literal can be used (with appropriate additional
|
||
escaping within the string), or else string concatenation or string
|
||
formatting can be combine the raw portions of the string with those that
|
||
require the use of Unicode escape sequences.
|
||
|
||
Note that when using ``from __future__ import unicode_literals`` in Python 2,
|
||
the nominally "raw" Unicode string literals will process ``\uXXXX`` and
|
||
``\UXXXXXXXX`` escape sequences, just like Python 2 strings explicitly marked
|
||
with the "raw Unicode" prefix.
|
||
|
||
|
||
Author's Note
|
||
=============
|
||
|
||
This PEP was originally written by Armin Ronacher, and Guido's approval was
|
||
given based on that version.
|
||
|
||
The currently published version has been rewritten by Alyssa Coghlan to
|
||
include additional historical details and rationale that were taken into
|
||
account when Guido made his decision, but were not explicitly documented in
|
||
Armin's version of the PEP.
|
||
|
||
Readers should be aware that many of the arguments in this PEP are *not*
|
||
technical ones. Instead, they relate heavily to the *social* and *personal*
|
||
aspects of software development.
|
||
|
||
|
||
Rationale
|
||
=========
|
||
|
||
With the release of a Python 3 compatible version of the Web Services Gateway
|
||
Interface (WSGI) specification (:pep:`3333`) for Python 3.2, many parts of the
|
||
Python web ecosystem have been making a concerted effort to support Python 3
|
||
without adversely affecting their existing developer and user communities.
|
||
|
||
One major item of feedback from key developers in those communities, including
|
||
Chris McDonough (WebOb, Pyramid), Armin Ronacher (Flask, Werkzeug), Jacob
|
||
Kaplan-Moss (Django) and Kenneth Reitz (``requests``) is that the requirement
|
||
to change the spelling of *every* Unicode literal in an application
|
||
(regardless of how that is accomplished) is a key stumbling block for porting
|
||
efforts.
|
||
|
||
In particular, unlike many of the other Python 3 changes, it isn't one that
|
||
framework and library authors can easily handle on behalf of their users. Most
|
||
of those users couldn't care less about the "purity" of the Python language
|
||
specification, they just want their websites and applications to work as well
|
||
as possible.
|
||
|
||
While it is the Python web community that has been most vocal in highlighting
|
||
this concern, it is expected that other highly Unicode aware domains (such as
|
||
GUI development) may run into similar issues as they (and their communities)
|
||
start making concerted efforts to support Python 3.
|
||
|
||
|
||
Common Objections
|
||
=================
|
||
|
||
|
||
Complaint: This PEP may harm adoption of Python 3.2
|
||
---------------------------------------------------
|
||
|
||
This complaint is interesting, as it carries within it a tacit admission that
|
||
this PEP *will* make it easier to port Unicode aware Python 2 applications to
|
||
Python 3.
|
||
|
||
There are many existing Python communities that are prepared to put up with
|
||
the constraints imposed by the existing suite of porting tools, or to update
|
||
their Python 2 code bases sufficiently that the problems are minimised.
|
||
|
||
This PEP is not for those communities. Instead, it is designed specifically to
|
||
help people that *don't* want to put up with those difficulties.
|
||
|
||
However, since the proposal is for a comparatively small tweak to the language
|
||
syntax with no semantic changes, it is feasible to support it as a third
|
||
party import hook. While such an import hook imposes some import time
|
||
overhead, and requires additional steps from each application that needs it
|
||
to get the hook in place, it allows applications that target Python 3.2
|
||
to use libraries and frameworks that would otherwise only run on Python 3.3+
|
||
due to their use of unicode literal prefixes.
|
||
|
||
One such import hook project is Vinay Sajip's ``uprefix`` [4]_.
|
||
|
||
For those that prefer to translate their code in advance rather than
|
||
converting on the fly at import time, Armin Ronacher is working on a hook
|
||
that runs at install time rather than during import [5]_.
|
||
|
||
Combining the two approaches is of course also possible. For example, the
|
||
import hook could be used for rapid edit-test cycles during local
|
||
development, but the install hook for continuous integration tasks and
|
||
deployment on Python 3.2.
|
||
|
||
The approaches described in this section may prove useful, for example, for
|
||
applications that wish to target Python 3 on the Ubuntu 12.04 LTS release,
|
||
which will ship with Python 2.7 and 3.2 as officially supported Python
|
||
versions.
|
||
|
||
Complaint: Python 3 shouldn't be made worse just to support porting from Python 2
|
||
---------------------------------------------------------------------------------
|
||
|
||
This is indeed one of the key design principles of Python 3. However, one of
|
||
the key design principles of Python as a whole is that "practicality beats
|
||
purity". If we're going to impose a significant burden on third party
|
||
developers, we should have a solid rationale for doing so.
|
||
|
||
In most cases, the rationale for backwards incompatible Python 3 changes are
|
||
either to improve code correctness (for example, stricter default separation
|
||
of binary and text data and integer division upgrading to floats when
|
||
necessary), reduce typical memory usage (for example, increased usage of
|
||
iterators and views over concrete lists), or to remove distracting nuisances
|
||
that make Python code harder to read without increasing its expressiveness
|
||
(for example, the comma based syntax for naming caught exceptions). Changes
|
||
backed by such reasoning are *not* going to be reverted, regardless of
|
||
objections from Python 2 developers attempting to make the transition to
|
||
Python 3.
|
||
|
||
In many cases, Python 2 offered two ways of doing things for historical reasons.
|
||
For example, inequality could be tested with both ``!=`` and ``<>`` and integer
|
||
literals could be specified with an optional ``L`` suffix. Such redundancies
|
||
have been eliminated in Python 3, which reduces the overall size of the
|
||
language and improves consistency across developers.
|
||
|
||
In the original Python 3 design (up to and including Python 3.2), the explicit
|
||
prefix syntax for unicode literals was deemed to fall into this category, as it
|
||
is completely unnecessary in Python 3. However, the difference between those
|
||
other cases and unicode literals is that the unicode literal prefix is *not*
|
||
redundant in Python 2 code: it is a programmatically significant distinction
|
||
that needs to be preserved in some fashion to avoid losing information.
|
||
|
||
While porting tools were created to help with the transition (see next section)
|
||
it still creates an additional burden on heavy users of unicode strings in
|
||
Python 2, solely so that future developers learning Python 3 don't need to be
|
||
told "For historical reasons, string literals may have an optional ``u`` or
|
||
``U`` prefix. Never use this yourselves, it's just there to help with porting
|
||
from an earlier version of the language."
|
||
|
||
Plenty of students learning Python 2 received similar warnings regarding string
|
||
exceptions without being confused or irreparably stunted in their growth as
|
||
Python developers. It will be the same with this feature.
|
||
|
||
This point is further reinforced by the fact that Python 3 *still* allows the
|
||
uppercase variants of the ``B`` and ``R`` prefixes for bytes literals and raw
|
||
bytes and string literals. If the potential for confusion due to string prefix
|
||
variants is that significant, where was the outcry asking that these
|
||
redundant prefixes be removed along with all the other redundancies that were
|
||
eliminated in Python 3?
|
||
|
||
Just as support for string exceptions was eliminated from Python 2 using the
|
||
normal deprecation process, support for redundant string prefix characters
|
||
(specifically, ``B``, ``R``, ``u``, ``U``) may eventually be eliminated
|
||
from Python 3, regardless of the current acceptance of this PEP. However,
|
||
such a change will likely only occur once third party libraries supporting
|
||
Python 2.7 is about as common as libraries supporting Python 2.2 or 2.3 is
|
||
today.
|
||
|
||
|
||
Complaint: The WSGI "native strings" concept is an ugly hack
|
||
------------------------------------------------------------
|
||
|
||
One reason the removal of unicode literals has provoked such concern amongst
|
||
the web development community is that the updated WSGI specification had to
|
||
make a few compromises to minimise the disruption for existing web servers
|
||
that provide a WSGI-compatible interface (this was deemed necessary in order
|
||
to make the updated standard a viable target for web application authors and
|
||
web framework developers).
|
||
|
||
One of those compromises is the concept of a "native string". WSGI defines
|
||
three different kinds of string:
|
||
|
||
* text strings: handled as ``unicode`` in Python 2 and ``str`` in Python 3
|
||
* native strings: handled as ``str`` in both Python 2 and Python 3
|
||
* binary data: handled as ``str`` in Python 2 and ``bytes`` in Python 3
|
||
|
||
Some developers consider WSGI's "native strings" to be an ugly hack, as they
|
||
are *explicitly* documented as being used solely for ``latin-1`` decoded
|
||
"text", regardless of the actual encoding of the underlying data. Using this
|
||
approach bypasses many of the updates to Python 3's data model that are
|
||
designed to encourage correct handling of text encodings. However, it
|
||
generally works due to the specific details of the problem domain - web server
|
||
and web framework developers are some of the individuals *most* aware of how
|
||
blurry the line can get between binary data and text when working with HTTP
|
||
and related protocols, and how important it is to understand the implications
|
||
of the encodings in use when manipulating encoded text data. At the
|
||
*application* level most of these details are hidden from the developer by
|
||
the web frameworks and support libraries (both in Python 2 *and* in Python 3).
|
||
|
||
In practice, native strings are a useful concept because there are some APIs
|
||
(both in the standard library and in third party frameworks and packages) and
|
||
some internal interpreter details that are designed primarily to work with
|
||
``str``. These components often don't support ``unicode`` in Python 2
|
||
or ``bytes`` in Python 3, or, if they do, require additional encoding details
|
||
and/or impose constraints that don't apply to the ``str`` variants.
|
||
|
||
Some example of interfaces that are best handled by using actual ``str``
|
||
instances are:
|
||
|
||
* Python identifiers (as attributes, dict keys, class names, module names,
|
||
import references, etc)
|
||
* URLs for the most part as well as HTTP headers in urllib/http servers
|
||
* WSGI environment keys and CGI-inherited values
|
||
* Python source code for dynamic compilation and AST hacks
|
||
* Exception messages
|
||
* ``__repr__`` return value
|
||
* preferred filesystem paths
|
||
* preferred OS environment
|
||
|
||
In Python 2.6 and 2.7, these distinctions are most naturally expressed as
|
||
follows:
|
||
|
||
* ``u""``: text string (``unicode``)
|
||
* ``""``: native string (``str``)
|
||
* ``b""``: binary data (``str``, also aliased as ``bytes``)
|
||
|
||
In Python 3, the ``latin-1`` decoded native strings are not distinguished
|
||
from any other text strings:
|
||
|
||
* ``""``: text string (``str``)
|
||
* ``""``: native string (``str``)
|
||
* ``b""``: binary data (``bytes``)
|
||
|
||
If ``from __future__ import unicode_literals`` is used to modify the behaviour
|
||
of Python 2, then, along with an appropriate definition of ``n()``, the
|
||
distinction can be expressed as:
|
||
|
||
* ``""``: text string
|
||
* ``n("")``: native string
|
||
* ``b""``: binary data
|
||
|
||
(While ``n=str`` works for simple cases, it can sometimes have problems
|
||
due to non-ASCII source encodings)
|
||
|
||
In the common subset of Python 2 and Python 3 (with appropriate
|
||
specification of a source encoding and definitions of the ``u()`` and ``b()``
|
||
helper functions), they can be expressed as:
|
||
|
||
* ``u("")``: text string
|
||
* ``""``: native string
|
||
* ``b("")``: binary data
|
||
|
||
That last approach is the only variant that supports Python 2.5 and earlier.
|
||
|
||
Of all the alternatives, the format currently supported in Python 2.6 and 2.7
|
||
is by far the cleanest approach that clearly distinguishes the three desired
|
||
kinds of behaviour. With this PEP, that format will also be supported in
|
||
Python 3.3+. It will also be supported in Python 3.1 and 3.2 through the use
|
||
of import and install hooks. While it is significantly less likely, it is
|
||
also conceivable that the hooks could be adapted to allow the use of the
|
||
``b`` prefix on Python 2.5.
|
||
|
||
|
||
Complaint: The existing tools should be good enough for everyone
|
||
----------------------------------------------------------------
|
||
|
||
A commonly expressed sentiment from developers that have already successfully
|
||
ported applications to Python 3 is along the lines of "if you think it's hard,
|
||
you're doing it wrong" or "it's not that hard, just try it!". While it is no
|
||
doubt unintentional, these responses all have the effect of telling the
|
||
people that are pointing out inadequacies in the current porting toolset
|
||
"there's nothing wrong with the porting tools, you just suck and don't know
|
||
how to use them properly".
|
||
|
||
These responses are a case of completely missing the point of what people are
|
||
complaining about. The feedback that resulted in this PEP isn't due to people
|
||
complaining that ports aren't possible. Instead, the feedback is coming from
|
||
people that have successfully *completed* ports and are objecting that they
|
||
found the experience thoroughly *unpleasant* for the class of application that
|
||
they needed to port (specifically, Unicode aware web frameworks and support
|
||
libraries).
|
||
|
||
This is a subjective appraisal, and it's the reason why the Python 3
|
||
porting tools ecosystem is a case where the "one obvious way to do it"
|
||
philosophy emphatically does *not* apply. While it was originally intended that
|
||
"develop in Python 2, convert with ``2to3``, test both" would be the standard
|
||
way to develop for both versions in parallel, in practice, the needs of
|
||
different projects and developer communities have proven to be sufficiently
|
||
diverse that a variety of approaches have been devised, allowing each group
|
||
to select an approach that best fits their needs.
|
||
|
||
Lennart Regebro has produced an excellent overview of the available migration
|
||
strategies [2]_, and a similar review is provided in the official porting
|
||
guide [3]_. (Note that the official guidance has softened to "it depends on
|
||
your specific situation" since Lennart wrote his overview).
|
||
|
||
However, both of those guides are written from the founding assumption that
|
||
all of the developers involved are *already* committed to the idea of
|
||
supporting Python 3. They make no allowance for the *social* aspects of such a
|
||
change when you're interacting with a user base that may not be especially
|
||
tolerant of disruptions without a clear benefit, or are trying to persuade
|
||
Python 2 focused upstream developers to accept patches that are solely about
|
||
improving Python 3 forward compatibility.
|
||
|
||
With the current porting toolset, *every* migration strategy will result in
|
||
changes to *every* Unicode literal in a project. No exceptions. They will
|
||
be converted to either an unprefixed string literal (if the project decides to
|
||
adopt the ``unicode_literals`` import) or else to a converter call like
|
||
``u("text")``.
|
||
|
||
If the ``unicode_literals`` import approach is employed, but is not adopted
|
||
across the entire project at the same time, then the meaning of a bare string
|
||
literal may become annoyingly ambiguous. This problem can be particularly
|
||
pernicious for *aggregated* software, like a Django site - in such a situation,
|
||
some files may end up using the ``unicode_literals`` import and others may not,
|
||
creating definite potential for confusion.
|
||
|
||
While these problems are clearly solvable at a technical level, they're a
|
||
completely unnecessary distraction at the social level. Developer energy should
|
||
be reserved for addressing *real* technical difficulties associated with the
|
||
Python 3 transition (like distinguishing their 8-bit text strings from their
|
||
binary data). They shouldn't be punished with additional code changes (even
|
||
automated ones) solely due to the fact that they have *already* explicitly
|
||
identified their Unicode strings in Python 2.
|
||
|
||
Armin Ronacher has created an experimental extension to 2to3 which only
|
||
modernizes Python code to the extent that it runs on Python 2.7 or later with
|
||
support from the cross-version compatibility ``six`` library. This tool is
|
||
available as ``python-modernize`` [1]_. Currently, the deltas generated by
|
||
this tool will affect every Unicode literal in the converted source. This
|
||
will create legitimate concerns amongst upstream developers asked to accept
|
||
such changes, and amongst framework *users* being asked to change their
|
||
applications.
|
||
|
||
However, by eliminating the noise from changes to the Unicode literal syntax,
|
||
many projects could be cleanly and (comparatively) non-controversially made
|
||
forward compatible with Python 3.3+ just by running ``python-modernize`` and
|
||
applying the recommended changes.
|
||
|
||
|
||
References
|
||
==========
|
||
|
||
.. [1] Python-Modernize
|
||
(http://github.com/mitsuhiko/python-modernize)
|
||
|
||
.. [2] Porting to Python 3: Migration Strategies
|
||
(http://python3porting.com/strategies.html)
|
||
|
||
.. [3] Porting Python 2 Code to Python 3
|
||
(http://docs.python.org/howto/pyporting.html)
|
||
|
||
.. [4] uprefix import hook project
|
||
(https://bitbucket.org/vinay.sajip/uprefix)
|
||
|
||
.. [5] install hook to remove unicode string prefix characters
|
||
(https://github.com/mitsuhiko/unicode-literals-pep/tree/master/install-hook)
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
End:
|