Rewrite PEP 414 to be less passionate in its tone and better address the common objections

This commit is contained in:
Nick Coghlan 2012-03-04 17:24:43 +10:00
parent 6e7b0815b4
commit 6f949e069a
1 changed files with 302 additions and 190 deletions

View File

@ -2,7 +2,8 @@ PEP: 414
Title: Explicit Unicode Literal for Python 3.3
Version: $Revision$
Last-Modified: $Date$
Author: Armin Ronacher <armin.ronacher@active-4.com>
Author: Armin Ronacher <armin.ronacher@active-4.com>,
Nick Coghlan <ncoghlan@gmail.com>
Status: Accepted
Type: Standards Track
Content-Type: text/x-rst
@ -16,231 +17,339 @@ Abstract
This document proposes the reintegration of an explicit unicode literal
from Python 2.x to the Python 3.x language specification, in order to
enable side-by-side support of libraries for both Python 2 and Python 3
without the need for an explicit 2to3 run.
reduce the volume of changes needed when porting Unicode-aware
Python 2 applications to Python 3.
BDFL Pronouncement
==================
This PEP has been formally accepted for Python 3.3.
This PEP has been formally accepted for Python 3.3:
I'm accepting the PEP. It's about as harmless as they come. Make it so.
Rationale and Goals
===================
Proposal
========
Python 3 is a major new revision of the language, and it was decided very
early on that breaking backwards compatibility was part of the design. The
migration from a Python 2.x to a Python 3 codebase is to be accomplished
with the aid of a separate translation tool that converts the Python 2.x
sourcecode to Python 3 syntax. With more and more libraries supporting
Python 3, however, it has become clear that 2to3 as a tool is
insufficient, and people are now attempting to find ways to make the same
source work in both Python 2.x and Python 3.x, with varying levels of
success.
This PEP proposes that Python 3.3 restore support for Python 2's Unicode
literal syntax, substantially increasing the number of lines of existing
Python 2 code in Unicode aware applications that will run without modification
on Python 3.
Python 2.6 and Python 2.7 support syntax features from Python 3 which for
the most part make a unified code base possible. Many thought that the
``unicode_literals`` future import might make a common source possible,
but it turns out that it's doing more harm than good.
Specifically, the Python 3 definition for string literal prefixes will be
expanded to allow::
With the design of the updated WSGI specification a few new terms for
strings were loosely defined: unicode strings, byte strings and native
strings. In Python 3 the native string type is unicode, in Python 2 the
native string type is a bytestring. These native string types are used in
a couple of places. The native string type can be interned and is
preferably used for identifier names, filenames, source code and a few
other low level interpreter operations such as the return value of a
``__repr__`` or exception messages.
"u" | "U" | "ur" | "UR" | "Ur" | "uR"
In Python 2.7 these string types can be defined explicitly. Without any
future imports ``b'foo'`` means bytestring, ``u'foo'`` declares a unicode
string and ``'foo'`` a native string which in Python 2.x means bytes.
With the ``unicode_literals`` import the native string type is no longer
available by syntax and has to be incorrectly labeled as bytestring. If
such a codebase is then used in Python 3, the interpreter will start using
byte objects in places where they are no longer accepted (such as
identifiers). This can be solved by a module that detects 2.x and 3.x and
provides wrapper functions that transcode literals at runtime (either by
having a ``u`` function that marks things as unicode without future
imports or the inverse by having a ``n`` function that marks strings as
native). Unfortunately, this has the side effect of slowing down the
runtime performance of Python and makes for less beautiful code.
Considering that Python 2 and Python 3 support for most libraries will
have to continue side by side for several more years to come, this means
that such modules lose one of Python's key properties: easily readable and
understandable code.
in additional to the currently supported::
Additionally, the vast majority of people who maintain Python 2.x
codebases are more familiar with Python 2.x semantics, and a per-file
difference in literal meanings will be very annoying for them in the long
run. A quick poll on Twitter about the use of the division future import
supported my suspicions that people opt out of behaviour-changing future
imports because they are a maintenance burden. Every time you review code
you have to check the top of the file to see if the behaviour was changed.
Obviously that was an unscientific informal poll, but it might be
something worth considering.
"r" | "R"
Proposed Solution
=================
The following will all denote ordinary Python 3 strings::
The idea is to support (with Python 3.3) an explicit ``u`` and ``U``
prefix for native strings in addition to the prefix-less variants. These
would stick around for the entirety of the Python 3 lifetime but might at
some point yield deprecation warnings if deemed appropriate. This could
be something for pyflakes or other similar libraries to support.
'text'
"text"
'''text'''
"""text"""
u'text'
u"text"
u'''text'''
u"""text"""
U'text'
U"text"
U'''text'''
U"""text"""
Python 3.2 and earlier
======================
Combination of the unicode prefix with the raw string prefix will also be
supported, just as it was in Python 2.
An argument against this proposal was made on the Python-Dev mailinglist,
mentioning that Ubuntu LTS will ship Python 3.2 and 2.7 for only 5 years.
The counterargument is that Python 2.7 is currently the Python version of
choice for users who want LTS support. As it stands, when chosing between
2.7 and Python 3.2, Python 3 is currently not the best choice for certain
long-term investments, since the ecosystem is not yet properly developed,
and libraries are still fighting with their API decisions for Python 3.
No changes are proposed to Python 3's actual Unicode handling, only to the
acceptable forms for string literals.
A valid point is that this would encourage people to become dependent on
Python 3.3 for their ports. Fortunately that is not a big problem since
that could be fixed at installation time similar to how many projects are
currently invoking 2to3 as part of their installation process.
For Python 3.1 and Python 3.2 (even 3.0 if necessary) a simple
on-installation hook could be provided that tokenizes all source files and
strips away the otherwise unnecessary ``u`` prefix at installation time.
Who Benefits?
Author's Note
=============
There are a couple of places where decisions have to be made for or
against unicode support almost arbitrarily. This is mostly the case for
protocols that do not support unicode all the way down, or hide it behind
transport encodings that might or might not be unicode themselves. HTTP,
Email and WSGI are good examples of that. For certain ambiguous cases it
would be possible to apply the same logic for unicode that Python 3
applies to the Python 2 versions of the library as well but, if those
details were exposed to the user of the API, it would mean breaking
compatibility for existing users of the Python 2 API which is a no-go for
many situations. The automatic upgrading of binary strings to unicode
strings that would be enabled by this proposal would make it much easier
to port such libraries over.
This PEP was originally written by Armin Ronacher, and directly reflected his
feelings regarding his personal experiences porting Unicode aware Python
applications to Python 3. Guido's approval was given based on Armin's version
of the PEP.
Not only the libraries but also the users of these APIs would benefit from
that. For instance, the urllib module in Python 2 is using byte strings,
and the one in Python 3 is using unicode strings. By leveraging a native
string, users can avoid having to adjust for that.
The currently published version has been rewritten by Nick Coghlan to address
the concerns of those who felt that Armin's experience did not accurately
reflect the *typical* experience of porting to Python 3, but rather only
related to a specific subset of porting activities that were not well served
by the existing set of porting tools.
Problems with 2to3
==================
In practice 2to3 currently suffers from a few problems which make it
unnecessarily difficult and/or unpleasant to use:
- Bad overall performance. In many cases 2to3 runs 20 times slower than
the testsuite for the library or application it's testing. (This for
instance is the case for the Jinja2 library).
- Slightly different behaviour in 2to3 between different versions of
Python cause different outcomes when paired with custom fixers.
- Line numbers from error messages do not match up with the real source
lines due to added/rewritten imports.
- extending 2to3 with custom fixers is nontrivial without using
distribute. By default 2to3 works acceptably well for upgrading
byte-based APIs to unicode based APIs but it fails to upgrade APIs
which already support unicode to Python 3::
--- test.py (original)
+++ test.py (refactored)
@@ -1,5 +1,5 @@
class Foo(object):
def __unicode__(self):
- return u'test'
+ return 'test'
def __str__(self):
- return unicode(self).encode('utf-8')
+ return str(self).encode('utf-8')
Readers should be aware that many of the arguments in this PEP are *not*
technical ones. Instead, they relate heavily to the *social* and *personal*
aspects of software development. After all, developers are people first,
coders second.
APIs and Concepts Using Native Strings
======================================
Rationale
=========
The following is an incomplete list of APIs and general concepts that use
native strings and need implicit upgrading to unicode in Python 3, and
which would directly benefit from this support:
With the release of a Python 3 compatible version of the Web Services Gateway
Interface (WSGI) specification (PEP 3333) for Python 3.2, many parts of the
Python web ecosystem have been making a concerted effort to support Python 3
without adversely affecting their existing developer and user communities.
- Python identifiers (dict keys, class names, module names, import
paths)
- URLs for the most part as well as HTTP headers in urllib/http servers
- WSGI environment keys and CGI-inherited values
- Python source code for dynamic compilation and AST hacks
- Exception messages
- ``__repr__`` return value
- preferred filesystem paths
- preferred OS environment
One major item of feedback from key developers in those communities, including
Chris McDonough (WebOb, Pyramid), Armin Ronacher (Flask, Werkzeug), Jacob
Kaplan-Moss (Django) and Kenneth Reitz (``requests``) is that the requirement
to change the spelling of *every* Unicode literal in an application
(regardless of how that is accomplished) is a key stumbling block for porting
efforts.
In particular, unlike many of the other Python 3 changes, it isn't one that
framework and library authors can easily handle on behalf of their users. Most
of those users couldn't care less about the "purity" of the Python language
specification, they just want their websites and applications to work as well
as possible.
While it is the Python web community that has been most vocal in highlighting
this concern, it is expected that other highly Unicode aware domains (such as
GUI development) may run into similar issues as they (and their communities)
start making concerted efforts to support Python 3.
Modernizing Code
================
Common Objections
=================
The 2to3 tool can be easily adjusted to generate code that runs on both
Python 2 and Python 3. An experimental extension to 2to3 which only
modernizes Python code to the extent that it runs on Python 2.7 or later
with support for the ``six`` library is available as python-modernize
[1]_. For most cases the runtime impact of ``six`` can be neglected (like
a function that calls ``iteritems()`` on a passed dictionary under 2.x or
``items()`` under 3.x), but to make strings cheap for both 2.x and 3.x it
is nearly impossible. The way it currently works is by abusing the
``unicode-escape`` codec on Python 2.x native strings. This is especially
ugly if such a string literal is used in a tight loop.
This proposal would fix this. The modernize module could easily be
adjusted to simply not translate unicode strings, and the runtime overhead
would disappear.
This PEP may harm adoption of Python 3.2
----------------------------------------
Possible Downsides
==================
This complaint is interesting, as it carries within it a tacit admission that
this PEP *will* make it easier to port Unicode aware Python 2 applications to
Python 3.
The obvious downside for this is that potential Python 3 users would have
to be aware of the fact that ``u`` is an optional prefix for strings.
This is something that Python 3 in general tried to avoid. The second
inequality comparison operator was removed, the ``L`` prefix for long
integers etc. This PEP would propose a slight revert on that practice by
reintroducing redundant syntax. On the other hand, Python already has
multiple literals for strings with mostly the same behavior (single
quoted, double quoted, single triple quoted, double triple quoted).
There are many existing Python communities that are prepared to put up with
the constraints imposed by the existing suite of porting tools, or to update
their Python 2 code bases sufficiently that the problems are minimised.
Runtime Overhead of Wrappers
============================
This PEP is not for those communities. Instead, it is designed specifically to
help people that *don't* want to put up with those difficulties.
I did some basic timings on the performance of a ``u()`` wrapper function
as used by the ``six`` library. The implementation of ``u()`` is as
follows::
However, since the proposal is for a comparatively small tweak to the language
syntax with no semantic changes, it may be feasible to support it as a third
party import hook. While such an import hook will impose a small import time
overhead, and will require additional steps from each application that needs it
to get the hook in place, it would allow applications that target Python 3.2
to use libraries and frameworks that may otherwise only run on Python 3.3+.
if sys.version_info >= (3, 0):
def u(value):
return value
else:
def u(value):
return unicode(value, 'unicode-escape')
This approach may prove useful, for example, for applications that wish to
target Python 3 for the Ubuntu LTS release that ships with Python 2.7 and 3.2.
The intention is that ``u'foo'`` can be turned to ``u('foo')`` and that on
Python 2.x an implicit decoding happens. In this case the wrapper will
have a decoding overhead for Python 2.x. I did some basic timings [2]_ to
see how bad the performance loss would be. The following examples measure
the execution time over 10000 iterations::
If such an import hook becomes available, this PEP will be updated to include
a reference to it.
u'\N{SNOWMAN}barbaz' 1000 loops, best of 3: 295 usec per loop
u('\N{SNOWMAN}barbaz') 10 loops, best of 3: 18.5 msec per loop
u'foobarbaz_%d' % x 100 loops, best of 3: 8.32 msec per loop
u('foobarbaz_%d') % x 10 loops, best of 3: 25.6 msec per loop
u'fööbarbaz' 1000 loops, best of 3: 289 usec per loop
u('fööbarbaz') 100 loops, best of 3: 15.1 msec per loop
u'foobarbaz' 1000 loops, best of 3: 294 usec per loop
u('foobarbaz') 100 loops, best of 3: 14.3 msec per loop
The overhead of the wrapper function in Python 3 is the price of a
function call since the function only has to return the argument
unchanged.
Python 3 shouldn't be made worse just to support porting from Python 2
----------------------------------------------------------------------
This is indeed one of the key design principles of Python 3. However, one of
the key design principles of Python as a whole is that "practicality beats
purity". If we're going to impose a significant burden on third party
developers, we should have a solid rationale for doing so.
In most cases, the rationale for backwards incompatible Python 3 changes are
either to improve code correctness (for example, stricter separation of binary
and text data and integer division upgrading to floats when necessary), reduce
typical memory usage (for example, increased usage of iterators and views over
concrete lists), or to remove distracting nuisances that make Python code
harder to read without increasing its expressiveness (for example, the comma
based syntax for naming caught exceptions). Changes backed by such reasoning
are *not* going to be reverted, regardless of objections from Python 2
developers attempting to make the transition to Python 3.
In many cases, Python 2 offered two ways of doing things for historical reasons.
For example, inequality could be tested with both ``!=`` and ``<>`` and integer
literals could be specified with an optional ``L`` suffix. Such redundancies
have been eliminated in Python 3, which reduces the overall size of the
language and improves consistency across developers.
In the original Python 3 design (up to and including Python 3.2), the explicit
prefix syntax for unicode literals was deemed to fall into this category, as it
is completely unnecessary in Python 3. However, the difference between those
other cases and unicode literals is that the unicode literal prefix is *not*
redundant in Python 2 code: it is a programmatically significant distinction
that needs to be preserved in some fashion to avoid losing information.
While porting tools were created to help with the transition (see next section)
it still creates an additional burden on heavy users of unicode strings in
Python 2, solely so that future developers learning Python 3 don't need to be
told "For historical reasons, string literals may have an optional ``u`` or
``U`` prefix. Never use this yourselves, it's just there to help with porting
from an earlier version of the language."
Plenty of students learning Python 2 received similar warnings regarding string
exceptions without being confused or irreparably stunted in their growth as
Python developers. It will be the same with this feature.
This point is further reinforced by the fact that Python 3 *still* allows the
uppercase variants of the ``B`` and ``R`` prefixes for bytes literals and raw
bytes and string literals. If the potential for confusion due to string prefix
variants is that significant, where was the outcry asking that these
redundant prefixes removed along with all the other redundancies that were
eliminated in Python 3?
Just as support for string exceptions was eliminated from Python 2 using the
normal deprecation process, support for redundant string prefix characters
(specifically, ``B``, ``R``, ``u``, ``U``) may be eventually eliminated
from Python 3, regardless of the current acceptance of this PEP.
The WSGI "native strings" concept is an ugly hack, anyway
---------------------------------------------------------
One reason the removal of unicode literals has provoked such concern amongst
the web development community is that the updated WSGI specification had to
make a few compromises to minimise the disruption for existing web servers
that provide a WSGI-compatible interface (this was deemed necessary in order
to make the updated standard a viable target for web application authors and
web framework developers).
One of those compromises is the concept of a "native string". WSGI defines
three different kinds of string:
* text strings: handled as ``unicode`` in Python 2 and ``str`` in Python 3
* native strings: handled as ``str`` in both Python 2 and Python 3
* binary data: handled as ``str`` in Python 2 and ``bytes`` in Python 3
Native strings are a useful concept because there are some APIs and internal
operations that are designed primarily to work with native strings. They often
don't support ``unicode`` in Python 2 and don't support ``bytes`` in Python 3
(at least, not without needing additional encoding information and/or imposing
constraints that don't apply to the native string variants).
Some example of such interfaces are:
* Python identifiers (dict keys, class names, module names, import paths)
* URLs for the most part as well as HTTP headers in urllib/http servers
* WSGI environment keys and CGI-inherited values
* Python source code for dynamic compilation and AST hacks
* Exception messages
* ``__repr__`` return value
* preferred filesystem paths
* preferred OS environment
In Python 2.6 and 2.7, these distinctions are most naturally expressed as
follows:
* ``u""``: text string
* ``""``: native string
* ``b""``: binary data
In Python 3, the native strings are not distinguished from any other text
strings:
* ``""``: text string
* ``""``: native string
* ``b""``: binary data
If ``from __future__ import unicode_literals`` is used to modify the behaviour
of Python 2, then, along with an appropriate definition of ``n()``, the
distinction can be expressed as:
* ``""``: text string
* ``n("")``: native string
* ``b""``: binary data
(While ``n=str`` works for simple cases, it can sometimes have problems
due to non-ASCII source encodings)
In the common subset of Python 2 and Python 3 (with appropriate
specification of a source encoding and definitions of the ``u()`` and ``b()``
helper functions), they can be expressed as:
* ``u("")``: text string
* ``""``: native string
* ``b("")``: binary data
That last approach is the only variant that supports Python 2.5 and earlier.
Of all the alternatives, the format currently supported in Python 2.6 and 2.7
is by far the cleanest. With this PEP, that format will also be supported in
Python 3.3+. If the import hook approach works out as planned, it may even be
supported in Python 3.1 and 3.2. A bit more effort could likely adapt the hook
to allow the use of the ``b`` prefix on Python 2.5
The existing tools should be good enough for everyone
-----------------------------------------------------
A commonly expressed sentiment from developers that have already sucessfully
ported applications to Python 3 is along the lines of "if you think it's hard,
you're doing it wrong" or "it's not that hard, just try it!". While it is no
doubt unintentional, these responses all have the effect of telling the
people that are pointing out inadequacies in the current porting toolset
"there's nothing wrong with the porting tools, you just suck and don't know
how to use them properly".
These responses are a case of completely missing the point of what people are
complaining about. The feedback that resulted in this PEP isn't due to people complaining that ports aren't possible. Instead, the feedback is coming from
people that have succesfully *completed* ports and are objecting that they
found the experience thoroughly *unpleasant* for the class of application that
they needed to port (specifically, Unicode aware web frameworks and support
libraries).
This is a subjective appraisal, and it's the reason why the Python 3
porting tools ecosystem is a case where the "one obvious way to do it"
philosophy emphatically does *not* apply. While it was originally intended that
"develop in Python 2, convert with ``2to3``, test both" would be the standard
way to develop for both versions in parallel, in practice, the needs of
different projects and developer communities have proven to be sufficiently
diverse that a variety of approaches have been devised, allowing each group
to select an approach that best fits their needs.
Lennart Regebro has produced an excellent overview of the available migration
strategies [2]_, and a similar review is provided in the official porting
guide [3]_. (Note that the official guidance has softened to "it depends on
your specific situation" since Lennart wrote his overview).
However, both of those guides are written from the founding assumption that
all of the developers involved are *already* committed to the idea of
supporting Python 3. They make no allowance for the *social* aspects of such a
change when you're interacting with a user base that may not be especially
tolerant of disruptions without a clear benefit, or are trying to persuade
Python 2 focused upstream developers to accept patches that are solely about
improving Python 3 forward compatibility.
With the current porting toolset, *every* migration strategy will result in
changes to *every* Unicode literal in a project. No exceptions. They will
be converted to either an unprefixed string literal (if the project decides to
adopt the ``unicode_literals`` import) or else to a converter call like
``u("text")``.
If the ``unicode_literals`` import approach is employed, but is not adopted
across the entire project at the same time, then the meaning of a bare string
literal may become annoyingly ambiguous. This problem can be particularly
pernicious for *aggregated* software, like a Django site - in such a situation,
some files may end up using the unicode literals import and others may not,
creating definite potential for confusion.
While these problems are clearly solvable at a technical level, they're a
completely unnecessary distraction at the social level. Developer energy should
be reserved for addressing *real* technical difficulties associated with the
Python 3 transition (like distinguishing their 8-bit text strings from their
binary data). They shouldn't be punished with additional code changes (even
automated ones) solely due to the fact that they have *already* explicitly
identified their Unicode strings in Python 2.
Armin Ronacher has created an experimental extension to 2to3 which only
modernizes Python code to the extent that it runs on Python 2.7 or later with
support from the cross-version compatibility ``six`` library is available as
``python-modernize`` [1]_. Currently, the deltas generated by this tool will
affect every Unicode literal in the converted source. This will create
legitimate concerns amongst upstream developers asked to accept such changes.
However, by eliminating the noise from changes to the Unicode literal syntax,
many projects could be cleanly and (relatively) non-controversially made
forward compatible with Python 3.3+ just by running ``python-modernize`` and
applying the recommended changes.
References
@ -248,9 +357,12 @@ References
.. [1] Python-Modernize
(http://github.com/mitsuhiko/python-modernize)
.. [2] Benchmark
(https://github.com/mitsuhiko/unicode-literals-pep/blob/master/timing.py)
.. [2] Porting to Python 3: Migration Strategies
(http://python3porting.com/strategies.html)
.. [3] Porting Python 2 Code to Python 3
(http://docs.python.org/howto/pyporting.html)
Copyright
=========