Rewrite PEP 414 to be less passionate in its tone and better address the common objections

2012-03-04 17:24:43 +10:00 · 2012-03-04 17:24:43 +10:00 · 6f949e069a
parent 6e7b0815b4
commit 6f949e069a
1 changed files with 302 additions and 190 deletions
--- a/pep-0414.txt
+++ b/pep-0414.txt
@ -2,7 +2,8 @@ PEP: 414
 Title: Explicit Unicode Literal for Python 3.3
 Version: $Revision$
 Last-Modified: $Date$
-Author: Armin Ronacher <armin.ronacher@active-4.com>
+Author: Armin Ronacher <armin.ronacher@active-4.com>,
+        Nick Coghlan <ncoghlan@gmail.com>
 Status: Accepted
 Type: Standards Track
 Content-Type: text/x-rst
@ -16,231 +17,339 @@ Abstract

 This document proposes the reintegration of an explicit unicode literal
 from Python 2.x to the Python 3.x language specification, in order to
-enable side-by-side support of libraries for both Python 2 and Python 3
-without the need for an explicit 2to3 run.
+reduce the volume of changes needed when porting Unicode-aware
+Python 2 applications to Python 3.


 BDFL Pronouncement
 ==================

-This PEP has been formally accepted for Python 3.3.
+This PEP has been formally accepted for Python 3.3:
+
+    I'm accepting the PEP. It's about as harmless as they come. Make it so.


-Rationale and Goals
-===================
+Proposal
+========

-Python 3 is a major new revision of the language, and it was decided very
-early on that breaking backwards compatibility was part of the design. The
-migration from a Python 2.x to a Python 3 codebase is to be accomplished
-with the aid of a separate translation tool that converts the Python 2.x
-sourcecode to Python 3 syntax.  With more and more libraries supporting
-Python 3, however, it has become clear that 2to3 as a tool is
-insufficient, and people are now attempting to find ways to make the same
-source work in both Python 2.x and Python 3.x, with varying levels of
-success.
+This PEP proposes that Python 3.3 restore support for Python 2's Unicode
+literal syntax, substantially increasing the number of lines of existing
+Python 2 code in Unicode aware applications that will run without modification
+on Python 3.

-Python 2.6 and Python 2.7 support syntax features from Python 3 which for
-the most part make a unified code base possible.  Many thought that the
-``unicode_literals`` future import might make a common source possible,
-but it turns out that it's doing more harm than good.
+Specifically, the Python 3 definition for string literal prefixes will be
+expanded to allow::

-With the design of the updated WSGI specification a few new terms for
-strings were loosely defined: unicode strings, byte strings and native
-strings.  In Python 3 the native string type is unicode, in Python 2 the
-native string type is a bytestring.  These native string types are used in
-a couple of places.  The native string type can be interned and is
-preferably used for identifier names, filenames, source code and a few
-other low level interpreter operations such as the return value of a
-``__repr__`` or exception messages.
+    "u" | "U" | "ur" | "UR" | "Ur" | "uR"

-In Python 2.7 these string types can be defined explicitly.  Without any
-future imports ``b'foo'`` means bytestring, ``u'foo'`` declares a unicode
-string and ``'foo'`` a native string which in Python 2.x means bytes.
-With the ``unicode_literals`` import the native string type is no longer
-available by syntax and has to be incorrectly labeled as bytestring.  If
-such a codebase is then used in Python 3, the interpreter will start using
-byte objects in places where they are no longer accepted (such as
-identifiers).  This can be solved by a module that detects 2.x and 3.x and
-provides wrapper functions that transcode literals at runtime (either by
-having a ``u`` function that marks things as unicode without future
-imports or the inverse by having a ``n`` function that marks strings as
-native).  Unfortunately, this has the side effect of slowing down the
-runtime performance of Python and makes for less beautiful code.
-Considering that Python 2 and Python 3 support for most libraries will
-have to continue side by side for several more years to come, this means
-that such modules lose one of Python's key properties: easily readable and
-understandable code.
+in additional to the currently supported::

-Additionally, the vast majority of people who maintain Python 2.x
-codebases are more familiar with Python 2.x semantics, and a per-file
-difference in literal meanings will be very annoying for them in the long
-run.  A quick poll on Twitter about the use of the division future import
-supported my suspicions that people opt out of behaviour-changing future
-imports because they are a maintenance burden.  Every time you review code
-you have to check the top of the file to see if the behaviour was changed.
-Obviously that was an unscientific informal poll, but it might be
-something worth considering.
+    "r" | "R"

-Proposed Solution
-=================
+The following will all denote ordinary Python 3 strings::

-The idea is to support (with Python 3.3) an explicit ``u`` and ``U``
-prefix for native strings in addition to the prefix-less variants.  These
-would stick around for the entirety of the Python 3 lifetime but might at
-some point yield deprecation warnings if deemed appropriate.  This could
-be something for pyflakes or other similar libraries to support.
+    'text'
+    "text"
+    '''text'''
+    """text"""
+    u'text'
+    u"text"
+    u'''text'''
+    u"""text"""
+    U'text'
+    U"text"
+    U'''text'''
+    U"""text"""

-Python 3.2 and earlier
-======================
+Combination of the unicode prefix with the raw string prefix will also be
+supported, just as it was in Python 2.

-An argument against this proposal was made on the Python-Dev mailinglist,
-mentioning that Ubuntu LTS will ship Python 3.2 and 2.7 for only 5 years.
-The counterargument is that Python 2.7 is currently the Python version of
-choice for users who want LTS support.  As it stands, when chosing between
-2.7 and Python 3.2, Python 3 is currently not the best choice for certain
-long-term investments, since the ecosystem is not yet properly developed,
-and libraries are still fighting with their API decisions for Python 3.
+No changes are proposed to Python 3's actual Unicode handling, only to the
+acceptable forms for string literals.

-A valid point is that this would encourage people to become dependent on
-Python 3.3 for their ports.  Fortunately that is not a big problem since
-that could be fixed at installation time similar to how many projects are
-currently invoking 2to3 as part of their installation process.

-For Python 3.1 and Python 3.2 (even 3.0 if necessary) a simple
-on-installation hook could be provided that tokenizes all source files and
-strips away the otherwise unnecessary ``u`` prefix at installation time.
-
-Who Benefits?
+Author's Note
 =============

-There are a couple of places where decisions have to be made for or
-against unicode support almost arbitrarily.  This is mostly the case for
-protocols that do not support unicode all the way down, or hide it behind
-transport encodings that might or might not be unicode themselves.  HTTP,
-Email and WSGI are good examples of that.  For certain ambiguous cases it
-would be possible to apply the same logic for unicode that Python 3
-applies to the Python 2 versions of the library as well but, if those
-details were exposed to the user of the API, it would mean breaking
-compatibility for existing users of the Python 2 API which is a no-go for
-many situations.  The automatic upgrading of binary strings to unicode
-strings that would be enabled by this proposal would make it much easier
-to port such libraries over.
+This PEP was originally written by Armin Ronacher, and directly reflected his
+feelings regarding his personal experiences porting Unicode aware Python
+applications to Python 3. Guido's approval was given based on Armin's version
+of the PEP.

-Not only the libraries but also the users of these APIs would benefit from
-that.  For instance, the urllib module in Python 2 is using byte strings,
-and the one in Python 3 is using unicode strings.  By leveraging a native
-string, users can avoid having to adjust for that.
+The currently published version has been rewritten by Nick Coghlan to address
+the concerns of those who felt that Armin's experience did not accurately
+reflect the *typical* experience of porting to Python 3, but rather only
+related to a specific subset of porting activities that were not well served
+by the existing set of porting tools.

-Problems with 2to3
-==================
-
-In practice 2to3 currently suffers from a few problems which make it
-unnecessarily difficult and/or unpleasant to use:
-
-   Bad overall performance.  In many cases 2to3 runs 20 times slower than
-    the testsuite for the library or application it's testing.  (This for
-    instance is the case for the Jinja2 library).
-   Slightly different behaviour in 2to3 between different versions of
-    Python cause different outcomes when paired with custom fixers.
-   Line numbers from error messages do not match up with the real source
-    lines due to added/rewritten imports.
-   extending 2to3 with custom fixers is nontrivial without using
-    distribute.  By default 2to3 works acceptably well for upgrading
-    byte-based APIs to unicode based APIs but it fails to upgrade APIs
-    which already support unicode to Python 3::
-
-        --- test.py (original)
-        +++ test.py (refactored)
-        @@ -1,5 +1,5 @@
-         class Foo(object):
-             def __unicode__(self):
-        -        return u'test'
-        +        return 'test'
-             def __str__(self):
-        -        return unicode(self).encode('utf-8')
-        +        return str(self).encode('utf-8')
+Readers should be aware that many of the arguments in this PEP are *not*
+technical ones. Instead, they relate heavily to the *social* and *personal*
+aspects of software development. After all, developers are people first,
+coders second.


-APIs and Concepts Using Native Strings
-======================================
+Rationale
+=========

-The following is an incomplete list of APIs and general concepts that use
-native strings and need implicit upgrading to unicode in Python 3, and
-which would directly benefit from this support:
+With the release of a Python 3 compatible version of the Web Services Gateway
+Interface (WSGI) specification (PEP 3333) for Python 3.2, many parts of the
+Python web ecosystem have been making a concerted effort to support Python 3
+without adversely affecting their existing developer and user communities.

-   Python identifiers (dict keys, class names, module names, import
-    paths)
-   URLs for the most part as well as HTTP headers in urllib/http servers
-   WSGI environment keys and CGI-inherited values
-   Python source code for dynamic compilation and AST hacks
-   Exception messages
-   ``__repr__`` return value
-   preferred filesystem paths
-   preferred OS environment
+One major item of feedback from key developers in those communities, including
+Chris McDonough (WebOb, Pyramid), Armin Ronacher (Flask, Werkzeug), Jacob
+Kaplan-Moss (Django) and Kenneth Reitz (``requests``) is that the requirement
+to change the spelling of *every* Unicode literal in an application
+(regardless of how that is accomplished) is a key stumbling block for porting
+efforts.
+
+In particular, unlike many of the other Python 3 changes, it isn't one that
+framework and library authors can easily handle on behalf of their users. Most
+of those users couldn't care less about the "purity" of the Python language
+specification, they just want their websites and applications to work as well
+as possible.
+
+While it is the Python web community that has been most vocal in highlighting
+this concern, it is expected that other highly Unicode aware domains (such as
+GUI development) may run into similar issues as they (and their communities)
+start making concerted efforts to support Python 3.


-Modernizing Code
-================
+Common Objections
+=================

-The 2to3 tool can be easily adjusted to generate code that runs on both
-Python 2 and Python 3.  An experimental extension to 2to3 which only
-modernizes Python code to the extent that it runs on Python 2.7 or later
-with support for the ``six`` library is available as python-modernize
-[1]_. For most cases the runtime impact of ``six`` can be neglected (like
-a function that calls ``iteritems()`` on a passed dictionary under 2.x or
-``items()`` under 3.x), but to make strings cheap for both 2.x and 3.x it
-is nearly impossible.  The way it currently works is by abusing the
-``unicode-escape`` codec on Python 2.x native strings.  This is especially
-ugly if such a string literal is used in a tight loop.

-This proposal would fix this.  The modernize module could easily be
-adjusted to simply not translate unicode strings, and the runtime overhead
-would disappear.
+This PEP may harm adoption of Python 3.2
+----------------------------------------

-Possible Downsides
-==================
+This complaint is interesting, as it carries within it a tacit admission that
+this PEP *will* make it easier to port Unicode aware Python 2 applications to
+Python 3.

-The obvious downside for this is that potential Python 3 users would have
-to be aware of the fact that ``u`` is an optional prefix for strings.
-This is something that Python 3 in general tried to avoid.  The second
-inequality comparison operator was removed, the ``L`` prefix for long
-integers etc.  This PEP would propose a slight revert on that practice by
-reintroducing redundant syntax.  On the other hand, Python already has
-multiple literals for strings with mostly the same behavior (single
-quoted, double quoted, single triple quoted, double triple quoted).
+There are many existing Python communities that are prepared to put up with
+the constraints imposed by the existing suite of porting tools, or to update
+their Python 2 code bases sufficiently that the problems are minimised.

-Runtime Overhead of Wrappers
-============================
+This PEP is not for those communities. Instead, it is designed specifically to
+help people that *don't* want to put up with those difficulties.

-I did some basic timings on the performance of a ``u()`` wrapper function
-as used by the ``six`` library.  The implementation of ``u()`` is as
-follows::
+However, since the proposal is for a comparatively small tweak to the language
+syntax with no semantic changes, it may be feasible to support it as a third
+party import hook. While such an import hook will impose a small import time
+overhead, and will require additional steps from each application that needs it
+to get the hook in place, it would allow applications that target Python 3.2
+to use libraries and frameworks that may otherwise only run on Python 3.3+.

-    if sys.version_info >= (3, 0):
-        def u(value):
-            return value
-    else:
-        def u(value):
-            return unicode(value, 'unicode-escape')
+This approach may prove useful, for example, for applications that wish to
+target Python 3 for the Ubuntu LTS release that ships with Python 2.7 and 3.2.

-The intention is that ``u'foo'`` can be turned to ``u('foo')`` and that on
-Python 2.x an implicit decoding happens.  In this case the wrapper will
-have a decoding overhead for Python 2.x.  I did some basic timings [2]_ to
-see how bad the performance loss would be.  The following examples measure
-the execution time over 10000 iterations::
+If such an import hook becomes available, this PEP will be updated to include
+a reference to it.

-    u'\N{SNOWMAN}barbaz'            1000 loops, best of 3: 295 usec per loop
-    u('\N{SNOWMAN}barbaz')          10 loops, best of 3: 18.5 msec per loop
-    u'foobarbaz_%d' % x             100 loops, best of 3: 8.32 msec per loop
-    u('foobarbaz_%d') % x           10 loops, best of 3: 25.6 msec per loop
-    u'fööbarbaz'                    1000 loops, best of 3: 289 usec per loop
-    u('fööbarbaz')                  100 loops, best of 3: 15.1 msec per loop
-    u'foobarbaz'                    1000 loops, best of 3: 294 usec per loop
-    u('foobarbaz')                  100 loops, best of 3: 14.3 msec per loop

-The overhead of the wrapper function in Python 3 is the price of a
-function call since the function only has to return the argument
-unchanged.
+Python 3 shouldn't be made worse just to support porting from Python 2
+----------------------------------------------------------------------
+
+This is indeed one of the key design principles of Python 3. However, one of
+the key design principles of Python as a whole is that "practicality beats
+purity". If we're going to impose a significant burden on third party
+developers, we should have a solid rationale for doing so.
+
+In most cases, the rationale for backwards incompatible Python 3 changes are
+either to improve code correctness (for example, stricter separation of binary
+and text data and integer division upgrading to floats when necessary), reduce
+typical memory usage (for example, increased usage of iterators and views over
+concrete lists), or to remove distracting nuisances that make Python code
+harder to read without increasing its expressiveness (for example, the comma
+based syntax for naming caught exceptions). Changes backed by such reasoning
+are *not* going to be reverted, regardless of objections from Python 2
+developers attempting to make the transition to Python 3.
+
+In many cases, Python 2 offered two ways of doing things for historical reasons.
+For example, inequality could be tested with both ``!=`` and ``<>`` and integer
+literals could be specified with an optional ``L`` suffix. Such redundancies
+have been eliminated in Python 3, which reduces the overall size of the
+language and improves consistency across developers.
+
+In the original Python 3 design (up to and including Python 3.2), the explicit
+prefix syntax for unicode literals was deemed to fall into this category, as it
+is completely unnecessary in Python 3. However, the difference between those
+other cases and unicode literals is that the unicode literal prefix is *not*
+redundant in Python 2 code: it is a programmatically significant distinction
+that needs to be preserved in some fashion to avoid losing information.
+
+While porting tools were created to help with the transition (see next section)
+it still creates an additional burden on heavy users of unicode strings in
+Python 2, solely so that future developers learning Python 3 don't need to be
+told "For historical reasons, string literals may have an optional ``u`` or
+``U`` prefix. Never use this yourselves, it's just there to help with porting
+from an earlier version of the language."
+
+Plenty of students learning Python 2 received similar warnings regarding string
+exceptions without being confused or irreparably stunted in their growth as
+Python developers. It will be the same with this feature.
+
+This point is further reinforced by the fact that Python 3 *still* allows the
+uppercase variants of the ``B`` and ``R`` prefixes for bytes literals and raw
+bytes and string literals. If the potential for confusion due to string prefix
+variants is that significant, where was the outcry asking that these
+redundant prefixes removed along with all the other redundancies that were
+eliminated in Python 3?
+
+Just as support for string exceptions was eliminated from Python 2 using the
+normal deprecation process, support for redundant string prefix characters
+(specifically, ``B``, ``R``, ``u``, ``U``) may be eventually eliminated
+from Python 3, regardless of the current acceptance of this PEP.
+
+
+The WSGI "native strings" concept is an ugly hack, anyway
+---------------------------------------------------------
+
+One reason the removal of unicode literals has provoked such concern amongst
+the web development community is that the updated WSGI specification had to
+make a few compromises to minimise the disruption for existing web servers
+that provide a WSGI-compatible interface (this was deemed necessary in order
+to make the updated standard a viable target for web application authors and
+web framework developers).
+
+One of those compromises is the concept of a "native string". WSGI defines
+three different kinds of string:
+
+* text strings: handled as ``unicode`` in Python 2 and ``str`` in Python 3
+* native strings: handled as ``str`` in both Python 2 and Python 3
+* binary data: handled as ``str`` in Python 2 and ``bytes`` in Python 3
+
+Native strings are a useful concept because there are some APIs and internal
+operations that are designed primarily to work with native strings. They often
+don't support ``unicode`` in Python 2 and don't support ``bytes`` in Python 3
+(at least, not without needing additional encoding information and/or imposing
+constraints that don't apply to the native string variants).
+
+Some example of such interfaces are:
+
+* Python identifiers (dict keys, class names, module names, import paths)
+* URLs for the most part as well as HTTP headers in urllib/http servers
+* WSGI environment keys and CGI-inherited values
+* Python source code for dynamic compilation and AST hacks
+* Exception messages
+* ``__repr__`` return value
+* preferred filesystem paths
+* preferred OS environment
+
+In Python 2.6 and 2.7, these distinctions are most naturally expressed as
+follows:
+
+* ``u""``: text string
+* ``""``: native string
+* ``b""``: binary data
+
+In Python 3, the native strings are not distinguished from any other text
+strings:
+
+* ``""``: text string
+* ``""``: native string
+* ``b""``: binary data
+
+If ``from __future__ import unicode_literals`` is used to modify the behaviour
+of Python 2, then, along with an appropriate definition of ``n()``, the
+distinction can be expressed as:
+
+* ``""``: text string
+* ``n("")``: native string
+* ``b""``: binary data
+
+(While ``n=str`` works for simple cases, it can sometimes have problems
+due to non-ASCII source encodings)
+
+In the common subset of Python 2 and Python 3 (with appropriate
+specification of a source encoding and definitions of the ``u()`` and ``b()``
+helper functions), they can be expressed as:
+
+* ``u("")``: text string
+* ``""``: native string
+* ``b("")``: binary data
+
+That last approach is the only variant that supports Python 2.5 and earlier.
+
+Of all the alternatives, the format currently supported in Python 2.6 and 2.7
+is by far the cleanest. With this PEP, that format will also be supported in
+Python 3.3+. If the import hook approach works out as planned, it may even be
+supported in Python 3.1 and 3.2. A bit more effort could likely adapt the hook
+to allow the use of the ``b`` prefix on Python 2.5
+
+
+The existing tools should be good enough for everyone
+-----------------------------------------------------
+
+A commonly expressed sentiment from developers that have already sucessfully
+ported applications to Python 3 is along the lines of "if you think it's hard,
+you're doing it wrong" or "it's not that hard, just try it!". While it is no
+doubt unintentional, these responses all have the effect of telling the
+people that are pointing out inadequacies in the current porting toolset
+"there's nothing wrong with the porting tools, you just suck and don't know
+how to use them properly".
+
+These responses are a case of completely missing the point of what people are
+complaining about. The feedback that resulted in this PEP isn't due to people complaining that ports aren't possible. Instead, the feedback is coming from
+people that have succesfully *completed* ports and are objecting that they
+found the experience thoroughly *unpleasant* for the class of application that
+they needed to port (specifically, Unicode aware web frameworks and support
+libraries).
+
+This is a subjective appraisal, and it's the reason why the Python 3
+porting tools ecosystem is a case where the "one obvious way to do it"
+philosophy emphatically does *not* apply. While it was originally intended that
+"develop in Python 2, convert with ``2to3``, test both" would be the standard
+way to develop for both versions in parallel, in practice, the needs of
+different projects and developer communities have proven to be sufficiently
+diverse that a variety of approaches have been devised, allowing each group
+to select an approach that best fits their needs.
+
+Lennart Regebro has produced an excellent overview of the available migration
+strategies [2]_, and a similar review is provided in the official porting
+guide [3]_. (Note that the official guidance has softened to "it depends on
+your specific situation" since Lennart wrote his overview).
+
+However, both of those guides are written from the founding assumption that
+all of the developers involved are *already* committed to the idea of
+supporting Python 3. They make no allowance for the *social* aspects of such a
+change when you're interacting with a user base that may not be especially
+tolerant of disruptions without a clear benefit, or are trying to persuade
+Python 2 focused upstream developers to accept patches that are solely about
+improving Python 3 forward compatibility.
+
+With the current porting toolset, *every* migration strategy will result in
+changes to *every* Unicode literal in a project. No exceptions. They will
+be converted to either an unprefixed string literal (if the project decides to
+adopt the ``unicode_literals`` import) or else to a converter call like
+``u("text")``.
+
+If the ``unicode_literals`` import approach is employed, but is not adopted
+across the entire project at the same time, then the meaning of a bare string
+literal may become annoyingly ambiguous. This problem can be particularly
+pernicious for *aggregated* software, like a Django site - in such a situation,
+some files may end up using the unicode literals import and others may not,
+creating definite potential for confusion.
+
+While these problems are clearly solvable at a technical level, they're a
+completely unnecessary distraction at the social level. Developer energy should
+be reserved for addressing *real* technical difficulties associated with the
+Python 3 transition (like distinguishing their 8-bit text strings from their
+binary data). They shouldn't be punished with additional code changes (even
+automated ones) solely due to the fact that they have *already* explicitly
+identified their Unicode strings in Python 2.
+
+Armin Ronacher has created an experimental extension to 2to3 which only
+modernizes Python code to the extent that it runs on Python 2.7 or later with
+support from the cross-version compatibility ``six`` library is available as
+``python-modernize`` [1]_. Currently, the deltas generated by this tool will
+affect every Unicode literal in the converted source. This will create
+legitimate concerns amongst upstream developers asked to accept such changes.
+
+However, by eliminating the noise from changes to the Unicode literal syntax,
+many projects could be cleanly and (relatively) non-controversially made
+forward compatible with Python 3.3+ just by running ``python-modernize`` and
+applying the recommended changes.


 References
@ -248,9 +357,12 @@ References

 .. [1] Python-Modernize
   (http://github.com/mitsuhiko/python-modernize)
-.. [2] Benchmark
-   (https://github.com/mitsuhiko/unicode-literals-pep/blob/master/timing.py)

+.. [2] Porting to Python 3: Migration Strategies
+   (http://python3porting.com/strategies.html)
+
+.. [3] Porting Python 2 Code to Python 3
+   (http://docs.python.org/howto/pyporting.html)

 Copyright
 =========