PEP: 414
Title: Explicit Unicode Literal for Python 3.3
Version: $Revision$
Last-Modified: $Date$
Author: Armin Ronacher <armin.ronacher@active-4.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 15-Feb-2012


Abstract
========

This document proposes the reintegration of an explicit unicode literal
from Python 2.x to the Python 3.x language specification, in order to
enable side-by-side support of libraries for both Python 2 and Python 3
without the need for an explicit 2to3 run.


Rationale and Goals
===================

Python 3 is a major new revision of the language, and it was decided very
early on that breaking backwards compatibility was part of the design. The
migration from a Python 2.x to a Python 3 codebase is to be accomplished
with the aid of a separate translation tool that converts the Python 2.x
sourcecode to Python 3 syntax.  With more and more libraries supporting
Python 3, however, it has become clear that 2to3 as a tool is
insufficient, and people are now attempting to find ways to make the same
source work in both Python 2.x and Python 3.x, with varying levels of
success.

Python 2.6 and Python 2.7 support syntax features from Python 3 which for
the most part make a unified code base possible.  Many thought that the
``unicode_literals`` future import might make a common source possible,
but it turns out that it's doing more harm than good.

With the design of the updated WSGI specification a few new terms for
strings were loosely defined: unicode strings, byte strings and native
strings.  In Python 3 the native string type is unicode, in Python 2 the
native string type is a bytestring.  These native string types are used in
a couple of places.  The native string type can be interned and is
preferably used for identifier names, filenames, source code and a few
other low level interpreter operations such as the return value of a
``__repr__`` or exception messages.

In Python 2.7 these string types can be defined explicitly.  Without any
future imports ``b'foo'`` means bytestring, ``u'foo'`` declares a unicode
string and ``'foo'`` a native string which in Python 2.x means bytes.
With the ``unicode_literals`` import the native string type is no longer
available and has to be incorrectly labeled as bytestring.  If such a
codebase is then used in Python 3, the interpreter will start using byte
objects in places where they are no longer accepted (such as identifiers).
This can be solved by a module that detects 2.x and 3.x and provides
wrapper functions that transcode literals at runtime.  Unfortunately, this
has the side effect of slowing down the runtime performance of Python and
makes for less beautiful code.  Considering that Python 2 and Python 3
support for most libraries will have to continue side by side for several
more years to come, this means that such modules lose one of Python's key
properties: easily readable and understandable code.

Additionally, the vast majority of people who maintain Python 2.x
codebases are more familiar with Python 2.x semantics, and a per-file
difference in literal meanings will be very annoying for them in the long
run.  A quick poll on Twitter about the use of the division future import
supported my suspicions that people opt out of behaviour-changing future
imports because they are a maintenance burden.  Every time you review code
you have to check the top of the file to see if the behaviour was changed.
Obviously that was an unscientific informal poll, but it might be
something worth considering.

Proposed Solution
=================

The idea is to support (with Python 3.3) an explicit ``u`` and ``U``
prefix for native strings in addition to the prefix-less variants.  These
would stick around for the entirety of the Python 3 lifetime but might at
some point yield deprecation warnings if deemed appropriate.  This could
be something for pyflakes or other similar libraries to support.

Python 3.2 and earlier
======================

An argument against this proposal was made on the Python-Dev mailinglist,
mentioning that Ubuntu LTS will ship Python 3.2 and 2.7 for only 5 years.
The counterargument is that Python 2.7 is currently the Python version of
choice for users who want LTS support.  As it stands, Python 3 is
currently a bad choice for long-term investments, since the ecosystem is
not yet properly developed, and libraries are still fighting with their
API decisions for Python 3.

A valid point is that this would encourage people to become dependent on
Python 3.3 for their ports.  Fortunately that is not a big problem since
that could be fixed at installation time similar to how many projects are
currently invoking 2to3 as part of their installation process.

For Python 3.1 and Python 3.2 (even 3.0 if necessary) a simple
on-installation hook could be provided that tokenizes all source files and
strips away the otherwise unnecessary ``u`` prefix at installation time.

Who Benefits?
=============

There are a couple of places where decisions have to be made for or
against unicode support almost arbitrarily.  This is mostly the case for
protocols that do not support unicode all the way down, or hide it behind
transport encodings that might or might not be unicode themselves.  HTTP,
Email and WSGI are good examples of that.  For certain ambiguous cases it
would be possible to apply the same logic for unicode that Python 3
applies to the Python 2 versions of the library as well but, if those
details were exposed to the user of the API, it would mean breaking
compatibility for existing users of the Python 2 API which is a no-go for
many situations.  The automatic upgrading of binary strings to unicode
strings that would be enabled by this proposal would make it much easier
to port such libraries over.

Not only the libraries but also the users of these APIs would benefit from
that.  For instance, the urllib module in Python 2 is using byte strings,
and the one in Python 3 is using unicode strings.  By leveraging a native
string, users can avoid having to adjust for that.

Problems with 2to3
==================

In practice 2to3 currently suffers from a few problems which make it
unnecessarily difficult and/or unpleasant to use:

-   Bad overall performance.  In many cases 2to3 runs one or two orders of
    magnitude slower than the testsuite for the library or application
    it's testing.
-   Slightly different behaviour in 2to3 between different versions of
    Python cause different outcomes when paired with custom fixers.
-   Line numbers from error messages do not match up with the real source
    lines due to added/rewritten imports.
-   extending 2to3 with custom fixers is nontrivial without using
    distribute.  By default 2to3 works acceptably well for upgrading
    byte-based APIs to unicode based APIs but it fails to upgrade APIs
    which already support unicode to Python 3::

        --- test.py (original)
        +++ test.py (refactored)
        @@ -1,5 +1,5 @@
         class Foo(object):
             def __unicode__(self):
        -        return u'test'
        +        return 'test'
             def __str__(self):
        -        return unicode(self).encode('utf-8')
        +        return str(self).encode('utf-8')


APIs and Concepts Using Native Strings
======================================

The following is an incomplete list of APIs and general concepts that use
native strings and need implicit upgrading to unicode in Python 3, and
which would directly benefit from this support:

-   Python identifiers (dict keys, class names, module names, import
    paths)
-   URLs for the most part as well as HTTP headers in urllib/http servers
-   WSGI environment keys and CGI-inherited values
-   Python source code for dynamic compilation and AST hacks
-   Exception messages
-   ``__repr__`` return value
-   preferred filesystem paths
-   preferred OS environment


Modernizing Code
================

The 2to3 tool can be easily adjusted to generate code that runs on both
Python 2 and Python 3.  An experimental extension to 2to3 which only
modernizes Python code to the extent that it runs on Python 2.7 or later
with support for the ``six`` library is available as python-modernize
[1]_. For most cases the runtime impact of ``six`` can be neglected (like
a function that calls ``iteritems()`` on a passed dictionary under 2.x or
``items()`` under 3.x), but to make strings cheap for both 2.x and 3.x it
is nearly impossible.  The way it currently works is by abusing the
``unicode-escape`` codec on Python 2.x native strings.  This is especially
ugly if such a string literal is used in a tight loop.

This proposal would fix this.  The modernize module could easily be
adjusted to simply not translate unicode strings, and the runtime overhead
would disappear.

Possible Downsides
==================

The obvious downside for this is that potential Python 3 users would have
to be aware of the fact that ``u`` is an optional prefix for strings.
This is something that Python 3 in general tried to avoid.  The second
inequality comparison operator was removed, the ``L`` prefix for long
integers etc.  This PEP would propose a slight revert on that practice by
reintroducing redundant syntax.  On the other hand, Python already has
multiple literals for strings with mostly the same behavior (single
quoted, double quoted, single triple quoted, double triple quoted).

Runtime Overhead of Wrappers
============================

I did some basic timings on the performance of a ``u()`` wrapper function
as used by the `six` library.  The implementation of ``u()`` is as
follows::

    if sys.version_info >= (3, 0):
        def u(value):
            return value
    else:
        def u(value):
            return unicode(value, 'unicode-escape')

The intention is that ``u'foo'`` can be turned to ``u('foo')`` and that on
Python 2.x an implicit decoding happens.  In this case the wrapper will
have a decoding overhead for Python 2.x.  I did some basic timings [2]_ to
see how bad the performance loss would be.  The following examples measure
the execution time over 10000 iterations::

    u'\N{SNOWMAN}barbaz'            1000 loops, best of 3: 295 usec per loop
    u('\N{SNOWMAN}barbaz')          10 loops, best of 3: 18.5 msec per loop
    u'foobarbaz_%d' % x             100 loops, best of 3: 8.32 msec per loop
    u('foobarbaz_%d') % x           10 loops, best of 3: 25.6 msec per loop
    u'fööbarbaz'                    1000 loops, best of 3: 289 usec per loop
    u('fööbarbaz')                  100 loops, best of 3: 15.1 msec per loop
    u'foobarbaz'                    1000 loops, best of 3: 294 usec per loop
    u('foobarbaz')                  100 loops, best of 3: 14.3 msec per loop

The overhead of the wrapper function in Python 3 is the price of a
function call since the function only has to return the argument
unchanged.


References
==========

.. [1] Python-Modernize
   (http://github.com/mitsuhiko/python-modernize)
.. [2] Benchmark
   (https://github.com/mitsuhiko/unicode-literals-pep/blob/master/timing.py)


Copyright
=========

This document has been placed in the public domain.


..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   End: