python-peps/pep-0510.txt

PEP: 510
Title: Specialized functions with guards
Version: $Revision$
Last-Modified: $Date$
Author: Victor Stinner <victor.stinner@gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 4-January-2016
Python-Version: 3.6


Abstract
========

Add an API to add specialized functions with guards to functions, to
support static optimizers respecting the Python semantics.


Rationale
=========

Python semantics
----------------

Python is hard to optimize because almost everything is mutable: builtin
functions, function code, global variables, local variables, ... can be
modified at runtime. Implement optimizations respecting the Python
semantics requires to detect when "something changes", we will call these
checks "guards".

This PEP proposes to add a ``specialize()`` method to functions to add a
specialized functions with guards. When the function is called, the
specialized function is used if nothing changed, otherwise use the
original bytecode.

Writing an optimizer is out of the scope of this PEP.


Why not a JIT compiler?
-----------------------

There are multiple JIT compilers for Python actively developed:

* `PyPy <http://pypy.org/>`_
* `Pyston <https://github.com/dropbox/pyston>`_
* `Numba <http://numba.pydata.org/>`_
* `Pyjion <https://github.com/microsoft/pyjion>`_

Numba is specific to numerical computation.  Pyston and Pyjion are still
young.  PyPy is the most complete Python interpreter, it is much faster
than CPython and has a very good compatibility with CPython (it respects
the Python semantics). There are still issues with Python JIT compilers
which avoid them to be widely used instead of CPython.

Many popular libraries like numpy, PyGTK, PyQt, PySide and wxPython are
implemented in C or C++ and use the Python C API. To have a small memory
footprint and better performances, Python JIT compilers do not use
reference counting to use a faster garbage collector, do not use C
structures of CPython objects and manage memory allocations differently.
PyPy has a ``cpyext`` module which emulates the Python C API but it has
worse performances than CPython and does not support the full Python C
API.

New features are first developped in CPython. In january 2016, the
latest CPython stable version is 3.5, whereas PyPy only supports Python
2.7 and 3.2, and Pyston only supports Python 2.7.

Even if PyPy has a very good compatibility with Python, some modules are
still not compatible with PyPy: see `PyPy Compatibility Wiki
<https://bitbucket.org/pypy/compatibility/wiki/Home>`_. The incomplete
support of the the Python C API is part of this problem. There are also
subtle differences between PyPy and CPython like reference counting:
object destructors are always called in PyPy, but can be called "later"
than in CPython. Using context managers helps to control when resources
are released.

Even if PyPy is much faster than CPython in a wide range of benchmarks,
some users still report worse performances than CPython on some specific
use cases or unstable performances.

When Python is used as a scripting program for programs running less
than 1 minute, JIT compilers can be slower because their startup time is
higher and the JIT compiler takes time to optimize the code. For
example, most Mercurial commands take a few seconds.

Numba now supports ahead of time compilation, but it requires decorator
to specify arguments types and it only supports numerical types.

CPython 3.5 has almost no optimization: the peephole optimizer only
implements basic optimizations. A static compiler is a compromise
between CPython 3.5 and PyPy.

.. note::
   There was also the Unladen Swallow project, but it was abandonned in
   2011.


Example
=======

Using bytecode
--------------

Replace ``chr(65)`` with ``"A"``::

    import myoptimizer

    def func():
        return chr(65)

    def fast_func():
        return "A"

    func.specialize(fast_func.__code__, [myoptimizer.GuardBuiltins("chr")])
    del fast_func

    print("func(): %s" % func())
    print("#specialized: %s" % len(func.get_specialized()))
    print()

    import builtins
    builtins.chr = lambda obj: "mock"

    print("func(): %s" % func())
    print("#specialized: %s" % len(func.get_specialized()))

Output::

    func(): A
    #specialized: 1

    func(): mock
    #specialized: 0

The hypothetical ``myoptimizer.GuardBuiltins("len")`` is a guard on the
builtin ``len()`` function and the ``len`` name in the global namespace.
The guard fails if the builtin function is replaced or if a ``len`` name
is defined in the global namespace.

The first call returns directly the string ``"A"``. The second call
removes the specialized function because the builtin ``chr()`` function
was replaced, and executes the original bytecode

On a microbenchmark, calling the specialized function takes 88 ns,
whereas the original bytecode takes 145 ns (+57 ns): 1.6 times as fast.


Using builtin function
----------------------

Replace a slow Python function calling ``chr(obj)`` with a direct call
to the builtin ``chr()`` function::

    import myoptimizer

    def func(arg):
        return chr(arg)

    func.specialize(chr, [myoptimizer.GuardBuiltins("chr")])

    print("func(65): %s" % func(65))
    print("#specialized: %s" % len(func.get_specialized()))
    print()

    import builtins
    builtins.chr = lambda obj: "mock"

    print("func(65): %s" % func(65))
    print("#specialized: %s" % len(func.get_specialized()))

Output::

    func(): A
    #specialized: 1

    func(): mock
    #specialized: 0

The first call returns directly the builtin ``chr()`` function (without
creating a Python frame). The second call removes the specialized
function because the builtin ``chr()`` function was replaced, and
executes the original bytecode.

On a microbenchmark, calling the specialized function takes 95 ns,
whereas the original bytecode takes 155 ns (+60 ns): 1.6 times as fast.
Calling directly ``chr(65)`` takes 76 ns.


Python Function Call
====================

Pseudo-code to call a Python function having specialized functions with
guards::

    def call_func(func, *args, **kwargs):
        # by default, call the regular bytecode
        code = func.__code__.co_code
        specialized = func.get_specialized()
        nspecialized = len(specialized)

        index = 0
        while index < nspecialized:
            guard = specialized[index].guard
            # pass arguments, some guards need them
            check = guard(args, kwargs)
            if check == 1:
                # guard succeeded: we can use the specialized function
                code = specialized[index].code
                break
            elif check == -1:
                # guard will always fail: remove the specialized function
                del specialized[index]
            elif check == 0:
                # guard failed temporarely
                index += 1

        # code can be a code object or any callable object
        execute_code(code, args, kwargs)


Changes
=======

* Add two new methods to functions:

  - ``specialize(code, guards: list)``: add specialized
    function with guard. `code` is a code object (ex:
    ``func2.__code__``) or any callable object (ex: ``len``).
    The specialization can be ignored if a guard already fails.
  - ``get_specialized()``: get the list of specialized functions with
    guards

* Base ``Guard`` type which can be used as parent type to implement
  guards. It requires to implement a ``check()`` function, with an
  optional ``first_check()`` function. API:

  * ``int first_check(PyObject *guard, PyObject *func)``: return 0 on
    success, -1 if the guard will always fail
  * ``int check(PyObject *guard, PyObject **stack, int na, int nk)``:
    return 1 on success, 0 if the guard failed temporarely, -1 if the
    guard will always fail

Microbenchmark on ``python3.6 -m timeit -s 'def f(): pass' 'f()'`` (best
of 3 runs):

* Original Python: 79 ns
* Patched Python: 79 ns

According to this microbenchmark, the changes has no overhead on calling
a Python function without specialization.


Behaviour
=========

When a function code is replaced (``func.__code__ = new_code``), all
specialized functions are removed.

When a function is serialized (by ``marshal`` or ``pickle`` for
example), specialized functions and guards are ignored (not serialized).


Discussion
==========

Thread on the python-ideas mailing list: `RFC: PEP: Specialized
functions with guards
<https://mail.python.org/pipermail/python-ideas/2016-January/037703.html>`_.


Copyright
=========

This document has been placed in the public domain.