PEP: 510 Title: Specialized functions with guards Version: $Revision$ Last-Modified: $Date$ Author: Victor Stinner Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 4-January-2016 Python-Version: 3.6 Abstract ======== Add an API to add specialized functions with guards to functions, to support static optimizers respecting the Python semantics. Rationale ========= Python semantics ---------------- Python is hard to optimize because almost everything is mutable: builtin functions, function code, global variables, local variables, ... can be modified at runtime. Implement optimizations respecting the Python semantics requires to detect when "something changes", we will call these checks "guards". This PEP proposes to add a ``specialize()`` method to functions to add a specialized functions with guards. When the function is called, the specialized function is used if nothing changed, otherwise use the original bytecode. Writing an optimizer is out of the scope of this PEP. Why not a JIT compiler? ----------------------- There are multiple JIT compilers for Python actively developed: * `PyPy `_ * `Pyston `_ * `Numba `_ * `Pyjion `_ Numba is specific to numerical computation. Pyston and Pyjion are still young. PyPy is the most complete Python interpreter, it is much faster than CPython and has a very good compatibility with CPython (it respects the Python semantics). There are still issues with Python JIT compilers which avoid them to be widely used instead of CPython. Many popular libraries like numpy, PyGTK, PyQt, PySide and wxPython are implemented in C or C++ and use the Python C API. To have a small memory footprint and better performances, Python JIT compilers do not use reference counting to use a faster garbage collector, do not use C structures of CPython objects and manage memory allocations differently. PyPy has a ``cpyext`` module which emulates the Python C API but it has worse performances than CPython and does not support the full Python C API. New features are first developped in CPython. In january 2016, the latest CPython stable version is 3.5, whereas PyPy only supports Python 2.7 and 3.2, and Pyston only supports Python 2.7. Even if PyPy has a very good compatibility with Python, some modules are still not compatible with PyPy: see `PyPy Compatibility Wiki `_. The incomplete support of the the Python C API is part of this problem. There are also subtle differences between PyPy and CPython like reference counting: object destructors are always called in PyPy, but can be called "later" than in CPython. Using context managers helps to control when resources are released. Even if PyPy is much faster than CPython in a wide range of benchmarks, some users still report worse performances than CPython on some specific use cases or unstable performances. When Python is used as a scripting program for programs running less than 1 minute, JIT compilers can be slower because their startup time is higher and the JIT compiler takes time to optimize the code. For example, most Mercurial commands take a few seconds. Numba now supports ahead of time compilation, but it requires decorator to specify arguments types and it only supports numerical types. CPython 3.5 has almost no optimization: the peephole optimizer only implements basic optimizations. A static compiler is a compromise between CPython 3.5 and PyPy. .. note:: There was also the Unladen Swallow project, but it was abandoned in 2011. Example ======= Using bytecode -------------- Replace ``chr(65)`` with ``"A"``:: import myoptimizer def func(): return chr(65) def fast_func(): return "A" func.specialize(fast_func.__code__, [myoptimizer.GuardBuiltins("chr")]) del fast_func print("func(): %s" % func()) print("#specialized: %s" % len(func.get_specialized())) print() import builtins builtins.chr = lambda obj: "mock" print("func(): %s" % func()) print("#specialized: %s" % len(func.get_specialized())) Output:: func(): A #specialized: 1 func(): mock #specialized: 0 The hypothetical ``myoptimizer.GuardBuiltins("len")`` is a guard on the builtin ``len()`` function and the ``len`` name in the global namespace. The guard fails if the builtin function is replaced or if a ``len`` name is defined in the global namespace. The first call returns directly the string ``"A"``. The second call removes the specialized function because the builtin ``chr()`` function was replaced, and executes the original bytecode On a microbenchmark, calling the specialized function takes 88 ns, whereas the original bytecode takes 145 ns (+57 ns): 1.6 times as fast. Using builtin function ---------------------- Replace a slow Python function calling ``chr(obj)`` with a direct call to the builtin ``chr()`` function:: import myoptimizer def func(arg): return chr(arg) func.specialize(chr, [myoptimizer.GuardBuiltins("chr")]) print("func(65): %s" % func(65)) print("#specialized: %s" % len(func.get_specialized())) print() import builtins builtins.chr = lambda obj: "mock" print("func(65): %s" % func(65)) print("#specialized: %s" % len(func.get_specialized())) Output:: func(): A #specialized: 1 func(): mock #specialized: 0 The first call returns directly the builtin ``chr()`` function (without creating a Python frame). The second call removes the specialized function because the builtin ``chr()`` function was replaced, and executes the original bytecode. On a microbenchmark, calling the specialized function takes 95 ns, whereas the original bytecode takes 155 ns (+60 ns): 1.6 times as fast. Calling directly ``chr(65)`` takes 76 ns. Python Function Call ==================== Pseudo-code to call a Python function having specialized functions with guards:: def call_func(func, *args, **kwargs): # by default, call the regular bytecode code = func.__code__.co_code specialized = func.get_specialized() nspecialized = len(specialized) index = 0 while index < nspecialized: guard = specialized[index].guard # pass arguments, some guards need them check = guard(args, kwargs) if check == 1: # guard succeeded: we can use the specialized function code = specialized[index].code break elif check == -1: # guard will always fail: remove the specialized function del specialized[index] elif check == 0: # guard failed temporarely index += 1 # code can be a code object or any callable object execute_code(code, args, kwargs) Changes ======= * Add two new methods to functions: - ``specialize(code, guards: list)``: add specialized function with guard. `code` is a code object (ex: ``func2.__code__``) or any callable object (ex: ``len``). The specialization can be ignored if a guard already fails. - ``get_specialized()``: get the list of specialized functions with guards * Base ``Guard`` type which can be used as parent type to implement guards. It requires to implement a ``check()`` function, with an optional ``first_check()`` function. API: * ``int first_check(PyObject *guard, PyObject *func)``: return 0 on success, -1 if the guard will always fail * ``int check(PyObject *guard, PyObject **stack, int na, int nk)``: return 1 on success, 0 if the guard failed temporarely, -1 if the guard will always fail Microbenchmark on ``python3.6 -m timeit -s 'def f(): pass' 'f()'`` (best of 3 runs): * Original Python: 79 ns * Patched Python: 79 ns According to this microbenchmark, the changes has no overhead on calling a Python function without specialization. Behaviour ========= When a function code is replaced (``func.__code__ = new_code``), all specialized functions are removed. When a function is serialized ``pickle``, specialized functions and guards are ignored (not serialized). Specialized functions and guards are not stored in ``.pyc`` files but created and registered at runtime, when a module is loaded. Discussion ========== Thread on the python-ideas mailing list: `RFC: PEP: Specialized functions with guards `_. Copyright ========= This document has been placed in the public domain.