380 lines
15 KiB
ReStructuredText
380 lines
15 KiB
ReStructuredText
PEP: 659
|
|
Title: Specializing Adaptive Interpreter
|
|
Author: Mark Shannon <mark@hotpy.org>
|
|
Status: Final
|
|
Type: Informational
|
|
Created: 13-Apr-2021
|
|
Post-History: 11-May-2021
|
|
|
|
.. canonical-doc:: :ref:`Specializing Adaptive Interpreter <whatsnew311-pep659>`
|
|
|
|
|
|
Abstract
|
|
========
|
|
|
|
In order to perform well, virtual machines for dynamic languages must
|
|
specialize the code that they execute to the types and values in the
|
|
program being run. This specialization is often associated with "JIT"
|
|
compilers, but is beneficial even without machine code generation.
|
|
|
|
A specializing, adaptive interpreter is one that speculatively specializes
|
|
on the types or values it is currently operating on, and adapts to changes
|
|
in those types and values.
|
|
|
|
Specialization gives us improved performance, and adaptation allows the
|
|
interpreter to rapidly change when the pattern of usage in a program alters,
|
|
limiting the amount of additional work caused by mis-specialization.
|
|
|
|
This PEP proposes using a specializing, adaptive interpreter that specializes
|
|
code aggressively, but over a very small region, and is able to adjust to
|
|
mis-specialization rapidly and at low cost.
|
|
|
|
Adding a specializing, adaptive interpreter to CPython will bring significant
|
|
performance improvements. It is hard to come up with meaningful numbers,
|
|
as it depends very much on the benchmarks and on work that has not yet happened.
|
|
Extensive experimentation suggests speedups of up to 50%.
|
|
Even if the speedup were only 25%, this would still be a worthwhile enhancement.
|
|
|
|
Motivation
|
|
==========
|
|
|
|
Python is widely acknowledged as slow.
|
|
Whilst Python will never attain the performance of low-level languages like C,
|
|
Fortran, or even Java, we would like it to be competitive with fast
|
|
implementations of scripting languages, like V8 for Javascript or luajit for
|
|
lua.
|
|
Specifically, we want to achieve these performance goals with CPython to
|
|
benefit all users of Python including those unable to use PyPy or
|
|
other alternative virtual machines.
|
|
|
|
Achieving these performance goals is a long way off, and will require a lot of
|
|
engineering effort, but we can make a significant step towards those goals by
|
|
speeding up the interpreter.
|
|
Both academic research and practical implementations have shown that a fast
|
|
interpreter is a key part of a fast virtual machine.
|
|
|
|
Typical optimizations for virtual machines are expensive, so a long "warm up"
|
|
time is required to gain confidence that the cost of optimization is justified.
|
|
In order to get speed-ups rapidly, without noticeable warmup times,
|
|
the VM should speculate that specialization is justified even after a few
|
|
executions of a function. To do that effectively, the interpreter must be able
|
|
to optimize and de-optimize continually and very cheaply.
|
|
|
|
By using adaptive and speculative specialization at the granularity of
|
|
individual virtual machine instructions,
|
|
we get a faster interpreter that also generates profiling information
|
|
for more sophisticated optimizations in the future.
|
|
|
|
Rationale
|
|
=========
|
|
|
|
There are many practical ways to speed-up a virtual machine for a dynamic
|
|
language.
|
|
However, specialization is the most important, both in itself and as an
|
|
enabler of other optimizations.
|
|
Therefore it makes sense to focus our efforts on specialization first,
|
|
if we want to improve the performance of CPython.
|
|
|
|
Specialization is typically done in the context of a JIT compiler,
|
|
but research shows specialization in an interpreter can boost performance
|
|
significantly, even outperforming a naive compiler [1]_.
|
|
|
|
There have been several ways of doing this proposed in the academic
|
|
literature, but most attempt to optimize regions larger than a
|
|
single bytecode [1]_ [2]_.
|
|
Using larger regions than a single instruction requires code to handle
|
|
de-optimization in the middle of a region.
|
|
Specialization at the level of individual bytecodes makes de-optimization
|
|
trivial, as it cannot occur in the middle of a region.
|
|
|
|
By speculatively specializing individual bytecodes, we can gain significant
|
|
performance improvements without anything but the most local,
|
|
and trivial to implement, de-optimizations.
|
|
|
|
The closest approach to this PEP in the literature is
|
|
"Inline Caching meets Quickening" [3]_.
|
|
This PEP has the advantages of inline caching,
|
|
but adds the ability to quickly de-optimize making the performance
|
|
more robust in cases where specialization fails or is not stable.
|
|
|
|
Performance
|
|
-----------
|
|
|
|
The speedup from specialization is hard to determine, as many specializations
|
|
depend on other optimizations. Speedups seem to be in the range 10% - 60%.
|
|
|
|
* Most of the speedup comes directly from specialization. The largest
|
|
contributors are speedups to attribute lookup, global variables, and calls.
|
|
* A small, but useful, fraction is from improved dispatch such as
|
|
super-instructions and other optimizations enabled by quickening.
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Overview
|
|
--------
|
|
|
|
Any instruction that would benefit from specialization will be replaced by an
|
|
"adaptive" form of that instruction. When executed, the adaptive instructions
|
|
will specialize themselves in response to the types and values that they see.
|
|
This process is known as "quickening".
|
|
|
|
Once an instruction in a code object has executed enough times,
|
|
that instruction will be "specialized" by replacing it with a new instruction
|
|
that is expected to execute faster for that operation.
|
|
|
|
Quickening
|
|
----------
|
|
|
|
Quickening is the process of replacing slow instructions with faster variants.
|
|
|
|
Quickened code has a number of advantages over immutable bytecode:
|
|
|
|
* It can be changed at runtime.
|
|
* It can use super-instructions that span lines and take multiple operands.
|
|
* It does not need to handle tracing as it can fallback to the original
|
|
bytecode for that.
|
|
|
|
In order that tracing can be supported, the quickened instruction format
|
|
should match the immutable, user visible, bytecode format:
|
|
16-bit instructions of 8-bit opcode followed by 8-bit operand.
|
|
|
|
Adaptive instructions
|
|
---------------------
|
|
|
|
Each instruction that would benefit from specialization is replaced by an
|
|
adaptive version during quickening. For example,
|
|
the ``LOAD_ATTR`` instruction would be replaced with ``LOAD_ATTR_ADAPTIVE``.
|
|
|
|
Each adaptive instruction periodically attempts to specialize itself.
|
|
|
|
Specialization
|
|
--------------
|
|
|
|
CPython bytecode contains many instructions that represent high-level
|
|
operations, and would benefit from specialization. Examples include ``CALL``,
|
|
``LOAD_ATTR``, ``LOAD_GLOBAL`` and ``BINARY_ADD``.
|
|
|
|
By introducing a "family" of specialized instructions for each of these
|
|
instructions allows effective specialization,
|
|
since each new instruction is specialized to a single task.
|
|
Each family will include an "adaptive" instruction, that maintains a counter
|
|
and attempts to specialize itself when that counter reaches zero.
|
|
|
|
Each family will also include one or more specialized instructions that
|
|
perform the equivalent of the generic operation much faster provided their
|
|
inputs are as expected.
|
|
Each specialized instruction will maintain a saturating counter which will
|
|
be incremented whenever the inputs are as expected. Should the inputs not
|
|
be as expected, the counter will be decremented and the generic operation
|
|
will be performed.
|
|
If the counter reaches the minimum value, the instruction is de-optimized by
|
|
simply replacing its opcode with the adaptive version.
|
|
|
|
Ancillary data
|
|
--------------
|
|
|
|
Most families of specialized instructions will require more information than
|
|
can fit in an 8-bit operand. To do this, a number of 16 bit entries immediately
|
|
following the instruction are used to store this data. This is a form of inline
|
|
cache, an "inline data cache". Unspecialized, or adaptive, instructions will
|
|
use the first entry of this cache as a counter, and simply skip over the others.
|
|
|
|
Example families of instructions
|
|
--------------------------------
|
|
|
|
LOAD_ATTR
|
|
'''''''''
|
|
|
|
The ``LOAD_ATTR`` instruction loads the named attribute of the object on top of the stack,
|
|
then replaces the object on top of the stack with the attribute.
|
|
|
|
This is an obvious candidate for specialization. Attributes might belong to
|
|
a normal instance, a class, a module, or one of many other special cases.
|
|
|
|
``LOAD_ATTR`` would initially be quickened to ``LOAD_ATTR_ADAPTIVE`` which
|
|
would track how often it is executed, and call the ``_Py_Specialize_LoadAttr``
|
|
internal function when executed enough times, or jump to the original
|
|
``LOAD_ATTR`` instruction to perform the load. When optimizing, the kind
|
|
of the attribute would be examined, and if a suitable specialized instruction
|
|
was found, it would replace ``LOAD_ATTR_ADAPTIVE`` in place.
|
|
|
|
Specialization for ``LOAD_ATTR`` might include:
|
|
|
|
* ``LOAD_ATTR_INSTANCE_VALUE`` A common case where the attribute is stored in
|
|
the object's value array, and not shadowed by an overriding descriptor.
|
|
* ``LOAD_ATTR_MODULE`` Load an attribute from a module.
|
|
* ``LOAD_ATTR_SLOT`` Load an attribute from an object whose
|
|
class defines ``__slots__``.
|
|
|
|
Note how this allows optimizations that complement other optimizations.
|
|
The ``LOAD_ATTR_INSTANCE_VALUE`` works well with the "lazy dictionary" used for
|
|
many objects.
|
|
|
|
LOAD_GLOBAL
|
|
'''''''''''
|
|
|
|
The ``LOAD_GLOBAL`` instruction looks up a name in the global namespace
|
|
and then, if not present in the global namespace,
|
|
looks it up in the builtins namespace.
|
|
In 3.9 the C code for the ``LOAD_GLOBAL`` includes code to check to see
|
|
whether the whole code object should be modified to add a cache,
|
|
whether either the global or builtins namespace,
|
|
code to lookup the value in a cache, and fallback code.
|
|
This makes it complicated and bulky.
|
|
It also performs many redundant operations even when supposedly optimized.
|
|
|
|
Using a family of instructions makes the code more maintainable and faster,
|
|
as each instruction only needs to handle one concern.
|
|
|
|
Specializations would include:
|
|
|
|
* ``LOAD_GLOBAL_ADAPTIVE`` would operate like ``LOAD_ATTR_ADAPTIVE`` above.
|
|
* ``LOAD_GLOBAL_MODULE`` can be specialized for the case where the value is in
|
|
the globals namespace. After checking that the keys of the namespace have
|
|
not changed, it can load the value from the stored index.
|
|
* ``LOAD_GLOBAL_BUILTIN`` can be specialized for the case where the value is
|
|
in the builtins namespace. It needs to check that the keys of the global
|
|
namespace have not been added to, and that the builtins namespace has not
|
|
changed. Note that we don't care if the values of the global namespace
|
|
have changed, just the keys.
|
|
|
|
See [4]_ for a full implementation.
|
|
|
|
.. note::
|
|
|
|
This PEP outlines the mechanisms for managing specialization, and does not
|
|
specify the particular optimizations to be applied.
|
|
It is likely that details, or even the entire implementation, may change
|
|
as the code is further developed.
|
|
|
|
Compatibility
|
|
=============
|
|
|
|
There will be no change to the language, library or API.
|
|
|
|
The only way that users will be able to detect the presence of the new
|
|
interpreter is through timing execution, the use of debugging tools,
|
|
or measuring memory use.
|
|
|
|
Costs
|
|
=====
|
|
|
|
Memory use
|
|
----------
|
|
|
|
An obvious concern with any scheme that performs any sort of caching is
|
|
"how much more memory does it use?".
|
|
The short answer is "not that much".
|
|
|
|
Comparing memory use to 3.10
|
|
''''''''''''''''''''''''''''
|
|
|
|
CPython 3.10 used 2 bytes per instruction, until the execution count
|
|
reached ~2000 when it allocates another byte per instruction and
|
|
32 bytes per instruction with a cache (``LOAD_GLOBAL`` and ``LOAD_ATTR``).
|
|
|
|
The following table shows the additional bytes per instruction to support the
|
|
3.10 opcache or the proposed adaptive interpreter, on a 64 bit machine.
|
|
|
|
================ ========== ========== ======
|
|
Version 3.10 cold 3.10 hot 3.11
|
|
Specialised 0% ~15% ~25%
|
|
---------------- ---------- ---------- ------
|
|
code 2 2 2
|
|
opcache_map 0 1 0
|
|
opcache/data 0 4.8 4
|
|
---------------- ---------- ---------- ------
|
|
Total 2 7.8 6
|
|
================ ========== ========== ======
|
|
|
|
``3.10 cold`` is before the code has reached the ~2000 limit.
|
|
``3.10 hot`` shows the cache use once the threshold is reached.
|
|
|
|
The relative memory use depends on how much code is "hot" enough to trigger
|
|
creation of the cache in 3.10. The break even point, where the memory used
|
|
by 3.10 is the same as for 3.11 is ~70%.
|
|
|
|
It is also worth noting that the actual bytecode is only part of a code
|
|
object. Code objects also include names, constants and quite a lot of
|
|
debugging information.
|
|
|
|
In summary, for most applications where many of the functions are relatively
|
|
unused, 3.11 will consume more memory than 3.10, but not by much.
|
|
|
|
|
|
Security Implications
|
|
=====================
|
|
|
|
None
|
|
|
|
|
|
Rejected Ideas
|
|
==============
|
|
|
|
By implementing a specializing adaptive interpreter with inline data caches,
|
|
we are implicitly rejecting many alternative ways to optimize CPython.
|
|
However, it is worth emphasizing that some ideas, such as just-in-time
|
|
compilation, have not been rejected, merely deferred.
|
|
|
|
Storing data caches before the bytecode.
|
|
----------------------------------------
|
|
|
|
An earlier implementation of this PEP for 3.11 alpha used a different caching
|
|
scheme as described below:
|
|
|
|
|
|
Quickened instructions will be stored in an array (it is neither necessary not
|
|
desirable to store them in a Python object) with the same format as the
|
|
original bytecode. Ancillary data will be stored in a separate array.
|
|
|
|
Each instruction will use 0 or more data entries.
|
|
Each instruction within a family must have the same amount of data allocated,
|
|
although some instructions may not use all of it.
|
|
Instructions that cannot be specialized, e.g. ``POP_TOP``,
|
|
do not need any entries.
|
|
Experiments show that 25% to 30% of instructions can be usefully specialized.
|
|
Different families will need different amounts of data,
|
|
but most need 2 entries (16 bytes on a 64 bit machine).
|
|
|
|
In order to support larger functions than 256 instructions,
|
|
we compute the offset of the first data entry for instructions
|
|
as ``(instruction offset)//2 + (quickened operand)``.
|
|
|
|
Compared to the opcache in Python 3.10, this design:
|
|
|
|
* is faster; it requires no memory reads to compute the offset.
|
|
3.10 requires two reads, which are dependent.
|
|
* uses much less memory, as the data can be different sizes for different
|
|
instruction families, and doesn't need an additional array of offsets.
|
|
can support much larger functions, up to about 5000 instructions
|
|
per function. 3.10 can support about 1000.
|
|
|
|
We rejected this scheme as the inline cache approach is both faster
|
|
and simpler.
|
|
|
|
References
|
|
==========
|
|
|
|
.. [1] The construction of high-performance virtual machines for
|
|
dynamic languages, Mark Shannon 2011.
|
|
https://theses.gla.ac.uk/2975/1/2011shannonphd.pdf
|
|
|
|
.. [2] Dynamic Interpretation for Dynamic Scripting Languages
|
|
https://www.scss.tcd.ie/publications/tech-reports/reports.09/TCD-CS-2009-37.pdf
|
|
|
|
.. [3] Inline Caching meets Quickening
|
|
https://www.unibw.de/ucsrl/pubs/ecoop10.pdf/view
|
|
|
|
.. [4] The adaptive and specialized instructions are implemented in
|
|
https://github.com/python/cpython/blob/main/Python/ceval.c
|
|
|
|
The optimizations are implemented in:
|
|
https://github.com/python/cpython/blob/main/Python/specialize.c
|
|
|
|
Copyright
|
|
=========
|
|
|
|
This document is placed in the public domain or under the
|
|
CC0-1.0-Universal license, whichever is more permissive.
|