PEP 659: Update to describe inline caches (#2462)
This commit is contained in:
parent
a5edad1428
commit
f3aad219e9
237
pep-0659.rst
237
pep-0659.rst
|
@ -99,16 +99,13 @@ more robust in cases where specialization fails or is not stable.
|
|||
Performance
|
||||
-----------
|
||||
|
||||
The expected speedup of 50% can be broken roughly down as follows:
|
||||
The speedup from specialization is hard to determine, as many specializations
|
||||
depend on other optimizations. Speedups seem to be in the range 10% - 60%.
|
||||
|
||||
* In the region of 30% from specialization. Much of that is from
|
||||
specialization of calls, with improvements in instructions that are already
|
||||
specialized such as ``LOAD_ATTR`` and ``LOAD_GLOBAL`` contributing much of
|
||||
the remainder. Specialization of operations adds a small amount.
|
||||
* About 10% from improved dispatch such as super-instructions
|
||||
and other optimizations enabled by quickening.
|
||||
* Further increases in the benefits of other optimizations,
|
||||
as they can exploit, or be exploited by specialization.
|
||||
* Most of the speedup comes directly from specialization. The largest
|
||||
contributors are speedups to attribute lookup, global variables, and calls.
|
||||
* A small, but useful, fraction is from from improved dispatch such as
|
||||
super-instructions and other optimizations enabled by quickening.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
@ -116,30 +113,29 @@ Implementation
|
|||
Overview
|
||||
--------
|
||||
|
||||
Once any instruction in a code object has executed a few times,
|
||||
that code object will be "quickened" by allocating a new array for the
|
||||
bytecode that can be modified at runtime, and is not constrained as the
|
||||
``code.co_code`` object is. From that point onwards, whenever any
|
||||
instruction in that code object is executed, it will use the quickened form.
|
||||
|
||||
Any instruction that would benefit from specialization will be replaced by an
|
||||
"adaptive" form of that instruction. When executed, the adaptive instructions
|
||||
will specialize themselves in response to the types and values that they see.
|
||||
This process is known as "quickening".
|
||||
|
||||
Once an instruction in a code object has executed enough times,
|
||||
that instruction will be "specialized" by replacing it with a new instruction
|
||||
that is expected to execute faster for that operation.
|
||||
|
||||
Quickening
|
||||
----------
|
||||
|
||||
Quickening is the process of replacing slow instructions with faster variants.
|
||||
|
||||
Quickened code has number of advantages over the normal bytecode:
|
||||
Quickened code has number of advantages over immutable bytecode:
|
||||
|
||||
* It can be changed at runtime
|
||||
* It can use super-instructions that span lines and take multiple operands.
|
||||
* It does not need to handle tracing as it can fallback to the normal
|
||||
* It does not need to handle tracing as it can fallback to the original
|
||||
bytecode for that.
|
||||
|
||||
In order that tracing can be supported, and quickening performed quickly,
|
||||
the quickened instruction format should match the normal bytecode format:
|
||||
In order that tracing can be supported, the quickened instruction format
|
||||
should match the immutable, user visible, bytecode format:
|
||||
16-bit instructions of 8-bit opcode followed by 8-bit operand.
|
||||
|
||||
Adaptive instructions
|
||||
|
@ -149,21 +145,21 @@ Each instruction that would benefit from specialization is replaced by an
|
|||
adaptive version during quickening. For example,
|
||||
the ``LOAD_ATTR`` instruction would be replaced with ``LOAD_ATTR_ADAPTIVE``.
|
||||
|
||||
Each adaptive instruction maintains a counter,
|
||||
and periodically attempts to specialize itself.
|
||||
Each adaptive instruction periodically attempts to specialize itself.
|
||||
|
||||
Specialization
|
||||
--------------
|
||||
|
||||
CPython bytecode contains many bytecodes that represent high-level operations,
|
||||
and would benefit from specialization. Examples include ``CALL_FUNCTION``,
|
||||
CPython bytecode contains many instructions that represent high-level
|
||||
operations, and would benefit from specialization. Examples include ``CALL``,
|
||||
``LOAD_ATTR``, ``LOAD_GLOBAL`` and ``BINARY_ADD``.
|
||||
|
||||
By introducing a "family" of specialized instructions for each of these
|
||||
instructions allows effective specialization,
|
||||
since each new instruction is specialized to a single task.
|
||||
Each family will include an "adaptive" instruction,
|
||||
that maintains a counter and periodically attempts to specialize itself.
|
||||
Each family will include an "adaptive" instruction, that maintains a counter
|
||||
and attempts to specialize itself when that counter reaches zero.
|
||||
|
||||
Each family will also include one or more specialized instructions that
|
||||
perform the equivalent of the generic operation much faster provided their
|
||||
inputs are as expected.
|
||||
|
@ -178,85 +174,41 @@ Ancillary data
|
|||
--------------
|
||||
|
||||
Most families of specialized instructions will require more information than
|
||||
can fit in an 8-bit operand. To do this, an array of specialization data entries
|
||||
will be maintained alongside the new instruction array. For instructions that
|
||||
need specialization data, the operand in the quickened array will serve as a
|
||||
partial index, along with the offset of the instruction, to find the first
|
||||
specialization data entry for that instruction.
|
||||
Each entry will be 8 bytes (for a 64 bit machine). The data in an entry,
|
||||
and the number of entries needed, will vary from instruction to instruction.
|
||||
|
||||
Data layout
|
||||
-----------
|
||||
|
||||
Quickened instructions will be stored in an array (it is neither necessary not
|
||||
desirable to store them in a Python object) with the same format as the
|
||||
original bytecode. Ancillary data will be stored in a separate array.
|
||||
|
||||
Each instruction will use 0 or more data entries.
|
||||
Each instruction within a family must have the same amount of data allocated,
|
||||
although some instructions may not use all of it.
|
||||
Instructions that cannot be specialized, e.g. ``POP_TOP``,
|
||||
do not need any entries.
|
||||
Experiments show that 25% to 30% of instructions can be usefully specialized.
|
||||
Different families will need different amounts of data,
|
||||
but most need 2 entries (16 bytes on a 64 bit machine).
|
||||
|
||||
In order to support larger functions than 256 instructions,
|
||||
we compute the offset of the first data entry for instructions
|
||||
as ``(instruction offset)//2 + (quickened operand)``.
|
||||
|
||||
Compared to the opcache in Python 3.10, this design:
|
||||
|
||||
* is faster; it requires no memory reads to compute the offset.
|
||||
3.10 requires two reads, which are dependent.
|
||||
* uses much less memory, as the data can be different sizes for different
|
||||
instruction families, and doesn't need an additional array of offsets.
|
||||
can support much larger functions, up to about 5000 instructions
|
||||
per function. 3.10 can support about 1000.
|
||||
|
||||
can fit in an 8-bit operand. To do this, a number of 16 bit entries immediately
|
||||
following the instruction are used to store this data. This is a form of inline
|
||||
cache, an "inline data cache". Unspecialized, or adaptive, instructions will
|
||||
use the first entry of this cache as a counter, and simply skip over the others.
|
||||
|
||||
Example families of instructions
|
||||
--------------------------------
|
||||
|
||||
CALL_FUNCTION
|
||||
'''''''''''''
|
||||
LOAD_ATTR
|
||||
'''''''''
|
||||
|
||||
The ``CALL_FUNCTION`` instruction calls the (N+1)th item on the stack with
|
||||
top N items on the stack as arguments.
|
||||
The ``LOAD_ATTR`` loads the named attribute of the object on top of the stack,
|
||||
then replaces the object on top of the stack with the attribute.
|
||||
|
||||
This is an obvious candidate for specialization. For example, the call in
|
||||
``len(x)`` is represented as the bytecode ``CALL_FUNCTION 1``.
|
||||
In this case we would always expect the object ``len`` to be the function.
|
||||
We probably don't want to specialize for ``len``
|
||||
(although we might for ``type`` and ``isinstance``), but it would be beneficial
|
||||
to specialize for builtin functions taking a single argument.
|
||||
A fast check that the underlying function is a builtin function taking a single
|
||||
argument (``METHOD_O``) would allow us to avoid a sequence of checks for number
|
||||
of parameters and keyword arguments.
|
||||
This is an obvious candidate for specialization. Attributes might belong to
|
||||
a normal instance, a class, a module, or one of many other special cases.
|
||||
|
||||
``CALL_FUNCTION_ADAPTIVE`` would track how often it is executed, and call the
|
||||
``call_function_optimize`` when executed enough times, or jump to ``CALL_FUNCTION``
|
||||
otherwise. When optimizing, the kind of the function would be checked and if a
|
||||
suitable specialized instruction was found,
|
||||
it would replace ``CALL_FUNCTION_ADAPTIVE`` in place.
|
||||
``LOAD_ATTR`` would initially be quickened to ``LOAD_ATTR_ADAPTIVE`` which
|
||||
would track how often it is executed, and call the ``_Py_Specialize_LoadAttr``
|
||||
internal function when executed enough times, or jump to the original
|
||||
``LOAD_ATTR`` instruction to perform the load. When optimizing, the kind
|
||||
of the attribute would be examined, and if a suitable specialized instruction
|
||||
was found, it would replace ``LOAD_ATTR_ADAPTIVE`` in place.
|
||||
|
||||
Specializations might include:
|
||||
Specialization for ``LOAD_ATTR`` might include:
|
||||
|
||||
* ``CALL_FUNCTION_PY_SIMPLE``: Calls to Python functions with
|
||||
exactly matching parameters.
|
||||
* ``CALL_FUNCTION_PY_DEFAULTS``: Calls to Python functions with more
|
||||
parameters and default values. Since the exact number of defaults needed is
|
||||
known, the instruction needs to do no additional checking or computation;
|
||||
just copy some defaults.
|
||||
* ``CALL_BUILTIN_O``: The example given above for calling builtin methods
|
||||
taking exactly one argument.
|
||||
* ``CALL_BUILTIN_VECTOR``: For calling builtin function taking
|
||||
vector arguments.
|
||||
* ``LOAD_ATTR_INSTANCE_VALUE`` A common case where the attribute is stored in
|
||||
the object's value array, and not shadowed by an overriding descriptor.
|
||||
* ``LOAD_ATTR_MODULE`` Load an attribute from a module.
|
||||
* ``LOAD_ATTR_SLOT`` Load an attribute from an object whose
|
||||
class defines ``__slots__``.
|
||||
|
||||
Note how this allows optimizations that complement other optimizations.
|
||||
For example, if the Python and C call stacks were decoupled and the data stack
|
||||
were contiguous, then Python-to-Python calls could be made very fast.
|
||||
The ``LOAD_ATTR_INSTANCE_VALUE`` works well with the "lazy dictionary" used for
|
||||
many objects.
|
||||
|
||||
LOAD_GLOBAL
|
||||
'''''''''''
|
||||
|
@ -276,7 +228,7 @@ as each instruction only needs to handle one concern.
|
|||
|
||||
Specializations would include:
|
||||
|
||||
* ``LOAD_GLOBAL_ADAPTIVE`` would operate like ``CALL_FUNCTION_ADAPTIVE`` above.
|
||||
* ``LOAD_GLOBAL_ADAPTIVE`` would operate like ``LOAD_ATTR_ADAPTIVE`` above.
|
||||
* ``LOAD_GLOBAL_MODULE`` can be specialized for the case where the value is in
|
||||
the globals namespace. After checking that the keys of the namespace have
|
||||
not changed, it can load the value from the stored index.
|
||||
|
@ -292,8 +244,8 @@ See [4]_ for a full implementation.
|
|||
|
||||
This PEP outlines the mechanisms for managing specialization, and does not
|
||||
specify the particular optimizations to be applied.
|
||||
The above scheme is just one possible scheme.
|
||||
Many others are possible and may well be better.
|
||||
It is likely that details, or even the entire implementation, may change
|
||||
as the code is further developed.
|
||||
|
||||
Compatibility
|
||||
=============
|
||||
|
@ -312,32 +264,42 @@ Memory use
|
|||
|
||||
An obvious concern with any scheme that performs any sort of caching is
|
||||
"how much more memory does it use?".
|
||||
The short answer is "none".
|
||||
The short answer is "not that much".
|
||||
|
||||
Comparing memory use to 3.10
|
||||
''''''''''''''''''''''''''''
|
||||
|
||||
CPython 3.10 used 2 bytes per instruction, until the execution count
|
||||
reached ~2000 when it allocates another byte per instruction and
|
||||
32 bytes per instruction with a cache (``LOAD_GLOBAL`` and ``LOAD_ATTR``).
|
||||
|
||||
The following table shows the additional bytes per instruction to support the
|
||||
3.10 opcache or the proposed adaptive interpreter, on a 64 bit machine.
|
||||
|
||||
================ ===== ======== ===== =====
|
||||
Version 3.10 3.10 opt 3.11 3.11
|
||||
Specialised 20% 20% 25% 33%
|
||||
---------------- ----- -------- ----- -----
|
||||
quickened code 0 0 2 2
|
||||
opcache_map 1 1 0 0
|
||||
opcache/data 6.4 4.8 4 5.3
|
||||
---------------- ----- -------- ----- -----
|
||||
Total 7.4 5.8 6 7.3
|
||||
================ ===== ======== ===== =====
|
||||
================ ========== ========== ======
|
||||
Version 3.10 cold 3.10 hot 3.11
|
||||
Specialised 0% ~15% ~25%
|
||||
---------------- ---------- ---------- ------
|
||||
code 2 2 2
|
||||
opcache_map 0 1 0
|
||||
opcache/data 0 4.8 4
|
||||
---------------- ---------- ---------- ------
|
||||
Total 2 7.8 6
|
||||
================ ========== ========== ======
|
||||
|
||||
``3.10`` is the current version of 3.10 which uses 32 bytes per entry.
|
||||
``3.10 opt`` is a hypothetical improved version of 3.10 that uses 24 bytes
|
||||
per entry.
|
||||
``3.10 cold`` is before the code has reached the ~2000 limit.
|
||||
``3.10 hot`` shows the cache use once the threshold is reached.
|
||||
|
||||
Even if one third of all instructions were specialized (a high proportion),
|
||||
then the memory use is still less than that of 3.10.
|
||||
With a more realistic 25%, then memory use is basically the same as the
|
||||
hypothetical improved version of 3.10.
|
||||
The relative memory use depends on how much code is "hot" enough to trigger
|
||||
creation of the cache in 3.10. The break even point, where the memory used
|
||||
by 3.10 is the same as for 3.11 is ~70%.
|
||||
|
||||
It is also worth noting that the actual bytecode is only part of a code
|
||||
object. Code objects also include names, constants and quite a lot of
|
||||
debugging information.
|
||||
|
||||
In summary, for most applications where many of the functions are relatively
|
||||
unused, 3.11 will consume more memory than 3.10, but not by much.
|
||||
|
||||
|
||||
Security Implications
|
||||
|
@ -349,8 +311,46 @@ None
|
|||
Rejected Ideas
|
||||
==============
|
||||
|
||||
Too many to list.
|
||||
By implementing a specializing adaptive interpreter with inline data caches,
|
||||
we are implicitly rejecting many alternative ways to optimize CPython.
|
||||
However, it is worth emphasizing that some ideas, such as just-in-time
|
||||
compilation, have not been rejected, merely deferred.
|
||||
|
||||
Storing data caches before the bytecode.
|
||||
----------------------------------------
|
||||
|
||||
An earlier implementation of this PEP for 3.11 alpha used a different caching
|
||||
scheme as described below:
|
||||
|
||||
|
||||
Quickened instructions will be stored in an array (it is neither necessary not
|
||||
desirable to store them in a Python object) with the same format as the
|
||||
original bytecode. Ancillary data will be stored in a separate array.
|
||||
|
||||
Each instruction will use 0 or more data entries.
|
||||
Each instruction within a family must have the same amount of data allocated,
|
||||
although some instructions may not use all of it.
|
||||
Instructions that cannot be specialized, e.g. ``POP_TOP``,
|
||||
do not need any entries.
|
||||
Experiments show that 25% to 30% of instructions can be usefully specialized.
|
||||
Different families will need different amounts of data,
|
||||
but most need 2 entries (16 bytes on a 64 bit machine).
|
||||
|
||||
In order to support larger functions than 256 instructions,
|
||||
we compute the offset of the first data entry for instructions
|
||||
as ``(instruction offset)//2 + (quickened operand)``.
|
||||
|
||||
Compared to the opcache in Python 3.10, this design:
|
||||
|
||||
* is faster; it requires no memory reads to compute the offset.
|
||||
3.10 requires two reads, which are dependent.
|
||||
* uses much less memory, as the data can be different sizes for different
|
||||
instruction families, and doesn't need an additional array of offsets.
|
||||
can support much larger functions, up to about 5000 instructions
|
||||
per function. 3.10 can support about 1000.
|
||||
|
||||
We rejected this scheme as the inline cache approach is both faster
|
||||
and simpler.
|
||||
|
||||
References
|
||||
==========
|
||||
|
@ -365,10 +365,11 @@ References
|
|||
.. [3] Inline Caching meets Quickening
|
||||
https://www.unibw.de/ucsrl/pubs/ecoop10.pdf/view
|
||||
|
||||
.. [4] Adaptive specializing examples
|
||||
(This will be moved to a more permanent location, once this PEP is accepted)
|
||||
https://gist.github.com/markshannon/556ccc0e99517c25a70e2fe551917c03
|
||||
.. [4] The adaptive and specialized instructions are implemented in
|
||||
https://github.com/python/cpython/blob/main/Python/ceval.c
|
||||
|
||||
The optimizations are implemented in:
|
||||
https://github.com/python/cpython/blob/main/Python/specialize.c
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
|
Loading…
Reference in New Issue