Reformat PEP 659 to obey 80 column limit. (#2458)

This commit is contained in:
Mark Shannon 2022-03-22 10:27:32 +00:00 committed by GitHub
parent 04eb44995d
commit 293dd4e107
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 192 additions and 112 deletions

View File

@ -11,21 +11,26 @@ Post-History: 11-May-2021
Abstract Abstract
======== ========
In order to perform well, virtual machines for dynamic languages must specialize the code that they execute In order to perform well, virtual machines for dynamic languages must
to the types and values in the program being run. specialize the code that they execute to the types and values in the
This specialization is often associated with "JIT" compilers, but is beneficial even without machine code generation. program being run. This specialization is often associated with "JIT"
compilers, but is beneficial even without machine code generation.
A specializing, adaptive interpreter is one that speculatively specializes on the types or values it is currently operating on, A specializing, adaptive interpreter is one that speculatively specializes
and adapts to changes in those types and values. on the types or values it is currently operating on, and adapts to changes
in those types and values.
Specialization gives us improved performance, and adaptation allows the interpreter to rapidly change when the pattern of usage in a program alters, Specialization gives us improved performance, and adaptation allows the
interpreter to rapidly change when the pattern of usage in a program alters,
limiting the amount of additional work caused by mis-specialization. limiting the amount of additional work caused by mis-specialization.
This PEP proposes using a specializing, adaptive interpreter that specializes code aggressively, but over a very small region, This PEP proposes using a specializing, adaptive interpreter that specializes
and is able to adjust to mis-specialization rapidly and at low cost. code aggressively, but over a very small region, and is able to adjust to
mis-specialization rapidly and at low cost.
Adding a specializing, adaptive interpreter to CPython will bring significant performance improvements. Adding a specializing, adaptive interpreter to CPython will bring significant
It is hard to come up with meaningful numbers, as it depends very much on the benchmarks and on work that has not yet happened. performance improvements. It is hard to come up with meaningful numbers,
as it depends very much on the benchmarks and on work that has not yet happened.
Extensive experimentation suggests speedups of up to 50%. Extensive experimentation suggests speedups of up to 50%.
Even if the speedup were only 25%, this would still be a worthwhile enhancement. Even if the speedup were only 25%, this would still be a worthwhile enhancement.
@ -33,44 +38,62 @@ Motivation
========== ==========
Python is widely acknowledged as slow. Python is widely acknowledged as slow.
Whilst Python will never attain the performance of low-level languages like C, Fortran, or even Java, Whilst Python will never attain the performance of low-level languages like C,
we would like it to be competitive with fast implementations of scripting languages, like V8 for Javascript or luajit for lua. Fortran, or even Java, we would like it to be competitive with fast
Specifically, we want to achieve these performance goals with CPython to benefit all users of Python implementations of scripting languages, like V8 for Javascript or luajit for
including those unable to use PyPy or other alternative virtual machines. lua.
Specifically, we want to achieve these performance goals with CPython to
benefit all users of Python including those unable to use PyPy or
other alternative virtual machines.
Achieving these performance goals is a long way off, and will require a lot of engineering effort, Achieving these performance goals is a long way off, and will require a lot of
but we can make a significant step towards those goals by speeding up the interpreter. engineering effort, but we can make a significant step towards those goals by
Both academic research and practical implementations have shown that a fast interpreter is a key part of a fast virtual machine. speeding up the interpreter.
Both academic research and practical implementations have shown that a fast
interpreter is a key part of a fast virtual machine.
Typical optimizations for virtual machines are expensive, so a long "warm up" time is required Typical optimizations for virtual machines are expensive, so a long "warm up"
to gain confidence that the cost of optimization is justified. time is required to gain confidence that the cost of optimization is justified.
In order to get speed-ups rapidly, without noticeable warmup times, In order to get speed-ups rapidly, without noticeable warmup times,
the VM should speculate that specialization is justified even after a few executions of a function. the VM should speculate that specialization is justified even after a few
To do that effectively, the interpreter must be able to optimize and deoptimize continually and very cheaply. executions of a function. To do that effectively, the interpreter must be able
to optimize and de-optimize continually and very cheaply.
By using adaptive and speculative specialization at the granularity of individual virtual machine instructions, we get a faster By using adaptive and speculative specialization at the granularity of
interpreter that also generates profiling information for more sophisticated optimizations in the future. individual virtual machine instructions,
we get a faster interpreter that also generates profiling information
for more sophisticated optimizations in the future.
Rationale Rationale
========= =========
There are many practical ways to speed-up a virtual machine for a dynamic language. There are many practical ways to speed-up a virtual machine for a dynamic
However, specialization is the most important, both in itself and as an enabler of other optimizations. language.
Therefore it makes sense to focus our efforts on specialization first, if we want to improve the performance of CPython. However, specialization is the most important, both in itself and as an
enabler of other optimizations.
Therefore it makes sense to focus our efforts on specialization first,
if we want to improve the performance of CPython.
Specialization is typically done in the context of a JIT compiler, but research shows specialization in an interpreter Specialization is typically done in the context of a JIT compiler,
can boost performance significantly, even outperforming a naive compiler [1]_. but research shows specialization in an interpreter can boost performance
significantly, even outperforming a naive compiler [1]_.
There have been several ways of doing this proposed in the academic literature, There have been several ways of doing this proposed in the academic
but most attempt to optimize regions larger than a single bytecode [1]_ [2]_. literature, but most attempt to optimize regions larger than a
Using larger regions than a single instruction, requires code to handle deoptimization in the middle of a region. single bytecode [1]_ [2]_.
Specialization at the level of individual bytecodes makes deoptimization trivial, as it cannot occur in the middle of a region. Using larger regions than a single instruction requires code to handle
de-optimization in the middle of a region.
Specialization at the level of individual bytecodes makes de-optimization
trivial, as it cannot occur in the middle of a region.
By speculatively specializing individual bytecodes, we can gain significant performance improvements without anything but the most local, By speculatively specializing individual bytecodes, we can gain significant
and trivial to implement, deoptimizations. performance improvements without anything but the most local,
and trivial to implement, de-optimizations.
The closest approach to this PEP in the literature is "Inline Caching meets Quickening" [3]_. The closest approach to this PEP in the literature is
This PEP has the advantages of inline caching, but adds the ability to quickly deoptimize making the performance "Inline Caching meets Quickening" [3]_.
This PEP has the advantages of inline caching,
but adds the ability to quickly de-optimize making the performance
more robust in cases where specialization fails or is not stable. more robust in cases where specialization fails or is not stable.
Performance Performance
@ -78,11 +101,14 @@ Performance
The expected speedup of 50% can be broken roughly down as follows: The expected speedup of 50% can be broken roughly down as follows:
* In the region of 30% from specialization. Much of that is from specialization of calls, * In the region of 30% from specialization. Much of that is from
with improvements in instructions that are already specialized such as ``LOAD_ATTR`` and ``LOAD_GLOBAL`` specialization of calls, with improvements in instructions that are already
contributing much of the remainder. Specialization of operations adds a small amount. specialized such as ``LOAD_ATTR`` and ``LOAD_GLOBAL`` contributing much of
* About 10% from improved dispatch such as super-instructions and other optimizations enabled by quickening. the remainder. Specialization of operations adds a small amount.
* Further increases in the benefits of other optimizations, as they can exploit, or be exploited by specialization. * About 10% from improved dispatch such as super-instructions
and other optimizations enabled by quickening.
* Further increases in the benefits of other optimizations,
as they can exploit, or be exploited by specialization.
Implementation Implementation
============== ==============
@ -90,12 +116,15 @@ Implementation
Overview Overview
-------- --------
Once any instruction in a code object has executed a few times, that code object will be "quickened" by allocating a new array Once any instruction in a code object has executed a few times,
for the bytecode that can be modified at runtime, and is not constrained as the ``code.co_code`` object is. that code object will be "quickened" by allocating a new array for the
From that point onwards, whenever any instruction in that code object is executed, it will use the quickened form. bytecode that can be modified at runtime, and is not constrained as the
``code.co_code`` object is. From that point onwards, whenever any
instruction in that code object is executed, it will use the quickened form.
Any instruction that would benefit from specialization will be replaced by an "adaptive" form of that instruction. Any instruction that would benefit from specialization will be replaced by an
When executed, the adaptive instructions will specialize themselves in response to the types and values that they see. "adaptive" form of that instruction. When executed, the adaptive instructions
will specialize themselves in response to the types and values that they see.
Quickening Quickening
---------- ----------
@ -106,62 +135,85 @@ Quickened code has number of advantages over the normal bytecode:
* It can be changed at runtime * It can be changed at runtime
* It can use super-instructions that span lines and take multiple operands. * It can use super-instructions that span lines and take multiple operands.
* It does not need to handle tracing as it can fallback to the normal bytecode for that. * It does not need to handle tracing as it can fallback to the normal
bytecode for that.
In order that tracing can be supported, and quickening performed quickly, the quickened instruction format should match the normal In order that tracing can be supported, and quickening performed quickly,
bytecode format: 16-bit instructions of 8-bit opcode followed by 8-bit operand. the quickened instruction format should match the normal bytecode format:
16-bit instructions of 8-bit opcode followed by 8-bit operand.
Adaptive instructions Adaptive instructions
--------------------- ---------------------
Each instruction that would benefit from specialization is replaced by an adaptive version during quickening. Each instruction that would benefit from specialization is replaced by an
For example, the ``LOAD_ATTR`` instruction would be replaced with ``LOAD_ATTR_ADAPTIVE``. adaptive version during quickening. For example,
the ``LOAD_ATTR`` instruction would be replaced with ``LOAD_ATTR_ADAPTIVE``.
Each adaptive instruction maintains a counter, and periodically attempts to specialize itself. Each adaptive instruction maintains a counter,
and periodically attempts to specialize itself.
Specialization Specialization
-------------- --------------
CPython bytecode contains many bytecodes that represent high-level operations, and would benefit from specialization. CPython bytecode contains many bytecodes that represent high-level operations,
Examples include ``CALL_FUNCTION``, ``LOAD_ATTR``, ``LOAD_GLOBAL`` and ``BINARY_ADD``. and would benefit from specialization. Examples include ``CALL_FUNCTION``,
``LOAD_ATTR``, ``LOAD_GLOBAL`` and ``BINARY_ADD``.
By introducing a "family" of specialized instructions for each of these instructions allows effective specialization, By introducing a "family" of specialized instructions for each of these
instructions allows effective specialization,
since each new instruction is specialized to a single task. since each new instruction is specialized to a single task.
Each family will include an "adaptive" instruction, that maintains a counter and periodically attempts to specialize itself. Each family will include an "adaptive" instruction,
Each family will also include one or more specialized instructions that perform the equivalent that maintains a counter and periodically attempts to specialize itself.
of the generic operation much faster provided their inputs are as expected. Each family will also include one or more specialized instructions that
Each specialized instruction will maintain a saturating counter which will be incremented whenever the inputs are as expected. perform the equivalent of the generic operation much faster provided their
Should the inputs not be as expected, the counter will be decremented and the generic operation will be performed. inputs are as expected.
If the counter reaches the minimum value, the instruction is deoptimized by simply replacing its opcode with the adaptive version. Each specialized instruction will maintain a saturating counter which will
be incremented whenever the inputs are as expected. Should the inputs not
be as expected, the counter will be decremented and the generic operation
will be performed.
If the counter reaches the minimum value, the instruction is de-optimized by
simply replacing its opcode with the adaptive version.
Ancillary data Ancillary data
-------------- --------------
Most families of specialized instructions will require more information than can fit in an 8-bit operand. Most families of specialized instructions will require more information than
To do this, an array of specialization data entries will be maintained alongside the new instruction array. can fit in an 8-bit operand. To do this, an array of specialization data entries
For instructions that need specialization data, the operand in the quickened array will serve as a partial index, will be maintained alongside the new instruction array. For instructions that
along with the offset of the instruction, to find the first specialization data entry for that instruction. need specialization data, the operand in the quickened array will serve as a
Each entry will be 8 bytes (for a 64 bit machine). The data in an entry, and the number of entries needed, will vary from instruction to instruction. partial index, along with the offset of the instruction, to find the first
specialization data entry for that instruction.
Each entry will be 8 bytes (for a 64 bit machine). The data in an entry,
and the number of entries needed, will vary from instruction to instruction.
Data layout Data layout
----------- -----------
Quickened instructions will be stored in an array (it is neither necessary not desirable to store them in a Python object) with the same Quickened instructions will be stored in an array (it is neither necessary not
format as the original bytecode. Ancillary data will be stored in a separate array. desirable to store them in a Python object) with the same format as the
original bytecode. Ancillary data will be stored in a separate array.
Each instruction will use 0 or more data entries. Each instruction within a family must have the same amount of data allocated, although some Each instruction will use 0 or more data entries.
instructions may not use all of it. Instructions that cannot be specialized, e.g. ``POP_TOP``, do not need any entries. Each instruction within a family must have the same amount of data allocated,
although some instructions may not use all of it.
Instructions that cannot be specialized, e.g. ``POP_TOP``,
do not need any entries.
Experiments show that 25% to 30% of instructions can be usefully specialized. Experiments show that 25% to 30% of instructions can be usefully specialized.
Different families will need different amounts of data, but most need 2 entries (16 bytes on a 64 bit machine). Different families will need different amounts of data,
but most need 2 entries (16 bytes on a 64 bit machine).
In order to support larger functions than 256 instructions, we compute the offset of the first data entry for instructions In order to support larger functions than 256 instructions,
we compute the offset of the first data entry for instructions
as ``(instruction offset)//2 + (quickened operand)``. as ``(instruction offset)//2 + (quickened operand)``.
Compared to the opcache in Python 3.10, this design: Compared to the opcache in Python 3.10, this design:
* is faster; it requires no memory reads to compute the offset. 3.10 requires two reads, which are dependent. * is faster; it requires no memory reads to compute the offset.
* uses much less memory, as the data can be different sizes for different instruction families, and doesn't need an additional array of offsets. 3.10 requires two reads, which are dependent.
* can support much larger functions, up to about 5000 instructions per function. 3.10 can support about 1000. * uses much less memory, as the data can be different sizes for different
instruction families, and doesn't need an additional array of offsets.
can support much larger functions, up to about 5000 instructions
per function. 3.10 can support about 1000.
Example families of instructions Example families of instructions
@ -170,64 +222,86 @@ Example families of instructions
CALL_FUNCTION CALL_FUNCTION
''''''''''''' '''''''''''''
The ``CALL_FUNCTION`` instruction calls the (N+1)th item on the stack with top N items on the stack as arguments. The ``CALL_FUNCTION`` instruction calls the (N+1)th item on the stack with
top N items on the stack as arguments.
This is an obvious candidate for specialization. For example, the call in ``len(x)`` is represented as the bytecode ``CALL_FUNCTION 1``. This is an obvious candidate for specialization. For example, the call in
In this case we would always expect the object ``len`` to be the function. We probably don't want to specialize for ``len`` ``len(x)`` is represented as the bytecode ``CALL_FUNCTION 1``.
(although we might for ``type`` and ``isinstance``), but it would be beneficial to specialize for builtin functions taking a single argument. In this case we would always expect the object ``len`` to be the function.
A fast check that the underlying function is a builtin function taking a single argument (``METHOD_O``) would allow us to avoid a We probably don't want to specialize for ``len``
sequence of checks for number of parameters and keyword arguments. (although we might for ``type`` and ``isinstance``), but it would be beneficial
to specialize for builtin functions taking a single argument.
A fast check that the underlying function is a builtin function taking a single
argument (``METHOD_O``) would allow us to avoid a sequence of checks for number
of parameters and keyword arguments.
``CALL_FUNCTION_ADAPTIVE`` would track how often it is executed, and call the ``call_function_optimize`` when executed enough times, or jump ``CALL_FUNCTION_ADAPTIVE`` would track how often it is executed, and call the
to ``CALL_FUNCTION`` otherwise. ``call_function_optimize`` when executed enough times, or jump to ``CALL_FUNCTION``
When optimizing, the kind of the function would be checked and if a suitable specialized instruction was found, otherwise. When optimizing, the kind of the function would be checked and if a
suitable specialized instruction was found,
it would replace ``CALL_FUNCTION_ADAPTIVE`` in place. it would replace ``CALL_FUNCTION_ADAPTIVE`` in place.
Specializations might include: Specializations might include:
* ``CALL_FUNCTION_PY_SIMPLE``: Calls to Python functions with exactly matching parameters. * ``CALL_FUNCTION_PY_SIMPLE``: Calls to Python functions with
* ``CALL_FUNCTION_PY_DEFAULTS``: Calls to Python functions with more parameters and default values. exactly matching parameters.
Since the exact number of defaults needed is known, the instruction needs to do no additional checking or computation; just copy some defaults. * ``CALL_FUNCTION_PY_DEFAULTS``: Calls to Python functions with more
* ``CALL_BUILTIN_O``: The example given above for calling builtin methods taking exactly one argument. parameters and default values. Since the exact number of defaults needed is
* ``CALL_BUILTIN_VECTOR``: For calling builtin function taking vector arguments. known, the instruction needs to do no additional checking or computation;
just copy some defaults.
* ``CALL_BUILTIN_O``: The example given above for calling builtin methods
taking exactly one argument.
* ``CALL_BUILTIN_VECTOR``: For calling builtin function taking
vector arguments.
Note how this allows optimizations that complement other optimizations. Note how this allows optimizations that complement other optimizations.
For example, if the Python and C call stacks were decoupled and the data stack were contiguous, For example, if the Python and C call stacks were decoupled and the data stack
then Python-to-Python calls could be made very fast. were contiguous, then Python-to-Python calls could be made very fast.
LOAD_GLOBAL LOAD_GLOBAL
''''''''''' '''''''''''
The ``LOAD_GLOBAL`` instruction looks up a name in the global namespace and then, if not present in the global namespace, The ``LOAD_GLOBAL`` instruction looks up a name in the global namespace
and then, if not present in the global namespace,
looks it up in the builtins namespace. looks it up in the builtins namespace.
In 3.9 the C code for the ``LOAD_GLOBAL`` includes code to check to see whether the whole code object should be modified to add a cache, In 3.9 the C code for the ``LOAD_GLOBAL`` includes code to check to see
whether either the global or builtins namespace, code to lookup the value in a cache, and fallback code. whether the whole code object should be modified to add a cache,
This makes it complicated and bulky. It also performs many redundant operations even when supposedly optimized. whether either the global or builtins namespace,
code to lookup the value in a cache, and fallback code.
This makes it complicated and bulky.
It also performs many redundant operations even when supposedly optimized.
Using a family of instructions makes the code more maintainable and faster, as each instruction only needs to handle one concern. Using a family of instructions makes the code more maintainable and faster,
as each instruction only needs to handle one concern.
Specializations would include: Specializations would include:
* ``LOAD_GLOBAL_ADAPTIVE`` would operate like ``CALL_FUNCTION_ADAPTIVE`` above. * ``LOAD_GLOBAL_ADAPTIVE`` would operate like ``CALL_FUNCTION_ADAPTIVE`` above.
* ``LOAD_GLOBAL_MODULE`` can be specialized for the case where the value is in the globals namespace. * ``LOAD_GLOBAL_MODULE`` can be specialized for the case where the value is in
After checking that the keys of the namespace have not changed, it can load the value from the stored index. the globals namespace. After checking that the keys of the namespace have
* ``LOAD_GLOBAL_BUILTIN`` can be specialized for the case where the value is in the builtins namespace. not changed, it can load the value from the stored index.
It needs to check that the keys of the global namespace have not been added to, and that the builtins namespace has not changed. * ``LOAD_GLOBAL_BUILTIN`` can be specialized for the case where the value is
Note that we don't care if the values of the global namespace have changed, just the keys. in the builtins namespace. It needs to check that the keys of the global
namespace have not been added to, and that the builtins namespace has not
changed. Note that we don't care if the values of the global namespace
have changed, just the keys.
See [4]_ for a full implementation. See [4]_ for a full implementation.
.. note:: .. note::
This PEP outlines the mechanisms for managing specialization, and does not specify the particular optimizations to be applied. This PEP outlines the mechanisms for managing specialization, and does not
The above scheme is just one possible scheme. Many others are possible and may well be better. specify the particular optimizations to be applied.
The above scheme is just one possible scheme.
Many others are possible and may well be better.
Compatibility Compatibility
============= =============
There will be no change to the language, library or API. There will be no change to the language, library or API.
The only way that users will be able to detect the presence of the new interpreter is through timing execution, the use of debugging tools, The only way that users will be able to detect the presence of the new
interpreter is through timing execution, the use of debugging tools,
or measuring memory use. or measuring memory use.
Costs Costs
@ -236,13 +310,14 @@ Costs
Memory use Memory use
---------- ----------
An obvious concern with any scheme that performs any sort of caching is "how much more memory does it use?". An obvious concern with any scheme that performs any sort of caching is
"how much more memory does it use?".
The short answer is "none". The short answer is "none".
Comparing memory use to 3.10 Comparing memory use to 3.10
'''''''''''''''''''''''''''' ''''''''''''''''''''''''''''
The following table shows the additional bytes per instruction to support the 3.10 opcache The following table shows the additional bytes per instruction to support the
or the proposed adaptive interpreter, on a 64 bit machine. 3.10 opcache or the proposed adaptive interpreter, on a 64 bit machine.
================ ===== ======== ===== ===== ================ ===== ======== ===== =====
Version 3.10 3.10 opt 3.11 3.11 Version 3.10 3.10 opt 3.11 3.11
@ -256,10 +331,13 @@ or the proposed adaptive interpreter, on a 64 bit machine.
================ ===== ======== ===== ===== ================ ===== ======== ===== =====
``3.10`` is the current version of 3.10 which uses 32 bytes per entry. ``3.10`` is the current version of 3.10 which uses 32 bytes per entry.
``3.10 opt`` is a hypothetical improved version of 3.10 that uses 24 bytes per entry. ``3.10 opt`` is a hypothetical improved version of 3.10 that uses 24 bytes
per entry.
Even if one third of all instructions were specialized (a high proportion), then the memory use is still less than Even if one third of all instructions were specialized (a high proportion),
that of 3.10. With a more realistic 25%, then memory use is basically the same as the hypothetical improved version of 3.10. then the memory use is still less than that of 3.10.
With a more realistic 25%, then memory use is basically the same as the
hypothetical improved version of 3.10.
Security Implications Security Implications
@ -277,7 +355,8 @@ Too many to list.
References References
========== ==========
.. [1] The construction of high-performance virtual machines for dynamic languages, Mark Shannon 2010. .. [1] The construction of high-performance virtual machines for
dynamic languages, Mark Shannon 2010.
http://theses.gla.ac.uk/2975/1/2011shannonphd.pdf http://theses.gla.ac.uk/2975/1/2011shannonphd.pdf
.. [2] Dynamic Interpretation for Dynamic Scripting Languages .. [2] Dynamic Interpretation for Dynamic Scripting Languages
@ -286,7 +365,8 @@ References
.. [3] Inline Caching meets Quickening .. [3] Inline Caching meets Quickening
https://www.unibw.de/ucsrl/pubs/ecoop10.pdf/view https://www.unibw.de/ucsrl/pubs/ecoop10.pdf/view
.. [4] Adaptive specializing examples (This will be moved to a more permanent location, once this PEP is accepted) .. [4] Adaptive specializing examples
(This will be moved to a more permanent location, once this PEP is accepted)
https://gist.github.com/markshannon/556ccc0e99517c25a70e2fe551917c03 https://gist.github.com/markshannon/556ccc0e99517c25a70e2fe551917c03