diff --git a/pep-0659.rst b/pep-0659.rst index 06abec878..1e8e938b7 100644 --- a/pep-0659.rst +++ b/pep-0659.rst @@ -11,21 +11,26 @@ Post-History: 11-May-2021 Abstract ======== -In order to perform well, virtual machines for dynamic languages must specialize the code that they execute -to the types and values in the program being run. -This specialization is often associated with "JIT" compilers, but is beneficial even without machine code generation. +In order to perform well, virtual machines for dynamic languages must +specialize the code that they execute to the types and values in the +program being run. This specialization is often associated with "JIT" +compilers, but is beneficial even without machine code generation. -A specializing, adaptive interpreter is one that speculatively specializes on the types or values it is currently operating on, -and adapts to changes in those types and values. +A specializing, adaptive interpreter is one that speculatively specializes +on the types or values it is currently operating on, and adapts to changes +in those types and values. -Specialization gives us improved performance, and adaptation allows the interpreter to rapidly change when the pattern of usage in a program alters, +Specialization gives us improved performance, and adaptation allows the +interpreter to rapidly change when the pattern of usage in a program alters, limiting the amount of additional work caused by mis-specialization. -This PEP proposes using a specializing, adaptive interpreter that specializes code aggressively, but over a very small region, -and is able to adjust to mis-specialization rapidly and at low cost. +This PEP proposes using a specializing, adaptive interpreter that specializes +code aggressively, but over a very small region, and is able to adjust to +mis-specialization rapidly and at low cost. -Adding a specializing, adaptive interpreter to CPython will bring significant performance improvements. -It is hard to come up with meaningful numbers, as it depends very much on the benchmarks and on work that has not yet happened. +Adding a specializing, adaptive interpreter to CPython will bring significant +performance improvements. It is hard to come up with meaningful numbers, +as it depends very much on the benchmarks and on work that has not yet happened. Extensive experimentation suggests speedups of up to 50%. Even if the speedup were only 25%, this would still be a worthwhile enhancement. @@ -33,44 +38,62 @@ Motivation ========== Python is widely acknowledged as slow. -Whilst Python will never attain the performance of low-level languages like C, Fortran, or even Java, -we would like it to be competitive with fast implementations of scripting languages, like V8 for Javascript or luajit for lua. -Specifically, we want to achieve these performance goals with CPython to benefit all users of Python -including those unable to use PyPy or other alternative virtual machines. +Whilst Python will never attain the performance of low-level languages like C, +Fortran, or even Java, we would like it to be competitive with fast +implementations of scripting languages, like V8 for Javascript or luajit for +lua. +Specifically, we want to achieve these performance goals with CPython to +benefit all users of Python including those unable to use PyPy or +other alternative virtual machines. -Achieving these performance goals is a long way off, and will require a lot of engineering effort, -but we can make a significant step towards those goals by speeding up the interpreter. -Both academic research and practical implementations have shown that a fast interpreter is a key part of a fast virtual machine. +Achieving these performance goals is a long way off, and will require a lot of +engineering effort, but we can make a significant step towards those goals by +speeding up the interpreter. +Both academic research and practical implementations have shown that a fast +interpreter is a key part of a fast virtual machine. -Typical optimizations for virtual machines are expensive, so a long "warm up" time is required -to gain confidence that the cost of optimization is justified. +Typical optimizations for virtual machines are expensive, so a long "warm up" +time is required to gain confidence that the cost of optimization is justified. In order to get speed-ups rapidly, without noticeable warmup times, -the VM should speculate that specialization is justified even after a few executions of a function. -To do that effectively, the interpreter must be able to optimize and deoptimize continually and very cheaply. +the VM should speculate that specialization is justified even after a few +executions of a function. To do that effectively, the interpreter must be able +to optimize and de-optimize continually and very cheaply. -By using adaptive and speculative specialization at the granularity of individual virtual machine instructions, we get a faster -interpreter that also generates profiling information for more sophisticated optimizations in the future. +By using adaptive and speculative specialization at the granularity of +individual virtual machine instructions, +we get a faster interpreter that also generates profiling information +for more sophisticated optimizations in the future. Rationale ========= -There are many practical ways to speed-up a virtual machine for a dynamic language. -However, specialization is the most important, both in itself and as an enabler of other optimizations. -Therefore it makes sense to focus our efforts on specialization first, if we want to improve the performance of CPython. +There are many practical ways to speed-up a virtual machine for a dynamic +language. +However, specialization is the most important, both in itself and as an +enabler of other optimizations. +Therefore it makes sense to focus our efforts on specialization first, +if we want to improve the performance of CPython. -Specialization is typically done in the context of a JIT compiler, but research shows specialization in an interpreter -can boost performance significantly, even outperforming a naive compiler [1]_. +Specialization is typically done in the context of a JIT compiler, +but research shows specialization in an interpreter can boost performance +significantly, even outperforming a naive compiler [1]_. -There have been several ways of doing this proposed in the academic literature, -but most attempt to optimize regions larger than a single bytecode [1]_ [2]_. -Using larger regions than a single instruction, requires code to handle deoptimization in the middle of a region. -Specialization at the level of individual bytecodes makes deoptimization trivial, as it cannot occur in the middle of a region. +There have been several ways of doing this proposed in the academic +literature, but most attempt to optimize regions larger than a +single bytecode [1]_ [2]_. +Using larger regions than a single instruction requires code to handle +de-optimization in the middle of a region. +Specialization at the level of individual bytecodes makes de-optimization +trivial, as it cannot occur in the middle of a region. -By speculatively specializing individual bytecodes, we can gain significant performance improvements without anything but the most local, -and trivial to implement, deoptimizations. +By speculatively specializing individual bytecodes, we can gain significant +performance improvements without anything but the most local, +and trivial to implement, de-optimizations. -The closest approach to this PEP in the literature is "Inline Caching meets Quickening" [3]_. -This PEP has the advantages of inline caching, but adds the ability to quickly deoptimize making the performance +The closest approach to this PEP in the literature is +"Inline Caching meets Quickening" [3]_. +This PEP has the advantages of inline caching, +but adds the ability to quickly de-optimize making the performance more robust in cases where specialization fails or is not stable. Performance @@ -78,11 +101,14 @@ Performance The expected speedup of 50% can be broken roughly down as follows: -* In the region of 30% from specialization. Much of that is from specialization of calls, - with improvements in instructions that are already specialized such as ``LOAD_ATTR`` and ``LOAD_GLOBAL`` - contributing much of the remainder. Specialization of operations adds a small amount. -* About 10% from improved dispatch such as super-instructions and other optimizations enabled by quickening. -* Further increases in the benefits of other optimizations, as they can exploit, or be exploited by specialization. +* In the region of 30% from specialization. Much of that is from + specialization of calls, with improvements in instructions that are already + specialized such as ``LOAD_ATTR`` and ``LOAD_GLOBAL`` contributing much of + the remainder. Specialization of operations adds a small amount. +* About 10% from improved dispatch such as super-instructions + and other optimizations enabled by quickening. +* Further increases in the benefits of other optimizations, + as they can exploit, or be exploited by specialization. Implementation ============== @@ -90,12 +116,15 @@ Implementation Overview -------- -Once any instruction in a code object has executed a few times, that code object will be "quickened" by allocating a new array -for the bytecode that can be modified at runtime, and is not constrained as the ``code.co_code`` object is. -From that point onwards, whenever any instruction in that code object is executed, it will use the quickened form. +Once any instruction in a code object has executed a few times, +that code object will be "quickened" by allocating a new array for the +bytecode that can be modified at runtime, and is not constrained as the +``code.co_code`` object is. From that point onwards, whenever any +instruction in that code object is executed, it will use the quickened form. -Any instruction that would benefit from specialization will be replaced by an "adaptive" form of that instruction. -When executed, the adaptive instructions will specialize themselves in response to the types and values that they see. +Any instruction that would benefit from specialization will be replaced by an +"adaptive" form of that instruction. When executed, the adaptive instructions +will specialize themselves in response to the types and values that they see. Quickening ---------- @@ -106,62 +135,85 @@ Quickened code has number of advantages over the normal bytecode: * It can be changed at runtime * It can use super-instructions that span lines and take multiple operands. -* It does not need to handle tracing as it can fallback to the normal bytecode for that. +* It does not need to handle tracing as it can fallback to the normal + bytecode for that. -In order that tracing can be supported, and quickening performed quickly, the quickened instruction format should match the normal -bytecode format: 16-bit instructions of 8-bit opcode followed by 8-bit operand. +In order that tracing can be supported, and quickening performed quickly, +the quickened instruction format should match the normal bytecode format: +16-bit instructions of 8-bit opcode followed by 8-bit operand. Adaptive instructions --------------------- -Each instruction that would benefit from specialization is replaced by an adaptive version during quickening. -For example, the ``LOAD_ATTR`` instruction would be replaced with ``LOAD_ATTR_ADAPTIVE``. +Each instruction that would benefit from specialization is replaced by an +adaptive version during quickening. For example, +the ``LOAD_ATTR`` instruction would be replaced with ``LOAD_ATTR_ADAPTIVE``. -Each adaptive instruction maintains a counter, and periodically attempts to specialize itself. +Each adaptive instruction maintains a counter, +and periodically attempts to specialize itself. Specialization -------------- -CPython bytecode contains many bytecodes that represent high-level operations, and would benefit from specialization. -Examples include ``CALL_FUNCTION``, ``LOAD_ATTR``, ``LOAD_GLOBAL`` and ``BINARY_ADD``. +CPython bytecode contains many bytecodes that represent high-level operations, +and would benefit from specialization. Examples include ``CALL_FUNCTION``, +``LOAD_ATTR``, ``LOAD_GLOBAL`` and ``BINARY_ADD``. -By introducing a "family" of specialized instructions for each of these instructions allows effective specialization, +By introducing a "family" of specialized instructions for each of these +instructions allows effective specialization, since each new instruction is specialized to a single task. -Each family will include an "adaptive" instruction, that maintains a counter and periodically attempts to specialize itself. -Each family will also include one or more specialized instructions that perform the equivalent -of the generic operation much faster provided their inputs are as expected. -Each specialized instruction will maintain a saturating counter which will be incremented whenever the inputs are as expected. -Should the inputs not be as expected, the counter will be decremented and the generic operation will be performed. -If the counter reaches the minimum value, the instruction is deoptimized by simply replacing its opcode with the adaptive version. +Each family will include an "adaptive" instruction, +that maintains a counter and periodically attempts to specialize itself. +Each family will also include one or more specialized instructions that +perform the equivalent of the generic operation much faster provided their +inputs are as expected. +Each specialized instruction will maintain a saturating counter which will +be incremented whenever the inputs are as expected. Should the inputs not +be as expected, the counter will be decremented and the generic operation +will be performed. +If the counter reaches the minimum value, the instruction is de-optimized by +simply replacing its opcode with the adaptive version. Ancillary data -------------- -Most families of specialized instructions will require more information than can fit in an 8-bit operand. -To do this, an array of specialization data entries will be maintained alongside the new instruction array. -For instructions that need specialization data, the operand in the quickened array will serve as a partial index, -along with the offset of the instruction, to find the first specialization data entry for that instruction. -Each entry will be 8 bytes (for a 64 bit machine). The data in an entry, and the number of entries needed, will vary from instruction to instruction. +Most families of specialized instructions will require more information than +can fit in an 8-bit operand. To do this, an array of specialization data entries +will be maintained alongside the new instruction array. For instructions that +need specialization data, the operand in the quickened array will serve as a +partial index, along with the offset of the instruction, to find the first +specialization data entry for that instruction. +Each entry will be 8 bytes (for a 64 bit machine). The data in an entry, +and the number of entries needed, will vary from instruction to instruction. Data layout ----------- -Quickened instructions will be stored in an array (it is neither necessary not desirable to store them in a Python object) with the same -format as the original bytecode. Ancillary data will be stored in a separate array. +Quickened instructions will be stored in an array (it is neither necessary not +desirable to store them in a Python object) with the same format as the +original bytecode. Ancillary data will be stored in a separate array. -Each instruction will use 0 or more data entries. Each instruction within a family must have the same amount of data allocated, although some -instructions may not use all of it. Instructions that cannot be specialized, e.g. ``POP_TOP``, do not need any entries. +Each instruction will use 0 or more data entries. +Each instruction within a family must have the same amount of data allocated, +although some instructions may not use all of it. +Instructions that cannot be specialized, e.g. ``POP_TOP``, +do not need any entries. Experiments show that 25% to 30% of instructions can be usefully specialized. -Different families will need different amounts of data, but most need 2 entries (16 bytes on a 64 bit machine). +Different families will need different amounts of data, +but most need 2 entries (16 bytes on a 64 bit machine). -In order to support larger functions than 256 instructions, we compute the offset of the first data entry for instructions +In order to support larger functions than 256 instructions, +we compute the offset of the first data entry for instructions as ``(instruction offset)//2 + (quickened operand)``. Compared to the opcache in Python 3.10, this design: -* is faster; it requires no memory reads to compute the offset. 3.10 requires two reads, which are dependent. -* uses much less memory, as the data can be different sizes for different instruction families, and doesn't need an additional array of offsets. -* can support much larger functions, up to about 5000 instructions per function. 3.10 can support about 1000. +* is faster; it requires no memory reads to compute the offset. + 3.10 requires two reads, which are dependent. +* uses much less memory, as the data can be different sizes for different + instruction families, and doesn't need an additional array of offsets. + can support much larger functions, up to about 5000 instructions + per function. 3.10 can support about 1000. Example families of instructions @@ -170,64 +222,86 @@ Example families of instructions CALL_FUNCTION ''''''''''''' -The ``CALL_FUNCTION`` instruction calls the (N+1)th item on the stack with top N items on the stack as arguments. +The ``CALL_FUNCTION`` instruction calls the (N+1)th item on the stack with +top N items on the stack as arguments. -This is an obvious candidate for specialization. For example, the call in ``len(x)`` is represented as the bytecode ``CALL_FUNCTION 1``. -In this case we would always expect the object ``len`` to be the function. We probably don't want to specialize for ``len`` -(although we might for ``type`` and ``isinstance``), but it would be beneficial to specialize for builtin functions taking a single argument. -A fast check that the underlying function is a builtin function taking a single argument (``METHOD_O``) would allow us to avoid a -sequence of checks for number of parameters and keyword arguments. +This is an obvious candidate for specialization. For example, the call in +``len(x)`` is represented as the bytecode ``CALL_FUNCTION 1``. +In this case we would always expect the object ``len`` to be the function. +We probably don't want to specialize for ``len`` +(although we might for ``type`` and ``isinstance``), but it would be beneficial +to specialize for builtin functions taking a single argument. +A fast check that the underlying function is a builtin function taking a single +argument (``METHOD_O``) would allow us to avoid a sequence of checks for number +of parameters and keyword arguments. -``CALL_FUNCTION_ADAPTIVE`` would track how often it is executed, and call the ``call_function_optimize`` when executed enough times, or jump -to ``CALL_FUNCTION`` otherwise. -When optimizing, the kind of the function would be checked and if a suitable specialized instruction was found, +``CALL_FUNCTION_ADAPTIVE`` would track how often it is executed, and call the +``call_function_optimize`` when executed enough times, or jump to ``CALL_FUNCTION`` +otherwise. When optimizing, the kind of the function would be checked and if a +suitable specialized instruction was found, it would replace ``CALL_FUNCTION_ADAPTIVE`` in place. Specializations might include: -* ``CALL_FUNCTION_PY_SIMPLE``: Calls to Python functions with exactly matching parameters. -* ``CALL_FUNCTION_PY_DEFAULTS``: Calls to Python functions with more parameters and default values. - Since the exact number of defaults needed is known, the instruction needs to do no additional checking or computation; just copy some defaults. -* ``CALL_BUILTIN_O``: The example given above for calling builtin methods taking exactly one argument. -* ``CALL_BUILTIN_VECTOR``: For calling builtin function taking vector arguments. +* ``CALL_FUNCTION_PY_SIMPLE``: Calls to Python functions with + exactly matching parameters. +* ``CALL_FUNCTION_PY_DEFAULTS``: Calls to Python functions with more + parameters and default values. Since the exact number of defaults needed is + known, the instruction needs to do no additional checking or computation; + just copy some defaults. +* ``CALL_BUILTIN_O``: The example given above for calling builtin methods + taking exactly one argument. +* ``CALL_BUILTIN_VECTOR``: For calling builtin function taking + vector arguments. Note how this allows optimizations that complement other optimizations. -For example, if the Python and C call stacks were decoupled and the data stack were contiguous, -then Python-to-Python calls could be made very fast. +For example, if the Python and C call stacks were decoupled and the data stack +were contiguous, then Python-to-Python calls could be made very fast. LOAD_GLOBAL ''''''''''' -The ``LOAD_GLOBAL`` instruction looks up a name in the global namespace and then, if not present in the global namespace, +The ``LOAD_GLOBAL`` instruction looks up a name in the global namespace +and then, if not present in the global namespace, looks it up in the builtins namespace. -In 3.9 the C code for the ``LOAD_GLOBAL`` includes code to check to see whether the whole code object should be modified to add a cache, -whether either the global or builtins namespace, code to lookup the value in a cache, and fallback code. -This makes it complicated and bulky. It also performs many redundant operations even when supposedly optimized. +In 3.9 the C code for the ``LOAD_GLOBAL`` includes code to check to see +whether the whole code object should be modified to add a cache, +whether either the global or builtins namespace, +code to lookup the value in a cache, and fallback code. +This makes it complicated and bulky. +It also performs many redundant operations even when supposedly optimized. -Using a family of instructions makes the code more maintainable and faster, as each instruction only needs to handle one concern. +Using a family of instructions makes the code more maintainable and faster, +as each instruction only needs to handle one concern. Specializations would include: * ``LOAD_GLOBAL_ADAPTIVE`` would operate like ``CALL_FUNCTION_ADAPTIVE`` above. -* ``LOAD_GLOBAL_MODULE`` can be specialized for the case where the value is in the globals namespace. - After checking that the keys of the namespace have not changed, it can load the value from the stored index. -* ``LOAD_GLOBAL_BUILTIN`` can be specialized for the case where the value is in the builtins namespace. - It needs to check that the keys of the global namespace have not been added to, and that the builtins namespace has not changed. - Note that we don't care if the values of the global namespace have changed, just the keys. +* ``LOAD_GLOBAL_MODULE`` can be specialized for the case where the value is in + the globals namespace. After checking that the keys of the namespace have + not changed, it can load the value from the stored index. +* ``LOAD_GLOBAL_BUILTIN`` can be specialized for the case where the value is + in the builtins namespace. It needs to check that the keys of the global + namespace have not been added to, and that the builtins namespace has not + changed. Note that we don't care if the values of the global namespace + have changed, just the keys. See [4]_ for a full implementation. .. note:: - This PEP outlines the mechanisms for managing specialization, and does not specify the particular optimizations to be applied. - The above scheme is just one possible scheme. Many others are possible and may well be better. + This PEP outlines the mechanisms for managing specialization, and does not + specify the particular optimizations to be applied. + The above scheme is just one possible scheme. + Many others are possible and may well be better. Compatibility ============= There will be no change to the language, library or API. -The only way that users will be able to detect the presence of the new interpreter is through timing execution, the use of debugging tools, +The only way that users will be able to detect the presence of the new +interpreter is through timing execution, the use of debugging tools, or measuring memory use. Costs @@ -236,13 +310,14 @@ Costs Memory use ---------- -An obvious concern with any scheme that performs any sort of caching is "how much more memory does it use?". +An obvious concern with any scheme that performs any sort of caching is +"how much more memory does it use?". The short answer is "none". Comparing memory use to 3.10 '''''''''''''''''''''''''''' -The following table shows the additional bytes per instruction to support the 3.10 opcache -or the proposed adaptive interpreter, on a 64 bit machine. +The following table shows the additional bytes per instruction to support the +3.10 opcache or the proposed adaptive interpreter, on a 64 bit machine. ================ ===== ======== ===== ===== Version 3.10 3.10 opt 3.11 3.11 @@ -256,10 +331,13 @@ or the proposed adaptive interpreter, on a 64 bit machine. ================ ===== ======== ===== ===== ``3.10`` is the current version of 3.10 which uses 32 bytes per entry. -``3.10 opt`` is a hypothetical improved version of 3.10 that uses 24 bytes per entry. +``3.10 opt`` is a hypothetical improved version of 3.10 that uses 24 bytes +per entry. -Even if one third of all instructions were specialized (a high proportion), then the memory use is still less than -that of 3.10. With a more realistic 25%, then memory use is basically the same as the hypothetical improved version of 3.10. +Even if one third of all instructions were specialized (a high proportion), +then the memory use is still less than that of 3.10. +With a more realistic 25%, then memory use is basically the same as the +hypothetical improved version of 3.10. Security Implications @@ -277,7 +355,8 @@ Too many to list. References ========== -.. [1] The construction of high-performance virtual machines for dynamic languages, Mark Shannon 2010. +.. [1] The construction of high-performance virtual machines for + dynamic languages, Mark Shannon 2010. http://theses.gla.ac.uk/2975/1/2011shannonphd.pdf .. [2] Dynamic Interpretation for Dynamic Scripting Languages @@ -286,7 +365,8 @@ References .. [3] Inline Caching meets Quickening https://www.unibw.de/ucsrl/pubs/ecoop10.pdf/view -.. [4] Adaptive specializing examples (This will be moved to a more permanent location, once this PEP is accepted) +.. [4] Adaptive specializing examples + (This will be moved to a more permanent location, once this PEP is accepted) https://gist.github.com/markshannon/556ccc0e99517c25a70e2fe551917c03