Reformat PEP 659 to obey 80 column limit. (#2458)

2022-03-22 10:27:32 +00:00 · 2022-03-22 10:27:32 +00:00 · 293dd4e107
parent 04eb44995d
commit 293dd4e107
1 changed files with 192 additions and 112 deletions
--- a/pep-0659.rst
+++ b/pep-0659.rst
@ -11,21 +11,26 @@ Post-History: 11-May-2021
 Abstract
 ========

-In order to perform well, virtual machines for dynamic languages must specialize the code that they execute 
-to the types and values in the program being run.
-This specialization is often associated with "JIT" compilers, but is beneficial even without machine code generation.
+In order to perform well, virtual machines for dynamic languages must
+specialize the code that they execute to the types and values in the
+program being run. This specialization is often associated with "JIT"
+compilers, but is beneficial even without machine code generation.

-A specializing, adaptive interpreter is one that speculatively specializes on the types or values it is currently operating on,
-and adapts to changes in those types and values.
+A specializing, adaptive interpreter is one that speculatively specializes
+on the types or values it is currently operating on, and adapts to changes
+in those types and values.

-Specialization gives us improved performance, and adaptation allows the interpreter to rapidly change when the pattern of usage in a program alters,
+Specialization gives us improved performance, and adaptation allows the
+interpreter to rapidly change when the pattern of usage in a program alters,
 limiting the amount of additional work caused by mis-specialization.

-This PEP proposes using a specializing, adaptive interpreter that specializes code aggressively, but over a very small region,
-and is able to adjust to mis-specialization rapidly and at low cost.
+This PEP proposes using a specializing, adaptive interpreter that specializes
+code aggressively, but over a very small region, and is able to adjust to
+mis-specialization rapidly and at low cost.

-Adding a specializing, adaptive interpreter to CPython will bring significant performance improvements.
-It is hard to come up with meaningful numbers, as it depends very much on the benchmarks and on work that has not yet happened.
+Adding a specializing, adaptive interpreter to CPython will bring significant
+performance improvements. It is hard to come up with meaningful numbers,
+as it depends very much on the benchmarks and on work that has not yet happened.
 Extensive experimentation suggests speedups of up to 50%.
 Even if the speedup were only 25%, this would still be a worthwhile enhancement.

@ -33,44 +38,62 @@ Motivation
 ==========

 Python is widely acknowledged as slow.
-Whilst Python will never attain the performance of low-level languages like C, Fortran, or even Java,
-we would like it to be competitive with fast implementations of scripting languages, like V8 for Javascript or luajit for lua.
-Specifically, we want to achieve these performance goals with CPython to benefit all users of Python
-including those unable to use PyPy or other alternative virtual machines.
+Whilst Python will never attain the performance of low-level languages like C,
+Fortran, or even Java, we would like it to be competitive with fast
+implementations of scripting languages, like V8 for Javascript or luajit for
+lua.
+Specifically, we want to achieve these performance goals with CPython to
+benefit all users of Python including those unable to use PyPy or
+other alternative virtual machines.

-Achieving these performance goals is a long way off, and will require a lot of engineering effort,
-but we can make a significant step towards those goals by speeding up the interpreter.
-Both academic research and practical implementations have shown that a fast interpreter is a key part of a fast virtual machine.
+Achieving these performance goals is a long way off, and will require a lot of
+engineering effort, but we can make a significant step towards those goals by
+speeding up the interpreter.
+Both academic research and practical implementations have shown that a fast
+interpreter is a key part of a fast virtual machine.

-Typical optimizations for virtual machines are expensive, so a long "warm up" time is required 
-to gain confidence that the cost of optimization is justified.
+Typical optimizations for virtual machines are expensive, so a long "warm up"
+time is required to gain confidence that the cost of optimization is justified.
 In order to get speed-ups rapidly, without noticeable warmup times,
-the VM should speculate that specialization is justified even after a few executions of a function.
-To do that effectively, the interpreter must be able to optimize and deoptimize continually and very cheaply.
+the VM should speculate that specialization is justified even after a few
+executions of a function. To do that effectively, the interpreter must be able
+to optimize and de-optimize continually and very cheaply.

-By using adaptive and speculative specialization at the granularity of individual virtual machine instructions, we get a faster
-interpreter that also generates profiling information for more sophisticated optimizations in the future.
+By using adaptive and speculative specialization at the granularity of
+individual virtual machine instructions,
+we get a faster interpreter that also generates profiling information
+for more sophisticated optimizations in the future.

 Rationale
 =========

-There are many practical ways to speed-up a virtual machine for a dynamic language.
-However, specialization is the most important, both in itself and as an enabler of other optimizations.
-Therefore it makes sense to focus our efforts on specialization first, if we want to improve the performance of CPython.
+There are many practical ways to speed-up a virtual machine for a dynamic
+language.
+However, specialization is the most important, both in itself and as an
+enabler of other optimizations.
+Therefore it makes sense to focus our efforts on specialization first,
+if we want to improve the performance of CPython.

-Specialization is typically done in the context of a JIT compiler, but research shows specialization in an interpreter
-can boost performance significantly, even outperforming a naive compiler [1]_.
+Specialization is typically done in the context of a JIT compiler,
+but research shows specialization in an interpreter can boost performance
+significantly, even outperforming a naive compiler [1]_.

-There have been several ways of doing this proposed in the academic literature,
-but most attempt to optimize regions larger than a single bytecode [1]_ [2]_.
-Using larger regions than a single instruction, requires code to handle deoptimization in the middle of a region.
-Specialization at the level of individual bytecodes makes deoptimization trivial, as it cannot occur in the middle of a region.
+There have been several ways of doing this proposed in the academic
+literature, but most attempt to optimize regions larger than a
+single bytecode [1]_ [2]_.
+Using larger regions than a single instruction requires code to handle
+de-optimization in the middle of a region.
+Specialization at the level of individual bytecodes makes de-optimization
+trivial, as it cannot occur in the middle of a region.

-By speculatively specializing individual bytecodes, we can gain significant performance improvements without anything but the most local,
-and trivial to implement, deoptimizations.
+By speculatively specializing individual bytecodes, we can gain significant
+performance improvements without anything but the most local,
+and trivial to implement, de-optimizations.

-The closest approach to this PEP in the literature is "Inline Caching meets Quickening" [3]_.
-This PEP has the advantages of inline caching, but adds the ability to quickly deoptimize making the performance
+The closest approach to this PEP in the literature is
+"Inline Caching meets Quickening" [3]_.
+This PEP has the advantages of inline caching,
+but adds the ability to quickly de-optimize making the performance
 more robust in cases where specialization fails or is not stable.

 Performance
@ -78,11 +101,14 @@ Performance

 The expected speedup of 50% can be broken roughly down as follows:

-* In the region of 30% from specialization. Much of that is from specialization of calls,
-  with improvements in instructions that are already specialized such as ``LOAD_ATTR`` and ``LOAD_GLOBAL``
-  contributing much of the remainder. Specialization of operations adds a small amount.
-* About 10% from improved dispatch such as super-instructions and other optimizations enabled by quickening.
-* Further increases in the benefits of other optimizations, as they can exploit, or be exploited by specialization.
+* In the region of 30% from specialization. Much of that is from
+  specialization of calls, with improvements in instructions that are already
+  specialized such as ``LOAD_ATTR`` and ``LOAD_GLOBAL`` contributing much of
+  the remainder. Specialization of operations adds a small amount.
+* About 10% from improved dispatch such as super-instructions
+  and other optimizations enabled by quickening.
+* Further increases in the benefits of other optimizations,
+  as they can exploit, or be exploited by specialization.

 Implementation
 ==============
@ -90,12 +116,15 @@ Implementation
 Overview
 --------

-Once any instruction in a code object has executed a few times, that code object will be "quickened" by allocating a new array
-for the bytecode that can be modified at runtime, and is not constrained as the ``code.co_code`` object is.
-From that point onwards, whenever any instruction in that code object is executed, it will use the quickened form.
+Once any instruction in a code object has executed a few times,
+that code object will be "quickened" by allocating a new array for the
+bytecode that can be modified at runtime, and is not constrained as the
+``code.co_code`` object is. From that point onwards, whenever any
+instruction in that code object is executed, it will use the quickened form.

-Any instruction that would benefit from specialization will be replaced by an "adaptive" form of that instruction.
-When executed, the adaptive instructions will specialize themselves in response to the types and values that they see.
+Any instruction that would benefit from specialization will be replaced by an
+"adaptive" form of that instruction. When executed, the adaptive instructions
+will specialize themselves in response to the types and values that they see.

 Quickening
 ----------
@ -106,62 +135,85 @@ Quickened code has number of advantages over the normal bytecode:

 * It can be changed at runtime
 * It can use super-instructions that span lines and take multiple operands.
-* It does not need to handle tracing as it can fallback to the normal bytecode for that.
+* It does not need to handle tracing as it can fallback to the normal
+  bytecode for that.

-In order that tracing can be supported, and quickening performed quickly, the quickened instruction format should match the normal
-bytecode format: 16-bit instructions of 8-bit opcode followed by 8-bit operand.
+In order that tracing can be supported, and quickening performed quickly,
+the quickened instruction format should match the normal bytecode format:
+16-bit instructions of 8-bit opcode followed by 8-bit operand.

 Adaptive instructions
 ---------------------

-Each instruction that would benefit from specialization is replaced by an adaptive version during quickening.
-For example, the ``LOAD_ATTR`` instruction would be replaced with ``LOAD_ATTR_ADAPTIVE``.
+Each instruction that would benefit from specialization is replaced by an
+adaptive version during quickening. For example,
+the ``LOAD_ATTR`` instruction would be replaced with ``LOAD_ATTR_ADAPTIVE``.

-Each adaptive instruction maintains a counter, and periodically attempts to specialize itself.
+Each adaptive instruction maintains a counter,
+and periodically attempts to specialize itself.

 Specialization
 --------------

-CPython bytecode contains many bytecodes that represent high-level operations, and would benefit from specialization.
-Examples include ``CALL_FUNCTION``, ``LOAD_ATTR``, ``LOAD_GLOBAL`` and ``BINARY_ADD``.
+CPython bytecode contains many bytecodes that represent high-level operations,
+and would benefit from specialization. Examples include ``CALL_FUNCTION``,
+``LOAD_ATTR``, ``LOAD_GLOBAL`` and ``BINARY_ADD``.

-By introducing a "family" of specialized instructions for each of these instructions allows effective specialization,
+By introducing a "family" of specialized instructions for each of these
+instructions allows effective specialization,
 since each new instruction is specialized to a single task.
-Each family will include an "adaptive" instruction, that maintains a counter and periodically attempts to specialize itself.
-Each family will also include one or more specialized instructions that perform the equivalent
-of the generic operation much faster provided their inputs are as expected.
-Each specialized instruction will maintain a saturating counter which will be incremented whenever the inputs are as expected.
-Should the inputs not be as expected, the counter will be decremented and the generic operation will be performed.
-If the counter reaches the minimum value, the instruction is deoptimized by simply replacing its opcode with the adaptive version.
+Each family will include an "adaptive" instruction,
+that maintains a counter and periodically attempts to specialize itself.
+Each family will also include one or more specialized instructions that
+perform the equivalent of the generic operation much faster provided their
+inputs are as expected.
+Each specialized instruction will maintain a saturating counter which will
+be incremented whenever the inputs are as expected. Should the inputs not
+be as expected, the counter will be decremented and the generic operation
+will be performed.
+If the counter reaches the minimum value, the instruction is de-optimized by
+simply replacing its opcode with the adaptive version.

 Ancillary data
 --------------

-Most families of specialized instructions will require more information than can fit in an 8-bit operand.
-To do this, an array of specialization data entries will be maintained alongside the new instruction array.
-For instructions that need specialization data, the operand in the quickened array will serve as a partial index, 
-along with the offset of the instruction, to find the first specialization data entry for that instruction.
-Each entry will be 8 bytes (for a 64 bit machine). The data in an entry, and the number of entries needed, will vary from instruction to instruction.
+Most families of specialized instructions will require more information than
+can fit in an 8-bit operand. To do this, an array of specialization data entries
+will be maintained alongside the new instruction array. For instructions that
+need specialization data, the operand in the quickened array will serve as a
+partial index,  along with the offset of the instruction, to find the first
+specialization data entry for that instruction.
+Each entry will be 8 bytes (for a 64 bit machine). The data in an entry,
+and the number of entries needed, will vary from instruction to instruction.

 Data layout
 -----------

-Quickened instructions will be stored in an array (it is neither necessary not desirable to store them in a Python object) with the same
-format as the original bytecode. Ancillary data will be stored in a separate array.
+Quickened instructions will be stored in an array (it is neither necessary not
+desirable to store them in a Python object) with the same format as the
+original bytecode. Ancillary data will be stored in a separate array.

-Each instruction will use 0 or more data entries. Each instruction within a family must have the same amount of data allocated, although some
-instructions may not use all of it. Instructions that cannot be specialized, e.g. ``POP_TOP``, do not need any entries.
+Each instruction will use 0 or more data entries.
+Each instruction within a family must have the same amount of data allocated,
+although some instructions may not use all of it.
+Instructions that cannot be specialized, e.g. ``POP_TOP``,
+do not need any entries.
 Experiments show that 25% to 30% of instructions can be usefully specialized.
-Different families will need different amounts of data, but most need 2 entries (16 bytes on a 64 bit machine).
+Different families will need different amounts of data,
+but most need 2 entries (16 bytes on a 64 bit machine).

-In order to support larger functions than 256 instructions, we compute the offset of the first data entry for instructions
+In order to support larger functions than 256 instructions,
+we compute the offset of the first data entry for instructions
 as ``(instruction offset)//2 + (quickened operand)``.

 Compared to the opcache in Python 3.10, this design:

-* is faster; it requires no memory reads to compute the offset. 3.10 requires two reads, which are dependent.
-* uses much less memory, as the data can be different sizes for different instruction families, and doesn't need an additional array of offsets.
-* can support much larger functions, up to about 5000 instructions per function. 3.10 can support about 1000.
+* is faster; it requires no memory reads to compute the offset.
+  3.10 requires two reads, which are dependent.
+* uses much less memory, as the data can be different sizes for different
+  instruction families, and doesn't need an additional array of offsets.
+  can support much larger functions, up to about 5000 instructions
+  per function. 3.10 can support about 1000.


 Example families of instructions
@ -170,64 +222,86 @@ Example families of instructions
 CALL_FUNCTION
 '''''''''''''

-The ``CALL_FUNCTION`` instruction calls the (N+1)th item on the stack with top N items on the stack as arguments.
+The ``CALL_FUNCTION`` instruction calls the (N+1)th item on the stack with
+top N items on the stack as arguments.

-This is an obvious candidate for specialization. For example, the call in ``len(x)`` is represented as the bytecode ``CALL_FUNCTION 1``.
-In this case we would always expect the object ``len`` to be the function. We probably don't want to specialize for ``len``
-(although we might for ``type`` and ``isinstance``), but it would be beneficial to specialize for builtin functions taking a single argument.
-A fast check that the underlying function is a builtin function taking a single argument (``METHOD_O``) would allow us to avoid a
-sequence of checks for number of parameters and keyword arguments.
+This is an obvious candidate for specialization. For example, the call in
+``len(x)`` is represented as the bytecode ``CALL_FUNCTION 1``.
+In this case we would always expect the object ``len`` to be the function.
+We probably don't want to specialize for ``len``
+(although we might for ``type`` and ``isinstance``), but it would be beneficial
+to specialize for builtin functions taking a single argument.
+A fast check that the underlying function is a builtin function taking a single
+argument (``METHOD_O``) would allow us to avoid a sequence of checks for number
+of parameters and keyword arguments.

-``CALL_FUNCTION_ADAPTIVE`` would track how often it is executed, and call the ``call_function_optimize`` when executed enough times, or jump
-to ``CALL_FUNCTION`` otherwise.
-When optimizing, the kind of the function would be checked and if a suitable specialized instruction was found,
+``CALL_FUNCTION_ADAPTIVE`` would track how often it is executed, and call the
+``call_function_optimize`` when executed enough times, or jump to ``CALL_FUNCTION``
+otherwise. When optimizing, the kind of the function would be checked and if a
+suitable specialized instruction was found,
 it would replace ``CALL_FUNCTION_ADAPTIVE`` in place.

 Specializations might include:

-* ``CALL_FUNCTION_PY_SIMPLE``: Calls to Python functions with exactly matching parameters.
-* ``CALL_FUNCTION_PY_DEFAULTS``: Calls to Python functions with more parameters and default values.
-  Since the exact number of defaults needed is known, the instruction needs to do no additional checking or computation; just copy some defaults.
-* ``CALL_BUILTIN_O``: The example given above for calling builtin methods taking exactly one argument.
-* ``CALL_BUILTIN_VECTOR``: For calling builtin function taking vector arguments.
+* ``CALL_FUNCTION_PY_SIMPLE``: Calls to Python functions with
+  exactly matching parameters.
+* ``CALL_FUNCTION_PY_DEFAULTS``: Calls to Python functions with more
+  parameters and default values. Since the exact number of defaults needed is
+  known, the instruction needs to do no additional checking or computation;
+  just copy some defaults.
+* ``CALL_BUILTIN_O``: The example given above for calling builtin methods
+  taking exactly one argument.
+* ``CALL_BUILTIN_VECTOR``: For calling builtin function taking
+  vector arguments.

 Note how this allows optimizations that complement other optimizations.
-For example, if the Python and C call stacks were decoupled and the data stack were contiguous,
-then Python-to-Python calls could be made very fast.
+For example, if the Python and C call stacks were decoupled and the data stack
+were contiguous, then Python-to-Python calls could be made very fast.

 LOAD_GLOBAL
 '''''''''''

-The ``LOAD_GLOBAL`` instruction looks up a name in the global namespace and then, if not present in the global namespace,
+The ``LOAD_GLOBAL`` instruction looks up a name in the global namespace
+and then, if not present in the global namespace,
 looks it up in the builtins namespace.
-In 3.9 the C code for the ``LOAD_GLOBAL`` includes code to check to see whether the whole code object should be modified to add a cache,
-whether either the global or builtins namespace, code to lookup the value in a cache, and fallback code.
-This makes it complicated and bulky. It also performs many redundant operations even when supposedly optimized.
+In 3.9 the C code for the ``LOAD_GLOBAL`` includes code to check to see
+whether the whole code object should be modified to add a cache,
+whether either the global or builtins namespace,
+code to lookup the value in a cache, and fallback code.
+This makes it complicated and bulky.
+It also performs many redundant operations even when supposedly optimized.

-Using a family of instructions makes the code more maintainable and faster, as each instruction only needs to handle one concern.
+Using a family of instructions makes the code more maintainable and faster,
+as each instruction only needs to handle one concern.

 Specializations would include:

 * ``LOAD_GLOBAL_ADAPTIVE`` would operate like ``CALL_FUNCTION_ADAPTIVE`` above.
-* ``LOAD_GLOBAL_MODULE`` can be specialized for the case where the value is in the globals namespace.
-  After checking that the keys of the namespace have not changed, it can load the value from the stored index.
-* ``LOAD_GLOBAL_BUILTIN``  can be specialized for the case where the value is in the builtins namespace.
-  It needs to check that the keys of the global namespace have not been added to, and that the builtins namespace has not changed.
-  Note that we don't care if the values of the global namespace have changed, just the keys.
+* ``LOAD_GLOBAL_MODULE`` can be specialized for the case where the value is in
+  the globals namespace. After checking that the keys of the namespace have
+  not changed, it can load the value from the stored index.
+* ``LOAD_GLOBAL_BUILTIN``  can be specialized for the case where the value is
+  in the builtins namespace. It needs to check that the keys of the global
+  namespace have not been added to, and that the builtins namespace has not
+  changed. Note that we don't care if the values of the global namespace
+  have changed, just the keys.

 See [4]_ for a full implementation.

 .. note::

-  This PEP outlines the mechanisms for managing specialization, and does not specify the particular optimizations to be applied.
-  The above scheme is just one possible scheme. Many others are possible and may well be better.
+  This PEP outlines the mechanisms for managing specialization, and does not
+  specify the particular optimizations to be applied.
+  The above scheme is just one possible scheme.
+  Many others are possible and may well be better.

 Compatibility
 =============

 There will be no change to the language, library or API.

-The only way that users will be able to detect the presence of the new interpreter is through timing execution, the use of debugging tools,
+The only way that users will be able to detect the presence of the new
+interpreter is through timing execution, the use of debugging tools,
 or measuring memory use.

 Costs
@ -236,13 +310,14 @@ Costs
 Memory use
 ----------

-An obvious concern with any scheme that performs any sort of caching is "how much more memory does it use?".
+An obvious concern with any scheme that performs any sort of caching is
+"how much more memory does it use?".
 The short answer is "none".

 Comparing memory use to 3.10
 ''''''''''''''''''''''''''''
-The following table shows the additional bytes per instruction to support the 3.10 opcache
-or the proposed adaptive interpreter, on a 64 bit machine.
+The following table shows the additional bytes per instruction to support the
+3.10 opcache or the proposed adaptive interpreter, on a 64 bit machine.

 ================   =====  ========  =====  =====
 Version           3.10   3.10 opt   3.11   3.11
@ -256,10 +331,13 @@ or the proposed adaptive interpreter, on a 64 bit machine.
 ================   =====  ========  =====  =====

 ``3.10`` is the current version of 3.10 which uses 32 bytes per entry.
-``3.10 opt`` is a hypothetical improved version of 3.10 that uses 24 bytes per entry.
+``3.10 opt`` is a hypothetical improved version of 3.10 that uses 24 bytes
+per entry.

-Even if one third of all instructions were specialized (a high proportion), then the memory use is still less than
-that of 3.10. With a more realistic 25%, then memory use is basically the same as the hypothetical improved version of 3.10.
+Even if one third of all instructions were specialized (a high proportion),
+then the memory use is still less than that of 3.10.
+With a more realistic 25%, then memory use is basically the same as the
+hypothetical improved version of 3.10.


 Security Implications
@ -277,7 +355,8 @@ Too many to list.
 References
 ==========

-.. [1] The construction of high-performance virtual machines for dynamic languages, Mark Shannon 2010.
+.. [1] The construction of high-performance virtual machines for
+  dynamic languages, Mark Shannon 2010.
  http://theses.gla.ac.uk/2975/1/2011shannonphd.pdf

 .. [2] Dynamic Interpretation for Dynamic Scripting Languages
@ -286,7 +365,8 @@ References
 .. [3] Inline Caching meets Quickening
  https://www.unibw.de/ucsrl/pubs/ecoop10.pdf/view

-.. [4] Adaptive specializing examples (This will be moved to a more permanent location, once this PEP is accepted)
+.. [4] Adaptive specializing examples
+  (This will be moved to a more permanent location, once this PEP is accepted)
  https://gist.github.com/markshannon/556ccc0e99517c25a70e2fe551917c03