PEP 659: Update to describe inline caches (#2462)

2022-03-25 13:09:22 +00:00 · 2022-03-25 13:09:22 +00:00 · f3aad219e9
parent a5edad1428
commit f3aad219e9
1 changed files with 119 additions and 118 deletions
--- a/pep-0659.rst
+++ b/pep-0659.rst
@ -99,16 +99,13 @@ more robust in cases where specialization fails or is not stable.
 Performance
 -----------

-The expected speedup of 50% can be broken roughly down as follows:
+The speedup from specialization is hard to determine, as many specializations
+depend on other optimizations. Speedups seem to be in the range 10% - 60%.

-* In the region of 30% from specialization. Much of that is from
-  specialization of calls, with improvements in instructions that are already
-  specialized such as ``LOAD_ATTR`` and ``LOAD_GLOBAL`` contributing much of
-  the remainder. Specialization of operations adds a small amount.
-* About 10% from improved dispatch such as super-instructions
-  and other optimizations enabled by quickening.
-* Further increases in the benefits of other optimizations,
-  as they can exploit, or be exploited by specialization.
+* Most of the speedup comes directly from specialization. The largest
+  contributors are speedups to attribute lookup, global variables, and calls.
+* A small, but useful, fraction is from from improved dispatch such as
+  super-instructions and other optimizations enabled by quickening.

 Implementation
 ==============
@ -116,30 +113,29 @@ Implementation
 Overview
 --------

-Once any instruction in a code object has executed a few times,
-that code object will be "quickened" by allocating a new array for the
-bytecode that can be modified at runtime, and is not constrained as the
-``code.co_code`` object is. From that point onwards, whenever any
-instruction in that code object is executed, it will use the quickened form.
-
 Any instruction that would benefit from specialization will be replaced by an
 "adaptive" form of that instruction. When executed, the adaptive instructions
 will specialize themselves in response to the types and values that they see.
+This process is known as "quickening".
+
+Once an instruction in a code object has executed enough times,
+that instruction will be "specialized" by replacing it with a new instruction
+that is expected to execute faster for that operation.

 Quickening
 ----------

 Quickening is the process of replacing slow instructions with faster variants.

-Quickened code has number of advantages over the normal bytecode:
+Quickened code has number of advantages over immutable bytecode:

 * It can be changed at runtime
 * It can use super-instructions that span lines and take multiple operands.
-* It does not need to handle tracing as it can fallback to the normal
+* It does not need to handle tracing as it can fallback to the original
  bytecode for that.

-In order that tracing can be supported, and quickening performed quickly,
-the quickened instruction format should match the normal bytecode format:
+In order that tracing can be supported, the quickened instruction format
+should match the immutable, user visible, bytecode format:
 16-bit instructions of 8-bit opcode followed by 8-bit operand.

 Adaptive instructions
@ -149,21 +145,21 @@ Each instruction that would benefit from specialization is replaced by an
 adaptive version during quickening. For example,
 the ``LOAD_ATTR`` instruction would be replaced with ``LOAD_ATTR_ADAPTIVE``.

-Each adaptive instruction maintains a counter,
-and periodically attempts to specialize itself.
+Each adaptive instruction periodically attempts to specialize itself.

 Specialization
 --------------

-CPython bytecode contains many bytecodes that represent high-level operations,
-and would benefit from specialization. Examples include ``CALL_FUNCTION``,
+CPython bytecode contains many instructions that represent high-level
+operations, and would benefit from specialization. Examples include ``CALL``,
 ``LOAD_ATTR``, ``LOAD_GLOBAL`` and ``BINARY_ADD``.

 By introducing a "family" of specialized instructions for each of these
 instructions allows effective specialization,
 since each new instruction is specialized to a single task.
-Each family will include an "adaptive" instruction,
-that maintains a counter and periodically attempts to specialize itself.
+Each family will include an "adaptive" instruction, that maintains a counter
+and attempts to specialize itself when that counter reaches zero.
+
 Each family will also include one or more specialized instructions that
 perform the equivalent of the generic operation much faster provided their
 inputs are as expected.
@ -178,85 +174,41 @@ Ancillary data
 --------------

 Most families of specialized instructions will require more information than
-can fit in an 8-bit operand. To do this, an array of specialization data entries
-will be maintained alongside the new instruction array. For instructions that
-need specialization data, the operand in the quickened array will serve as a
-partial index,  along with the offset of the instruction, to find the first
-specialization data entry for that instruction.
-Each entry will be 8 bytes (for a 64 bit machine). The data in an entry,
-and the number of entries needed, will vary from instruction to instruction.
-
-Data layout
-----------
-
-Quickened instructions will be stored in an array (it is neither necessary not
-desirable to store them in a Python object) with the same format as the
-original bytecode. Ancillary data will be stored in a separate array.
-
-Each instruction will use 0 or more data entries.
-Each instruction within a family must have the same amount of data allocated,
-although some instructions may not use all of it.
-Instructions that cannot be specialized, e.g. ``POP_TOP``,
-do not need any entries.
-Experiments show that 25% to 30% of instructions can be usefully specialized.
-Different families will need different amounts of data,
-but most need 2 entries (16 bytes on a 64 bit machine).
-
-In order to support larger functions than 256 instructions,
-we compute the offset of the first data entry for instructions
-as ``(instruction offset)//2 + (quickened operand)``.
-
-Compared to the opcache in Python 3.10, this design:
-
-* is faster; it requires no memory reads to compute the offset.
-  3.10 requires two reads, which are dependent.
-* uses much less memory, as the data can be different sizes for different
-  instruction families, and doesn't need an additional array of offsets.
-  can support much larger functions, up to about 5000 instructions
-  per function. 3.10 can support about 1000.
-
+can fit in an 8-bit operand. To do this, a number of 16 bit entries immediately
+following the instruction are used to store this data. This is a form of inline
+cache, an "inline data cache". Unspecialized, or adaptive, instructions will
+use the first entry of this cache as a counter, and simply skip over the others.

 Example families of instructions
 --------------------------------

-CALL_FUNCTION
-'''''''''''''
+LOAD_ATTR
+'''''''''

-The ``CALL_FUNCTION`` instruction calls the (N+1)th item on the stack with
-top N items on the stack as arguments.
+The ``LOAD_ATTR`` loads the named attribute of the object on top of the stack,
+then replaces the object on top of the stack with the attribute.

-This is an obvious candidate for specialization. For example, the call in
-``len(x)`` is represented as the bytecode ``CALL_FUNCTION 1``.
-In this case we would always expect the object ``len`` to be the function.
-We probably don't want to specialize for ``len``
-(although we might for ``type`` and ``isinstance``), but it would be beneficial
-to specialize for builtin functions taking a single argument.
-A fast check that the underlying function is a builtin function taking a single
-argument (``METHOD_O``) would allow us to avoid a sequence of checks for number
-of parameters and keyword arguments.
+This is an obvious candidate for specialization. Attributes might belong to
+a normal instance, a class, a module, or one of many other special cases.

-``CALL_FUNCTION_ADAPTIVE`` would track how often it is executed, and call the
-``call_function_optimize`` when executed enough times, or jump to ``CALL_FUNCTION``
-otherwise. When optimizing, the kind of the function would be checked and if a
-suitable specialized instruction was found,
-it would replace ``CALL_FUNCTION_ADAPTIVE`` in place.
+``LOAD_ATTR`` would initially be quickened to ``LOAD_ATTR_ADAPTIVE`` which
+would track how often it is executed, and call the ``_Py_Specialize_LoadAttr``
+internal function when executed enough times, or jump to the original
+``LOAD_ATTR`` instruction to perform the load. When optimizing, the kind
+of the attribute would be examined, and if a suitable specialized instruction
+was found, it would replace ``LOAD_ATTR_ADAPTIVE`` in place.

-Specializations might include:
+Specialization for ``LOAD_ATTR`` might include:

-* ``CALL_FUNCTION_PY_SIMPLE``: Calls to Python functions with
-  exactly matching parameters.
-* ``CALL_FUNCTION_PY_DEFAULTS``: Calls to Python functions with more
-  parameters and default values. Since the exact number of defaults needed is
-  known, the instruction needs to do no additional checking or computation;
-  just copy some defaults.
-* ``CALL_BUILTIN_O``: The example given above for calling builtin methods
-  taking exactly one argument.
-* ``CALL_BUILTIN_VECTOR``: For calling builtin function taking
-  vector arguments.
+* ``LOAD_ATTR_INSTANCE_VALUE`` A common case where the attribute is stored in
+  the object's value array, and not shadowed by an overriding descriptor.
+* ``LOAD_ATTR_MODULE`` Load an attribute from a module.
+* ``LOAD_ATTR_SLOT`` Load an attribute from an object whose
+  class defines ``__slots__``.

 Note how this allows optimizations that complement other optimizations.
-For example, if the Python and C call stacks were decoupled and the data stack
-were contiguous, then Python-to-Python calls could be made very fast.
+The ``LOAD_ATTR_INSTANCE_VALUE`` works well with the "lazy dictionary" used for
+many objects.

 LOAD_GLOBAL
 '''''''''''
@ -276,7 +228,7 @@ as each instruction only needs to handle one concern.

 Specializations would include:

-* ``LOAD_GLOBAL_ADAPTIVE`` would operate like ``CALL_FUNCTION_ADAPTIVE`` above.
+* ``LOAD_GLOBAL_ADAPTIVE`` would operate like ``LOAD_ATTR_ADAPTIVE`` above.
 * ``LOAD_GLOBAL_MODULE`` can be specialized for the case where the value is in
  the globals namespace. After checking that the keys of the namespace have
  not changed, it can load the value from the stored index.
@ -292,8 +244,8 @@ See [4]_ for a full implementation.

  This PEP outlines the mechanisms for managing specialization, and does not
  specify the particular optimizations to be applied.
-  The above scheme is just one possible scheme.
-  Many others are possible and may well be better.
+  It is likely that details, or even the entire implementation, may change
+  as the code is further developed.

 Compatibility
 =============
@ -312,32 +264,42 @@ Memory use

 An obvious concern with any scheme that performs any sort of caching is
 "how much more memory does it use?".
-The short answer is "none".
+The short answer is "not that much".

 Comparing memory use to 3.10
 ''''''''''''''''''''''''''''
+
+CPython 3.10 used 2 bytes per instruction, until the execution count
+reached ~2000 when it allocates another byte per instruction and
+32 bytes per instruction with a cache (``LOAD_GLOBAL`` and ``LOAD_ATTR``).
+
 The following table shows the additional bytes per instruction to support the
 3.10 opcache or the proposed adaptive interpreter, on a 64 bit machine.

-================   =====  ========  =====  =====
- Version           3.10   3.10 opt   3.11   3.11
- Specialised       20%      20%      25%    33%
----------------   -----  --------  -----  -----
- quickened code     0        0       2      2
- opcache_map        1        1       0      0
- opcache/data       6.4     4.8      4      5.3
----------------   -----  --------  -----  -----
- Total              7.4     5.8      6      7.3
-================   =====  ========  =====  =====
+================   ==========  ==========  ======
+ Version           3.10 cold    3.10 hot    3.11
+ Specialised           0%        ~15%       ~25%
+----------------   ----------  ----------  ------
+ code                 2           2          2
+ opcache_map          0           1          0
+ opcache/data         0          4.8         4
+----------------   ----------  ----------  ------
+ Total                2          7.8         6
+================   ==========  ==========  ======

-``3.10`` is the current version of 3.10 which uses 32 bytes per entry.
-``3.10 opt`` is a hypothetical improved version of 3.10 that uses 24 bytes
-per entry.
+``3.10 cold`` is before the code has reached the ~2000 limit.
+``3.10 hot`` shows the cache use once the threshold is reached.

-Even if one third of all instructions were specialized (a high proportion),
-then the memory use is still less than that of 3.10.
-With a more realistic 25%, then memory use is basically the same as the
-hypothetical improved version of 3.10.
+The relative memory use depends on how much code is "hot" enough to trigger
+creation of the cache in 3.10. The break even point, where the memory used
+by 3.10 is the same as for 3.11 is ~70%.
+
+It is also worth noting that the actual bytecode is only part of a code
+object. Code objects also include names, constants and quite a lot of
+debugging information.
+
+In summary, for most applications where many of the functions are relatively
+unused, 3.11 will consume more memory than 3.10, but not by much.


 Security Implications
@ -349,8 +311,46 @@ None
 Rejected Ideas
 ==============

-Too many to list.
+By implementing a specializing adaptive interpreter with inline data caches,
+we are implicitly rejecting many alternative ways to optimize CPython.
+However, it is worth emphasizing that some ideas, such as just-in-time
+compilation, have not been rejected, merely deferred.

+Storing data caches before the bytecode.
+----------------------------------------
+
+An earlier implementation of this PEP for 3.11 alpha used a different caching
+scheme as described below:
+
+
+  Quickened instructions will be stored in an array (it is neither necessary not
+  desirable to store them in a Python object) with the same format as the
+  original bytecode. Ancillary data will be stored in a separate array.
+
+  Each instruction will use 0 or more data entries.
+  Each instruction within a family must have the same amount of data allocated,
+  although some instructions may not use all of it.
+  Instructions that cannot be specialized, e.g. ``POP_TOP``,
+  do not need any entries.
+  Experiments show that 25% to 30% of instructions can be usefully specialized.
+  Different families will need different amounts of data,
+  but most need 2 entries (16 bytes on a 64 bit machine).
+
+  In order to support larger functions than 256 instructions,
+  we compute the offset of the first data entry for instructions
+  as ``(instruction offset)//2 + (quickened operand)``.
+
+  Compared to the opcache in Python 3.10, this design:
+
+  * is faster; it requires no memory reads to compute the offset.
+    3.10 requires two reads, which are dependent.
+  * uses much less memory, as the data can be different sizes for different
+    instruction families, and doesn't need an additional array of offsets.
+    can support much larger functions, up to about 5000 instructions
+    per function. 3.10 can support about 1000.
+
+We rejected this scheme as the inline cache approach is both faster
+and simpler.

 References
 ==========
@ -365,10 +365,11 @@ References
 .. [3] Inline Caching meets Quickening
  https://www.unibw.de/ucsrl/pubs/ecoop10.pdf/view

-.. [4] Adaptive specializing examples
-  (This will be moved to a more permanent location, once this PEP is accepted)
-  https://gist.github.com/markshannon/556ccc0e99517c25a70e2fe551917c03
+.. [4] The adaptive and specialized instructions are implemented in
+  https://github.com/python/cpython/blob/main/Python/ceval.c

+  The optimizations are implemented in:
+  https://github.com/python/cpython/blob/main/Python/specialize.c

 Copyright
 =========