From cf5741c181735d8bd62ec687693c8cd510c7ef15 Mon Sep 17 00:00:00 2001 From: Carl Meyer Date: Sun, 26 Feb 2023 18:11:03 -0700 Subject: [PATCH] PEP 709: Inlined comprehensions (#3029) Co-authored-by: C.A.M. Gerlach Co-authored-by: Jelle Zijlstra --- .github/CODEOWNERS | 1 + pep-0709.rst | 320 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 321 insertions(+) create mode 100644 pep-0709.rst diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 5d26afea5..d39689cc1 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -588,6 +588,7 @@ pep-0704.rst @brettcannon @pradyunsg # pep-0705.rst pep-0706.rst @encukou pep-0708.rst @dstufft +pep-0709.rst @carljm # ... # pep-0754.txt # ... diff --git a/pep-0709.rst b/pep-0709.rst new file mode 100644 index 000000000..bf9caebd1 --- /dev/null +++ b/pep-0709.rst @@ -0,0 +1,320 @@ +PEP: 709 +Title: Inlined comprehensions +Author: Carl Meyer +Sponsor: Guido van Rossum +Discussions-To: https://discuss.python.org/t/pep-709-inlined-comprehensions/24240 +Status: Draft +Type: Standards Track +Content-Type: text/x-rst +Created: 24-Feb-2023 +Python-Version: 3.12 +Post-History: `25-Feb-2023 `__ + + +Abstract +======== + +Comprehensions are currently compiled as nested functions, which provides +isolation of the comprehension's iteration variable, but is inefficient at +runtime. This PEP proposes to inline list, dictionary, and set comprehensions +into the function where they are defined, and provide the expected isolation by +pushing/popping clashing locals on the stack. This change makes comprehensions +much faster: up to 2x faster for a microbenchmark of a comprehension alone, +translating to an 11% speedup for one sample benchmark derived from real-world +code that makes heavy use of comprehensions in the context of doing actual +work. + + +Motivation +========== + +Comprehensions are a popular and widely-used feature of the Python language. +The nested-function compilation of comprehensions optimizes for compiler +simplicity at the expense of performance of user code. It is possible to +provide near-identical semantics (see `Backwards Compatibility`_) with much +better runtime performance for all users of comprehensions, with only a small +increase in compiler complexity. + + +Rationale +========= + +Inlining is a common compiler optimization in many languages. Generalized +inlining of function calls at compile time in Python is near-impossible, since +call targets may be patched at runtime. Comprehensions are a special case, +where we have a call target known statically in the compiler that can neither +be patched (barring undocumented and unsupported fiddling with bytecode +directly) nor escape. + +Inlining also permits other compiler optimizations of bytecode to be more +effective, because they can now "see through" the comprehension bytecode, +instead of it being an opaque call. + +Normally a performance improvement would not require a PEP. In this case, the +simplest and most efficient implementation results in some user-visible effects, +so this is not just a performance improvement, it is a (small) change to the +language. + + +Specification +============= + +Given a simple comprehension:: + + def f(lst): + return [x for x in lst] + +The compiler currently emits the following bytecode for the function ``f``: + +.. code-block:: text + + 1 0 RESUME 0 + + 2 2 LOAD_CONST 1 ( at 0x...) + 4 MAKE_FUNCTION 0 + 6 LOAD_FAST 0 (lst) + 8 GET_ITER + 10 CALL 0 + 20 RETURN_VALUE + + Disassembly of at 0x...>: + 2 0 RESUME 0 + 2 BUILD_LIST 0 + 4 LOAD_FAST 0 (.0) + >> 6 FOR_ITER 4 (to 18) + 10 STORE_FAST 1 (x) + 12 LOAD_FAST 1 (x) + 14 LIST_APPEND 2 + 16 JUMP_BACKWARD 6 (to 6) + >> 18 END_FOR + 20 RETURN_VALUE + +The bytecode for the comprehension is in a separate code object. Each time +``f()`` is called, a new single-use function object is allocated (by +``MAKE_FUNCTION``), called (allocating and then destroying a new frame on the +Python stack), and then immediately thrown away. + +Under this PEP, the compiler will emit the following bytecode for ``f()`` +instead: + +.. code-block:: text + + 1 0 RESUME 0 + + 2 2 LOAD_FAST 0 (lst) + 4 GET_ITER + 6 LOAD_FAST_AND_CLEAR 1 (x) + 8 SWAP 2 + 10 BUILD_LIST 0 + 12 SWAP 2 + >> 14 FOR_ITER 4 (to 26) + 18 STORE_FAST 1 (x) + 20 LOAD_FAST 1 (x) + 22 LIST_APPEND 2 + 24 JUMP_BACKWARD 6 (to 14) + >> 26 END_FOR + 28 SWAP 2 + 30 STORE_FAST 1 (x) + 32 RETURN_VALUE + +There is no longer a separate code object, nor creation of a single-use function +object, nor any need to create and destroy a Python frame. + +Isolation of the ``x`` iteration variable is achieved by the combination of the +new ``LOAD_FAST_AND_CLEAR`` opcode at offset ``6``, which saves any outer value +of ``x`` on the stack before running the comprehension, and ``30 STORE_FAST``, +which restores the outer value of ``x`` (if any) after running the +comprehension. + +If the comprehension accesses variables from the outer scope, inlining avoids +the need to place these variables in a cell, allowing the comprehension (and all +other code in the outer function) to access them as normal fast locals instead. +This provides further performance gains. + +Only comprehensions occurring inside functions, where fast-locals +(``LOAD_FAST/STORE_FAST``) are used, will be inlined. Module-level +comprehensions will continue to create and call a function. + +Generator expressions are currently never inlined in the reference +implementation of this PEP. In the future, some generator expressions may be +inlined, where the returned generator object does not leak. + + +Backwards Compatibility +======================= + +Comprehension inlining will cause the following visible behavior changes. No +changes in the standard library or test suite were necessary to adapt to these +changes in the implementation, suggesting the impact in user code is likely to +be minimal. + +Specialized tools depending on undocumented details of compiler bytecode output +may of course be affected in ways beyond the below, but these tools already must +adapt to bytecode changes in each Python version. + +locals() includes outer variables +--------------------------------- + +Calling ``locals()`` within a comprehension will include all locals of the +function containing the comprehension. E.g. given the following function:: + + def f(lst): + return [locals() for x in lst] + +Calling ``f([1])`` in current Python will return:: + + [{'.0': , 'x': 1}] + +where ``.0`` is an internal implementation detail: the synthetic sole argument +to the comprehension "function". + +Under this PEP, it will instead return:: + + [{'lst': [1], 'x': 1}] + +This now includes the outer ``lst`` variable as a local, and eliminates the +synthetic ``.0``. + +No comprehension frame in tracebacks +------------------------------------ + +Under this PEP, a comprehension will no longer have its own dedicated frame in +a stack trace. For example, given this function:: + + def g(): + raise RuntimeError("boom") + + def f(): + return [g() for x in [1]] + +Currently, calling ``f()`` results in the following traceback: + +.. code-block:: text + + Traceback (most recent call last): + File "", line 1, in + File "", line 5, in f + File "", line 5, in + File "", line 2, in g + RuntimeError: boom + +Note the dedicated frame for ````. + +Under this PEP, the traceback looks like this instead: + +.. code-block:: text + + Traceback (most recent call last): + File "", line 1, in + File "", line 5, in f + File "", line 2, in g + RuntimeError: boom + +There is no longer an extra frame for the list comprehension. The frame for the +``f`` function has the correct line number for the comprehension, however, so +this simply makes the traceback more compact without losing any useful +information. + +It is theoretically possible that code using warnings with the ``stacklevel`` +argument could observe a behavior change due to the frame stack change. In +practice, however, this seems unlikely. It would require a warning raised in +library code that is always called through a comprehension in that same +library, where the warning is using a ``stacklevel`` of 3+ to bypass the +comprehension and its containing function and point to a calling frame outside +the library. In such a scenario it would usually be simpler and more reliable +to raise the warning closer to the calling code and bypass fewer frames. + + +UnboundLocalError instead of NameError +-------------------------------------- + +Although the value of the comprehension iteration variable is saved and +restored to provide isolation, it still becomes a local variable of the outer +function under this PEP. This implies a small behavior change in a function +where the comprehension iteration variable is accessed outside the +comprehension without ever being set outside the comprehension:: + + def f(lst): + items = [x for x in lst] + return x + +Under this PEP, calling ``f()`` will raise ``UnboundLocalError``, where +currently it raises ``NameError``. ``UnboundLocalError`` is a subclass of +``NameError``, so this should not impact code catching ``NameError``. + + +How to Teach This +================= + +It is not intuitively obvious that comprehension syntax will or should result +in creation and call of a nested function. For new users not already accustomed +to the prior behavior, I suspect the new behavior in this PEP will be more +intuitive and require less explanation. ("Why is there a ```` line in +my traceback when I didn't define any such function? What is this ``.0`` +variable I see in ``locals()``?") + + +Security Implications +===================== + +None known. + + +Reference Implementation +======================== + +This PEP has a reference implementation in the form of `a PR against the CPython main +branch `_ which passes all tests. + +The reference implementation performs the micro-benchmark ``./python -m pyperf +timeit -s 'l = [1]' '[x for x in l]'`` 1.96x faster than the ``main`` branch (in a +build compiled with ``--enable-optimizations``.) + +The reference implementation performs the ``comprehensions`` benchmark in the +`pyperformance `_ benchmark suite +(which is not a micro-benchmark of comprehensions alone, but tests +real-world-derived code doing realistic work using comprehensions) 11% faster +than ``main`` branch (again in optimized builds). Other benchmarks in +pyperformance (none of which use comprehensions heavily) don't show any impact +outside the noise. + +The implementation has no impact on non-comprehension code. + + +Rejected Ideas +============== + +More efficient comprehension calling, without inlining +------------------------------------------------------ + +An `alternate approach `_ +introduces a new opcode for "calling" a comprehension in streamlined fashion +without the need to create a throwaway function object, but still creating a new +Python frame. This avoids all of the visible effects listed under `Backwards +Compatibility`_, and provides roughly half of the performance benefit (1.5x +improvement on the microbenchmark, 4% improvement on ``comprehensions`` +benchmark in pyperformance.) It also requires adding a new pointer to the +``_PyInterpreterFrame`` struct and a new ``Py_INCREF`` on each frame +construction, meaning (unlike this PEP) it has a (very small) performance cost +for all code. It also provides less scope for future optimizations. + +This PEP takes the position that full inlining offers sufficient additional +performance to more than justify the behavior changes. + +Inlining module-level comprehensions +------------------------------------ + +Module-level comprehensions are generally called only once (when the module is +imported), so optimizing their performance is low priority. Inlining them would +require separate code paths in the compiler to handle a module global namespace +dictionary instead of fast-locals. It would be difficult or impossible to avoid +breaking semantics, since the comprehension iteration variable itself would be +a module global which might be referenced inside other functions that in turn +could be called within the comprehension. + + +Copyright +========= + +This document is placed in the public domain or under the +CC0-1.0-Universal license, whichever is more permissive.