PEP 709: Inlined comprehensions (#3029)

Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM> Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
2023-02-26 18:11:03 -07:00 · 2023-02-26 18:11:03 -07:00 · cf5741c181
parent 6d182da522
commit cf5741c181
2 changed files with 321 additions and 0 deletions
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@ -588,6 +588,7 @@ pep-0704.rst  @brettcannon @pradyunsg
 # pep-0705.rst
 pep-0706.rst  @encukou
 pep-0708.rst  @dstufft
+pep-0709.rst  @carljm
 # ...
 # pep-0754.txt
 # ...
--- a/pep-0709.rst
+++ b/pep-0709.rst
@ -0,0 +1,320 @@
+PEP: 709
+Title: Inlined comprehensions
+Author: Carl Meyer <carl@oddbird.net>
+Sponsor: Guido van Rossum <guido@python.org>
+Discussions-To: https://discuss.python.org/t/pep-709-inlined-comprehensions/24240
+Status: Draft
+Type: Standards Track
+Content-Type: text/x-rst
+Created: 24-Feb-2023
+Python-Version: 3.12
+Post-History: `25-Feb-2023 <https://discuss.python.org/t/pep-709-inlined-comprehensions/24240>`__
+
+
+Abstract
+========
+
+Comprehensions are currently compiled as nested functions, which provides
+isolation of the comprehension's iteration variable, but is inefficient at
+runtime. This PEP proposes to inline list, dictionary, and set comprehensions
+into the function where they are defined, and provide the expected isolation by
+pushing/popping clashing locals on the stack. This change makes comprehensions
+much faster: up to 2x faster for a microbenchmark of a comprehension alone,
+translating to an 11% speedup for one sample benchmark derived from real-world
+code that makes heavy use of comprehensions in the context of doing actual
+work.
+
+
+Motivation
+==========
+
+Comprehensions are a popular and widely-used feature of the Python language.
+The nested-function compilation of comprehensions optimizes for compiler
+simplicity at the expense of performance of user code. It is possible to
+provide near-identical semantics (see `Backwards Compatibility`_) with much
+better runtime performance for all users of comprehensions, with only a small
+increase in compiler complexity.
+
+
+Rationale
+=========
+
+Inlining is a common compiler optimization in many languages.  Generalized
+inlining of function calls at compile time in Python is near-impossible, since
+call targets may be patched at runtime. Comprehensions are a special case,
+where we have a call target known statically in the compiler that can neither
+be patched (barring undocumented and unsupported fiddling with bytecode
+directly) nor escape.
+
+Inlining also permits other compiler optimizations of bytecode to be more
+effective, because they can now "see through" the comprehension bytecode,
+instead of it being an opaque call.
+
+Normally a performance improvement would not require a PEP. In this case, the
+simplest and most efficient implementation results in some user-visible effects,
+so this is not just a performance improvement, it is a (small) change to the
+language.
+
+
+Specification
+=============
+
+Given a simple comprehension::
+
+  def f(lst):
+      return [x for x in lst]
+
+The compiler currently emits the following bytecode for the function ``f``:
+
+.. code-block:: text
+
+   1           0 RESUME                   0
+
+   2           2 LOAD_CONST               1 (<code object <listcomp> at 0x...)
+               4 MAKE_FUNCTION            0
+               6 LOAD_FAST                0 (lst)
+               8 GET_ITER
+              10 CALL                     0
+              20 RETURN_VALUE
+
+   Disassembly of <code object <listcomp> at 0x...>:
+   2           0 RESUME                   0
+               2 BUILD_LIST               0
+               4 LOAD_FAST                0 (.0)
+         >>    6 FOR_ITER                 4 (to 18)
+              10 STORE_FAST               1 (x)
+              12 LOAD_FAST                1 (x)
+              14 LIST_APPEND              2
+              16 JUMP_BACKWARD            6 (to 6)
+         >>   18 END_FOR
+              20 RETURN_VALUE
+
+The bytecode for the comprehension is in a separate code object. Each time
+``f()`` is called, a new single-use function object is allocated (by
+``MAKE_FUNCTION``), called (allocating and then destroying a new frame on the
+Python stack), and then immediately thrown away.
+
+Under this PEP, the compiler will emit the following bytecode for ``f()``
+instead:
+
+.. code-block:: text
+
+  1           0 RESUME                   0
+
+  2           2 LOAD_FAST                0 (lst)
+              4 GET_ITER
+              6 LOAD_FAST_AND_CLEAR      1 (x)
+              8 SWAP                     2
+             10 BUILD_LIST               0
+             12 SWAP                     2
+        >>   14 FOR_ITER                 4 (to 26)
+             18 STORE_FAST               1 (x)
+             20 LOAD_FAST                1 (x)
+             22 LIST_APPEND              2
+             24 JUMP_BACKWARD            6 (to 14)
+        >>   26 END_FOR
+             28 SWAP                     2
+             30 STORE_FAST               1 (x)
+             32 RETURN_VALUE
+
+There is no longer a separate code object, nor creation of a single-use function
+object, nor any need to create and destroy a Python frame.
+
+Isolation of the ``x`` iteration variable is achieved by the combination of the
+new ``LOAD_FAST_AND_CLEAR`` opcode at offset ``6``, which saves any outer value
+of ``x`` on the stack before running the comprehension, and ``30 STORE_FAST``,
+which restores the outer value of ``x`` (if any) after running the
+comprehension.
+
+If the comprehension accesses variables from the outer scope, inlining avoids
+the need to place these variables in a cell, allowing the comprehension (and all
+other code in the outer function) to access them as normal fast locals instead.
+This provides further performance gains.
+
+Only comprehensions occurring inside functions, where fast-locals
+(``LOAD_FAST/STORE_FAST``) are used, will be inlined. Module-level
+comprehensions will continue to create and call a function.
+
+Generator expressions are currently never inlined in the reference
+implementation of this PEP. In the future, some generator expressions may be
+inlined, where the returned generator object does not leak.
+
+
+Backwards Compatibility
+=======================
+
+Comprehension inlining will cause the following visible behavior changes. No
+changes in the standard library or test suite were necessary to adapt to these
+changes in the implementation, suggesting the impact in user code is likely to
+be minimal.
+
+Specialized tools depending on undocumented details of compiler bytecode output
+may of course be affected in ways beyond the below, but these tools already must
+adapt to bytecode changes in each Python version.
+
+locals() includes outer variables
+---------------------------------
+
+Calling ``locals()`` within a comprehension will include all locals of the
+function containing the comprehension. E.g. given the following function::
+
+  def f(lst):
+      return [locals() for x in lst]
+
+Calling ``f([1])`` in current Python will return::
+
+  [{'.0': <list_iterator object at 0x7f8d37170460>, 'x': 1}]
+
+where ``.0`` is an internal implementation detail: the synthetic sole argument
+to the comprehension "function".
+
+Under this PEP, it will instead return::
+
+  [{'lst': [1], 'x': 1}]
+
+This now includes the outer ``lst`` variable as a local, and eliminates the
+synthetic ``.0``.
+
+No comprehension frame in tracebacks
+------------------------------------
+
+Under this PEP, a comprehension will no longer have its own dedicated frame in
+a stack trace. For example, given this function::
+
+  def g():
+      raise RuntimeError("boom")
+
+  def f():
+      return [g() for x in [1]]
+
+Currently, calling ``f()`` results in the following traceback:
+
+.. code-block:: text
+
+   Traceback (most recent call last):
+     File "<stdin>", line 1, in <module>
+     File "<stdin>", line 5, in f
+     File "<stdin>", line 5, in <listcomp>
+     File "<stdin>", line 2, in g
+   RuntimeError: boom
+
+Note the dedicated frame for ``<listcomp>``.
+
+Under this PEP, the traceback looks like this instead:
+
+.. code-block:: text
+
+   Traceback (most recent call last):
+     File "<stdin>", line 1, in <module>
+     File "<stdin>", line 5, in f
+     File "<stdin>", line 2, in g
+   RuntimeError: boom
+
+There is no longer an extra frame for the list comprehension. The frame for the
+``f`` function has the correct line number for the comprehension, however, so
+this simply makes the traceback more compact without losing any useful
+information.
+
+It is theoretically possible that code using warnings with the ``stacklevel``
+argument could observe a behavior change due to the frame stack change. In
+practice, however, this seems unlikely. It would require a warning raised in
+library code that is always called through a comprehension in that same
+library, where the warning is using a ``stacklevel`` of 3+ to bypass the
+comprehension and its containing function and point to a calling frame outside
+the library. In such a scenario it would usually be simpler and more reliable
+to raise the warning closer to the calling code and bypass fewer frames.
+
+
+UnboundLocalError instead of NameError
+--------------------------------------
+
+Although the value of the comprehension iteration variable is saved and
+restored to provide isolation, it still becomes a local variable of the outer
+function under this PEP. This implies a small behavior change in a function
+where the comprehension iteration variable is accessed outside the
+comprehension without ever being set outside the comprehension::
+
+   def f(lst):
+       items = [x for x in lst]
+       return x
+
+Under this PEP, calling ``f()`` will raise ``UnboundLocalError``, where
+currently it raises ``NameError``. ``UnboundLocalError`` is a subclass of
+``NameError``, so this should not impact code catching ``NameError``.
+
+
+How to Teach This
+=================
+
+It is not intuitively obvious that comprehension syntax will or should result
+in creation and call of a nested function. For new users not already accustomed
+to the prior behavior, I suspect the new behavior in this PEP will be more
+intuitive and require less explanation. ("Why is there a ``<listcomp>`` line in
+my traceback when I didn't define any such function? What is this ``.0``
+variable I see in ``locals()``?")
+
+
+Security Implications
+=====================
+
+None known.
+
+
+Reference Implementation
+========================
+
+This PEP has a reference implementation in the form of `a PR against the CPython main
+branch <https://github.com/python/cpython/pull/101441>`_ which passes all tests.
+
+The reference implementation performs the micro-benchmark ``./python -m pyperf
+timeit -s 'l = [1]' '[x for x in l]'`` 1.96x faster than the ``main`` branch (in a
+build compiled with ``--enable-optimizations``.)
+
+The reference implementation performs the ``comprehensions`` benchmark in the
+`pyperformance <https://github.com/python/pyperformance>`_ benchmark suite
+(which is not a micro-benchmark of comprehensions alone, but tests
+real-world-derived code doing realistic work using comprehensions) 11% faster
+than ``main`` branch (again in optimized builds). Other benchmarks in
+pyperformance (none of which use comprehensions heavily) don't show any impact
+outside the noise.
+
+The implementation has no impact on non-comprehension code.
+
+
+Rejected Ideas
+==============
+
+More efficient comprehension calling, without inlining
+------------------------------------------------------
+
+An `alternate approach <https://github.com/python/cpython/pull/101310>`_
+introduces a new opcode for "calling" a comprehension in streamlined fashion
+without the need to create a throwaway function object, but still creating a new
+Python frame. This avoids all of the visible effects listed under `Backwards
+Compatibility`_, and provides roughly half of the performance benefit (1.5x
+improvement on the microbenchmark, 4% improvement on ``comprehensions``
+benchmark in pyperformance.) It also requires adding a new pointer to the
+``_PyInterpreterFrame`` struct and a new ``Py_INCREF`` on each frame
+construction, meaning (unlike this PEP) it has a (very small) performance cost
+for all code. It also provides less scope for future optimizations.
+
+This PEP takes the position that full inlining offers sufficient additional
+performance to more than justify the behavior changes.
+
+Inlining module-level comprehensions
+------------------------------------
+
+Module-level comprehensions are generally called only once (when the module is
+imported), so optimizing their performance is low priority. Inlining them would
+require separate code paths in the compiler to handle a module global namespace
+dictionary instead of fast-locals. It would be difficult or impossible to avoid
+breaking semantics, since the comprehension iteration variable itself would be
+a module global which might be referenced inside other functions that in turn
+could be called within the comprehension.
+
+
+Copyright
+=========
+
+This document is placed in the public domain or under the
+CC0-1.0-Universal license, whichever is more permissive.