python-peps/pep-0709.rst

PEP: 709
Title: Inlined comprehensions
Author: Carl Meyer <carl@oddbird.net>
Sponsor: Guido van Rossum <guido@python.org>
Discussions-To: https://discuss.python.org/t/pep-709-inlined-comprehensions/24240
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 24-Feb-2023
Python-Version: 3.12
Post-History: `25-Feb-2023 <https://discuss.python.org/t/pep-709-inlined-comprehensions/24240>`__


Abstract
========

Comprehensions are currently compiled as nested functions, which provides
isolation of the comprehension's iteration variable, but is inefficient at
runtime. This PEP proposes to inline list, dictionary, and set comprehensions
into the function where they are defined, and provide the expected isolation by
pushing/popping clashing locals on the stack. This change makes comprehensions
much faster: up to 2x faster for a microbenchmark of a comprehension alone,
translating to an 11% speedup for one sample benchmark derived from real-world
code that makes heavy use of comprehensions in the context of doing actual
work.


Motivation
==========

Comprehensions are a popular and widely-used feature of the Python language.
The nested-function compilation of comprehensions optimizes for compiler
simplicity at the expense of performance of user code. It is possible to
provide near-identical semantics (see `Backwards Compatibility`_) with much
better runtime performance for all users of comprehensions, with only a small
increase in compiler complexity.


Rationale
=========

Inlining is a common compiler optimization in many languages.  Generalized
inlining of function calls at compile time in Python is near-impossible, since
call targets may be patched at runtime. Comprehensions are a special case,
where we have a call target known statically in the compiler that can neither
be patched (barring undocumented and unsupported fiddling with bytecode
directly) nor escape.

Inlining also permits other compiler optimizations of bytecode to be more
effective, because they can now "see through" the comprehension bytecode,
instead of it being an opaque call.

Normally a performance improvement would not require a PEP. In this case, the
simplest and most efficient implementation results in some user-visible effects,
so this is not just a performance improvement, it is a (small) change to the
language.


Specification
=============

Given a simple comprehension::

  def f(lst):
      return [x for x in lst]

The compiler currently emits the following bytecode for the function ``f``:

.. code-block:: text

   1           0 RESUME                   0

   2           2 LOAD_CONST               1 (<code object <listcomp> at 0x...)
               4 MAKE_FUNCTION            0
               6 LOAD_FAST                0 (lst)
               8 GET_ITER
              10 CALL                     0
              20 RETURN_VALUE

   Disassembly of <code object <listcomp> at 0x...>:
   2           0 RESUME                   0
               2 BUILD_LIST               0
               4 LOAD_FAST                0 (.0)
         >>    6 FOR_ITER                 4 (to 18)
              10 STORE_FAST               1 (x)
              12 LOAD_FAST                1 (x)
              14 LIST_APPEND              2
              16 JUMP_BACKWARD            6 (to 6)
         >>   18 END_FOR
              20 RETURN_VALUE

The bytecode for the comprehension is in a separate code object. Each time
``f()`` is called, a new single-use function object is allocated (by
``MAKE_FUNCTION``), called (allocating and then destroying a new frame on the
Python stack), and then immediately thrown away.

Under this PEP, the compiler will emit the following bytecode for ``f()``
instead:

.. code-block:: text

  1           0 RESUME                   0

  2           2 LOAD_FAST                0 (lst)
              4 GET_ITER
              6 LOAD_FAST_AND_CLEAR      1 (x)
              8 SWAP                     2
             10 BUILD_LIST               0
             12 SWAP                     2
        >>   14 FOR_ITER                 4 (to 26)
             18 STORE_FAST               1 (x)
             20 LOAD_FAST                1 (x)
             22 LIST_APPEND              2
             24 JUMP_BACKWARD            6 (to 14)
        >>   26 END_FOR
             28 SWAP                     2
             30 STORE_FAST               1 (x)
             32 RETURN_VALUE

There is no longer a separate code object, nor creation of a single-use function
object, nor any need to create and destroy a Python frame.

Isolation of the ``x`` iteration variable is achieved by the combination of the
new ``LOAD_FAST_AND_CLEAR`` opcode at offset ``6``, which saves any outer value
of ``x`` on the stack before running the comprehension, and ``30 STORE_FAST``,
which restores the outer value of ``x`` (if any) after running the
comprehension.

If the comprehension accesses variables from the outer scope, inlining avoids
the need to place these variables in a cell, allowing the comprehension (and all
other code in the outer function) to access them as normal fast locals instead.
This provides further performance gains.

Only comprehensions occurring inside functions, where fast-locals
(``LOAD_FAST/STORE_FAST``) are used, will be inlined. Module-level
comprehensions will continue to create and call a function.

Generator expressions are currently never inlined in the reference
implementation of this PEP. In the future, some generator expressions may be
inlined, where the returned generator object does not leak.


Backwards Compatibility
=======================

Comprehension inlining will cause the following visible behavior changes. No
changes in the standard library or test suite were necessary to adapt to these
changes in the implementation, suggesting the impact in user code is likely to
be minimal.

Specialized tools depending on undocumented details of compiler bytecode output
may of course be affected in ways beyond the below, but these tools already must
adapt to bytecode changes in each Python version.

locals() includes outer variables
---------------------------------

Calling ``locals()`` within a comprehension will include all locals of the
function containing the comprehension. E.g. given the following function::

  def f(lst):
      return [locals() for x in lst]

Calling ``f([1])`` in current Python will return::

  [{'.0': <list_iterator object at 0x7f8d37170460>, 'x': 1}]

where ``.0`` is an internal implementation detail: the synthetic sole argument
to the comprehension "function".

Under this PEP, it will instead return::

  [{'lst': [1], 'x': 1}]

This now includes the outer ``lst`` variable as a local, and eliminates the
synthetic ``.0``.

No comprehension frame in tracebacks
------------------------------------

Under this PEP, a comprehension will no longer have its own dedicated frame in
a stack trace. For example, given this function::

  def g():
      raise RuntimeError("boom")

  def f():
      return [g() for x in [1]]

Currently, calling ``f()`` results in the following traceback:

.. code-block:: text

   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "<stdin>", line 5, in f
     File "<stdin>", line 5, in <listcomp>
     File "<stdin>", line 2, in g
   RuntimeError: boom

Note the dedicated frame for ``<listcomp>``.

Under this PEP, the traceback looks like this instead:

.. code-block:: text

   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "<stdin>", line 5, in f
     File "<stdin>", line 2, in g
   RuntimeError: boom

There is no longer an extra frame for the list comprehension. The frame for the
``f`` function has the correct line number for the comprehension, however, so
this simply makes the traceback more compact without losing any useful
information.

It is theoretically possible that code using warnings with the ``stacklevel``
argument could observe a behavior change due to the frame stack change. In
practice, however, this seems unlikely. It would require a warning raised in
library code that is always called through a comprehension in that same
library, where the warning is using a ``stacklevel`` of 3+ to bypass the
comprehension and its containing function and point to a calling frame outside
the library. In such a scenario it would usually be simpler and more reliable
to raise the warning closer to the calling code and bypass fewer frames.


How to Teach This
=================

It is not intuitively obvious that comprehension syntax will or should result
in creation and call of a nested function. For new users not already accustomed
to the prior behavior, I suspect the new behavior in this PEP will be more
intuitive and require less explanation. ("Why is there a ``<listcomp>`` line in
my traceback when I didn't define any such function? What is this ``.0``
variable I see in ``locals()``?")


Security Implications
=====================

None known.


Reference Implementation
========================

This PEP has a reference implementation in the form of `a PR against the CPython main
branch <https://github.com/python/cpython/pull/101441>`_ which passes all tests.

The reference implementation performs the micro-benchmark ``./python -m pyperf
timeit -s 'l = [1]' '[x for x in l]'`` 1.96x faster than the ``main`` branch (in a
build compiled with ``--enable-optimizations``.)

The reference implementation performs the ``comprehensions`` benchmark in the
`pyperformance <https://github.com/python/pyperformance>`_ benchmark suite
(which is not a micro-benchmark of comprehensions alone, but tests
real-world-derived code doing realistic work using comprehensions) 11% faster
than ``main`` branch (again in optimized builds). Other benchmarks in
pyperformance (none of which use comprehensions heavily) don't show any impact
outside the noise.

The implementation has no impact on non-comprehension code.


Rejected Ideas
==============

More efficient comprehension calling, without inlining
------------------------------------------------------

An `alternate approach <https://github.com/python/cpython/pull/101310>`_
introduces a new opcode for "calling" a comprehension in streamlined fashion
without the need to create a throwaway function object, but still creating a new
Python frame. This avoids all of the visible effects listed under `Backwards
Compatibility`_, and provides roughly half of the performance benefit (1.5x
improvement on the microbenchmark, 4% improvement on ``comprehensions``
benchmark in pyperformance.) It also requires adding a new pointer to the
``_PyInterpreterFrame`` struct and a new ``Py_INCREF`` on each frame
construction, meaning (unlike this PEP) it has a (very small) performance cost
for all code. It also provides less scope for future optimizations.

This PEP takes the position that full inlining offers sufficient additional
performance to more than justify the behavior changes.

Inlining module-level comprehensions
------------------------------------

Module-level comprehensions are generally called only once (when the module is
imported), so optimizing their performance is low priority. Inlining them would
require separate code paths in the compiler to handle a module global namespace
dictionary instead of fast-locals. It would be difficult or impossible to avoid
breaking semantics, since the comprehension iteration variable itself would be
a module global which might be referenced inside other functions that in turn
could be called within the comprehension.


Copyright
=========

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.
PEP 709: Inlined comprehensions (#3029) Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM> Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com> 2023-02-26 20:11:03 -05:00			`PEP: 709`
			`Title: Inlined comprehensions`
			`Author: Carl Meyer <carl@oddbird.net>`
			`Sponsor: Guido van Rossum <guido@python.org>`
			`Discussions-To: https://discuss.python.org/t/pep-709-inlined-comprehensions/24240`
			`Status: Draft`
			`Type: Standards Track`
			`Content-Type: text/x-rst`
			`Created: 24-Feb-2023`
			`Python-Version: 3.12`
			Post-History: `25-Feb-2023 <https://discuss.python.org/t/pep-709-inlined-comprehensions/24240>`__


			`Abstract`
			`========`

			`Comprehensions are currently compiled as nested functions, which provides`
			`isolation of the comprehension's iteration variable, but is inefficient at`
			`runtime. This PEP proposes to inline list, dictionary, and set comprehensions`
			`into the function where they are defined, and provide the expected isolation by`
			`pushing/popping clashing locals on the stack. This change makes comprehensions`
			`much faster: up to 2x faster for a microbenchmark of a comprehension alone,`
			`translating to an 11% speedup for one sample benchmark derived from real-world`
			`code that makes heavy use of comprehensions in the context of doing actual`
			`work.`


			`Motivation`
			`==========`

			`Comprehensions are a popular and widely-used feature of the Python language.`
			`The nested-function compilation of comprehensions optimizes for compiler`
			`simplicity at the expense of performance of user code. It is possible to`
			provide near-identical semantics (see `Backwards Compatibility`_) with much
			`better runtime performance for all users of comprehensions, with only a small`
			`increase in compiler complexity.`


			`Rationale`
			`=========`

			`Inlining is a common compiler optimization in many languages. Generalized`
			`inlining of function calls at compile time in Python is near-impossible, since`
			`call targets may be patched at runtime. Comprehensions are a special case,`
			`where we have a call target known statically in the compiler that can neither`
			`be patched (barring undocumented and unsupported fiddling with bytecode`
			`directly) nor escape.`

			`Inlining also permits other compiler optimizations of bytecode to be more`
			`effective, because they can now "see through" the comprehension bytecode,`
			`instead of it being an opaque call.`

			`Normally a performance improvement would not require a PEP. In this case, the`
			`simplest and most efficient implementation results in some user-visible effects,`
			`so this is not just a performance improvement, it is a (small) change to the`
			`language.`


			`Specification`
			`=============`

			`Given a simple comprehension::`

			`def f(lst):`
			`return [x for x in lst]`

			The compiler currently emits the following bytecode for the function ``f``:

			`.. code-block:: text`

			`1 0 RESUME 0`

			`2 2 LOAD_CONST 1 (<code object <listcomp> at 0x...)`
			`4 MAKE_FUNCTION 0`
			`6 LOAD_FAST 0 (lst)`
			`8 GET_ITER`
			`10 CALL 0`
			`20 RETURN_VALUE`

			`Disassembly of <code object <listcomp> at 0x...>:`
			`2 0 RESUME 0`
			`2 BUILD_LIST 0`
			`4 LOAD_FAST 0 (.0)`
			`>> 6 FOR_ITER 4 (to 18)`
			`10 STORE_FAST 1 (x)`
			`12 LOAD_FAST 1 (x)`
			`14 LIST_APPEND 2`
			`16 JUMP_BACKWARD 6 (to 6)`
			`>> 18 END_FOR`
			`20 RETURN_VALUE`

			`The bytecode for the comprehension is in a separate code object. Each time`
			``f()`` is called, a new single-use function object is allocated (by
			``MAKE_FUNCTION``), called (allocating and then destroying a new frame on the
			`Python stack), and then immediately thrown away.`

			Under this PEP, the compiler will emit the following bytecode for ``f()``
			`instead:`

			`.. code-block:: text`

			`1 0 RESUME 0`

			`2 2 LOAD_FAST 0 (lst)`
			`4 GET_ITER`
			`6 LOAD_FAST_AND_CLEAR 1 (x)`
			`8 SWAP 2`
			`10 BUILD_LIST 0`
			`12 SWAP 2`
			`>> 14 FOR_ITER 4 (to 26)`
			`18 STORE_FAST 1 (x)`
			`20 LOAD_FAST 1 (x)`
			`22 LIST_APPEND 2`
			`24 JUMP_BACKWARD 6 (to 14)`
			`>> 26 END_FOR`
			`28 SWAP 2`
			`30 STORE_FAST 1 (x)`
			`32 RETURN_VALUE`

			`There is no longer a separate code object, nor creation of a single-use function`
			`object, nor any need to create and destroy a Python frame.`

			Isolation of the ``x`` iteration variable is achieved by the combination of the
			new ``LOAD_FAST_AND_CLEAR`` opcode at offset ``6``, which saves any outer value
			of ``x`` on the stack before running the comprehension, and ``30 STORE_FAST``,
			which restores the outer value of ``x`` (if any) after running the
			`comprehension.`

			`If the comprehension accesses variables from the outer scope, inlining avoids`
			`the need to place these variables in a cell, allowing the comprehension (and all`
			`other code in the outer function) to access them as normal fast locals instead.`
			`This provides further performance gains.`

			`Only comprehensions occurring inside functions, where fast-locals`
			(``LOAD_FAST/STORE_FAST``) are used, will be inlined. Module-level
			`comprehensions will continue to create and call a function.`

			`Generator expressions are currently never inlined in the reference`
			`implementation of this PEP. In the future, some generator expressions may be`
			`inlined, where the returned generator object does not leak.`


			`Backwards Compatibility`
			`=======================`

			`Comprehension inlining will cause the following visible behavior changes. No`
			`changes in the standard library or test suite were necessary to adapt to these`
			`changes in the implementation, suggesting the impact in user code is likely to`
			`be minimal.`

			`Specialized tools depending on undocumented details of compiler bytecode output`
			`may of course be affected in ways beyond the below, but these tools already must`
			`adapt to bytecode changes in each Python version.`

			`locals() includes outer variables`
			`---------------------------------`

			Calling ``locals()`` within a comprehension will include all locals of the
			`function containing the comprehension. E.g. given the following function::`

			`def f(lst):`
			`return [locals() for x in lst]`

			Calling ``f([1])`` in current Python will return::

			`[{'.0': <list_iterator object at 0x7f8d37170460>, 'x': 1}]`

			where ``.0`` is an internal implementation detail: the synthetic sole argument
			`to the comprehension "function".`

			`Under this PEP, it will instead return::`

			`[{'lst': [1], 'x': 1}]`

			This now includes the outer ``lst`` variable as a local, and eliminates the
			synthetic ``.0``.

			`No comprehension frame in tracebacks`
			`------------------------------------`

			`Under this PEP, a comprehension will no longer have its own dedicated frame in`
			`a stack trace. For example, given this function::`

			`def g():`
			`raise RuntimeError("boom")`

			`def f():`
			`return [g() for x in [1]]`

			Currently, calling ``f()`` results in the following traceback:

			`.. code-block:: text`

			`Traceback (most recent call last):`
			`File "<stdin>", line 1, in <module>`
			`File "<stdin>", line 5, in f`
			`File "<stdin>", line 5, in <listcomp>`
			`File "<stdin>", line 2, in g`
			`RuntimeError: boom`

			Note the dedicated frame for ``<listcomp>``.

			`Under this PEP, the traceback looks like this instead:`

			`.. code-block:: text`

			`Traceback (most recent call last):`
			`File "<stdin>", line 1, in <module>`
			`File "<stdin>", line 5, in f`
			`File "<stdin>", line 2, in g`
			`RuntimeError: boom`

			`There is no longer an extra frame for the list comprehension. The frame for the`
			``f`` function has the correct line number for the comprehension, however, so
			`this simply makes the traceback more compact without losing any useful`
			`information.`

			It is theoretically possible that code using warnings with the ``stacklevel``
			`argument could observe a behavior change due to the frame stack change. In`
			`practice, however, this seems unlikely. It would require a warning raised in`
			`library code that is always called through a comprehension in that same`
			library, where the warning is using a ``stacklevel`` of 3+ to bypass the
			`comprehension and its containing function and point to a calling frame outside`
			`the library. In such a scenario it would usually be simpler and more reliable`
			`to raise the warning closer to the calling code and bypass fewer frames.`


			`How to Teach This`
			`=================`

			`It is not intuitively obvious that comprehension syntax will or should result`
			`in creation and call of a nested function. For new users not already accustomed`
			`to the prior behavior, I suspect the new behavior in this PEP will be more`
			intuitive and require less explanation. ("Why is there a ``<listcomp>`` line in
			my traceback when I didn't define any such function? What is this ``.0``
			variable I see in ``locals()``?")


			`Security Implications`
			`=====================`

			`None known.`


			`Reference Implementation`
			`========================`

			This PEP has a reference implementation in the form of `a PR against the CPython main
			branch <https://github.com/python/cpython/pull/101441>`_ which passes all tests.

			The reference implementation performs the micro-benchmark ``./python -m pyperf
			timeit -s 'l = [1]' '[x for x in l]'`` 1.96x faster than the ``main`` branch (in a
			build compiled with ``--enable-optimizations``.)

			The reference implementation performs the ``comprehensions`` benchmark in the
			`pyperformance <https://github.com/python/pyperformance>`_ benchmark suite
			`(which is not a micro-benchmark of comprehensions alone, but tests`
			`real-world-derived code doing realistic work using comprehensions) 11% faster`
			than ``main`` branch (again in optimized builds). Other benchmarks in
			`pyperformance (none of which use comprehensions heavily) don't show any impact`
			`outside the noise.`

			`The implementation has no impact on non-comprehension code.`


			`Rejected Ideas`
			`==============`

			`More efficient comprehension calling, without inlining`
			`------------------------------------------------------`

			An `alternate approach <https://github.com/python/cpython/pull/101310>`_
			`introduces a new opcode for "calling" a comprehension in streamlined fashion`
			`without the need to create a throwaway function object, but still creating a new`
			Python frame. This avoids all of the visible effects listed under `Backwards
			Compatibility`_, and provides roughly half of the performance benefit (1.5x
			improvement on the microbenchmark, 4% improvement on ``comprehensions``
			`benchmark in pyperformance.) It also requires adding a new pointer to the`
			``_PyInterpreterFrame`` struct and a new ``Py_INCREF`` on each frame
			`construction, meaning (unlike this PEP) it has a (very small) performance cost`
			`for all code. It also provides less scope for future optimizations.`

			`This PEP takes the position that full inlining offers sufficient additional`
			`performance to more than justify the behavior changes.`

			`Inlining module-level comprehensions`
			`------------------------------------`

			`Module-level comprehensions are generally called only once (when the module is`
			`imported), so optimizing their performance is low priority. Inlining them would`
			`require separate code paths in the compiler to handle a module global namespace`
			`dictionary instead of fast-locals. It would be difficult or impossible to avoid`
			`breaking semantics, since the comprehension iteration variable itself would be`
			`a module global which might be referenced inside other functions that in turn`
			`could be called within the comprehension.`


			`Copyright`
			`=========`

			`This document is placed in the public domain or under the`
			`CC0-1.0-Universal license, whichever is more permissive.`