1924 lines
84 KiB
ReStructuredText
1924 lines
84 KiB
ReStructuredText
PEP: 703
|
||
Title: Making the Global Interpreter Lock Optional in CPython
|
||
Author: Sam Gross <colesbury at gmail.com>
|
||
Sponsor: Łukasz Langa <lukasz at python.org>
|
||
Discussions-To: https://discuss.python.org/t/22606
|
||
Status: Accepted
|
||
Type: Standards Track
|
||
Content-Type: text/x-rst
|
||
Created: 09-Jan-2023
|
||
Python-Version: 3.13
|
||
Post-History: `09-Jan-2023 <https://discuss.python.org/t/22606>`__,
|
||
`04-May-2023 <https://discuss.python.org/t/26503>`__
|
||
Resolution: `24-Oct-2023 <https://discuss.python.org/t/pep-703-making-the-global-interpreter-lock-optional-in-cpython-acceptance/37075>`__
|
||
|
||
.. note::
|
||
The Steering Council accepts PEP 703, but with clear proviso: that
|
||
the rollout be gradual and break as little as possible, and that we can roll
|
||
back any changes that turn out to be too disruptive – which includes
|
||
potentially rolling back all of PEP 703 entirely if necessary
|
||
(however unlikely or undesirable we expect that to be).
|
||
|
||
|
||
Abstract
|
||
========
|
||
|
||
CPython's global interpreter lock ("GIL") prevents multiple threads
|
||
from executing Python code at the same time. The GIL is an obstacle
|
||
to using multi-core CPUs from Python efficiently. This PEP proposes
|
||
adding a build configuration (``--disable-gil``) to CPython to let it
|
||
run Python code without the global interpreter lock and with the
|
||
necessary changes needed to make the interpreter thread-safe.
|
||
|
||
Motivation
|
||
==========
|
||
|
||
The GIL is a major obstacle to concurrency. For scientific computing
|
||
tasks, this lack of concurrency is often a bigger issue than speed of
|
||
executing Python code, since most of the processor cycles are spent
|
||
in optimized CPU or GPU kernels. The GIL introduces a global
|
||
bottleneck that can prevent other threads from making progress if
|
||
they call any Python code. There are existing ways to enable
|
||
parallelism in CPython today, but those techniques come with
|
||
significant limitations (see `Alternatives`_).
|
||
|
||
This section focuses on the GIL's impact on scientific computing,
|
||
particular AI/ML workloads because that is the area with which this
|
||
author has the most experience, but the GIL also affects other users
|
||
of Python.
|
||
|
||
|
||
The GIL Makes Many Types of Parallelism Difficult to Express
|
||
------------------------------------------------------------
|
||
|
||
Neural network-based AI models expose multiple opportunities for
|
||
parallelism. For example, individual operations may be parallelized
|
||
internally ("intra-operator"), multiple operations may be executed
|
||
simultaneously ("inter-operator"), and requests (spanning multiple
|
||
operations) may also be parallelized. Efficient execution requires
|
||
exploiting multiple types of parallelism [#yuemmwang2019]_.
|
||
|
||
The GIL makes it difficult to express inter-operator parallelism, as
|
||
well as some forms of request parallelism, efficiently in Python. In
|
||
other programming languages, a system might use threads to run
|
||
different parts of a neural network on separate CPU cores, but this is
|
||
inefficient in Python due to the GIL. Similarly, latency-sensitive
|
||
inference workloads frequently use threads to parallelize across
|
||
requests, but face the same scaling bottlenecks in Python.
|
||
|
||
The challenges the GIL poses to exploiting parallelism in Python
|
||
frequently come up in reinforcement learning. Heinrich Kuttler,
|
||
author of the NetHack Learning Environment and Member of Technical
|
||
Staff at Inflection AI, writes:
|
||
|
||
Recent breakthroughs in reinforcement learning, such as on `Dota
|
||
2`_, `StarCraft`_, and `NetHack`_ rely on running multiple
|
||
environments (simulated games) in parallel using asynchronous
|
||
actor-critic methods. Straightforward multithreaded implementations
|
||
in Python don't scale beyond more than a few parallel environments
|
||
due to GIL contention. Multiprocessing, with communication via
|
||
shared memory or UNIX sockets, adds much complexity and in effect
|
||
rules out interacting with CUDA from different workers, severely
|
||
restricting the design space.
|
||
|
||
.. _Dota 2: https://openai.com/five/
|
||
.. _StarCraft: https://www.deepmind.com/blog/alphastar-grandmaster-level-in-starcraft-ii-using-multi-agent-reinforcement-learning
|
||
.. _NetHack: https://ai.facebook.com/blog/nethack-learning-environment-to-advance-deep-reinforcement-learning/
|
||
|
||
Manuel Kroiss, software engineer at DeepMind on the reinforcement
|
||
learning team, describes how the bottlenecks posed by the GIL lead to
|
||
rewriting Python codebases in C++, making the code less accessible:
|
||
|
||
We frequently battle issues with the Python GIL at DeepMind. In many
|
||
of our applications, we would like to run on the order of 50-100
|
||
threads per process. However, we often see that even with fewer
|
||
than 10 threads the GIL becomes the bottleneck. To work around this
|
||
problem, we sometimes use subprocesses, but in many cases the
|
||
inter-process communication becomes too big of an overhead. To
|
||
deal with the GIL, we usually end up translating large parts of our
|
||
Python codebase into C++. This is undesirable because it makes the
|
||
code less accessible to researchers.
|
||
|
||
|
||
Projects that involve interfacing with multiple hardware devices face
|
||
similar challenges: efficient communication requires use of multiple
|
||
CPU cores. The `Dose-3D`_ project aims to improve cancer
|
||
radiotherapy with precise dose planning. It uses medical phantoms
|
||
(stand-ins for human tissue) together with custom hardware and a
|
||
server application written in Python. Paweł Jurgielewicz, lead
|
||
software architect for the data acquisition system on the Dose-3D
|
||
project, describes the scaling challenges posed by the GIL and how
|
||
using a fork of Python without the GIL simplified the project:
|
||
|
||
In the Dose-3D project, the key challenge was to maintain a stable,
|
||
non-trivial concurrent communication link with hardware units while
|
||
utilizing a 1 Gbit/s UDP/IP connection to the maximum. Naturally,
|
||
we started with the multiprocessing package, but at some point, it
|
||
became clear that most CPU time was consumed by the data transfers
|
||
between the data processing stages, not by data processing itself.
|
||
The CPython multithreading implementation based on GIL was a dead
|
||
end too. When we found out about the "nogil" fork of Python it took
|
||
a single person less than half a working day to adjust the codebase
|
||
to use this fork and the results were astonishing. Now we can focus
|
||
on data acquisition system development rather than fine-tuning data
|
||
exchange algorithms.
|
||
|
||
.. _Dose-3D: https://dose3d.fis.agh.edu.pl/en/projekt-dose-3d-z-programu-team-net-fnp-eng/
|
||
|
||
|
||
Allen Goodman, author of `CellProfiler`_ and staff engineer at
|
||
Prescient Design and Genentech, describes how the GIL makes
|
||
biological methods research more difficult in Python:
|
||
|
||
Issues with Python's global interpreter lock are a frequent source
|
||
of frustration throughout biological methods research.
|
||
|
||
I wanted to better understand the current multithreading situation
|
||
so I reimplemented parts of `HMMER`_, a standard method for
|
||
multiple-sequence alignment. I chose this method because it
|
||
stresses both single-thread performance (scoring) and
|
||
multi-threaded performance (searching a database of sequences). The
|
||
GIL became the bottleneck when using only eight threads. This is a
|
||
method where the current popular implementations rely on 64 or
|
||
even 128 threads per process. I tried moving to subprocesses but
|
||
was blocked by the prohibitive IPC costs. HMMER is a relatively
|
||
elementary bioinformatics method and newer methods have far bigger
|
||
multi-threading demands.
|
||
|
||
Method researchers are begging to use Python (myself included),
|
||
because of its ease of use, the Python ecosystem, and because "it's
|
||
what people know." Many biologists only know a little bit of
|
||
programming (and that's almost always Python). Until Python's
|
||
multithreading situation is addressed, C and C++ will remain the
|
||
lingua franca of the biological methods research community.
|
||
|
||
.. _CellProfiler: https://cellprofiler.org/
|
||
.. _HMMER: http://hmmer.org/
|
||
|
||
|
||
The GIL Affects Python Library Usability
|
||
----------------------------------------
|
||
|
||
The GIL is a CPython implementation detail that limits multithreaded
|
||
parallelism, so it might seem unintuitive to think of it as a
|
||
usability issue. However, library authors frequently care a great
|
||
deal about performance and will design APIs that support working
|
||
around the GIL. These workaround frequently lead to APIs that are
|
||
more difficult to use. Consequently, users of these APIs may
|
||
experience the GIL as a *usability* issue and not just a performance
|
||
issue.
|
||
|
||
For example, PyTorch exposes a multiprocessing-based API called
|
||
``DataLoader`` for building data input pipelines. It uses ``fork()``
|
||
on Linux because it is generally faster and uses less memory
|
||
than ``spawn()``, but this leads to additional challenges for users:
|
||
creating a ``DataLoader`` after accessing a GPU can lead to confusing
|
||
CUDA errors. Accessing GPUs within a ``DataLoader`` worker quickly
|
||
leads to out-of-memory errors because processes do not share CUDA
|
||
contexts (unlike threads within a process).
|
||
|
||
Olivier Grisel, scikit-learn developer and software engineer at Inria,
|
||
describes how having to work around the GIL in scikit-learn related
|
||
libraries leads to a more complex and confusing user experience:
|
||
|
||
Over the years, scikit-learn developers have maintained ancillary
|
||
libraries such as ``joblib`` and ``loky`` to try to work around some
|
||
of the limitations of multiprocessing: extra memory usage partially
|
||
mitigated via semi-automated memory mapping of large data buffers,
|
||
slow worker startup by transparently reusing a pool of long
|
||
running workers, fork-safety problems of third-party native runtime
|
||
libraries such as GNU OpenMP by never using the fork-only
|
||
start-method, ability to perform parallel calls of interactively
|
||
defined functions in notebooks and REPLs in cross-platform manner
|
||
via cloudpickle. Despite our efforts, this multiprocessing-based
|
||
solution is still brittle, complex to maintain and confusing to
|
||
datascientists with limited understanding of system-level
|
||
constraints. Furthermore, there are still irreducible limitations
|
||
such as the overhead caused by the pickle-based
|
||
serialization/deserialization steps required for inter-process
|
||
communication. A lot of this extra work and complexity would not be
|
||
needed anymore if we could use threads without contention on
|
||
multicore hosts (sometimes with 64 physical cores or more) to run
|
||
data science pipelines that alternate between Python-level
|
||
operations and calls to native libraries.
|
||
|
||
Ralf Gommers, co-director of Quansight Labs and NumPy and SciPy
|
||
maintainer, describes how the GIL affects the user experience of
|
||
NumPy and numeric Python libraries:
|
||
|
||
A key problem in NumPy and the stack of packages built around it is
|
||
that NumPy is still (mostly) single-threaded --- and that has shaped
|
||
significant parts of the user experience and projects built around
|
||
it. NumPy does release the GIL in its inner loops (which do the
|
||
heavy lifting), but that is not nearly enough. NumPy doesn't offer
|
||
a solution to utilize all CPU cores of a single machine well, and
|
||
instead leaves that to Dask and other multiprocessing solutions.
|
||
Those aren't very efficient and are also more clumsy to use. That
|
||
clumsiness comes mainly in the extra abstractions and layers the
|
||
users need to concern themselves with when using, e.g.,
|
||
``dask.array`` which wraps ``numpy.ndarray``. It also shows up in
|
||
oversubscription issues that the user must explicitly be aware of
|
||
and manage via either environment variables or a third package,
|
||
``threadpoolctl``. The main reason is that NumPy calls into BLAS
|
||
for linear algebra - and those calls it has no control over, they
|
||
do use all cores by default via either pthreads or OpenMP.
|
||
|
||
Coordinating on APIs and design decisions to control parallelism is
|
||
still a major amount of work, and one of the harder challenges
|
||
across the PyData ecosystem. It would have looked a lot different
|
||
(better, easier) without a GIL.
|
||
|
||
|
||
GPU-Heavy Workloads Require Multi-Core Processing
|
||
-------------------------------------------------
|
||
|
||
Many high-performance computing (HPC) and AI workloads make heavy use
|
||
of GPUs. These applications frequently require efficient multi-core
|
||
CPU execution even though the bulk of the computation runs on a GPU.
|
||
|
||
Zachary DeVito, PyTorch core developer and researcher at FAIR
|
||
(Meta AI), describes how the GIL makes multithreaded scaling
|
||
inefficient even when the bulk of computation is performed outside of
|
||
Python:
|
||
|
||
In PyTorch, Python is commonly used to orchestrate ~8 GPUs and ~64
|
||
CPU threads, growing to 4k GPUs and 32k CPU threads for big models.
|
||
While the heavy lifting is done outside of Python, the speed of
|
||
GPUs makes even just the orchestration in Python not scalable. We
|
||
often end up with 72 processes in place of one because of the GIL.
|
||
Logging, debugging, and performance tuning are orders-of-magnitude
|
||
more difficult in this regime, continuously causing lower developer
|
||
productivity.
|
||
|
||
The use of many processes (instead of threads) makes common tasks more
|
||
difficult. Zachary DeVito continues:
|
||
|
||
On three separate occasions in the past couple of months
|
||
(reducing redundant compute in data loaders, writing model
|
||
checkpoints asynchronously, and parallelizing compiler
|
||
optimizations), I spent an order-of-magnitude more time figuring
|
||
out how to work around GIL limitations than actually solving the
|
||
particular problem.
|
||
|
||
Even GPU-heavy workloads frequently have a CPU-intensive component.
|
||
For example, computer vision tasks typically require
|
||
multiple "pre-processing" steps in the data input pipeline, like
|
||
image decoding, cropping, and resizing. These tasks are commonly
|
||
performed on the CPU and may use Python libraries like `Pillow`_
|
||
or `Pillow-SIMD`_. It is necessary to run the data input pipeline
|
||
on multiple CPU cores in order to keep the GPU "fed" with data.
|
||
|
||
The increase in GPU performance compared to individual CPU cores makes
|
||
multi-core performance more important. It is progressively more
|
||
difficult to keep the GPUs fully occupied. To do so requires efficient
|
||
use of multiple CPU cores, especially on multi-GPU systems. For
|
||
example, NVIDIA's DGX-A100 has 8 GPUs and two 64-core CPUs in order to
|
||
keep the GPUs "fed" with data.
|
||
|
||
.. _Pillow: https://pillow.readthedocs.io/en/stable/
|
||
.. _Pillow-SIMD: https://github.com/uploadcare/pillow-simd
|
||
|
||
|
||
The GIL Makes Deploying Python AI Models Difficult
|
||
--------------------------------------------------
|
||
|
||
Python is widely used to develop neural network-based AI models. In
|
||
PyTorch, models are frequently deployed as part of multi-threaded,
|
||
mostly C++, environments. Python is often viewed skeptically
|
||
because the GIL can be a global bottleneck, preventing efficient
|
||
scaling even though the vast majority of the computations
|
||
occur "outside" of Python with the GIL released. The torchdeploy
|
||
paper [#torchdeploy]_ shows experimental evidence for these scaling
|
||
bottlenecks in multiple model architectures.
|
||
|
||
PyTorch provides a number of mechanisms for deploying Python AI
|
||
models that avoid or work around the GIL, but they all come with
|
||
substantial limitations. For example, `TorchScript
|
||
<https://pytorch.org/docs/stable/jit.html>`_ captures a
|
||
representation of the model that can be executed from C++ without any
|
||
Python dependencies, but it only supports a limited subset of Python
|
||
and often requires rewriting some of the model's code. The
|
||
`torch::deploy <https://pytorch.org/docs/stable/package.html>`_ API
|
||
allows multiple Python interpreters, each with its own GIL, in the
|
||
same process(similar to :pep:`684`). However, ``torch::deploy`` has
|
||
limited support for Python modules that use C-API extensions.
|
||
|
||
|
||
Motivation Summary
|
||
------------------
|
||
|
||
Python's global interpreter lock makes it difficult to use modern
|
||
multi-core CPUs efficiently for many scientific and numeric computing
|
||
applications. Heinrich Kuttler, Manuel Kroiss, and Paweł
|
||
Jurgielewicz found that multi-threaded implementations in Python did
|
||
not scale well for their tasks and that using multiple processes
|
||
was not a suitable alternative.
|
||
|
||
The scaling bottlenecks are not solely in core numeric tasks. Both
|
||
Zachary DeVito and Paweł Jurgielewicz described challenges with
|
||
coordination and communication in Python.
|
||
|
||
Olivier Grisel, Ralf Gommers, and Zachary DeVito described how current
|
||
workarounds for the GIL are "complex to maintain" and cause "lower
|
||
developer productivity." The GIL makes it more difficult to develop
|
||
and maintain scientific and numeric computing libraries as well
|
||
leading to library designs that are more difficult to use.
|
||
|
||
|
||
|
||
Specification
|
||
=============
|
||
|
||
Build Configuration Changes
|
||
---------------------------
|
||
|
||
The global interpreter lock will remain the default for CPython builds
|
||
and python.org downloads. A new build configuration flag,
|
||
``--disable-gil`` will be added to the configure script that will build
|
||
CPython with support for running without the global interpreter lock.
|
||
|
||
When built with ``--disable-gil``, CPython will define the ``Py_GIL_DISABLED``
|
||
macro in Python/patchlevel.h. The ABI tag will include the letter "t"
|
||
(for "threading").
|
||
|
||
The ``--disable-gil`` builds of CPython will still support optionally
|
||
running with the GIL enabled at runtime (see `PYTHONGIL Environment
|
||
Variable`_ and `Py_mod_gil Slot`_).
|
||
|
||
Overview of CPython Changes
|
||
---------------------------
|
||
|
||
Removing the global interpreter lock requires substantial changes to
|
||
CPython internals, but relatively few changes to the public Python
|
||
and C APIs. This section describes the required changes to the
|
||
CPython implementation followed by the proposed API changes.
|
||
|
||
The implementation changes can be grouped into the following four
|
||
categories:
|
||
|
||
* Reference counting
|
||
* Memory management
|
||
* Container thread-safety
|
||
* Locking and atomic APIs
|
||
|
||
Reference Counting
|
||
------------------
|
||
|
||
Removing the GIL requires changes to CPython's
|
||
reference counting implementation to make it thread-safe.
|
||
Furthermore, it needs to have low execution overhead and allow for
|
||
efficient scaling with multiple threads. This PEP proposes a
|
||
combination of three techniques to address these constraints. The
|
||
first is a switch from plain non-atomic reference counting to biased
|
||
reference counting, which is a thread-safe reference counting
|
||
technique with lower execution overhead than plain atomic reference
|
||
counting. The other two techniques are immortalization and a limited
|
||
form of deferred reference counting; they address some of the
|
||
multi-threaded scalability issues with reference counting by avoiding
|
||
some reference count modifications.
|
||
|
||
Biased reference counting (BRC) is a technique first described in 2018
|
||
by Jiho Choi, Thomas Shull, and Josep Torrellas [#brc]_. It is based on the
|
||
observation that most objects are only accessed by a single thread,
|
||
even in multi-threaded programs. Each object is associated with an
|
||
owning thread (the thread that created it). Reference counting
|
||
operations from the owning thread use non-atomic instructions to
|
||
modify a "local" reference count. Other threads use atomic
|
||
instructions to modify a "shared" reference count. This design avoids
|
||
many atomic read-modify-write operations that are expensive on
|
||
contemporary processors.
|
||
|
||
The implementation of BRC proposed in this PEP largely matches the
|
||
original description of biased reference counting, but differs in
|
||
details like the size of reference counting fields and special bits in
|
||
those fields. BRC requires storing three pieces of information in each
|
||
object's header: the "local" reference count, the "shared" reference
|
||
count, and the identifier of the owning thread. The BRC paper packs
|
||
these three things into a single 64-bit field. This PEP proposes using
|
||
three separate fields in each object's header to avoid potential issues
|
||
due to reference count overflow. Additionally, the PEP supports a
|
||
faster deallocation path that avoids an atomic operation in the common
|
||
case.
|
||
|
||
The proposed ``PyObject`` struct (also called ``struct _object``) is
|
||
below:
|
||
|
||
.. code-block:: c
|
||
|
||
struct _object {
|
||
_PyObject_HEAD_EXTRA
|
||
uintptr_t ob_tid; // owning thread id (4-8 bytes)
|
||
uint16_t __padding; // reserved for future use (2 bytes)
|
||
PyMutex ob_mutex; // per-object mutex (1 byte)
|
||
uint8_t ob_gc_bits; // GC fields (1 byte)
|
||
uint32_t ob_ref_local; // local reference count (4 bytes)
|
||
Py_ssize_t ob_ref_shared; // shared reference count and state bits (4-8 bytes)
|
||
PyTypeObject *ob_type;
|
||
};
|
||
|
||
The ``ob_tid``, ``ob_ref_local``, and ``ob_ref_shared`` are used by
|
||
the biased reference counting implementation. The ``ob_gc_bits`` field
|
||
is used store garbage collection flags that were previously stored in
|
||
``PyGC_Head`` (see `Garbage Collection (Cycle Collection)`_). The
|
||
``ob_mutex`` field provides a per-object lock in a single byte.
|
||
|
||
|
||
|
||
Immortalization
|
||
'''''''''''''''
|
||
|
||
Some objects, such as interned strings, small integers, statically
|
||
allocated PyTypeObjects, and the ``True``, ``False``, and ``None``
|
||
objects stay alive for the lifetime of the program. These objects are
|
||
marked as immortal by setting the local reference count field
|
||
(``ob_ref_local``) to ``UINT32_MAX``.
|
||
|
||
The ``Py_INCREF`` and ``Py_DECREF`` macros are no-ops for immortal
|
||
objects. This avoids contention on the reference count fields of
|
||
these objects when multiple threads access them concurrently.
|
||
|
||
This proposed immortalization scheme is very similar to :pep:`683`,
|
||
adopted in Python 3.12, but with slightly different bit representation
|
||
in the reference count fields for immortal objects in order to work
|
||
with biased reference counting and deferred reference counting. See
|
||
also `Why Not Use PEP 683 Immortalization?`_.
|
||
|
||
Biased Reference Counting
|
||
'''''''''''''''''''''''''
|
||
|
||
Biased reference counting has a fast-path for objects "owned" by the
|
||
current thread and a slow-path for other objects. Ownership is
|
||
indicated by the ``ob_tid`` field. Determining the thread id requires
|
||
platform specific code [#tid]_. A value of ``0`` in ``ob_tid``
|
||
indicates that the object is not owned by any thread.
|
||
|
||
The ``ob_ref_local`` field stores the local reference count and two
|
||
flags. The two most significant bits are used to indicate the object
|
||
is immortal or uses deferred reference counting (see `Deferred
|
||
reference counting`_).
|
||
|
||
The ``ob_ref_shared`` field stores the shared reference count. The
|
||
two *least* significant bits are used to store the reference
|
||
counting state. The shared reference count is therefore shifted left by
|
||
two. The ``ob_ref_shared`` field uses the least significant bits
|
||
because the shared reference count can be temporarily negative; increfs
|
||
and decrefs may not be balanced between threads.
|
||
|
||
The possible reference counting states are listed below:
|
||
|
||
* ``0b00`` - default
|
||
* ``0b01`` - weakrefs
|
||
* ``0b10`` - queued
|
||
* ``0b11`` - merged
|
||
|
||
The states form a progression: during their lifecycle, objects may
|
||
transition to any numerically higher state. Objects can only be
|
||
deallocated from the "default" and "merged" states. Other states must
|
||
transition to the "merged" state before deallocation. Transitioning
|
||
states requires an atomic compare-and-swap on the ``ob_ref_shared``
|
||
field.
|
||
|
||
Default (``0b00``)
|
||
""""""""""""""""""
|
||
|
||
Objects are initially created in the default state. This is the only
|
||
state that allows for the quick deallocation code path. Otherwise, the
|
||
thread must merge the local and shared reference count fields, which
|
||
requires an atomic compare-and-swap.
|
||
|
||
This quick deallocation code path would not be thread-safe with
|
||
concurrent dereferencing of weakrefs, so the first time a weak
|
||
reference is created, the object is transitioned to the "weakrefs"
|
||
state if it is currently in the "default" state.
|
||
|
||
Similarly, the quick deallocation code path would not be thread-safe
|
||
with the lockless list and dictionary accesses (see `Optimistically
|
||
Avoiding Locking`_), so the first time a non-owning thread thread
|
||
attempts to retrieve an object in the "default" state it falls back to
|
||
the slower locking code path and transitions the object to
|
||
the "weakrefs" state.
|
||
|
||
|
||
Weakrefs (``0b01``)
|
||
"""""""""""""""""""
|
||
|
||
Objects in weakref and higher states support dereferencing weakrefs
|
||
as well as the lockless list and dictionary access by non-owning
|
||
threads. They require transitioning to the merged state before
|
||
deallocation, which is more expensive than the quick deallocation code
|
||
path supported by the "default" state.
|
||
|
||
|
||
Queued (``0b10``)
|
||
""""""""""""""""""
|
||
|
||
The queued state indicates that the a non-owning thread has requested
|
||
that the reference count fields be merged. This can happen when the
|
||
shared reference count becomes negative (due to an imbalance between
|
||
increfs and decrefs between threads). The object is inserted into the
|
||
owning thread's queue of objects to be merged. The owning thread is
|
||
notified via the ``eval_breaker`` mechanism. In practice, this
|
||
operation is rare. Most objects are only accessed by a single thread
|
||
and those objects accessed by multiple threads rarely have negative
|
||
shared reference counts.
|
||
|
||
If the owning thread has terminated, the acting thread immediately
|
||
merges the local and shared reference count fields and transitions to
|
||
the merged state.
|
||
|
||
|
||
Merged (``0b11``)
|
||
"""""""""""""""""
|
||
|
||
The merged state indicates that the object is not owned by any thread.
|
||
The ``ob_tid`` field is zero in this state and ``ob_ref_local`` is not
|
||
used. Once the shared reference count reaches zero, the object can
|
||
be deallocated from the merged state.
|
||
|
||
|
||
Reference counting pseudo-code
|
||
""""""""""""""""""""""""""""""
|
||
|
||
|
||
The proposed ``Py_INCREF`` and ``Py_DECREF`` operation should behave
|
||
as follows (using C-like pseudo-code):
|
||
|
||
.. code-block:: c
|
||
|
||
// low two bits of "ob_ref_shared" are used for flags
|
||
#define _Py_SHARED_SHIFT 2
|
||
|
||
void Py_INCREF(PyObject *op)
|
||
{
|
||
uint32_t new_local = op->ob_ref_local + 1;
|
||
if (new_local == 0)
|
||
return; // object is immortal
|
||
if (op->ob_tid == _Py_ThreadId())
|
||
op->ob_ref_local = new_local;
|
||
else
|
||
atomic_add(&op->ob_ref_shared, 1 << _Py_SHARED_SHIFT);
|
||
}
|
||
|
||
void Py_DECREF(PyObject *op)
|
||
{
|
||
if (op->ob_ref_local == _Py_IMMORTAL_REFCNT) {
|
||
return; // object is immortal
|
||
}
|
||
if (op->ob_tid == _Py_ThreadId()) {
|
||
op->ob_ref_local -= 1;
|
||
if (op->ob_ref_local == 0) {
|
||
_Py_MergeZeroRefcount(); // merge refcount
|
||
}
|
||
}
|
||
else {
|
||
_Py_DecRefShared(); // slow path
|
||
}
|
||
}
|
||
|
||
void _Py_MergeZeroRefcount(PyObject *op)
|
||
{
|
||
if (op->ob_ref_shared == 0) {
|
||
// quick deallocation code path (common case)
|
||
op->ob_tid = 0;
|
||
_Py_Dealloc(op);
|
||
}
|
||
else {
|
||
// slower merging path not shown
|
||
}
|
||
}
|
||
|
||
The reference implementation [#nogil312]_ contains implementations of
|
||
``_Py_MergeZeroRefcount`` and ``_Py_DecRefShared``.
|
||
|
||
Note that the above is pseudocode: in practice, the implementation
|
||
should use "relaxed atomics" to access ``ob_tid`` and
|
||
``ob_ref_local`` to avoid undefined behavior in C and C++.
|
||
|
||
|
||
Deferred Reference Counting
|
||
'''''''''''''''''''''''''''
|
||
|
||
A few types of objects, such as top-level functions, code objects,
|
||
modules, and methods, tend to be frequently accessed by many threads
|
||
concurrently. These objects don't necessarily live for the lifetime of
|
||
the program, so immortalization is not a good fit. This PEP proposes a
|
||
limited form of deferred reference counting to avoid contention on
|
||
these objects' reference count fields in multi-threaded programs.
|
||
|
||
Typically, the interpreter modifies objects' reference counts as they
|
||
are pushed to and popped from the interpreter's stack. The
|
||
interpreter skips these reference counting operations for objects
|
||
that use deferred reference counting. Objects that support deferred
|
||
reference counting are marked by setting the two most significant
|
||
bits in the local reference count field to one.
|
||
|
||
Because some reference counting operations are skipped, the reference
|
||
count fields no longer reflect the true number of references to these
|
||
objects. The true reference count is the sum of the reference count
|
||
fields plus any skipped references from each thread's interpreter
|
||
stack. The true reference count can only be safely computed when all
|
||
threads are paused during cyclic garbage collection. Consequently,
|
||
objects that use deferred reference counting can only be deallocated
|
||
during garbage collection cycles.
|
||
|
||
Note that the objects that use deferred reference counting already
|
||
naturally form reference cycles in CPython, so they would typically be
|
||
deallocated by the garbage collector even without deferred reference
|
||
counting. For example, top-level functions and modules form a reference
|
||
cycle as do methods and type objects.
|
||
|
||
|
||
Garbage Collector Modifications for Deferred Reference Counting
|
||
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
|
||
|
||
The tracing garbage collector finds and deallocates unreferenced
|
||
objects. Currently, the tracing garbage collector only finds
|
||
unreferenced objects that are part of a reference cycle. With
|
||
deferred reference counting, the tracing garbage collector will also
|
||
find and collect some unreferenced objects that may not be part of
|
||
any reference cycle, but whose collection has been delayed due to
|
||
deferred reference counting. This requires that all objects that
|
||
support deferred reference counting also have a corresponding type
|
||
object that supports tracing garbage collection (through the
|
||
``Py_TPFLAGS_HAVE_GC`` flag). Additionally, the garbage collector
|
||
will need to traverse each thread's stack to add references to the GC
|
||
reference count at the start of each collection.
|
||
|
||
Reference Counting Type Objects
|
||
'''''''''''''''''''''''''''''''
|
||
|
||
Type objects (``PyTypeObject``) use a mix of reference counting
|
||
techniques. Statically allocated type objects are immortalized because
|
||
the objects already live for the lifetime of the program. Heap type
|
||
objects use deferred reference counting in combination with per-thread
|
||
reference counting. Deferred reference counting is not sufficient to
|
||
address the multi-threaded scaling bottlenecks with heap types because
|
||
most references to heap types are from object instances, not references
|
||
on the interpreter stack.
|
||
|
||
To address this, heap type reference counts are partially stored in a
|
||
distributed manner in per-thread arrays. Every thread stores an
|
||
array of local reference counts for each heap type object. Heap type
|
||
objects are assigned a unique number that determines its position in
|
||
the local reference count arrays. A heap type's true reference count
|
||
is the sum of its entries in the per-thread arrays, plus the reference
|
||
count on the ``PyTypeObject``, plus any deferred references in the
|
||
interpreter stack.
|
||
|
||
Threads may grow their own type reference count arrays as needed when
|
||
incrementing or decrementing the local reference count of a type
|
||
object.
|
||
|
||
Use of the per-thread reference count arrays is limited to a few
|
||
places:
|
||
|
||
* ``PyType_GenericAlloc(PyTypeObject *type, Py_ssize_t nitems)``:
|
||
Increments the current thread's local reference count for ``type``,
|
||
if it is a heap type.
|
||
* ``subtype_dealloc(PyObject *self)``: Decrements the current thread's
|
||
local reference count for ``self->ob_type``, if the type is a heap
|
||
type.
|
||
* ``gcmodule.c``: Adds each thread's local reference counts to the
|
||
``gc_refs`` count for the corresponding heap type object.
|
||
|
||
Additionally, when a thread terminates, it adds any non-zero local
|
||
reference counts to each type object's own reference count field.
|
||
|
||
|
||
Memory Management
|
||
-----------------
|
||
|
||
CPython currently uses an internal allocator, pymalloc, which is
|
||
optimized for small object allocation. The pymalloc implementation is
|
||
not thread-safe without the GIL. This PEP proposes replacing pymalloc
|
||
with mimalloc, a general-purpose thread-safe allocator with good
|
||
performance, including for small allocations.
|
||
|
||
Using mimalloc, with some modifications, also addresses two other
|
||
issues related to removing the GIL. First, traversing the internal
|
||
mimalloc structures allows the garbage collector to find all Python
|
||
objects without maintaining a linked list. This is described in more
|
||
detail in the garbage collection section. Second, mimalloc heaps and
|
||
allocations based on size class enable collections like dict to
|
||
generally avoid acquiring locks during read-only operations. This is
|
||
described in more detail in the collection thread-safety section.
|
||
|
||
CPython already requires that objects that support garbage collection
|
||
use the GC allocator APIs (typically indirectly by calling
|
||
``PyType_GenericAlloc``). This PEP would add additional requirements
|
||
to the use of the Python allocator APIs. First, Python objects must
|
||
be allocated through object allocation APIs, such as
|
||
``PyType_GenericAlloc``, ``PyObject_Malloc``, or other Python APIs
|
||
that wrap those calls. Python objects should not be allocated through
|
||
other APIs, such as raw calls to C's malloc or the C++ new operator.
|
||
Additionally, ``PyObject_Malloc`` should be used only for allocating
|
||
Python objects; it should not be used for allocating buffers,
|
||
storages, or other data structures that are not PyObjects.
|
||
|
||
This PEP also imposes restrictions on the pluggable allocator API
|
||
(``PyMem_SetAllocator``). When compiling without the GIL, allocators
|
||
set using this API must eventually delegate the allocation to the
|
||
corresponding underlying allocator, such as ``PyObject_Malloc``, for
|
||
Python object allocations. This allows for allocators that "wrap"
|
||
underlying allocators, such as Python's tracemalloc and debug
|
||
allocator, but not for wholly replacing the allocator.
|
||
|
||
|
||
CPython Free Lists
|
||
''''''''''''''''''
|
||
|
||
CPython makes use of free lists to speed up the allocation of small,
|
||
frequently allocated objects like tuples and numbers. These free
|
||
lists are moved to ``PyThreadState`` from per-interpreter state.
|
||
|
||
|
||
|
||
Garbage Collection (Cycle Collection)
|
||
-------------------------------------
|
||
|
||
The CPython garbage collector requires the following changes to work
|
||
with this proposal:
|
||
|
||
* Use of "stop-the-world" to provide thread-safety guarantees that
|
||
were previously provided by the GIL.
|
||
* Elimination of generational garbage collection in favor of
|
||
non-generational collector.
|
||
* Integration with deferred reference counting and biased reference
|
||
counting.
|
||
|
||
Additionally, the above changes enable removing the
|
||
``_gc_prev`` and ``_gc_next`` fields from GC objects. The GC bits
|
||
that stored the tracked, finalized, and unreachable states are moved
|
||
to the ``ob_gc_bits`` field in the PyObject header.
|
||
|
||
Stop-the-World
|
||
''''''''''''''
|
||
|
||
The CPython cycle garbage collector currently relies on the global
|
||
interpreter lock to prevent other threads from accessing Python
|
||
objects while the collector finds cycles. The GIL is never released
|
||
during the cycle-finding routine, so the collector can rely on
|
||
stable (i.e., unchanging) reference counts and references for the
|
||
duration of that routine. However, following cycle detection, the GIL
|
||
may be temporarily released while calling objects' finalizers and
|
||
clear (``tp_clear``) functions, allowing other threads to run in an
|
||
interleaved fashion.
|
||
|
||
When running without the GIL, the implementation needs a way to ensure
|
||
that reference counts remain stable during cycle detection. Threads
|
||
running Python code must be paused to ensure that references and
|
||
reference counts remain stable. Once the cycles are identified, other
|
||
threads are resumed.
|
||
|
||
The current CPython cyclic garbage collector involves two
|
||
cycle-detection passes during each garbage collection cycle.
|
||
Consequently, this requires two stop-the-world pauses when running the
|
||
garbage collector without the GIL. The first cycle-detection pass
|
||
identifies cyclic trash. The second pass runs after finalizers to
|
||
identify which objects still remain unreachable. Note that other
|
||
threads are resumed before finalizers and ``tp_clear`` functions are
|
||
called to avoid introducing potential deadlocks that are not present in
|
||
the current CPython behavior.
|
||
|
||
Thread States
|
||
'''''''''''''
|
||
|
||
To support pausing threads for garbage collection, the PyThreadState
|
||
gets a new "status" field. Like the other fields in PyThreadState,
|
||
the status field is not part of the public CPython API. The status
|
||
field may be in one of three states:
|
||
|
||
* ``ATTACHED``
|
||
* ``DETACHED``
|
||
* ``GC``
|
||
|
||
The ``ATTACHED`` and ``DETACHED`` states correspond closely to
|
||
acquiring and releasing the global interpreter lock. When compiling
|
||
without the GIL, functions that previously acquired the GIL instead
|
||
transition the thread state to ``ATTACHED``, and functions that
|
||
previously released the GIL transition the thread state
|
||
to ``DETACHED``. Just as threads previously needed to acquire the
|
||
GIL before accessing or modifying Python objects, they now must be in
|
||
the ``ATTACHED`` state before accessing or modifying Python
|
||
objects. Since the same public C-API functions "attach" the thread as
|
||
previously acquired the GIL (e.g., ``PyEval_RestoreThread``), the
|
||
requirements for thread initialization in extensions remain the same.
|
||
The substantial difference is that multiple threads can be in the
|
||
attached state simultaneously, while previously only one thread could
|
||
acquire the GIL at a time.
|
||
|
||
During stop-the-world pauses, the thread performing garbage collection
|
||
needs to ensure that no other thread is accessing or modifying Python
|
||
objects. All other threads must be in the "GC" state. The garbage
|
||
collection thread can transition other threads from the ``DETACHED``
|
||
state to the GC state using an atomic compare-and-swap operation on
|
||
the status field. Threads in the ``ATTACHED`` state are requested to
|
||
pause themselves and set their status to "GC", using the
|
||
existing "eval breaker" mechanism. At the end of the stop-the-world
|
||
pause, all threads in the "GC" state are set to ``DETACHED`` and
|
||
woken up if they are paused. Threads that were previously attached
|
||
(i.e., executing Python bytecode) can re-attach (set their thread
|
||
states to ``ATTACHED``) and resume executing Python code. Threads
|
||
that were previously ``DETACHED`` ignore the notification.
|
||
|
||
Generations
|
||
'''''''''''
|
||
|
||
The existing Python garbage collector uses three generations. When
|
||
compiling without the GIL, the garbage collector will only use a single
|
||
generation (i.e., it will be non-generational). The primary reason for
|
||
this change is to reduce the impact of the stop-the-world pauses in
|
||
multithreaded applications. Frequent stop-the-world pauses for
|
||
collecting the young generation would have more of an impact on
|
||
multi-threaded applications than less frequent collections.
|
||
|
||
|
||
Integration With Deferred and Biased Reference Counting
|
||
'''''''''''''''''''''''''''''''''''''''''''''''''''''''
|
||
|
||
To find unreferenced objects, the cyclic garbage collector computes
|
||
the difference between the number of incoming references and the
|
||
object's reference count. This difference is called ``gc_refs`` and
|
||
is stored in the ``_gc_prev`` field. If ``gc_refs`` is greater than
|
||
zero, then the object is guaranteed to be alive (i.e., not cyclic
|
||
trash). If ``gc_refs`` is zero, then the object is only alive if it
|
||
is transitively referenced by another live object. When computing
|
||
this difference, the collector should traverse each thread's stack,
|
||
and for every deferred reference, increment the ``gc_refs`` for the
|
||
referred object. Since generator objects also have stacks with
|
||
deferred references, the same procedure is applied to each
|
||
generator's stack.
|
||
|
||
Python unit tests commonly use ``gc.collect()`` to ensure that any
|
||
unreferenced objects are destructed and their finalizers run. Since
|
||
biased reference counting can delay the destruction of some objects
|
||
that are referenced by multiple threads, it's convenient to ensure
|
||
that those objects are destructed during garbage collection, even
|
||
though they may not be part of any reference cycles. While other
|
||
threads are paused, the garbage collector thread should merge the
|
||
reference counts for any queued objects, but not call any destructors
|
||
even if the combined reference count is zero. (Calling destructors
|
||
while other threads are paused risks introducing deadlocks.) Once
|
||
other threads are resumed, the GC thread should call ``_Py_Dealloc``
|
||
on those objects with a zero merged reference count.
|
||
|
||
Container Thread-Safety
|
||
-----------------------
|
||
|
||
In CPython, the global interpreter lock protects against corruption of
|
||
internal interpreter states when multiple threads concurrently access
|
||
or modify Python objects. For example, if multiple threads
|
||
concurrently modify the same list, the GIL ensures that the length of
|
||
the list (``ob_size``) accurately matches the number of elements, and
|
||
that the reference counts of each element accurately reflect the
|
||
number of references to those elements. Without the GIL --- and
|
||
absent other changes --- concurrent modifications would corrupt those
|
||
fields and likely lead to program crashes.
|
||
|
||
The GIL does not necessarily ensure that operations are atomic or
|
||
remain correct when multiple operations occur concurrently. For
|
||
example, ``list.extend(iterable)`` may not appear atomic if the
|
||
iterable has an iterator implemented in Python (or releases the GIL
|
||
internally). Similarly, ``list.remove(x)`` can remove the wrong
|
||
object if it overlaps with another operation that modifies the list,
|
||
depending on the implementation of the equality operator. Still, the
|
||
GIL ensures that some operations are effectively atomic. For example,
|
||
the constructor ``list(set)`` atomically copies the items of the set
|
||
to a new list, and some code relies on that copy being atomic
|
||
(i.e., having a snapshot of the items in the set). This PEP preserves
|
||
that property.
|
||
|
||
This PEP proposes using per-object locks to provide many of the same
|
||
protections that the GIL provides. For example, every list,
|
||
dictionary, and set will have an associated lightweight lock. All
|
||
operations that modify the object must hold the object's lock. Most
|
||
operations that read from the object should acquire the object's lock
|
||
as well; the few read operations that can proceed without holding a
|
||
lock are described below.
|
||
|
||
Per-object locks with critical sections provide weaker protections
|
||
than the GIL. Because the GIL doesn't necessarily ensure that
|
||
concurrent operations are atomic or correct, the per-object locking
|
||
scheme also cannot ensure that concurrent operations are atomic or
|
||
correct. Instead, per-object locking aims for similar protections as
|
||
the GIL, but with mutual exclusion limited to individual objects.
|
||
|
||
Most operations on an instance of a container type require locking
|
||
that object. For example:
|
||
|
||
* ``list.append``, ``list.insert``, ``list.repeat``,
|
||
``PyList_SetItem``
|
||
* ``dict.__setitem__``, ``PyDict_SetItem``
|
||
* ``list.clear``, ``dict.clear``
|
||
* ``list.__repr__``, ``dict.__repr__``, etc.
|
||
* ``list.extend(iterable)``
|
||
* ``setiter_iternext``
|
||
|
||
Some operations operate directly on two container objects, with
|
||
knowledge about both containers' internal structure. For example,
|
||
there are internal specializations of ``list.extend(iterable)`` for
|
||
specific iterable types, like ``set``. These operations need to lock
|
||
both container objects because they access the internals of both
|
||
objects simultaneously. Note that the generic implementation of
|
||
``list.extend`` only needs to lock one object (the list) because the
|
||
other object is accessed indirectly through the thread-safe iterator
|
||
API. Operations that lock two containers are:
|
||
|
||
* ``list.extend(list)``, ``list.extend(set)``, ``list.extend
|
||
(dictitems)``, and other specializations where the implementation
|
||
is specialized for argument type.
|
||
* ``list.concat(list)``
|
||
* ``list.__eq__(list)``, ``dict.__eq__(dict)``
|
||
|
||
Some simple operations can be implemented directly with atomic
|
||
accesses and do not need locks because they only access a single
|
||
field. These operations include:
|
||
|
||
* ``len(list)`` i.e., ``list_length(PyListObject *a)``
|
||
* ``len(dict)``
|
||
* ``len(set)``
|
||
|
||
A select few operations optimistically avoid locking to improve
|
||
performance. These require special implementations and cooperation
|
||
from the memory allocator:
|
||
|
||
* ``list[idx]`` (``list_subscript``)
|
||
* ``dict[key]`` (``dict_subscript``)
|
||
* ``listiter_next``, ``dictiter_iternextkey/value/item``
|
||
* ``list.contains``
|
||
|
||
Borrowed References
|
||
'''''''''''''''''''
|
||
|
||
Per-object locking provides many of the important protections that the
|
||
GIL provides, but there are a few cases where it's not sufficient.
|
||
For example, code that relies on upgrading a borrowed reference to
|
||
an "owned" reference may be unsafe in certain circumstances:
|
||
|
||
.. code-block:: c
|
||
|
||
PyObject *item = PyList_GetItem(list, idx);
|
||
Py_INCREF(item);
|
||
|
||
The GIL ensures that no other thread can modify the list in between
|
||
the access and the ``Py_INCREF`` call. Without the GIL -- even with
|
||
per-object locking -- another thread might modify the list leading to
|
||
``item`` being freed between the access and the ``Py_INCREF`` call.
|
||
|
||
The problematic borrowed reference APIs are supplemented with
|
||
functions that return "new references" but are otherwise
|
||
equivalent:
|
||
|
||
* ``PyList_FetchItem(list, idx)`` for ``PyList_GetItem``
|
||
* ``PyDict_FetchItem(dict, key)`` for ``PyDict_GetItem``
|
||
* ``PyWeakref_FetchObject`` for ``PyWeakref_GetObject``
|
||
|
||
Note that some APIs that return borrowed references, such as
|
||
``PyTuple_GetItem``, are not problematic because tuples are
|
||
immutable. Similarly, not all uses of the above APIs are problematic.
|
||
For example, ``PyDict_GetItem`` is often used for parsing keyword
|
||
argument dictionaries in function calls; those keyword argument
|
||
dictionaries are effectively private (not accessible by other
|
||
threads).
|
||
|
||
Python Critical Sections
|
||
''''''''''''''''''''''''
|
||
|
||
Straightforward per-object locking could introduce deadlocks that were
|
||
not present when running with the GIL. Threads may hold locks for
|
||
multiple objects simultaneously because Python operations can nest.
|
||
Operations on objects can invoke operations on other objects,
|
||
acquiring multiple per-object locks. If threads try to acquire the
|
||
same locks in different orders, they will deadlock.
|
||
|
||
This PEP proposes a scheme called "Python critical sections" to
|
||
implicitly release per-object locks to avoid deadlocks. To
|
||
understand the scheme, we first introduce a general approach to avoid
|
||
deadlocks, and then propose a refinement of that approach with better
|
||
performance.
|
||
|
||
One way to avoid deadlocks is to allow threads to hold only the lock
|
||
(or locks) for a single operation at a time (typically a single lock,
|
||
but some operations involve two locks as described above). When a
|
||
thread begins a nested operation it should suspend the locks for any
|
||
outer operation: before beginning the nested operation, the locks for
|
||
the outer operation are released and when the nested operation
|
||
completes, the locks for the outer operation are reacquired.
|
||
|
||
Additionally, the locks for any active operation should be suspended
|
||
around potentially blocking operations, such as I/O (i.e., operations
|
||
that would have released the GIL). This is because the interaction
|
||
between locks and blocking operations can lead to deadlocks in the
|
||
same way as the interaction between multiple locks.
|
||
|
||
To improve performance, this PEP proposes a variation of the above
|
||
scheme that still avoids deadlocks. Instead of immediately
|
||
suspending locks any time a nested operation begins, locks are only
|
||
suspended if the thread would block (i.e., would have released the
|
||
GIL). This reduces the number of lock acquisitions and releases for
|
||
nested operations, while avoiding deadlocks.
|
||
|
||
The proposed API for Python critical sections are the following four
|
||
macros. These are intended to be public (usable by C-API extensions),
|
||
but not part of the limited API:
|
||
|
||
- ``Py_BEGIN_CRITICAL_SECTION(PyObject *op);``:
|
||
Begins a critical section by acquiring the mutex for the referenced
|
||
object. If the object is already locked, then locks for any
|
||
outstanding critical sections are released before this thread waits
|
||
for referenced object to be unlocked.
|
||
|
||
- ``Py_END_CRITICAL_SECTION;``:
|
||
Ends the most recent operation, unlocking the mutex. The next
|
||
most recent previous critical section (if any) is resumed if it is
|
||
currently suspended.
|
||
|
||
- ``Py_BEGIN_CRITICAL_SECTION2(PyObject *a, PyObject *b);``:
|
||
Begins a critical section by acquiring the mutexes for two objects.
|
||
To ensure consistent lock ordering, the order of acquisition is
|
||
determined by memory address (i.e., the mutex with lower memory
|
||
address is acquired first). If either mutex is already locked, then
|
||
locks for any outstanding critical sections are released before this
|
||
thread waits for the referenced objects to be unlocked.
|
||
|
||
- ``Py_END_CRITICAL_SECTION2;``:
|
||
Behaves the same as ``Py_END_CRITICAL_SECTION`` but unlocks two
|
||
objects.
|
||
|
||
Additionally, when a thread transitions from the ``ATTACHED`` state to
|
||
the ``DETACHED`` state, it should suspend any active critical
|
||
sections. When transitioning from ``DETACHED`` to ``ATTACHED``, the
|
||
most recent suspended critical section, if any, should be resumed.
|
||
|
||
Note that operations that lock two containers simultaneously need to use
|
||
the ``Py_BEGIN_CRITICAL_SECTION2`` macro. It is not sufficient to nest
|
||
two calls to ``Py_BEGIN_CRITICAL_SECTION`` because the inner critical
|
||
section may release the locks from the outer critical section.
|
||
|
||
Optimistically Avoiding Locking
|
||
'''''''''''''''''''''''''''''''
|
||
|
||
A few operations on ``dict`` and ``list`` optimistically avoid
|
||
acquiring the per-object locks. They have a fast path operation that
|
||
does not acquire locks, but may fall back to a slower operation that
|
||
acquires the dictionary's or list's lock when another thread is
|
||
concurrently modifying that container.
|
||
|
||
The operations with an optimistic fast path are:
|
||
|
||
* ``PyDict_FetchItem/GetItem`` and ``dict.__getitem__``
|
||
* ``PyList_FetchItem/GetItem`` and ``list.__getitem__``
|
||
|
||
Additionally, iterators for ``dict`` and ``list`` use the above
|
||
functions so they also optimistically avoid locking when returning
|
||
the next item.
|
||
|
||
There are two motivations for avoiding lock acquisitions in these
|
||
functions. The primary reason is that it is necessary for scalable
|
||
multi-threaded performance even for simple applications. Dictionaries
|
||
hold top-level functions in modules and methods for classes. These
|
||
dictionaries are inherently highly shared by many threads in
|
||
multi-threaded programs. Contention on these locks in multi-threaded
|
||
programs for loading methods and functions would inhibit efficient
|
||
scaling in many basic programs.
|
||
|
||
The secondary motivation for avoiding locking is to reduce overhead
|
||
and improve single-threaded performance. Although lock acquisition
|
||
has low overhead compared to most operations, accessing individual
|
||
elements of lists and dictionaries are fast operations (so the
|
||
locking overhead is comparatively larger) and frequent (so the
|
||
overhead has more impact).
|
||
|
||
This section describes the challenges with implementing dictionary and
|
||
list accesses without locking followed by a description of this PEP's
|
||
changes to the Python interpreter required to address those
|
||
challenges.
|
||
|
||
The main challenge is that retrieving an item from a list or
|
||
dictionary and incrementing the reference count of that item is not
|
||
an atomic operation. In between the time the item is retrieved and
|
||
the reference count is incremented, another thread may modify the
|
||
list or dictionary, possibly freeing the memory for the previously
|
||
retrieved item.
|
||
|
||
A partial attempt at addressing this issue would be to convert the
|
||
reference count increment to a conditional increment, only
|
||
incrementing the reference count if it's not zero. This change is
|
||
not sufficient because when a Python object's reference count reaches
|
||
zero, the object's destructor is called and the memory storing the
|
||
object may be re-used for other data structures or returned to the
|
||
operating system. Instead, this PEP proposes a technique to ensure
|
||
that the reference count fields remain valid for the duration of the
|
||
access, so that the conditional reference count increment is safe.
|
||
This technique requires cooperation from the memory allocator
|
||
(mimalloc) as well as changes to the list and dictionary objects. The
|
||
proposed technique is similar to read-copy update (RCU) [#rcu]_, a
|
||
synchronization mechanism widely used in the Linux kernel.
|
||
|
||
The current implementation of ``list_item`` (the C function
|
||
implementing ``list.__getitem__``) is the following:
|
||
|
||
.. code-block:: c
|
||
|
||
Py_INCREF(a->ob_item[i]);
|
||
return a->ob_item[i];
|
||
|
||
The proposed implementation uses the conditional increment
|
||
(``_Py_TRY_INCREF``) and has additional checks:
|
||
|
||
.. code-block:: c
|
||
|
||
PyObject **ob_item = atomic_load(&a->ob_item);
|
||
PyObject *item = atomic_load(&ob_item[i]);
|
||
if (!item || !_Py_TRY_INCREF(item)) goto retry;
|
||
if (item != atomic_load(&ob_item[i])) {
|
||
Py_DECREF(item);
|
||
goto retry;
|
||
}
|
||
if (ob_item != atomic_load(&a->ob_item)) {
|
||
Py_DECREF(item);
|
||
goto retry;
|
||
}
|
||
return item;
|
||
|
||
|
||
The "retry" subroutine implements the locked fallback path when
|
||
concurrent modifications to the list cause the above fast,
|
||
non-locking path to fail:
|
||
|
||
.. code-block:: c
|
||
|
||
retry:
|
||
PyObject *item;
|
||
Py_BEGIN_CRITICAL_SECTION(a->ob_mutex);
|
||
item = a->ob_item[i];
|
||
Py_INCREF(item);
|
||
Py_END_CRITICAL_SECTION(a->ob_mutex);
|
||
return item;
|
||
|
||
The modifications to the ``dict`` implementation are similar, because
|
||
the relevant parts of both list and dictionary retrieval involve
|
||
loading an item/value from an array at a known index.
|
||
|
||
The additional checks following the conditional increment are
|
||
necessary because the scheme allows immediate re-use of memory,
|
||
including the memory that previously held a ``PyObject`` structure or
|
||
``list`` or ``dict`` array. Without these extra checks, the function
|
||
might return a Python object that was never in the list, if the
|
||
memory occupied by the Python object previously held a different
|
||
``PyObject`` whose memory previously stored an item in the list.
|
||
|
||
|
||
Mimalloc Changes for Optimistic ``list`` and ``dict`` Access
|
||
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
|
||
|
||
The implementation requires additional constraints to the memory
|
||
allocator, including some changes to the mimalloc code. Some
|
||
background on mimalloc's implementation is helpful to understand the
|
||
required changes. Individual allocations from mimalloc are
|
||
called "blocks." Mimalloc "pages" contain consecutive blocks that
|
||
are all the same size. A mimalloc "page" is similar to
|
||
a "superblock" in other allocators; it is NOT an operating system
|
||
page. A mimalloc "heap" contains pages of various size classes; each
|
||
page belongs to a single heap. If none of the blocks of a page are
|
||
allocated, then mimalloc may re-use the page for a different size
|
||
class or different heap (i.e., it might reinitialize the page).
|
||
|
||
The list and dictionary access scheme works by partially restricting
|
||
re-use of mimalloc pages so that reference count fields remains valid
|
||
for the duration of the access. The restricted re-use of mimalloc
|
||
pages is enforced by having separate heaps for Python objects
|
||
[#heaps]_. This ensures that even if an item is freed during access
|
||
and the memory reused for a new object, the new object's reference
|
||
count field is placed at the same location in memory. The reference
|
||
count field remains valid (or zero) across allocations.
|
||
|
||
Python objects that support ``Py_TPFLAGS_MANAGED_DICT`` have their
|
||
dictionary and weak reference fields preceding the ``PyObject``
|
||
header, so their reference count fields are at a different offset from
|
||
the start of their allocations. They are stored in a separate mimalloc
|
||
heap. Additionally, non-GC objects are stored in their own heap so
|
||
that the GC only has to look at GC objects. There are therefore three
|
||
mimalloc heaps for Python objects, one for non-GC objects, one for GC
|
||
objects with managed dictionaries, and one for GC objects without
|
||
managed dictionaries.
|
||
|
||
|
||
Mimalloc Page Reuse
|
||
'''''''''''''''''''
|
||
|
||
It is beneficial to keep the restrictions on mimalloc page reuse to a
|
||
short period of time to avoid increasing overall memory usage.
|
||
Precisely limiting the restrictions to list and dictionary accesses
|
||
would minimize memory usage, but would require expensive
|
||
synchronizations. At the other extreme, keeping the restrictions
|
||
until the next GC cycle would avoid introducing any extra
|
||
synchronizations, but would potentially increase memory usage.
|
||
|
||
This PEP proposes a system that lies between those two extremes based
|
||
on FreeBSD's "GUS" [#gus]_. It uses a combination of global and
|
||
per-thread counters (or "sequence numbers") to coordinate the
|
||
determination of when it is safe to reuse an empty mimalloc page for
|
||
a different heap or for a different size class, or to return it to
|
||
the operating system:
|
||
|
||
* There is a global write sequence number that monotonically
|
||
increases.
|
||
* When a mimalloc page is empty, it's tagged with the current write
|
||
sequence number. The thread may also atomically increment the
|
||
global write sequence number.
|
||
* Each thread has a local read sequence number that records the most
|
||
recent write sequence number it has observed.
|
||
* Threads may observe the write sequence number whenever they are not
|
||
in a list or dictionary access. The reference implementation does
|
||
this in mimalloc's slow-path allocation function. This is called
|
||
regularly enough to be useful, but not so frequently as to
|
||
introduce significant overhead.
|
||
* There is a global read sequence number that stores the minimum of
|
||
all active threads' read sequence numbers. A thread may update the
|
||
global read sequence number by scanning each threads' local read
|
||
sequence number. The reference implementation does this before
|
||
allocating a fresh mimalloc page if there are restricted pages
|
||
that could possibly be reused.
|
||
* An empty mimalloc page may be reused for a different heap or size
|
||
class when the global read sequence number is larger than the
|
||
page's tag number.
|
||
|
||
The condition that the global read sequence number is larger than the
|
||
page's tag is sufficient because it ensures that any thread that had
|
||
a concurrent optimistic list or dictionary access is finished with
|
||
that access. In other words, there are no threads accessing the
|
||
empty blocks in the freed page, so the page can be used for any other
|
||
purpose or even returned to the operating system.
|
||
|
||
Optimistic ``dict`` and ``list`` Access Summary
|
||
'''''''''''''''''''''''''''''''''''''''''''''''
|
||
|
||
This PEP proposes a technique for thread-safe list and dictionary
|
||
accesses that typically avoids acquiring locks. This reduces
|
||
execution overhead and avoids some multi-threaded scaling bottlenecks
|
||
in common operations, like calling functions and methods. The scheme
|
||
works by placing temporary restrictions on mimalloc page reuse to
|
||
ensure that objects' reference count fields remain valid after
|
||
objects are freed so that conditional reference count increment
|
||
operations are safe. The restrictions are placed on mimalloc pages
|
||
instead of on individual objects to improve opportunities for memory
|
||
reuse. The restrictions are lifted as soon as the system can
|
||
determine that there are no outstanding accesses involving the empty
|
||
mimalloc page. To determine this, the system uses a combination of
|
||
lightweight per-thread sequence counters and also tags pages when
|
||
they are empty. Once each thread's local counter is larger than the
|
||
page's tag, it can be reused for any purpose or returned to the
|
||
operating system. The restrictions are also lifted whenever the
|
||
cyclic garbage collector runs because the stop-the-world pause
|
||
ensures that threads do not have any outstanding references to empty
|
||
mimalloc pages.
|
||
|
||
|
||
Specializing Interpreter
|
||
------------------------
|
||
|
||
The specializing interpreter requires some changes to be thread-safe
|
||
when running without the GIL:
|
||
|
||
* Concurrent specializations are prevented by using a mutex. This
|
||
prevents multiple threads writing to the same inline cache.
|
||
* In multi-threaded programs running without the GIL, each bytecode is
|
||
only specialized once. This prevents a thread from reading a
|
||
partially written inline cache.
|
||
* Locking also ensures that cached values of ``tp_version_tag`` and
|
||
``keys_version`` are consistent with the cached descriptors and other
|
||
values.
|
||
* Modifications to inline counters use "relaxed atomics". In other
|
||
words, some counter decrements may be missed or overwritten, but that
|
||
does not affect correctness.
|
||
|
||
|
||
``Py_mod_gil`` Slot
|
||
-------------------
|
||
|
||
In ``--disable-gil`` builds, when loading an extension, CPython will
|
||
check for a new :pep:`489`-style ``Py_mod_gil`` slot. If the slot is
|
||
set to ``Py_mod_gil_not_used``, then extension loading proceeds as
|
||
normal. If the slot is not set, the interpreter pauses all threads and
|
||
enables the GIL before continuing. Additionally, the interpreter will
|
||
issue a visible warning naming the extension, that the GIL was enabled
|
||
(and why) and the steps the user can take to override it.
|
||
|
||
|
||
``PYTHONGIL`` Environment Variable
|
||
----------------------------------
|
||
|
||
In ``--disable-gil`` builds, the user can also override the behavior at
|
||
runtime by setting the ``PYTHONGIL`` environment variable. Setting
|
||
``PYTHONGIL=0``, forces the GIL to be disabled, overriding the module
|
||
slot logic. Setting ``PYTHONGIL=1``, forces the GIL to be enabled.
|
||
|
||
The ``PYTHONGIL=0`` override is important because extensions that are
|
||
not thread-safe can still be useful in multi-threaded applications. For
|
||
example, one may want to use the extension from only a single thread or
|
||
guard access by locks. For context, there are already some extensions
|
||
that are not thread-safe even with the GIL, and users already have to
|
||
take these sorts of steps.
|
||
|
||
The ``PYTHONGIL=1`` override is sometimes useful for debugging.
|
||
|
||
|
||
Rationale
|
||
=========
|
||
|
||
Non-Generational Garbage Collection
|
||
-----------------------------------
|
||
|
||
This PEP proposes switching from a generational cyclic garbage
|
||
collector to a non-generational collector (when CPython is built
|
||
without the GIL). That is equivalent to only having one generation
|
||
(the "old" generation). There are two reasons for this proposed
|
||
change.
|
||
|
||
Cyclic garbage collection, even for just the young generation,
|
||
requires pausing other threads in the program. The author is
|
||
concerned that frequent collections of the young generation would
|
||
inhibit efficient scaling in multi-threaded programs. This is a
|
||
concern for young generations (but not the old generation) because
|
||
the young generations are collected after a fixed number of
|
||
allocations, while the collections for the older generation are
|
||
scheduled in proportion to the number of live objects in the heap.
|
||
Additionally, it is difficult to efficiently keep track of objects in
|
||
each generation without the GIL. For example, CPython currently uses
|
||
a linked list of objects in each generation. If CPython were to keep
|
||
that design, those lists would need to be made thread-safe, and it's
|
||
not clear how to do that efficiently.
|
||
|
||
Generational garbage collection is used to good effect in many other
|
||
language runtimes. For example, many of the Java HotSpot garbage
|
||
collector implementations use multiple generations [#hotspotgc]_. In
|
||
these runtimes, a young generation is frequently a throughput win:
|
||
since a large percentage of the young generation is typically "dead,"
|
||
the GC is able to reclaim a large amount memory relative to the
|
||
amount of work performed. For example, several Java benchmarks show
|
||
over 90% of "young" objects are typically collected [#decapo]_
|
||
[#exploitingmemoryjava]_. This is commonly referred to as the "weak
|
||
generational hypothesis;" the observation is that most objects die
|
||
young. This pattern is reversed in CPython due to the use of
|
||
reference counting. Although most objects still die young, they are
|
||
collected when their reference counts reach zero. Objects that
|
||
survive to a garbage collection cycle are most likely to remain
|
||
alive [#cpythongc]_. This difference means that generational
|
||
collection is much less effective in CPython than in many other
|
||
language runtimes [#golangc]_.
|
||
|
||
|
||
Optimistic Avoiding Locking in ``dict`` and ``list`` Accesses
|
||
-------------------------------------------------------------
|
||
|
||
This proposal relies on a scheme that mostly avoids acquiring locks
|
||
when accessing individual elements in lists and dictionaries. Note
|
||
that this is not "lock free" in the sense of "lock-free"
|
||
and "wait-free" algorithms that guarantee forward progress. It
|
||
simply avoids acquiring locks (mutexes) in the common case to improve
|
||
parallelism and reduce overhead.
|
||
|
||
A much simpler alternative would be to use reader-writer locks to
|
||
protect dictionary and list accesses. Reader-writer locks allow
|
||
concurrent reads, but not updates, which might seem ideal for list
|
||
and dictionaries. The problem is that reader-writer locks have
|
||
substantial overhead and poor scalability, particularly when the
|
||
critical sections are small, as they are for single-element
|
||
dictionary and list accesses [#perfbook]_. The poor reader
|
||
scalability stems from the fact that readers must all update the same
|
||
data structure, such as the number of readers in
|
||
``pthread_rwlocks``.
|
||
|
||
The technique described in this PEP is related to RCU
|
||
("read-copy-update") [#rcu]_ and, to a lesser extent, hazard
|
||
pointers, two well-known schemes for optimizing concurrent,
|
||
read-mostly data structures. RCU is widely used in the Linux kernel
|
||
to protect shared data structures in a scalable manner. Both the
|
||
technique in this PEP and RCU work by deferring reclamation while
|
||
readers may be accessing the concurrent data structure. RCU is most
|
||
commonly used to protect individual objects (like hash tables or
|
||
linked lists), while this PEP proposes a scheme to protect larger
|
||
blocks of memory (mimalloc "pages") [#typesafe_rcu]_.
|
||
|
||
The need for this scheme is largely due to the use of reference
|
||
counting in CPython. If CPython only relied on a tracing garbage
|
||
collector, then this scheme would probably not be necessary because
|
||
tracing garbage collectors already defer reclamation in the required
|
||
manner. This would not "solve" scaling issues, but would shift many
|
||
of the challenges to the garbage collector implementation.
|
||
|
||
|
||
Backwards Compatibility
|
||
=======================
|
||
|
||
This PEP poses a number of backwards compatibility issues when
|
||
building CPython with the ``--disable-gil`` flag, but those issues do
|
||
not occur when using the default build configuration. Nearly all the
|
||
backwards compatibility concerns involve the C-API:
|
||
|
||
* CPython builds without the GIL will not be ABI compatible with the
|
||
standard CPython build or with the stable ABI due to changes to the
|
||
Python object header needed to support biased reference counting.
|
||
C-API extensions will need to be rebuilt specifically for this
|
||
version.
|
||
* C-API extensions that rely on the GIL to protect global state or
|
||
object state in C code will need additional explicit locking to
|
||
remain thread-safe when run without the GIL.
|
||
* C-API extensions that use borrowed references in ways that are not
|
||
safe without the GIL will need to use the equivalent new APIs that
|
||
return non-borrowed references. Note that only some uses of
|
||
borrowed references are a concern; only references to objects that
|
||
might be freed by other threads pose an issue.
|
||
* Custom memory allocators (``PyMem_SetAllocator``) are required to
|
||
delegate the actual allocation to the previously set allocator. For
|
||
example, the Python debug allocator and tracing allocators will
|
||
continue to work because they delegate the allocation to the
|
||
underlying allocator. On the other hand, wholesale replacing of the
|
||
allocator (e.g., with jemalloc or tcmalloc) will not work
|
||
correctly.
|
||
* Python objects must be allocated through the standard APIs, such as
|
||
``PyType_GenericNew`` or ``PyObject_Malloc``. Non-Python objects
|
||
must **not** be allocated through those APIs. For example, it is
|
||
currently acceptable to allocate buffers(non-Python objects)
|
||
through ``PyObject_Malloc``; that will no longer be allowed and
|
||
buffers should instead be allocated through ``PyMem_Malloc``,
|
||
``PyMem_RawMalloc``, or ``malloc``.
|
||
|
||
There are fewer potential backwards compatibility issues for Python
|
||
code:
|
||
|
||
* Destructors and weak reference callbacks for code objects and
|
||
top-level function objects are delayed until the next cyclic
|
||
garbage collection due to the use of deferred reference counting.
|
||
* Destructors for some objects accessed by multiple threads may be
|
||
delayed slightly due to biased reference counting. This is rare:
|
||
most objects, even those accessed by multiple threads, are
|
||
destroyed immediately as soon as their reference counts are zero.
|
||
Two places in the Python standard library tests required
|
||
``gc.collect()`` calls to continue to pass.
|
||
|
||
|
||
Distribution
|
||
============
|
||
|
||
This PEP poses new challenges for distributing Python. At least for
|
||
some time, there will be two versions of Python requiring separately
|
||
compiled C-API extensions. It may take some time for C-API extension
|
||
authors to build ``--disable-gil`` compatible packages and upload
|
||
them to PyPI. Additionally, some authors may be hesitant to support
|
||
the ``--disable-gil`` mode until it has wide adoption, but adoption
|
||
will likely depend on the availability of Python's rich set of
|
||
extensions.
|
||
|
||
To mitigate this, the author will work with Anaconda to distribute
|
||
a ``--disable-gil`` version of Python together with compatible
|
||
packages from conda channels. This centralizes the challenges of
|
||
building extensions, and the author believes this will enable more
|
||
people to use Python without the GIL sooner than they would otherwise
|
||
be able to.
|
||
|
||
|
||
Performance
|
||
===========
|
||
|
||
The changes to make CPython thread-safe without the GIL increase
|
||
execution overhead for ``--disable-gil`` builds. The performance
|
||
impact is different for programs that use only a single thread compared
|
||
to programs that use multiple threads, so the table below reports
|
||
execution overhead separately for these types of programs separately.
|
||
|
||
|
||
.. list-table:: Execution Overhead on pyperformance 1.0.6
|
||
:header-rows: 1
|
||
:widths: auto
|
||
|
||
* -
|
||
- Intel Skylake
|
||
- AMD Zen 3
|
||
* - One thread
|
||
- 6%
|
||
- 5%
|
||
* - Multiple threads
|
||
- 8%
|
||
- 7%
|
||
|
||
The baseline used to measure overhead is ``018be4c`` from `PR 19474`_,
|
||
which implements immortal objects for Python 3.12. The largest
|
||
contribution to execution overhead is biased reference counting
|
||
followed by per-object locking. For thread-safety reasons, an
|
||
application running with multiple threads will only specialize a given
|
||
bytecode once; this is why the overhead for programs that use multiple
|
||
threads is larger compared to programs that only use one thread.
|
||
However, with the GIL disabled, programs that use multiple threads
|
||
should also be able to more effectively use multiple CPU cores.
|
||
|
||
Note that this PEP would not affect the performance of the default
|
||
(non ``--disable-gil``) builds of CPython.
|
||
|
||
.. _PR 19474: https://github.com/python/cpython/pull/19474
|
||
|
||
|
||
Build Bots
|
||
==========
|
||
|
||
The stable build bots will also include ``--disable-gil`` builds.
|
||
|
||
|
||
How to Teach This
|
||
=================
|
||
|
||
As part of implementing the ``--disable-gil`` mode, the author will
|
||
write a "HOWTO" guide [#howto]_ for making packages compatible when
|
||
running Python without the GIL.
|
||
|
||
|
||
Reference Implementation
|
||
========================
|
||
|
||
There are two GitHub repositories implementing versions of CPython
|
||
without the GIL:
|
||
|
||
* https://github.com/colesbury/nogil-3.12
|
||
* https://github.com/colesbury/nogil
|
||
|
||
The ``nogil-3.12`` is based on Python 3.12.0a4. It is useful for
|
||
evaluating single-threaded execution overhead and as a reference
|
||
implementation for this PEP. It is less useful for evaluating C-API
|
||
extension compatibility because many extensions are not currently
|
||
compatible with Python 3.12. Due to limited time for the 3.12 port,
|
||
the ``nogil-3.12`` implementation does not skip all deferred reference
|
||
counts. As a temporary work around, the implementation immortalizes
|
||
objects that use deferred reference counting in programs that spawn
|
||
multiple threads.
|
||
|
||
|
||
The ``nogil`` repository is based on Python 3.9.10. It is useful for
|
||
evaluating multi-threading scaling in real world applications and
|
||
extension compatibility. It is more stable and well tested than the
|
||
``nogil-3.12`` repository.
|
||
|
||
Alternatives
|
||
============
|
||
|
||
Python currently supports a number of ways to enable parallelism, but
|
||
the existing techniques come with significant limitations.
|
||
|
||
Multiprocessing
|
||
---------------
|
||
|
||
The multiprocessing library allows Python programs to start and
|
||
communicate with Python subprocesses. This allows for parallelism
|
||
because each subprocess has its own Python interpreter (i.e., there's
|
||
one GIL per process). Multiprocessing has a few substantial
|
||
limitations. Communication between processes is limited: objects
|
||
generally need to be serialized or copied to shared memory. This
|
||
introduces overhead (due to serialization) and complicates building
|
||
APIs on top of multiprocessing. Starting a subprocess is also more
|
||
expensive than starting a thread, especially with the "spawn"
|
||
implementation. Starting a thread takes ~100 µs, while spawning a
|
||
subprocess takes ~50 ms (50,000 µs) due to Python re-initialization.
|
||
|
||
Finally, many C and C++ libraries support access from multiple
|
||
threads but do not support access or use across multiple processes.
|
||
|
||
Releasing the GIL in C-API Extensions
|
||
-------------------------------------
|
||
|
||
C-API extensions can release the GIL around long running functions.
|
||
This allows for some degree of parallelism, since multiple threads
|
||
can run concurrently when the GIL is released, but the overhead of
|
||
acquiring and releasing the GIL typically prevents this from scaling
|
||
efficiently beyond a few threads. Many scientific computing
|
||
libraries release the GIL in computational heavy functions, and the
|
||
CPython standard library releases the GIL around blocking I/O.
|
||
|
||
Internal Parallelization
|
||
------------------------
|
||
|
||
Functions implemented in C may use multiple threads internally. For
|
||
example, Intel's NumPy distribution, PyTorch, and TensorFlow all use
|
||
this technique to internally parallelize individual operations. This
|
||
works well when the basic operations are large enough to be
|
||
parallelized efficiently, but not when there are many small
|
||
operations or when the operations depend on some Python code. Calling
|
||
into Python from C requires acquiring the GIL -- even short snippets
|
||
of Python code can inhibit scaling.
|
||
|
||
|
||
Related Work
|
||
=============
|
||
|
||
|
||
Per-Interpreter GIL
|
||
-------------------
|
||
|
||
The recently accepted :pep:`684` proposes a per-interpreter GIL to
|
||
address multi-core parallelism. This would allow parallelism between
|
||
interpreters in the same process, but places substantial restrictions
|
||
on sharing Python data between interpreters. Both this PEP
|
||
and :pep:`684` address the multi-core parallelism, but with different
|
||
tradeoffs and techniques. It is feasible to implement both PEPs in
|
||
CPython at the same time.
|
||
|
||
|
||
Gilectomy
|
||
---------
|
||
|
||
Gilectomy [#gilectomy]_ was a project by Larry Hastings to remove the
|
||
GIL in CPython. Like the design proposed by this PEP, the Gilectomy
|
||
supported multiple threads running in parallel within the same
|
||
interpreter (i.e., "free-threading") and made use of fine-grained
|
||
locking. The reference implementation in this PEP improves on
|
||
single-threaded performance and scalability compared to the
|
||
Gilectomy.
|
||
|
||
|
||
PyParallel
|
||
----------
|
||
|
||
PyParallel [#pyparallel]_ was a proof-of-concept fork of Python 3.3 by
|
||
Trent Nelson that supported multiple threads running simultaneously
|
||
in a single Python process. The fork introduced the concept
|
||
of "parallel threads" -- threads that can run simultaneously while
|
||
the main Python thread is suspended. Parallel threads had read-only
|
||
access to objects created by the main thread. Objects created within
|
||
parallel threads lived for the lifetime of the creating thread. For
|
||
HTTP servers, this might correspond to the lifetime of a request.
|
||
|
||
|
||
|
||
python-safethread
|
||
-----------------
|
||
|
||
The python-safethread [#pythonsafethread]_ project was a patch to
|
||
Python 3.0 by Adam Olsen to remove the GIL. Some aspects of the
|
||
project are similar to the design proposed by this PEP. Both use
|
||
fine-grained locking and optimize reference counting for cases
|
||
where the object is created and accessed by the same thread.
|
||
|
||
|
||
Greg Stein's Free-Threading Patch
|
||
---------------------------------
|
||
|
||
In 1996, Greg Stein published a patch against Python 1.4 that removed
|
||
the GIL [#gsteinpatch]_. The patch used atomic reference counting on
|
||
Windows and a global reference count lock on Linux. List and
|
||
dictionary accesses were protected by mutexes. Parts of the patch
|
||
were adopted in CPython. In particular, the patch introduced a
|
||
PyThreadState structure and correct per-thread exception handling.
|
||
|
||
|
||
Dave Beazley revisited the patch in a 2011 blog post [#dabeaz]_.
|
||
|
||
|
||
Jython and IronPython
|
||
---------------------
|
||
|
||
Some alternative Python implementations like Jython [#jython]_ and
|
||
IronPython [#ironpython]_ do not have a global interpreter lock.
|
||
However, they do not support CPython extensions. (The implementations
|
||
can interface with code written in Java or C#).
|
||
|
||
|
||
PyPy-STM
|
||
--------
|
||
|
||
The pypy-stm [#pypystm]_ interpreter is a variant of PyPy that uses
|
||
software transactional memory. The authors report single-threaded
|
||
performance overhead in the 20%-50% range compared to PyPy. It is
|
||
not compatible with CPython extensions.
|
||
|
||
|
||
|
||
Rejected Ideas
|
||
==============
|
||
|
||
Why Not Use a Concurrent Garbage Collector?
|
||
-------------------------------------------
|
||
|
||
Many recent garbage collectors are mostly concurrent -- they avoid long
|
||
stop-the-world pauses by allowing the garbage collector to run
|
||
concurrently with the application. So why not use a concurrent
|
||
collector?
|
||
|
||
Concurrent collection requires write barriers (or read barriers). The
|
||
author is not aware of a way to add write barriers to CPython without
|
||
substantially breaking the C-API.
|
||
|
||
|
||
Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
|
||
----------------------------------------------------------------------
|
||
|
||
This PEP proposes a new API ``PyDict_FetchItem`` which behaves like
|
||
``PyDict_GetItem``, but returns a new reference instead of a borrowed
|
||
reference. As described in `Borrowed References`_, some uses of
|
||
borrowed references that were safe when running with the GIL are
|
||
unsafe when running without the GIL and need to be replaced by
|
||
functions like ``PyDict_FetchItem`` that return new references.
|
||
|
||
This PEP does *not* propose deprecating ``PyDict_GetItem`` and similar
|
||
functions that return borrowed references for a few reasons:
|
||
|
||
* Many of the uses of borrowed references are safe, even when running
|
||
without the GIL. For example, C API functions often use
|
||
``PyDict_GetItem`` to retrieve items from the keyword
|
||
argument dictionary. These calls are safe because the keyword
|
||
argument dictionary is only visible to a single thread.
|
||
* I tried this approach early on and found that wholesale replacing of
|
||
``PyDict_GetItem`` with ``PyDict_FetchItem`` frequently introduced
|
||
new reference counting bugs. In my opinion, the risk of
|
||
introducing new reference counting bugs generally outweighs the
|
||
risks of missing a ``PyDict_GetItem`` call that is unsafe without
|
||
the GIL.
|
||
|
||
|
||
Why Not Use PEP 683 Immortalization?
|
||
------------------------------------
|
||
|
||
Like :pep:`683`, this PEP proposes an immortalization scheme for
|
||
Python objects, but the PEPs use different bit representations to
|
||
mark immortal objects. The schemes cannot be identical because this
|
||
PEP depends on biased reference counting, which has two reference
|
||
count fields instead of one.
|
||
|
||
|
||
Open Issues
|
||
===========
|
||
|
||
Improved Specialization
|
||
-----------------------
|
||
|
||
The Python 3.11 release introduced quickening and specialization as part
|
||
of the faster CPython project, substantially improving performance.
|
||
Specialization replaces slow bytecode instructions with faster
|
||
variants [#pep659]_. To maintain thread-safety, applications that use
|
||
multiple threads (and run without the GIL) will only specialize each
|
||
bytecode once, which can lower performance on some programs. It is
|
||
possible to support specializing multiple times, but that requires more
|
||
investigation and is not part of this PEP.
|
||
|
||
|
||
Python Build Modes
|
||
------------------
|
||
|
||
This PEP introduces a new build mode (``--disable-gil``) that is not
|
||
ABI compatible with the standard build mode. The additional build
|
||
mode adds complexity for both Python core developers and extension
|
||
developers. The author believes a worthwhile goal is to combine
|
||
these build modes and have the global interpreter lock controlled at
|
||
runtime, possibly disabled by default. The path to this goal remains
|
||
an open issue, but a possible path might look like the following:
|
||
|
||
#. In 2024, CPython 3.13 is released with support for a
|
||
``--disable-gil`` build time flag. There are two ABIs for
|
||
CPython, one with the GIL and one without. Extension authors
|
||
target both ABIs.
|
||
#. After 2--3 releases, (i.e., in 2026--2027), CPython is released
|
||
with the GIL controlled by a runtime environment variable or
|
||
flag. The GIL is enabled by default. There is only a single ABI.
|
||
#. After another 2--3 release (i.e., 2028--2030), CPython switches to
|
||
the GIL being disabled by default. The GIL can still be enabled
|
||
at runtime via an environment variable or command line flag.
|
||
|
||
This PEP covers the first step, with the remaining steps left as open
|
||
issues. In this scenario, there would be a two to three year period
|
||
where extension authors would target an extra CPython build per
|
||
supported CPU architecture and OS.
|
||
|
||
Integration
|
||
-----------
|
||
|
||
The reference implementation changes approximately 15,000 lines of code
|
||
in CPython and includes mimalloc, which is also approximately 15,000
|
||
lines of code. Most changes are not performance sensitive and can be
|
||
included in both ``--disable-gil`` and the default builds. Some
|
||
macros, like ``Py_BEGIN_CRITICAL_SECTION`` will be no-ops in the
|
||
default build. Thee author does not expect a huge number of ``#ifdef``
|
||
statements to support the ``--disable-gil`` builds.
|
||
|
||
|
||
Mitigations for Single-Threaded Performance
|
||
-------------------------------------------
|
||
|
||
The changes proposed in the PEP will increase execution overhead for
|
||
``--disable-gil`` builds compared to Python builds with the GIL. In
|
||
other words, it will have slower single-threaded performance. There
|
||
are some possible optimizations to reduce execution overhead,
|
||
especially for ``--disable-gil`` builds that only use a single
|
||
thread. These may be worthwhile if a longer term goal is to have a
|
||
single build mode, but the choice of optimizations and their
|
||
trade-offs remain an open issue.
|
||
|
||
|
||
References
|
||
==========
|
||
|
||
.. [#yuemmwang2019] "Exploiting Parallelism Opportunities with Deep Learning Frameworks."
|
||
Yu Emma Wang, Carole-Jean Wu, Xiaodong Wang, Kim Hazelwood, David Brooks. 2019.
|
||
https://arxiv.org/abs/1908.04705.
|
||
|
||
.. [#torchdeploy] "Using Python for Model Inference in Deep Learning."
|
||
Zachary DeVito, Jason Ansel, Will Constable, Michael Suo, Ailing Zhang, Kim Hazelwood. 2021.
|
||
https://arxiv.org/abs/2104.00254. See Figure 5.
|
||
|
||
.. [#brc] "Biased reference counting: minimizing atomic operations in garbage collection".
|
||
Jiho Choi, Thomas Shull, and Josep Torrellas. PACT 2018.
|
||
https://dl.acm.org/doi/abs/10.1145/3243176.3243195.
|
||
|
||
.. [#pep683] :pep:`683` -- Immortal Objects, Using a Fixed Refcount.
|
||
|
||
.. [#tid] https://github.com/colesbury/nogil/blob/f7e45d6bfbbd48c8d5cf851c116b73b85add9fc6/Include/object.h#L428-L455.
|
||
|
||
.. [#rcu] "What is RCU, Fundamentally?"
|
||
Paul E. McKenney, Jonathan Walpole. 2017.
|
||
https://lwn.net/Articles/262464/
|
||
|
||
.. [#heaps] There are two heaps for Python objects because PyObjects
|
||
that support cyclic garbage collection have extra fields preceding
|
||
the PyObject struct.
|
||
|
||
.. [#gus] "Global Unbounded Sequences (GUS)"
|
||
https://github.com/freebsd/freebsd-src/blob/9408f36627b74a472dc82f7a43320235c0c9055a/sys/kern/subr_smr.c#L44.
|
||
See also https://people.kernel.org/joelfernandes/gus-vs-rcu.
|
||
|
||
.. [#perfbook] "Is Parallel Programming Hard, And, If So, What Can You Do About It?"
|
||
Paul E. McKenney. 2022.
|
||
https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html.
|
||
|
||
.. [#typesafe_rcu] ``SLAB_TYPESAFE_BY_RCU`` is an example in which RCU
|
||
protects blocks of memory and not any individual object. See
|
||
https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html#analogy-with-reference-counting.
|
||
|
||
.. [#hotspotgc] "HotSpot Virtual Machine Garbage Collection Tuning Guide."
|
||
https://docs.oracle.com/en/java/javase/12/gctuning/hotspot-virtual-machine-garbage-collection-tuning-guide.pdf.
|
||
Most of the hotspot garbage collectors are generational, with the
|
||
notable exception of ZGC, although there is ongoing work to make
|
||
that generational.
|
||
|
||
.. [#decapo] `The DaCapo Benchmarks: Java Benchmarking Development and
|
||
Analysis
|
||
<https://openresearch-repository.anu.edu.au/bitstream/1885/33723/2/01_Blackburn_The_DaCapo_Benchmarks:_Java_2006.pdf>`_.
|
||
See column "Nursery Survival" in Table 4.
|
||
|
||
.. [#exploitingmemoryjava] "Exploiting memory usage patterns to improve garbage collections in Java."
|
||
https://dl.acm.org/doi/abs/10.1145/1852761.1852768.
|
||
|
||
.. [#cpythongc] "most things usually turn out to be reachable"
|
||
https://github.com/python/cpython/blob/cd6655a8589e99ae4088b3bed4a692a19ed48779/Modules/gcmodule.c#L1106.
|
||
|
||
.. [#golangc] The Go team observed something similar in Go, but due to
|
||
escape analysis and pass-by-value instead of reference
|
||
counting. Recent versions of Go use a non-generational garbage
|
||
collector. https://go.dev/blog/ismmkeynote.
|
||
|
||
.. [#nogil] https://github.com/colesbury/nogil.
|
||
|
||
.. [#nogil312] https://github.com/colesbury/nogil-3.12.
|
||
|
||
.. [#howto] Python HOWTOs.
|
||
https://docs.python.org/3/howto/index.html.
|
||
|
||
.. [#pep659] :pep:`659` -- Specializing Adaptive Interpreter.
|
||
|
||
.. [#gilectomy] Gilectomy.
|
||
Larry Hastings. 2016.
|
||
https://github.com/larryhastings/gilectomy/tree/gilectomy.
|
||
|
||
.. [#pyparallel] PyParallel.
|
||
Trent Nelson. 2016.
|
||
http://pyparallel.org/.
|
||
|
||
.. [#pythonsafethread] python-safethread.
|
||
Adam Olsen. 2008.
|
||
https://launchpad.net/python-safethread
|
||
|
||
.. [#gsteinpatch] https://www.python.org/ftp/python/contrib-09-Dec-1999/System/threading.tar.gz.
|
||
|
||
.. [#dabeaz] An Inside Look at the GIL Removal Patch of Lore.
|
||
David Beazley. 2011.
|
||
https://dabeaz.blogspot.com/2011/08/inside-look-at-gil-removal-patch-of.html.
|
||
|
||
.. [#jython] Jython.
|
||
https://www.jython.org/
|
||
|
||
.. [#ironpython] IronPython.
|
||
https://ironpython.net/
|
||
|
||
.. [#pypystm] PyPy: Software Transactional Memory.
|
||
https://doc.pypy.org/en/latest/stm.html
|
||
|
||
|
||
|
||
Acknowledgments
|
||
===============
|
||
|
||
Thanks to Hugh Leather, Łukasz Langa, and Eric Snow for providing
|
||
feedback on drafts of this PEP.
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document is placed in the public domain or under the
|
||
CC0-1.0-Universal license, whichever is more permissive.
|