python-peps/peps/pep-0703.rst

PEP: 703
Title: Making the Global Interpreter Lock Optional in CPython
Author: Sam Gross <colesbury at gmail.com>
Sponsor: Łukasz Langa <lukasz at python.org>
Discussions-To: https://discuss.python.org/t/22606
Status: Accepted
Type: Standards Track
Content-Type: text/x-rst
Created: 09-Jan-2023
Python-Version: 3.13
Post-History: `09-Jan-2023 <https://discuss.python.org/t/22606>`__,
              `04-May-2023 <https://discuss.python.org/t/26503>`__
Resolution: https://discuss.python.org/t/pep-703-making-the-global-interpreter-lock-optional-in-cpython-acceptance/37075

.. note::
   The Steering Council accepts PEP 703, but with clear provisio: that
   the rollout be gradual and break as little as possible, and that we can roll
   back any changes that turn out to be too disruptive – which includes
   potentially rolling back all of PEP 703 entirely if necessary
   (however unlikely or undesirable we expect that to be).


Abstract
========

CPython's global interpreter lock ("GIL") prevents multiple threads
from executing Python code at the same time.  The GIL is an obstacle
to using multi-core CPUs from Python efficiently.  This PEP proposes
adding a build configuration (``--disable-gil``) to CPython to let it
run Python code without the global interpreter lock and with the
necessary changes needed to make the interpreter thread-safe.

Motivation
==========

The GIL is a major obstacle to concurrency.  For scientific computing
tasks, this lack of concurrency is often a bigger issue than speed of
executing Python code, since most of the processor cycles are spent
in optimized CPU or GPU kernels.  The GIL introduces a global
bottleneck that can prevent other threads from making progress if
they call any Python code.  There are existing ways to enable
parallelism in CPython today, but those techniques come with
significant limitations (see `Alternatives`_).

This section focuses on the GIL's impact on scientific computing,
particular AI/ML workloads because that is the area with which this
author has the most experience, but the GIL also affects other users
of Python.


The GIL Makes Many Types of Parallelism Difficult to Express
------------------------------------------------------------

Neural network-based AI models expose multiple opportunities for
parallelism.  For example, individual operations may be parallelized
internally ("intra-operator"), multiple operations may be executed
simultaneously ("inter-operator"), and requests (spanning multiple
operations) may also be parallelized.  Efficient execution requires
exploiting multiple types of parallelism [#yuemmwang2019]_.

The GIL makes it difficult to express inter-operator parallelism, as
well as some forms of request parallelism, efficiently in Python. In
other programming languages, a system might use threads to run
different parts of a neural network on separate CPU cores, but this is
inefficient in Python due to the GIL. Similarly, latency-sensitive
inference workloads frequently use threads to parallelize across
requests, but face the same scaling bottlenecks in Python.

The challenges the GIL poses to exploiting parallelism in Python
frequently come up in reinforcement learning.  Heinrich Kuttler,
author of the NetHack Learning Environment and Member of Technical
Staff at Inflection AI, writes:

  Recent breakthroughs in reinforcement learning, such as on `Dota
  2`_, `StarCraft`_, and `NetHack`_ rely on running multiple
  environments (simulated games) in parallel using asynchronous
  actor-critic methods. Straightforward multithreaded implementations
  in Python don't scale beyond more than a few parallel environments
  due to GIL contention. Multiprocessing, with communication via
  shared memory or UNIX sockets, adds much complexity and in effect
  rules out interacting with CUDA from different workers, severely
  restricting the design space.

.. _Dota 2: https://openai.com/five/
.. _StarCraft: https://www.deepmind.com/blog/alphastar-grandmaster-level-in-starcraft-ii-using-multi-agent-reinforcement-learning
.. _NetHack: https://ai.facebook.com/blog/nethack-learning-environment-to-advance-deep-reinforcement-learning/

Manuel Kroiss, software engineer at DeepMind on the reinforcement
learning team, describes how the bottlenecks posed by the GIL lead to
rewriting Python codebases in C++, making the code less accessible:

  We frequently battle issues with the Python GIL at DeepMind. In many
  of our applications, we would like to run on the order of 50-100
  threads per process. However, we often see that even with fewer
  than 10 threads the GIL becomes the bottleneck. To work around this
  problem, we sometimes use subprocesses, but in many cases the
  inter-process communication becomes too big of an overhead.  To
  deal with the GIL, we usually end up translating large parts of our
  Python codebase into C++. This is undesirable because it makes the
  code less accessible to researchers.


Projects that involve interfacing with multiple hardware devices face
similar challenges: efficient communication requires use of multiple
CPU cores.  The `Dose-3D`_ project aims to improve cancer
radiotherapy with precise dose planning.  It uses medical phantoms
(stand-ins for human tissue) together with custom hardware and a
server application written in Python.  Paweł Jurgielewicz, lead
software architect for the data acquisition system on the Dose-3D
project, describes the scaling challenges posed by the GIL and how
using a fork of Python without the GIL simplified the project:

  In the Dose-3D project, the key challenge was to maintain a stable,
  non-trivial concurrent communication link with hardware units while
  utilizing a 1 Gbit/s UDP/IP connection to the maximum. Naturally,
  we started with the multiprocessing package, but at some point, it
  became clear that most CPU time was consumed by the data transfers
  between the data processing stages, not by data processing itself.
  The CPython multithreading implementation based on GIL was a dead
  end too. When we found out about the "nogil" fork of Python it took
  a single person less than half a working day to adjust the codebase
  to use this fork and the results were astonishing. Now we can focus
  on data acquisition system development rather than fine-tuning data
  exchange algorithms.

.. _Dose-3D: https://dose3d.fis.agh.edu.pl/en/projekt-dose-3d-z-programu-team-net-fnp-eng/


Allen Goodman, author of `CellProfiler`_ and staff engineer at
Prescient Design and Genentech, describes how the GIL makes
biological methods research more difficult in Python:

  Issues with Python's global interpreter lock are a frequent source
  of frustration throughout biological methods research.

  I wanted to better understand the current multithreading situation
  so I reimplemented parts of `HMMER`_, a standard method for
  multiple-sequence alignment. I chose this method because it
  stresses both single-thread performance (scoring) and
  multi-threaded performance (searching a database of sequences). The
  GIL became the bottleneck when using only eight threads. This is a
  method where the current popular implementations rely on 64 or
  even 128 threads per process. I tried moving to subprocesses but
  was blocked by the prohibitive IPC costs.  HMMER is a relatively
  elementary bioinformatics method and newer methods have far bigger
  multi-threading demands.

  Method researchers are begging to use Python (myself included),
  because of its ease of use, the Python ecosystem, and because "it's
  what people know."  Many biologists only know a little bit of
  programming (and that's almost always Python). Until Python's
  multithreading situation is addressed, C and C++ will remain the
  lingua franca of the biological methods research community.

.. _CellProfiler: https://cellprofiler.org/
.. _HMMER: http://hmmer.org/


The GIL Affects Python Library Usability
----------------------------------------

The GIL is a CPython implementation detail that limits multithreaded
parallelism, so it might seem unintuitive to think of it as a
usability issue.  However, library authors frequently care a great
deal about performance and will design APIs that support working
around the GIL.  These workaround frequently lead to APIs that are
more difficult to use.  Consequently, users of these APIs may
experience the GIL as a *usability* issue and not just a performance
issue.

For example, PyTorch exposes a multiprocessing-based API called
``DataLoader`` for building data input pipelines.  It uses ``fork()``
on Linux because it is generally faster and uses less memory
than ``spawn()``, but this leads to additional challenges for users:
creating a ``DataLoader`` after accessing a GPU can lead to confusing
CUDA errors.  Accessing GPUs within a ``DataLoader`` worker quickly
leads to out-of-memory errors because processes do not share CUDA
contexts (unlike threads within a process).

Olivier Grisel, scikit-learn developer and software engineer at Inria,
describes how having to work around the GIL in scikit-learn related
libraries leads to a more complex and confusing user experience:

  Over the years, scikit-learn developers have maintained ancillary
  libraries such as ``joblib`` and ``loky`` to try to work around some
  of the limitations of multiprocessing: extra memory usage partially
  mitigated via semi-automated memory mapping of large data buffers,
  slow worker startup by transparently reusing a pool of long
  running workers, fork-safety problems of third-party native runtime
  libraries such as GNU OpenMP by never using the fork-only
  start-method, ability to perform parallel calls of interactively
  defined functions in notebooks and REPLs in cross-platform manner
  via cloudpickle. Despite our efforts, this multiprocessing-based
  solution is still brittle, complex to maintain and confusing to
  datascientists with limited understanding of system-level
  constraints. Furthermore, there are still irreducible limitations
  such as the overhead caused by the pickle-based
  serialization/deserialization steps required for inter-process
  communication. A lot of this extra work and complexity would not be
  needed anymore if we could use threads without contention on
  multicore hosts (sometimes with 64 physical cores or more) to run
  data science pipelines that alternate between Python-level
  operations and calls to native libraries.

Ralf Gommers, co-director of Quansight Labs and NumPy and SciPy
maintainer, describes how the GIL affects the user experience of
NumPy and numeric Python libraries:

  A key problem in NumPy and the stack of packages built around it is
  that NumPy is still (mostly) single-threaded --- and that has shaped
  significant parts of the user experience and projects built around
  it. NumPy does release the GIL in its inner loops (which do the
  heavy lifting), but that is not nearly enough. NumPy doesn't offer
  a solution to utilize all CPU cores of a single machine well, and
  instead leaves that to Dask and other multiprocessing solutions.
  Those aren't very efficient and are also more clumsy to use. That
  clumsiness comes mainly in the extra abstractions and layers the
  users need to concern themselves with when using, e.g.,
  ``dask.array`` which wraps ``numpy.ndarray``. It also shows up in
  oversubscription issues that the user must explicitly be aware of
  and manage via either environment variables or a third package,
  ``threadpoolctl``. The main reason is that NumPy calls into BLAS
  for linear algebra - and those calls it has no control over, they
  do use all cores by default via either pthreads or OpenMP.

  Coordinating on APIs and design decisions to control parallelism is
  still a major amount of work, and one of the harder challenges
  across the PyData ecosystem. It would have looked a lot different
  (better, easier) without a GIL.


GPU-Heavy Workloads Require Multi-Core Processing
-------------------------------------------------

Many high-performance computing (HPC) and AI workloads make heavy use
of GPUs.  These applications frequently require efficient multi-core
CPU execution even though the bulk of the computation runs on a GPU.

Zachary DeVito, PyTorch core developer and researcher at FAIR
(Meta AI), describes how the GIL makes multithreaded scaling
inefficient even when the bulk of computation is performed outside of
Python:

  In PyTorch, Python is commonly used to orchestrate ~8 GPUs and ~64
  CPU threads, growing to 4k GPUs and 32k CPU threads for big models.
  While the heavy lifting is done outside of Python, the speed of
  GPUs makes even just the orchestration in Python not scalable. We
  often end up with 72 processes in place of one because of the GIL.
  Logging, debugging, and performance tuning are orders-of-magnitude
  more difficult in this regime, continuously causing lower developer
  productivity.

The use of many processes (instead of threads) makes common tasks more
difficult. Zachary DeVito continues:

  On three separate occasions in the past couple of months
  (reducing redundant compute in data loaders, writing model
  checkpoints asynchronously, and parallelizing compiler
  optimizations), I spent an order-of-magnitude more time figuring
  out how to work around GIL limitations than actually solving the
  particular problem.

Even GPU-heavy workloads frequently have a CPU-intensive component.
For example, computer vision tasks typically require
multiple "pre-processing" steps in the data input pipeline, like
image decoding, cropping, and resizing.  These tasks are commonly
performed on the CPU and may use Python libraries like `Pillow`_
or `Pillow-SIMD`_.  It is necessary to run the data input pipeline
on multiple CPU cores in order to keep the GPU "fed" with data.

The increase in GPU performance compared to individual CPU cores makes
multi-core performance more important.  It is progressively more
difficult to keep the GPUs fully occupied.  To do so requires efficient
use of multiple CPU cores, especially on multi-GPU systems.  For
example, NVIDIA's DGX-A100 has 8 GPUs and two 64-core CPUs in order to
keep the GPUs "fed" with data.

.. _Pillow: https://pillow.readthedocs.io/en/stable/
.. _Pillow-SIMD: https://github.com/uploadcare/pillow-simd


The GIL Makes Deploying Python AI Models Difficult
--------------------------------------------------

Python is widely used to develop neural network-based AI models.  In
PyTorch, models are frequently deployed as part of multi-threaded,
mostly C++, environments.  Python is often viewed skeptically
because the GIL can be a global bottleneck, preventing efficient
scaling even though the vast majority of the computations
occur "outside" of Python with the GIL released.  The torchdeploy
paper [#torchdeploy]_ shows experimental evidence for these scaling
bottlenecks in multiple model architectures.

PyTorch provides a number of mechanisms for deploying Python AI
models that avoid or work around the GIL, but they all come with
substantial limitations.  For example, `TorchScript
<https://pytorch.org/docs/stable/jit.html>`_ captures a
representation of the model that can be executed from C++ without any
Python dependencies, but it only supports a limited subset of Python
and often requires rewriting some of the model's code.  The
`torch::deploy <https://pytorch.org/docs/stable/package.html>`_ API
allows multiple Python interpreters, each with its own GIL, in the
same process(similar to :pep:`684`).  However, ``torch::deploy`` has
limited support for Python modules that use C-API extensions.


Motivation Summary
------------------

Python's global interpreter lock makes it difficult to use modern
multi-core CPUs efficiently for many scientific and numeric computing
applications.  Heinrich Kuttler, Manuel Kroiss, and Paweł
Jurgielewicz found that multi-threaded implementations in Python did
not scale well for their tasks and that using multiple processes
was not a suitable alternative.

The scaling bottlenecks are not solely in core numeric tasks. Both
Zachary DeVito and Paweł Jurgielewicz described challenges with
coordination and communication in Python.

Olivier Grisel, Ralf Gommers, and Zachary DeVito described how current
workarounds for the GIL are "complex to maintain" and cause "lower
developer productivity."  The GIL makes it more difficult to develop
and maintain scientific and numeric computing libraries as well
leading to library designs that are more difficult to use.


Specification
=============

Build Configuration Changes
---------------------------

The global interpreter lock will remain the default for CPython builds
and python.org downloads. A new build configuration flag,
``--disable-gil`` will be added to the configure script that will build
CPython with support for running without the global interpreter lock.

When built with ``--disable-gil``, CPython will define the ``Py_NOGIL``
macro in Python/patchlevel.h.  The ABI tag will include the letter "t"
(for "threading").

The ``--disable-gil`` builds of CPython will still support optionally
running with the GIL enabled at runtime (see `PYTHONGIL Environment
Variable`_ and `Py_mod_gil Slot`_).

Overview of CPython Changes
---------------------------

Removing the global interpreter lock requires substantial changes to
CPython internals, but relatively few changes to the public Python
and C APIs. This section describes the required changes to the
CPython implementation followed by the proposed API changes.

The implementation changes can be grouped into the following four
categories:

* Reference counting
* Memory management
* Container thread-safety
* Locking and atomic APIs

Reference Counting
------------------

Removing the GIL requires changes to CPython's
reference counting implementation to make it thread-safe.
Furthermore, it needs to have low execution overhead and allow for
efficient scaling with multiple threads. This PEP proposes a
combination of three techniques to address these constraints. The
first is a switch from plain non-atomic reference counting to biased
reference counting, which is a thread-safe reference counting
technique with lower execution overhead than plain atomic reference
counting. The other two techniques are immortalization and a limited
form of deferred reference counting; they address some of the
multi-threaded scalability issues with reference counting by avoiding
some reference count modifications.

Biased reference counting (BRC) is a technique first described in 2018
by Jiho Choi, Thomas Shull, and Josep Torrellas [#brc]_. It is based on the
observation that most objects are only accessed by a single thread,
even in multi-threaded programs. Each object is associated with an
owning thread (the thread that created it). Reference counting
operations from the owning thread use non-atomic instructions to
modify a "local" reference count. Other threads use atomic
instructions to modify a "shared" reference count. This design avoids
many atomic read-modify-write operations that are expensive on
contemporary processors.

The implementation of BRC proposed in this PEP largely matches the
original description of biased reference counting, but differs in
details like the size of reference counting fields and special bits in
those fields. BRC requires storing three pieces of information in each
object's header: the "local" reference count, the "shared" reference
count, and the identifier of the owning thread.  The BRC paper packs
these three things into a single 64-bit field.  This PEP proposes using
three separate fields in each object's header to avoid potential issues
due to reference count overflow.  Additionally, the PEP supports a
faster deallocation path that avoids an atomic operation in the common
case.

The proposed ``PyObject`` struct (also called ``struct _object``) is
below:

.. code-block:: c

  struct _object {
    _PyObject_HEAD_EXTRA
    uintptr_t ob_tid;         // owning thread id (4-8 bytes)
    uint16_t __padding;       // reserved for future use (2 bytes)
    PyMutex ob_mutex;         // per-object mutex (1 byte)
    uint8_t ob_gc_bits;       // GC fields (1 byte)
    uint32_t ob_ref_local;    // local reference count (4 bytes)
    Py_ssize_t ob_ref_shared; // shared reference count and state bits (4-8 bytes)
    PyTypeObject *ob_type;
  };

The ``ob_tid``, ``ob_ref_local``, and ``ob_ref_shared`` are used by
the biased reference counting implementation.  The ``ob_gc_bits`` field
is used store garbage collection flags that were previously stored in
``PyGC_Head`` (see `Garbage Collection (Cycle Collection)`_).  The
``ob_mutex`` field provides a per-object lock in a single byte.


Immortalization
'''''''''''''''

Some objects, such as interned strings, small integers, statically
allocated PyTypeObjects, and the ``True``, ``False``, and ``None``
objects stay alive for the lifetime of the program. These objects are
marked as immortal by setting the local reference count field
(``ob_ref_local``) to ``UINT32_MAX``.

The ``Py_INCREF`` and ``Py_DECREF`` macros are no-ops for immortal
objects.  This avoids contention on the reference count fields of
these objects when multiple threads access them concurrently.

This proposed immortalization scheme is very similar to :pep:`683`,
adopted in Python 3.12, but with slightly different bit representation
in the reference count fields for immortal objects in order to work
with biased reference counting and deferred reference counting.  See
also `Why Not Use PEP 683 Immortalization?`_.

Biased Reference Counting
'''''''''''''''''''''''''

Biased reference counting has a fast-path for objects "owned" by the
current thread and a slow-path for other objects.  Ownership is
indicated by the ``ob_tid`` field.  Determining the thread id requires
platform specific code [#tid]_.  A value of ``0`` in ``ob_tid``
indicates that the object is not owned by any thread.

The ``ob_ref_local`` field stores the local reference count and two
flags.  The two most significant bits are used to indicate the object
is immortal or uses deferred reference counting (see `Deferred
reference counting`_).

The ``ob_ref_shared`` field stores the shared reference count.  The
two *least* significant bits are used to store the reference
counting state.  The shared reference count is therefore shifted left by
two.  The ``ob_ref_shared`` field uses the least significant bits
because the shared reference count can be temporarily negative; increfs
and decrefs may not be balanced between threads.

The possible reference counting states are listed below:

* ``0b00`` - default
* ``0b01`` - weakrefs
* ``0b10`` - queued
* ``0b11`` - merged

The states form a progression: during their lifecycle, objects may
transition to any numerically higher state.  Objects can only be
deallocated from the "default" and "merged" states.  Other states must
transition to the "merged" state before deallocation.  Transitioning
states requires an atomic compare-and-swap on the ``ob_ref_shared``
field.

Default (``0b00``)
""""""""""""""""""

Objects are initially created in the default state.  This is the only
state that allows for the quick deallocation code path.  Otherwise, the
thread must merge the local and shared reference count fields, which
requires an atomic compare-and-swap.

This quick deallocation code path would not be thread-safe with
concurrent dereferencing of weakrefs, so the first time a weak
reference is created, the object is transitioned to the "weakrefs"
state if it is currently in the "default" state.

Similarly, the quick deallocation code path would not be thread-safe
with the lockless list and dictionary accesses (see `Optimistically
Avoiding Locking`_), so the first time a non-owning thread thread
attempts to retrieve an object in the "default" state it falls back to
the slower locking code path and transitions the object to
the "weakrefs" state.


Weakrefs (``0b01``)
"""""""""""""""""""

Objects in weakref and higher states support dereferencing weakrefs
as well as the lockless list and dictionary access by non-owning
threads.  They require transitioning to the merged state before
deallocation, which is more expensive than the quick deallocation code
path supported by the "default" state.


Queued (``0b10``)
""""""""""""""""""

The queued state indicates that the a non-owning thread has requested
that the reference count fields be merged.  This can happen when the
shared reference count becomes negative (due to an imbalance between
increfs and decrefs between threads).  The object is inserted into the
owning thread's queue of objects to be merged.  The owning thread is
notified via the ``eval_breaker`` mechanism.  In practice, this
operation is rare.  Most objects are only accessed by a single thread
and those objects accessed by multiple threads rarely have negative
shared reference counts.

If the owning thread has terminated, the acting thread immediately
merges the local and shared reference count fields and transitions to
the merged state.


Merged (``0b11``)
"""""""""""""""""

The merged state indicates that the object is not owned by any thread.
The ``ob_tid`` field is zero in this state and ``ob_ref_local`` is not
used.  Once the shared reference count reaches zero, the object can
be deallocated from the merged state.


Reference counting pseudo-code
""""""""""""""""""""""""""""""


The proposed ``Py_INCREF`` and ``Py_DECREF`` operation should behave
as follows (using C-like pseudo-code):

.. code-block:: c

  // low two bits of "ob_ref_shared" are used for flags
  #define _Py_SHARED_SHIFT 2

  void Py_INCREF(PyObject *op)
  {
    uint32_t new_local = op->ob_ref_local + 1;
    if (new_local == 0)
      return;  // object is immortal
    if (op->ob_tid == _Py_ThreadId())
      op->ob_ref_local = new_local;
    else
      atomic_add(&op->ob_ref_shared, 1 << _Py_SHARED_SHIFT);
  }

  void Py_DECREF(PyObject *op)
  {
    if (op->ob_ref_local == _Py_IMMORTAL_REFCNT) {
      return;  // object is immortal
    }
    if (op->ob_tid == _Py_ThreadId()) {
      op->ob_ref_local -= 1;
      if (op->ob_ref_local == 0) {
        _Py_MergeZeroRefcount(); // merge refcount
      }
    }
    else {
      _Py_DecRefShared(); // slow path
    }
  }

  void _Py_MergeZeroRefcount(PyObject *op)
  {
    if (op->ob_ref_shared == 0) {
      // quick deallocation code path (common case)
      op->ob_tid = 0;
      _Py_Dealloc(op);
    }
    else {
      // slower merging path not shown
    }
  }

The reference implementation [#nogil312]_ contains implementations of
``_Py_MergeZeroRefcount`` and ``_Py_DecRefShared``.

Note that the above is pseudocode: in practice, the implementation
should use "relaxed atomics" to access ``ob_tid`` and
``ob_ref_local`` to avoid undefined behavior in C and C++.


Deferred Reference Counting
'''''''''''''''''''''''''''

A few types of objects, such as top-level functions, code objects,
modules, and methods, tend to be frequently accessed by many threads
concurrently.  These objects don't necessarily live for the lifetime of
the program, so immortalization is not a good fit. This PEP proposes a
limited form of deferred reference counting to avoid contention on
these objects' reference count fields in multi-threaded programs.

Typically, the interpreter modifies objects' reference counts as they
are pushed to and popped from the interpreter's stack. The
interpreter skips these reference counting operations for objects
that use deferred reference counting.  Objects that support deferred
reference counting are marked by setting the two most significant
bits in the local reference count field to one.

Because some reference counting operations are skipped, the reference
count fields no longer reflect the true number of references to these
objects.  The true reference count is the sum of the reference count
fields plus any skipped references from each thread's interpreter
stack.  The true reference count can only be safely computed when all
threads are paused during cyclic garbage collection.  Consequently,
objects that use deferred reference counting can only be deallocated
during garbage collection cycles.

Note that the objects that use deferred reference counting already
naturally form reference cycles in CPython, so they would typically be
deallocated by the garbage collector even without deferred reference
counting. For example, top-level functions and modules form a reference
cycle as do methods and type objects.


Garbage Collector Modifications for Deferred Reference Counting
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

The tracing garbage collector finds and deallocates unreferenced
objects.  Currently, the tracing garbage collector only finds
unreferenced objects that are part of a reference cycle. With
deferred reference counting, the tracing garbage collector will also
find and collect some unreferenced objects that may not be part of
any reference cycle, but whose collection has been delayed due to
deferred reference counting. This requires that all objects that
support deferred reference counting also have a corresponding type
object that supports tracing garbage collection (through the
``Py_TPFLAGS_HAVE_GC`` flag). Additionally, the garbage collector
will need to traverse each thread's stack to add references to the GC
reference count at the start of each collection.

Reference Counting Type Objects
'''''''''''''''''''''''''''''''

Type objects (``PyTypeObject``) use a mix of reference counting
techniques. Statically allocated type objects are immortalized because
the objects already live for the lifetime of the program.  Heap type
objects use deferred reference counting in combination with per-thread
reference counting.  Deferred reference counting is not sufficient to
address the multi-threaded scaling bottlenecks with heap types because
most references to heap types are from object instances, not references
on the interpreter stack.

To address this, heap type reference counts are partially stored in a
distributed manner in per-thread arrays.  Every thread stores an
array of local reference counts for each heap type object.  Heap type
objects are assigned a unique number that determines its position in
the local reference count arrays.  A heap type's true reference count
is the sum of its entries in the per-thread arrays, plus the reference
count on the ``PyTypeObject``, plus any deferred references in the
interpreter stack.

Threads may grow their own type reference count arrays as needed when
incrementing or decrementing the local reference count of a type
object.

Use of the per-thread reference count arrays is limited to a few
places:

* ``PyType_GenericAlloc(PyTypeObject *type, Py_ssize_t nitems)``:
  Increments the current thread's local reference count for ``type``,
  if it is a heap type.
* ``subtype_dealloc(PyObject *self)``: Decrements the current thread's
  local reference count for ``self->ob_type``, if the type is a heap
  type.
* ``gcmodule.c``: Adds each thread's local reference counts to the
  ``gc_refs`` count for the corresponding heap type object.

Additionally, when a thread terminates, it adds any non-zero local
reference counts to each type object's own reference count field.


Memory Management
-----------------

CPython currently uses an internal allocator, pymalloc, which is
optimized for small object allocation.  The pymalloc implementation is
not thread-safe without the GIL.  This PEP proposes replacing pymalloc
with mimalloc, a general-purpose thread-safe allocator with good
performance, including for small allocations.

Using mimalloc, with some modifications, also addresses two other
issues related to removing the GIL.  First, traversing the internal
mimalloc structures allows the garbage collector to find all Python
objects without maintaining a linked list.  This is described in more
detail in the garbage collection section.  Second, mimalloc heaps and
allocations based on size class enable collections like dict to
generally avoid acquiring locks during read-only operations. This is
described in more detail in the collection thread-safety section.

CPython already requires that objects that support garbage collection
use the GC allocator APIs (typically indirectly by calling
``PyType_GenericAlloc``). This PEP would add additional requirements
to the use of the Python allocator APIs. First, Python objects must
be allocated through object allocation APIs, such as
``PyType_GenericAlloc``, ``PyObject_Malloc``, or other Python APIs
that wrap those calls. Python objects should not be allocated through
other APIs, such as raw calls to C's malloc or the C++ new operator.
Additionally, ``PyObject_Malloc`` should be used only for allocating
Python objects; it should not be used for allocating buffers,
storages, or other data structures that are not PyObjects. 

This PEP also imposes restrictions on the pluggable allocator API
(``PyMem_SetAllocator``). When compiling without the GIL, allocators
set using this API must eventually delegate the allocation to the
corresponding underlying allocator, such as ``PyObject_Malloc``, for
Python object allocations. This allows for allocators that "wrap"
underlying allocators, such as Python's tracemalloc and debug
allocator, but not for wholly replacing the allocator.


CPython Free Lists
''''''''''''''''''

CPython makes use of free lists to speed up the allocation of small,
frequently allocated objects like tuples and numbers.  These free
lists are moved to ``PyThreadState`` from per-interpreter state.


Garbage Collection (Cycle Collection)
-------------------------------------

The CPython garbage collector requires the following changes to work
with this proposal:

* Use of "stop-the-world" to provide thread-safety guarantees that
  were previously provided by the GIL.
* Elimination of generational garbage collection in favor of
  non-generational collector.
* Integration with deferred reference counting and biased reference
  counting.

Additionally, the above changes enable removing the
``_gc_prev`` and ``_gc_next`` fields from GC objects.  The GC bits
that stored the tracked, finalized, and unreachable states are moved
to the ``ob_gc_bits`` field in the PyObject header.

Stop-the-World
''''''''''''''

The CPython cycle garbage collector currently relies on the global
interpreter lock to prevent other threads from accessing Python
objects while the collector finds cycles.  The GIL is never released
during the cycle-finding routine, so the collector can rely on
stable (i.e., unchanging) reference counts and references for the
duration of that routine. However, following cycle detection, the GIL
may be temporarily released while calling objects' finalizers and
clear (``tp_clear``) functions, allowing other threads to run in an
interleaved fashion.

When running without the GIL, the implementation needs a way to ensure
that reference counts remain stable during cycle detection. Threads
running Python code must be paused to ensure that references and
reference counts remain stable. Once the cycles are identified, other
threads are resumed.

The current CPython cyclic garbage collector involves two
cycle-detection passes during each garbage collection cycle.
Consequently, this requires two stop-the-world pauses when running the
garbage collector without the GIL.  The first cycle-detection pass
identifies cyclic trash. The second pass runs after finalizers to
identify which objects still remain unreachable.  Note that other
threads are resumed before finalizers and ``tp_clear`` functions are
called to avoid introducing potential deadlocks that are not present in
the current CPython behavior.

Thread States
'''''''''''''

To support pausing threads for garbage collection, the PyThreadState
gets a new "status" field. Like the other fields in PyThreadState,
the status field is not part of the public CPython API. The status
field may be in one of three states:

* ``ATTACHED``
* ``DETACHED``
* ``GC``

The ``ATTACHED`` and ``DETACHED`` states correspond closely to
acquiring and releasing the global interpreter lock. When compiling
without the GIL, functions that previously acquired the GIL instead
transition the thread state to ``ATTACHED``, and functions that
previously released the GIL transition the thread state
to ``DETACHED``. Just as threads previously needed to acquire the
GIL before accessing or modifying Python objects, they now must be in
the ``ATTACHED`` state before accessing or modifying Python
objects. Since the same public C-API functions "attach" the thread as
previously acquired the GIL (e.g., ``PyEval_RestoreThread``), the
requirements for thread initialization in extensions remain the same.
The substantial difference is that multiple threads can be in the
attached state simultaneously, while previously only one thread could
acquire the GIL at a time.

During stop-the-world pauses, the thread performing garbage collection
needs to ensure that no other thread is accessing or modifying Python
objects.  All other threads must be in the "GC" state. The garbage
collection thread can transition other threads from the ``DETACHED``
state to the GC state using an atomic compare-and-swap operation on
the status field. Threads in the ``ATTACHED`` state are requested to
pause themselves and set their status to "GC", using the
existing "eval breaker" mechanism. At the end of the stop-the-world
pause, all threads in the "GC" state are set to ``DETACHED`` and
woken up if they are paused. Threads that were previously attached
(i.e., executing Python bytecode) can re-attach (set their thread
states to ``ATTACHED``) and resume executing Python code. Threads
that were previously ``DETACHED`` ignore the notification.

Generations
'''''''''''

The existing Python garbage collector uses three generations.  When
compiling without the GIL, the garbage collector will only use a single
generation (i.e., it will be non-generational).  The primary reason for
this change is to reduce the impact of the stop-the-world pauses in
multithreaded applications.  Frequent stop-the-world pauses for
collecting the young generation would have more of an impact on
multi-threaded applications than less frequent collections.


Integration With Deferred and Biased Reference Counting
'''''''''''''''''''''''''''''''''''''''''''''''''''''''

To find unreferenced objects, the cyclic garbage collector computes
the difference between the number of incoming references and the
object's reference count.  This difference is called ``gc_refs`` and
is stored in the ``_gc_prev`` field.  If ``gc_refs`` is greater than
zero, then the object is guaranteed to be alive (i.e., not cyclic
trash). If ``gc_refs`` is zero, then the object is only alive if it
is transitively referenced by another live object. When computing
this difference, the collector should traverse each thread's stack,
and for every deferred reference, increment the ``gc_refs`` for the
referred object. Since generator objects also have stacks with
deferred references, the same procedure is applied to each
generator's stack.

Python unit tests commonly use ``gc.collect()`` to ensure that any
unreferenced objects are destructed and their finalizers run.  Since
biased reference counting can delay the destruction of some objects
that are referenced by multiple threads, it's convenient to ensure
that those objects are destructed during garbage collection, even
though they may not be part of any reference cycles.  While other
threads are paused, the garbage collector thread should merge the
reference counts for any queued objects, but not call any destructors
even if the combined reference count is zero. (Calling destructors
while other threads are paused risks introducing deadlocks.) Once
other threads are resumed, the GC thread should call ``_Py_Dealloc``
on those objects with a zero merged reference count.

Container Thread-Safety
-----------------------

In CPython, the global interpreter lock protects against corruption of
internal interpreter states when multiple threads concurrently access
or modify Python objects.  For example, if multiple threads
concurrently modify the same list, the GIL ensures that the length of
the list (``ob_size``) accurately matches the number of elements, and
that the reference counts of each element accurately reflect the
number of references to those elements. Without the GIL --- and
absent other changes --- concurrent modifications would corrupt those
fields and likely lead to program crashes.

The GIL does not necessarily ensure that operations are atomic or
remain correct when multiple operations occur concurrently. For
example, ``list.extend(iterable)`` may not appear atomic if the
iterable has an iterator implemented in Python (or releases the GIL
internally). Similarly, ``list.remove(x)`` can remove the wrong
object if it overlaps with another operation that modifies the list,
depending on the implementation of the equality operator.  Still, the
GIL ensures that some operations are effectively atomic. For example,
the constructor ``list(set)`` atomically copies the items of the set
to a new list, and some code relies on that copy being atomic
(i.e., having a snapshot of the items in the set). This PEP preserves
that property.

This PEP proposes using per-object locks to provide many of the same
protections that the GIL provides.  For example, every list,
dictionary, and set will have an associated lightweight lock.  All
operations that modify the object must hold the object's lock.  Most
operations that read from the object should acquire the object's lock
as well; the few read operations that can proceed without holding a
lock are described below.

Per-object locks with critical sections provide weaker protections
than the GIL. Because the GIL doesn't necessarily ensure that
concurrent operations are atomic or correct, the per-object locking
scheme also cannot ensure that concurrent operations are atomic or
correct. Instead, per-object locking aims for similar protections as
the GIL, but with mutual exclusion limited to individual objects.

Most operations on an instance of a container type require locking
that object. For example:

* ``list.append``, ``list.insert``, ``list.repeat``,
  ``PyList_SetItem``
* ``dict.__setitem__``, ``PyDict_SetItem``
* ``list.clear``, ``dict.clear``
* ``list.__repr__``, ``dict.__repr__``, etc.
* ``list.extend(iterable)``
* ``setiter_iternext``

Some operations operate directly on two container objects, with
knowledge about both containers' internal structure.  For example,
there are internal specializations of ``list.extend(iterable)`` for
specific iterable types, like ``set``. These operations need to lock
both container objects because they access the internals of both
objects simultaneously.  Note that the generic implementation of
``list.extend`` only needs to lock one object (the list) because the
other object is accessed indirectly through the thread-safe iterator
API.  Operations that lock two containers are:

* ``list.extend(list)``, ``list.extend(set)``, ``list.extend
  (dictitems)``, and other specializations where the implementation
  is specialized for argument type.
* ``list.concat(list)``
* ``list.__eq__(list)``, ``dict.__eq__(dict)``

Some simple operations can be implemented directly with atomic
accesses and do not need locks because they only access a single
field.  These operations include:

* ``len(list)`` i.e., ``list_length(PyListObject *a)``
* ``len(dict)``
* ``len(set)``

A select few operations optimistically avoid locking to improve
performance. These require special implementations and cooperation
from the memory allocator:

* ``list[idx]`` (``list_subscript``)
* ``dict[key]`` (``dict_subscript``)
* ``listiter_next``, ``dictiter_iternextkey/value/item``
* ``list.contains``

Borrowed References
'''''''''''''''''''

Per-object locking provides many of the important protections that the
GIL provides, but there are a few cases where it's not sufficient.
For example, code that relies on upgrading a borrowed reference to
an "owned" reference may be unsafe in certain circumstances:

.. code-block:: c

  PyObject *item = PyList_GetItem(list, idx);
  Py_INCREF(item);

The GIL ensures that no other thread can modify the list in between
the access and the ``Py_INCREF`` call. Without the GIL -- even with
per-object locking -- another thread might modify the list leading to
``item`` being freed between the access and the ``Py_INCREF`` call.

The problematic borrowed reference APIs are supplemented with
functions that return "new references" but are otherwise
equivalent: 

* ``PyList_FetchItem(list, idx)`` for ``PyList_GetItem``
* ``PyDict_FetchItem(dict, key)`` for ``PyDict_GetItem``
* ``PyWeakref_FetchObject`` for ``PyWeakref_GetObject``

Note that some APIs that return borrowed references, such as
``PyTuple_GetItem``, are not problematic because tuples are
immutable. Similarly, not all uses of the above APIs are problematic.
For example, ``PyDict_GetItem`` is often used for parsing keyword
argument dictionaries in function calls; those keyword argument
dictionaries are effectively private (not accessible by other
threads).

Python Critical Sections
''''''''''''''''''''''''

Straightforward per-object locking could introduce deadlocks that were
not present when running with the GIL.  Threads may hold locks for
multiple objects simultaneously because Python operations can nest.
Operations on objects can invoke operations on other objects,
acquiring multiple per-object locks.  If threads try to acquire the
same locks in different orders, they will deadlock.

This PEP proposes a scheme called "Python critical sections" to
implicitly release per-object locks to avoid deadlocks.  To
understand the scheme, we first introduce a general approach to avoid
deadlocks, and then propose a refinement of that approach with better
performance.

One way to avoid deadlocks is to allow threads to hold only the lock
(or locks) for a single operation at a time (typically a single lock,
but some operations involve two locks as described above).  When a
thread begins a nested operation it should suspend the locks for any
outer operation: before beginning the nested operation, the locks for
the outer operation are released and when the nested operation
completes, the locks for the outer operation are reacquired.

Additionally, the locks for any active operation should be suspended
around potentially blocking operations, such as I/O (i.e., operations
that would have released the GIL). This is because the interaction
between locks and blocking operations can lead to deadlocks in the
same way as the interaction between multiple locks.

To improve performance, this PEP proposes a variation of the above
scheme that still avoids deadlocks.  Instead of immediately
suspending locks any time a nested operation begins, locks are only
suspended if the thread would block (i.e., would have released the
GIL).  This reduces the number of lock acquisitions and releases for
nested operations, while avoiding deadlocks.

The proposed API for Python critical sections are the following four
macros. These are intended to be public (usable by C-API extensions),
but not part of the limited API:

- ``Py_BEGIN_CRITICAL_SECTION(PyObject *op);``:
  Begins a critical section by acquiring the mutex for the referenced
  object.  If the object is  already locked, then locks for any
  outstanding critical sections are released before this thread waits
  for referenced object to be unlocked.

- ``Py_END_CRITICAL_SECTION;``:
  Ends the most recent operation, unlocking the mutex. The next
  most recent previous critical section (if any) is resumed if it is
  currently suspended.

- ``Py_BEGIN_CRITICAL_SECTION2(PyObject *a, PyObject *b);``:
  Begins a critical section by acquiring the mutexes for two objects.
  To ensure consistent lock ordering, the order of acquisition is
  determined by memory address (i.e., the mutex with lower memory
  address is acquired first). If either mutex is already locked, then
  locks for any outstanding critical sections are released before this
  thread waits for the referenced objects to be unlocked.

- ``Py_END_CRITICAL_SECTION2;``:
  Behaves the same as ``Py_END_CRITICAL_SECTION`` but unlocks two
  objects.

Additionally, when a thread transitions from the ``ATTACHED`` state to
the ``DETACHED`` state, it should suspend any active critical
sections. When transitioning from ``DETACHED`` to ``ATTACHED``, the
most recent suspended critical section, if any, should be resumed.

Note that operations that lock two containers simultaneously need to use
the ``Py_BEGIN_CRITICAL_SECTION2`` macro.  It is not sufficient to nest
two calls to ``Py_BEGIN_CRITICAL_SECTION`` because the inner critical
section may release the locks from the outer critical section.

Optimistically Avoiding Locking
'''''''''''''''''''''''''''''''

A few operations on ``dict`` and ``list`` optimistically avoid
acquiring the per-object locks. They have a fast path operation that
does not acquire locks, but may fall back to a slower operation that
acquires the dictionary's or list's lock when another thread is
concurrently modifying that container.

The operations with an optimistic fast path are:

* ``PyDict_FetchItem/GetItem`` and ``dict.__getitem__``
* ``PyList_FetchItem/GetItem`` and ``list.__getitem__``

Additionally, iterators for ``dict`` and ``list`` use the above
functions so they also optimistically avoid locking when returning
the next item.

There are two motivations for avoiding lock acquisitions in these
functions. The primary reason is that it is necessary for scalable
multi-threaded performance even for simple applications. Dictionaries
hold top-level functions in modules and methods for classes. These
dictionaries are inherently highly shared by many threads in
multi-threaded programs. Contention on these locks in multi-threaded
programs for loading methods and functions would inhibit efficient
scaling in many basic programs.

The secondary motivation for avoiding locking is to reduce overhead
and improve single-threaded performance.  Although lock acquisition
has low overhead compared to most operations, accessing individual
elements of lists and dictionaries are fast operations (so the
locking overhead is comparatively larger) and frequent (so the
overhead has more impact). 

This section describes the challenges with implementing dictionary and
list accesses without locking followed by a description of this PEP's
changes to the Python interpreter required to address those
challenges.

The main challenge is that retrieving an item from a list or
dictionary and incrementing the reference count of that item is not
an atomic operation. In between the time the item is retrieved and
the reference count is incremented, another thread may modify the
list or dictionary, possibly freeing the memory for the previously
retrieved item.

A partial attempt at addressing this issue would be to convert the
reference count increment to a conditional increment, only
incrementing the reference count if it's not zero.  This change is
not sufficient because when a Python object's reference count reaches
zero, the object's destructor is called and the memory storing the
object may be re-used for other data structures or returned to the
operating system.  Instead, this PEP proposes a technique to ensure
that the reference count fields remain valid for the duration of the
access, so that the conditional reference count increment is safe.
This technique requires cooperation from the memory allocator
(mimalloc) as well as changes to the list and dictionary objects. The
proposed technique is similar to read-copy update (RCU) [#rcu]_, a
synchronization mechanism widely used in the Linux kernel.

The current implementation of ``list_item`` (the C function
implementing ``list.__getitem__``) is the following:

.. code-block:: c

    Py_INCREF(a->ob_item[i]);
    return a->ob_item[i];

The proposed implementation uses the conditional increment
(``_Py_TRY_INCREF``) and has additional checks:

.. code-block:: c

  PyObject **ob_item = atomic_load(&a->ob_item);
  PyObject *item = atomic_load(&ob_item[i]);
  if (!item || !_Py_TRY_INCREF(item)) goto retry;
  if (item != atomic_load(&ob_item[i])) {
    Py_DECREF(item);
    goto retry;
  }
  if (ob_item != atomic_load(&a->ob_item)) {
    Py_DECREF(item);
    goto retry;
 }
 return item;


The "retry" subroutine implements the locked fallback path when
concurrent modifications to the list cause the above fast,
non-locking path to fail:

.. code-block:: c

  retry:
    PyObject *item;
    Py_BEGIN_CRITICAL_SECTION(a->ob_mutex);
    item = a->ob_item[i];
    Py_INCREF(item);
    Py_END_CRITICAL_SECTION(a->ob_mutex);
    return item;

The modifications to the ``dict`` implementation are similar, because
the relevant parts of both list and dictionary retrieval involve
loading an item/value from an array at a known index.

The additional checks following the conditional increment are
necessary because the scheme allows immediate re-use of memory,
including the memory that previously held a ``PyObject`` structure or
``list`` or ``dict`` array.  Without these extra checks, the function
might return a Python object that was never in the list, if the
memory occupied by the Python object previously held a different
``PyObject`` whose memory previously stored an item in the list.


Mimalloc Changes for Optimistic ``list`` and ``dict`` Access
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

The implementation requires additional constraints to the memory
allocator, including some changes to the mimalloc code.  Some
background on mimalloc's implementation is helpful to understand the
required changes.  Individual allocations from mimalloc are
called "blocks."  Mimalloc "pages" contain consecutive blocks that
are all the same size.  A mimalloc "page" is similar to
a "superblock" in other allocators; it is NOT an operating system
page.  A mimalloc "heap" contains pages of various size classes; each
page belongs to a single heap. If none of the blocks of a page are
allocated, then mimalloc may re-use the page for a different size
class or different heap (i.e., it might reinitialize the page).

The list and dictionary access scheme works by partially restricting
re-use of mimalloc pages so that reference count fields remains valid
for the duration of the access.  The restricted re-use of mimalloc
pages is enforced by having separate heaps for Python objects
[#heaps]_.  This ensures that even if an item is freed during access
and the memory reused for a new object, the new object's reference
count field is placed at the same location in memory.  The reference
count field remains valid (or zero) across allocations.

Python objects that support ``Py_TPFLAGS_MANAGED_DICT`` have their
dictionary and weak reference fields preceding the  ``PyObject``
header, so their reference count fields are at a different offset from
the start of their allocations.  They are stored in a separate mimalloc
heap.  Additionally, non-GC objects are stored in their own heap so
that the GC only has to look at GC objects.  There are therefore three
mimalloc heaps for Python objects, one for non-GC objects, one for GC
objects with managed dictionaries, and one for GC objects without
managed dictionaries.


Mimalloc Page Reuse
'''''''''''''''''''

It is beneficial to keep the restrictions on mimalloc page reuse to a
short period of time to avoid increasing overall memory usage.
Precisely limiting the restrictions to list and dictionary accesses
would minimize memory usage, but would require expensive
synchronizations.  At the other extreme, keeping the restrictions
until the next GC cycle would avoid introducing any extra
synchronizations, but would potentially increase memory usage.

This PEP proposes a system that lies between those two extremes based
on FreeBSD's "GUS" [#gus]_.  It uses a combination of global and
per-thread counters (or "sequence numbers") to coordinate the
determination of when it is safe to reuse an empty mimalloc page for
a different heap or for a different size class, or to return it to
the operating system:

* There is a global write sequence number that monotonically
  increases.
* When a mimalloc page is empty, it's tagged with the current write
  sequence number.  The thread may also atomically increment the
  global write sequence number.
* Each thread has a local read sequence number that records the most
  recent write sequence number it has observed.
* Threads may observe the write sequence number whenever they are not
  in a list or dictionary access.  The reference implementation does
  this in mimalloc's slow-path allocation function.  This is called
  regularly enough to be useful, but not so frequently as to
  introduce significant overhead.
* There is a global read sequence number that stores the minimum of
  all active threads' read sequence numbers.  A thread may update the
  global read sequence number by scanning each threads' local read
  sequence number.  The reference implementation does this before
  allocating a fresh mimalloc page if there are restricted pages
  that could possibly be reused.
* An empty mimalloc page may be reused for a different heap or size
  class when the global read sequence number is larger than the
  page's tag number.

The condition that the global read sequence number is larger than the
page's tag is sufficient because it ensures that any thread that had
a concurrent optimistic list or dictionary access is finished with
that access.  In other words, there are no threads accessing the
empty blocks in the freed page, so the page can be used for any other
purpose or even returned to the operating system.

Optimistic ``dict`` and ``list`` Access Summary
'''''''''''''''''''''''''''''''''''''''''''''''

This PEP proposes a technique for thread-safe list and dictionary
accesses that typically avoids acquiring locks.  This reduces
execution overhead and avoids some multi-threaded scaling bottlenecks
in common operations, like calling functions and methods.  The scheme
works by placing temporary restrictions on mimalloc page reuse to
ensure that objects' reference count fields remain valid after
objects are freed so that conditional reference count increment
operations are safe.  The restrictions are placed on mimalloc pages
instead of on individual objects to improve opportunities for memory
reuse.  The restrictions are lifted as soon as the system can
determine that there are no outstanding accesses involving the empty
mimalloc page.  To determine this, the system uses a combination of
lightweight per-thread sequence counters and also tags pages when
they are empty.  Once each thread's local counter is larger than the
page's tag, it can be reused for any purpose or returned to the
operating system.  The restrictions are also lifted whenever the
cyclic garbage collector runs because the stop-the-world pause
ensures that threads do not have any outstanding references to empty
mimalloc pages.


Specializing Interpreter
------------------------

The specializing interpreter requires some changes to be thread-safe
when running without the GIL:

* Concurrent specializations are prevented by using a mutex.  This
  prevents multiple threads writing to the same inline cache.
* In multi-threaded programs running without the GIL, each bytecode is
  only specialized once.  This prevents a thread from reading a
  partially written inline cache.
* Locking also ensures that cached values of ``tp_version_tag`` and
  ``keys_version`` are consistent with the cached descriptors and other
  values.
* Modifications to inline counters use "relaxed atomics".  In other
  words, some counter decrements may be missed or overwritten, but that
  does not affect correctness.


``Py_mod_gil`` Slot
-------------------

In ``--disable-gil`` builds, when loading an extension, CPython will
check for a new :pep:`489`-style ``Py_mod_gil`` slot.  If the slot is
set to ``Py_mod_gil_not_used``, then extension loading proceeds as
normal. If the slot is not set, the interpreter pauses all threads and
enables the GIL before continuing.  Additionally, the interpreter will
issue a visible warning naming the extension, that the GIL was enabled
(and why) and the steps the user can take to override it.


``PYTHONGIL`` Environment Variable
----------------------------------

In ``--disable-gil`` builds, the user can also override the behavior at
runtime by setting the ``PYTHONGIL`` environment variable. Setting
``PYTHONGIL=0``, forces the GIL to be disabled, overriding the module
slot logic.  Setting ``PYTHONGIL=1``, forces the GIL to be enabled.

The ``PYTHONGIL=0`` override is important because extensions that are
not thread-safe can still be useful in multi-threaded applications. For
example, one may want to use the extension from only a single thread or
guard access by locks.  For context, there are already some extensions
that are not thread-safe even with the GIL, and users already have to
take these sorts of steps.

The ``PYTHONGIL=1`` override is sometimes useful for debugging.


Rationale
=========

Non-Generational Garbage Collection
-----------------------------------

This PEP proposes switching from a generational cyclic garbage
collector to a non-generational collector (when CPython is built
without the GIL). That is equivalent to only having one generation
(the "old" generation). There are two reasons for this proposed
change.

Cyclic garbage collection, even for just the young generation,
requires pausing other threads in the program. The author is
concerned that frequent collections of the young generation would
inhibit efficient scaling in multi-threaded programs. This is a
concern for young generations (but not the old generation) because
the young generations are collected after a fixed number of
allocations, while the collections for the older generation are
scheduled in proportion to the number of live objects in the heap.
Additionally, it is difficult to efficiently keep track of objects in
each generation without the GIL. For example, CPython currently uses
a linked list of objects in each generation. If CPython were to keep
that design, those lists would need to be made thread-safe, and it's
not clear how to do that efficiently.

Generational garbage collection is used to good effect in many other
language runtimes.  For example, many of the Java HotSpot garbage
collector implementations use multiple generations [#hotspotgc]_. In
these runtimes, a young generation is frequently a throughput win:
since a large percentage of the young generation is typically "dead,"
the GC is able to reclaim a large amount memory relative to the
amount of work performed. For example, several Java benchmarks show
over 90% of "young" objects are typically collected [#decapo]_
[#exploitingmemoryjava]_. This is commonly referred to as the "weak
generational hypothesis;" the observation is that most objects die
young. This pattern is reversed in CPython due to the use of
reference counting.  Although most objects still die young, they are
collected when their reference counts reach zero. Objects that
survive to a garbage collection cycle are most likely to remain
alive [#cpythongc]_. This difference means that generational
collection is much less effective in CPython than in many other
language runtimes [#golangc]_.


Optimistic Avoiding Locking in ``dict`` and ``list`` Accesses
-------------------------------------------------------------

This proposal relies on a scheme that mostly avoids acquiring locks
when accessing individual elements in lists and dictionaries.  Note
that this is not "lock free" in the sense of "lock-free"
and "wait-free" algorithms that guarantee forward progress.  It
simply avoids acquiring locks (mutexes) in the common case to improve
parallelism and reduce overhead.

A much simpler alternative would be to use reader-writer locks to
protect dictionary and list accesses. Reader-writer locks allow
concurrent reads, but not updates, which might seem ideal for list
and dictionaries. The problem is that reader-writer locks have
substantial overhead and poor scalability, particularly when the
critical sections are small, as they are for single-element
dictionary and list accesses [#perfbook]_. The poor reader
scalability stems from the fact that readers must all update the same
data structure, such as the number of readers in
``pthread_rwlocks``.

The technique described in this PEP is related to RCU
("read-copy-update") [#rcu]_ and, to a lesser extent, hazard
pointers, two well-known schemes for optimizing concurrent,
read-mostly data structures. RCU is widely used in the Linux kernel
to protect shared data structures in a scalable manner. Both the
technique in this PEP and RCU work by deferring reclamation while
readers may be accessing the concurrent data structure. RCU is most
commonly used to protect individual objects (like hash tables or
linked lists), while this PEP proposes a scheme to protect larger
blocks of memory (mimalloc "pages") [#typesafe_rcu]_.

The need for this scheme is largely due to the use of reference
counting in CPython.  If CPython only relied on a tracing garbage
collector, then this scheme would probably not be necessary because
tracing garbage collectors already defer reclamation in the required
manner.  This would not "solve" scaling issues, but would shift many
of the challenges to the garbage collector implementation.


Backwards Compatibility
=======================

This PEP poses a number of backwards compatibility issues when
building CPython with the ``--disable-gil`` flag, but those issues do
not occur when using the default build configuration.  Nearly all the
backwards compatibility concerns involve the C-API:

* CPython builds without the GIL will not be ABI compatible with the
  standard CPython build or with the stable ABI due to changes to the
  Python object header needed to support biased reference counting.
  C-API extensions will need to be rebuilt specifically for this
  version.
* C-API extensions that rely on the GIL to protect global state or
  object state in C code will need additional explicit locking to
  remain thread-safe when run without the GIL.
* C-API extensions that use borrowed references in ways that are not
  safe without the GIL will need to use the equivalent new APIs that
  return non-borrowed references. Note that only some uses of
  borrowed references are a concern; only references to objects that
  might be freed by other threads pose an issue.
* Custom memory allocators (``PyMem_SetAllocator``) are required to
  delegate the actual allocation to the previously set allocator. For
  example, the Python debug allocator and tracing allocators will
  continue to work because they delegate the allocation to the
  underlying allocator. On the other hand, wholesale replacing of the
  allocator (e.g., with jemalloc or tcmalloc) will not work
  correctly.
* Python objects must be allocated through the standard APIs, such as
  ``PyType_GenericNew`` or ``PyObject_Malloc``. Non-Python objects
  must **not** be allocated through those APIs. For example, it is
  currently acceptable to allocate buffers(non-Python objects)
  through ``PyObject_Malloc``; that will no longer be allowed and
  buffers should instead be allocated through ``PyMem_Malloc``,
  ``PyMem_RawMalloc``, or ``malloc``.

There are fewer potential backwards compatibility issues for Python
code:

* Destructors and weak reference callbacks for code objects and
  top-level function objects are delayed until the next cyclic
  garbage collection due to the use of deferred reference counting.
* Destructors for some objects accessed by multiple threads may be
  delayed slightly due to biased reference counting. This is rare:
  most objects, even those accessed by multiple threads, are
  destroyed immediately as soon as their reference counts are zero.
  Two places in the Python standard library tests required
  ``gc.collect()`` calls to continue to pass.


Distribution
============

This PEP poses new challenges for distributing Python.  At least for
some time, there will be two versions of Python requiring separately
compiled C-API extensions.  It may take some time for C-API extension
authors to build ``--disable-gil`` compatible packages and upload
them to PyPI.  Additionally, some authors may be hesitant to support
the ``--disable-gil``  mode until it has wide adoption, but adoption
will likely depend on the availability of Python's rich set of
extensions.

To mitigate this, the author will work with Anaconda to distribute
a ``--disable-gil`` version of Python together with compatible
packages from conda channels.  This centralizes the challenges of
building extensions, and the author believes this will enable more
people to use Python without the GIL sooner than they would otherwise
be able to.


Performance
===========

The changes to make CPython thread-safe without the GIL increase
execution overhead for ``--disable-gil`` builds.  The performance
impact is different for programs that use only a single thread compared
to programs that use multiple threads, so the table below reports
execution overhead separately for these types of programs separately.


.. list-table:: Execution Overhead on pyperformance 1.0.6
   :header-rows: 1
   :widths: auto

   * -
     - Intel Skylake
     - AMD Zen 3
   * - One thread
     - 6%
     - 5%
   * - Multiple threads
     - 8%
     - 7%

The baseline used to measure overhead is ``018be4c`` from `PR 19474`_,
which implements immortal objects for Python 3.12.  The largest
contribution to execution overhead is biased reference counting
followed by per-object locking.  For thread-safety reasons, an
application running with multiple threads will only specialize a given
bytecode once; this is why the overhead for programs that use multiple
threads is larger compared to programs that only use one thread.
However, with the GIL disabled, programs that use multiple threads
should also be able to more effectively use multiple CPU cores.

Note that this PEP would not affect the performance of the default
(non ``--disable-gil``) builds of CPython.

.. _PR 19474: https://github.com/python/cpython/pull/19474


Build Bots
==========

The stable build bots will also include ``--disable-gil`` builds.


How to Teach This
=================

As part of implementing the ``--disable-gil`` mode, the author will
write a "HOWTO" guide [#howto]_ for making packages compatible when
running Python without the GIL.


Reference Implementation
========================

There are two GitHub repositories implementing versions of CPython
without the GIL:

* https://github.com/colesbury/nogil-3.12
* https://github.com/colesbury/nogil

The ``nogil-3.12`` is based on Python 3.12.0a4.  It is useful for
evaluating single-threaded execution overhead and as a reference
implementation for this PEP.  It is less useful for evaluating C-API
extension compatibility because many extensions are not currently
compatible with Python 3.12.  Due to limited time for the 3.12 port,
the ``nogil-3.12`` implementation does not skip all deferred reference
counts.  As a temporary work around, the implementation immortalizes
objects that use deferred reference counting in programs that spawn
multiple threads.


The ``nogil`` repository is based on Python 3.9.10.  It is useful for
evaluating multi-threading scaling in real world applications and
extension compatibility.  It is more stable and well tested than the
``nogil-3.12`` repository.

Alternatives
============

Python currently supports a number of ways to enable parallelism, but
the existing techniques come with significant limitations.

Multiprocessing
---------------

The multiprocessing library allows Python programs to start and
communicate with Python subprocesses.  This allows for parallelism
because each subprocess has its own Python interpreter (i.e., there's
one GIL per process).  Multiprocessing has a few substantial
limitations.  Communication between processes is limited: objects
generally need to be serialized or copied to shared memory.  This
introduces overhead (due to serialization) and complicates building
APIs on top of multiprocessing.  Starting a subprocess is also more
expensive than starting a thread, especially with the "spawn"
implementation.  Starting a thread takes ~100 µs, while spawning a
subprocess takes ~50 ms (50,000 µs) due to Python re-initialization.

Finally, many C and C++ libraries support access from multiple
threads but do not support access or use across multiple processes.

Releasing the GIL in C-API Extensions
-------------------------------------

C-API extensions can release the GIL around long running functions.
This allows for some degree of parallelism, since multiple threads
can run concurrently when the GIL is released, but the overhead of
acquiring and releasing the GIL typically prevents this from scaling
efficiently beyond a few threads.  Many scientific computing
libraries release the GIL in computational heavy functions, and the
CPython standard library releases the GIL around blocking I/O.

Internal Parallelization
------------------------

Functions implemented in C may use multiple threads internally. For
example, Intel's NumPy distribution, PyTorch, and TensorFlow all use
this technique to internally parallelize individual operations. This
works well when the basic operations are large enough to be
parallelized efficiently, but not when there are many small
operations or when the operations depend on some Python code. Calling
into Python from C requires acquiring the GIL -- even short snippets
of Python code can inhibit scaling.


Related Work
=============


Per-Interpreter GIL
-------------------

The recently accepted :pep:`684` proposes a per-interpreter GIL to
address multi-core parallelism.  This would allow parallelism between
interpreters in the same process, but places substantial restrictions
on sharing Python data between interpreters.  Both this PEP
and :pep:`684` address the multi-core parallelism, but with different
tradeoffs and techniques.  It is feasible to implement both PEPs in
CPython at the same time.


Gilectomy
---------

Gilectomy [#gilectomy]_ was a project by Larry Hastings to remove the
GIL in CPython.  Like the design proposed by this PEP, the Gilectomy
supported multiple threads running in parallel within the same
interpreter (i.e., "free-threading") and made use of fine-grained
locking.  The reference implementation in this PEP improves on
single-threaded performance and scalability compared to the
Gilectomy.


PyParallel
----------

PyParallel [#pyparallel]_ was a proof-of-concept fork of Python 3.3 by
Trent Nelson that supported multiple threads running simultaneously
in a single Python process.  The fork introduced the concept
of "parallel threads" -- threads that can run simultaneously while
the main Python thread is suspended.  Parallel threads had read-only
access to objects created by the main thread.  Objects created within
parallel threads lived for the lifetime of the creating thread.  For
HTTP servers, this might correspond to the lifetime of a request.


python-safethread
-----------------

The python-safethread [#pythonsafethread]_ project was a patch to
Python 3.0 by Adam Olsen to remove the GIL.  Some aspects of the
project are similar to the design proposed by this PEP.  Both use
fine-grained locking and optimize reference counting for cases
where the object is created and accessed by the same thread.


Greg Stein's Free-Threading Patch
---------------------------------

In 1996, Greg Stein published a patch against Python 1.4 that removed
the GIL [#gsteinpatch]_.  The patch used atomic reference counting on
Windows and a global reference count lock on Linux. List and
dictionary accesses were protected by mutexes.  Parts of the patch
were adopted in CPython. In particular, the patch introduced a
PyThreadState structure and correct per-thread exception handling.


Dave Beazley revisited the patch in a 2011 blog post [#dabeaz]_.


Jython and IronPython
---------------------

Some alternative Python implementations like Jython [#jython]_ and
IronPython [#ironpython]_ do not have a global interpreter lock.
However, they do not support CPython extensions. (The implementations
can interface with code written in Java or C#).


PyPy-STM
--------

The pypy-stm [#pypystm]_ interpreter is a variant of PyPy that uses
software transactional memory.  The authors report single-threaded
performance overhead in the 20%-50% range compared to PyPy.  It is
not compatible with CPython extensions.


Rejected Ideas
==============

Why Not Use a Concurrent Garbage Collector?
-------------------------------------------

Many recent garbage collectors are mostly concurrent -- they avoid long
stop-the-world pauses by allowing the garbage collector to run
concurrently with the application. So why not use a concurrent
collector?

Concurrent collection requires write barriers (or read barriers).  The
author is not aware of a way to add write barriers to CPython without
substantially breaking the C-API.


Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
----------------------------------------------------------------------

This PEP proposes a new API ``PyDict_FetchItem`` which behaves like
``PyDict_GetItem``, but returns a new reference instead of a borrowed
reference.  As described in `Borrowed References`_, some uses of
borrowed references that were safe when running with the GIL are
unsafe when running without the GIL and need to be replaced by
functions like ``PyDict_FetchItem`` that return new references.

This PEP does *not* propose deprecating ``PyDict_GetItem`` and similar
functions that return borrowed references for a few reasons:

* Many of the uses of borrowed references are safe, even when running
  without the GIL.  For example, C API functions often use
  ``PyDict_GetItem`` to retrieve items from the keyword
  argument dictionary.  These calls are safe because the keyword
  argument dictionary is only visible to a single thread.
* I tried this approach early on and found that wholesale replacing of
  ``PyDict_GetItem`` with ``PyDict_FetchItem`` frequently introduced
  new reference counting bugs.  In my opinion, the risk of
  introducing new reference counting bugs generally outweighs the
  risks of missing a ``PyDict_GetItem`` call that is unsafe without
  the GIL.


Why Not Use PEP 683 Immortalization?
------------------------------------

Like :pep:`683`, this PEP proposes an immortalization scheme for
Python objects, but the PEPs use different bit representations to
mark immortal objects.  The schemes cannot be identical because this
PEP depends on biased reference counting, which has two reference
count fields instead of one.


Open Issues
===========

Improved Specialization
-----------------------

The Python 3.11 release introduced quickening and specialization as part
of the faster CPython project, substantially improving performance.
Specialization replaces slow bytecode instructions with faster
variants [#pep659]_.  To maintain thread-safety, applications that use
multiple threads (and run without the GIL) will only specialize each
bytecode once, which can lower performance on some programs.  It is
possible to support specializing multiple times, but that requires more
investigation and is not part of this PEP.


Python Build Modes
------------------

This PEP introduces a new build mode (``--disable-gil``) that is not
ABI compatible with the standard build mode.  The additional build
mode adds complexity for both Python core developers and extension
developers.  The author believes a worthwhile goal is to combine
these build modes and have the global interpreter lock controlled at
runtime, possibly disabled by default.  The path to this goal remains
an open issue, but a possible path might look like the following:

#. In 2024, CPython 3.13 is released with support for a
   ``--disable-gil`` build time flag.  There are two ABIs for
   CPython, one with the GIL and one without.  Extension authors
   target both ABIs.
#. After 2--3 releases, (i.e., in 2026--2027), CPython is released
   with the GIL controlled by a runtime environment variable or
   flag. The GIL is enabled by default.  There is only a single ABI.
#. After another 2--3 release (i.e., 2028--2030), CPython switches to
   the GIL being disabled by default.  The GIL can still be enabled
   at runtime via an environment variable or command line flag.

This PEP covers the first step, with the remaining steps left as open
issues.  In this scenario, there would be a two to three year period
where extension authors would target an extra CPython build per
supported CPU architecture and OS.

Integration
-----------

The reference implementation changes approximately 15,000 lines of code
in CPython and includes mimalloc, which is also approximately 15,000
lines of code.  Most changes are not performance sensitive and can be
included in both ``--disable-gil`` and the default builds.  Some
macros, like ``Py_BEGIN_CRITICAL_SECTION`` will be no-ops in the
default build.  Thee author does not expect a huge number of ``#ifdef``
statements to support the ``--disable-gil`` builds.


Mitigations for Single-Threaded Performance
-------------------------------------------

The changes proposed in the PEP will increase execution overhead for
``--disable-gil`` builds compared to Python builds with the GIL.  In
other words, it will have slower single-threaded performance.  There
are some possible optimizations to reduce execution overhead,
especially for ``--disable-gil`` builds that only use a single
thread.  These may be worthwhile if a longer term goal is to have a
single build mode, but the choice of optimizations and their
trade-offs remain an open issue.


References
==========

.. [#yuemmwang2019] "Exploiting Parallelism Opportunities with Deep Learning Frameworks."
   Yu Emma Wang, Carole-Jean Wu, Xiaodong Wang, Kim Hazelwood, David Brooks. 2019.
   https://arxiv.org/abs/1908.04705.

.. [#torchdeploy] "Using Python for Model Inference in Deep Learning."
   Zachary DeVito, Jason Ansel, Will Constable, Michael Suo, Ailing Zhang, Kim Hazelwood. 2021.
   https://arxiv.org/abs/2104.00254. See Figure 5.

.. [#brc] "Biased reference counting: minimizing atomic operations in garbage collection".
   Jiho Choi, Thomas Shull, and Josep Torrellas. PACT 2018.
   https://dl.acm.org/doi/abs/10.1145/3243176.3243195.

.. [#pep683] :pep:`683` -- Immortal Objects, Using a Fixed Refcount.

.. [#tid] https://github.com/colesbury/nogil/blob/f7e45d6bfbbd48c8d5cf851c116b73b85add9fc6/Include/object.h#L428-L455.

.. [#rcu] "What is RCU, Fundamentally?"
   Paul E. McKenney, Jonathan Walpole. 2017.
   https://lwn.net/Articles/262464/

.. [#heaps] There are two heaps for Python objects because PyObjects
   that support cyclic garbage collection have extra fields preceding
   the PyObject struct.

.. [#gus] "Global Unbounded Sequences (GUS)"
   https://github.com/freebsd/freebsd-src/blob/9408f36627b74a472dc82f7a43320235c0c9055a/sys/kern/subr_smr.c#L44.
   See also https://people.kernel.org/joelfernandes/gus-vs-rcu.

.. [#perfbook] "Is Parallel Programming Hard, And, If So, What Can You Do About It?"
   Paul E. McKenney. 2022.
   https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html.

.. [#typesafe_rcu] ``SLAB_TYPESAFE_BY_RCU`` is an example in which RCU
   protects blocks of memory and not any individual object.  See
   https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html#analogy-with-reference-counting.

.. [#hotspotgc] "HotSpot Virtual Machine Garbage Collection Tuning Guide."
   https://docs.oracle.com/en/java/javase/12/gctuning/hotspot-virtual-machine-garbage-collection-tuning-guide.pdf.
   Most of the hotspot garbage collectors are generational, with the
   notable exception of ZGC, although there is ongoing work to make
   that generational.

.. [#decapo] `The DaCapo Benchmarks: Java Benchmarking Development and
   Analysis
   <https://openresearch-repository.anu.edu.au/bitstream/1885/33723/2/01_Blackburn_The_DaCapo_Benchmarks:_Java_2006.pdf>`_.
   See column "Nursery Survival" in Table 4.

.. [#exploitingmemoryjava] "Exploiting memory usage patterns to improve garbage collections in Java."
   https://dl.acm.org/doi/abs/10.1145/1852761.1852768.

.. [#cpythongc] "most things usually turn out to be reachable"
    https://github.com/python/cpython/blob/cd6655a8589e99ae4088b3bed4a692a19ed48779/Modules/gcmodule.c#L1106.

.. [#golangc] The Go team observed something similar in Go, but due to
   escape analysis and pass-by-value instead of reference
   counting. Recent versions of Go use a non-generational garbage
   collector. https://go.dev/blog/ismmkeynote.

.. [#nogil] https://github.com/colesbury/nogil.

.. [#nogil312] https://github.com/colesbury/nogil-3.12.

.. [#howto] Python HOWTOs.
   https://docs.python.org/3/howto/index.html.

.. [#pep659] :pep:`659` -- Specializing Adaptive Interpreter.

.. [#gilectomy] Gilectomy.
   Larry Hastings. 2016.
   https://github.com/larryhastings/gilectomy/tree/gilectomy.

.. [#pyparallel] PyParallel.
   Trent Nelson. 2016.
   http://pyparallel.org/.

.. [#pythonsafethread] python-safethread.
   Adam Olsen. 2008.
   https://launchpad.net/python-safethread

.. [#gsteinpatch] https://www.python.org/ftp/python/contrib-09-Dec-1999/System/threading.tar.gz.

.. [#dabeaz] An Inside Look at the GIL Removal Patch of Lore.
   David Beazley. 2011.
   https://dabeaz.blogspot.com/2011/08/inside-look-at-gil-removal-patch-of.html.

.. [#jython] Jython.
   https://www.jython.org/

.. [#ironpython] IronPython.
   https://ironpython.net/

.. [#pypystm] PyPy: Software Transactional Memory.
   https://doc.pypy.org/en/latest/stm.html


Acknowledgments
===============

Thanks to Hugh Leather, Łukasz Langa, and Eric Snow for providing
feedback on drafts of this PEP.

Copyright
=========

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								PEP: 703
 								Title: Making the Global Interpreter Lock Optional in CPython
 								Author: Sam Gross <colesbury at gmail.com>
 								Sponsor: Łukasz Langa <lukasz at python.org>
-												[pep-0703] Add discussion thread on discuss.python.org

											
										
										
											2023-01-10 11:09:10 -05:00
+								Discussions-To: https://discuss.python.org/t/22606
-												PEP 703: Mark as Accepted (#3512)


											
										
										
											2023-10-28 19:23:10 -04:00
+								Status: Accepted
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								Type: Standards Track
 								Content-Type: text/x-rst
 								Created: 09-Jan-2023
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								Python-Version: 3.13
-												PEP 703: Update Post-History (#3137)


											
										
										
											2023-05-04 14:54:52 -04:00
+								Post-History: `09-Jan-2023 <https://discuss.python.org/t/22606>`__,
 								              `04-May-2023 <https://discuss.python.org/t/26503>`__
-												PEP 703: Mark as Accepted (#3512)


											
										
										
											2023-10-28 19:23:10 -04:00
+								Resolution: https://discuss.python.org/t/pep-703-making-the-global-interpreter-lock-optional-in-cpython-acceptance/37075
 								.. note::
 								   The Steering Council accepts PEP 703, but with clear provisio: that
 								   the rollout be gradual and break as little as possible, and that we can roll
 								   back any changes that turn out to be too disruptive – which includes
 								   potentially rolling back all of PEP 703 entirely if necessary
 								   (however unlikely or undesirable we expect that to be).
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								Abstract
 								========
 								CPython's global interpreter lock ("GIL") prevents multiple threads
 								from executing Python code at the same time.  The GIL is an obstacle
 								to using multi-core CPUs from Python efficiently.  This PEP proposes
-												PEP 703: --disable-gil and other small changes (#2979)

* Rename `--without-gil` to `--disable-gil`.

Suggested by Inada Naoki.

* Add two sections to "Rejected Ideas":

 - Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
 - Why Not Use PEP 683 Immortalization?

* Expand on Python build modes
											
										
										
											2023-01-23 12:06:13 -05:00
+								adding a build configuration (``--disable-gil``) to CPython to let it
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								run Python code without the global interpreter lock and with the
 								necessary changes needed to make the interpreter thread-safe.
 								Motivation
 								==========
 								The GIL is a major obstacle to concurrency.  For scientific computing
 								tasks, this lack of concurrency is often a bigger issue than speed of
 								executing Python code, since most of the processor cycles are spent
 								in optimized CPU or GPU kernels.  The GIL introduces a global
 								bottleneck that can prevent other threads from making progress if
 								they call any Python code.  There are existing ways to enable
 								parallelism in CPython today, but those techniques come with
 								significant limitations (see `Alternatives`_).
 								This section focuses on the GIL's impact on scientific computing,
 								particular AI/ML workloads because that is the area with which this
 								author has the most experience, but the GIL also affects other users
 								of Python.
 								The GIL Makes Many Types of Parallelism Difficult to Express
 								------------------------------------------------------------
 								Neural network-based AI models expose multiple opportunities for
 								parallelism.  For example, individual operations may be parallelized
 								internally ("intra-operator"), multiple operations may be executed
 								simultaneously ("inter-operator"), and requests (spanning multiple
 								operations) may also be parallelized.  Efficient execution requires
 								exploiting multiple types of parallelism [#yuemmwang2019]_.
 								The GIL makes it difficult to express inter-operator parallelism, as
 								well as some forms of request parallelism, efficiently in Python. In
 								other programming languages, a system might use threads to run
 								different parts of a neural network on separate CPU cores, but this is
 								inefficient in Python due to the GIL. Similarly, latency-sensitive
 								inference workloads frequently use threads to parallelize across
 								requests, but face the same scaling bottlenecks in Python.
 								The challenges the GIL poses to exploiting parallelism in Python
 								frequently come up in reinforcement learning.  Heinrich Kuttler,
 								author of the NetHack Learning Environment and Member of Technical
 								Staff at Inflection AI, writes:
 								  Recent breakthroughs in reinforcement learning, such as on `Dota
 `_, `StarCraft`_, and `NetHack`_ rely on running multiple
 								  environments (simulated games) in parallel using asynchronous
 								  actor-critic methods. Straightforward multithreaded implementations
 								  in Python don't scale beyond more than a few parallel environments
 								  due to GIL contention. Multiprocessing, with communication via
 								  shared memory or UNIX sockets, adds much complexity and in effect
 								  rules out interacting with CUDA from different workers, severely
 								  restricting the design space.
 								.. _Dota 2: https://openai.com/five/
 								.. _StarCraft: https://www.deepmind.com/blog/alphastar-grandmaster-level-in-starcraft-ii-using-multi-agent-reinforcement-learning
 								.. _NetHack: https://ai.facebook.com/blog/nethack-learning-environment-to-advance-deep-reinforcement-learning/
 								Manuel Kroiss, software engineer at DeepMind on the reinforcement
 								learning team, describes how the bottlenecks posed by the GIL lead to
 								rewriting Python codebases in C++, making the code less accessible:
 								  We frequently battle issues with the Python GIL at DeepMind. In many
 								  of our applications, we would like to run on the order of 50-100
 								  threads per process. However, we often see that even with fewer
 								  than 10 threads the GIL becomes the bottleneck. To work around this
 								  problem, we sometimes use subprocesses, but in many cases the
 								  inter-process communication becomes too big of an overhead.  To
 								  deal with the GIL, we usually end up translating large parts of our
 								  Python codebase into C++. This is undesirable because it makes the
 								  code less accessible to researchers.
 								Projects that involve interfacing with multiple hardware devices face
 								similar challenges: efficient communication requires use of multiple
 								CPU cores.  The `Dose-3D`_ project aims to improve cancer
 								radiotherapy with precise dose planning.  It uses medical phantoms
 								(stand-ins for human tissue) together with custom hardware and a
 								server application written in Python.  Paweł Jurgielewicz, lead
 								software architect for the data acquisition system on the Dose-3D
 								project, describes the scaling challenges posed by the GIL and how
 								using a fork of Python without the GIL simplified the project:
 								  In the Dose-3D project, the key challenge was to maintain a stable,
 								  non-trivial concurrent communication link with hardware units while
 								  utilizing a 1 Gbit/s UDP/IP connection to the maximum. Naturally,
 								  we started with the multiprocessing package, but at some point, it
 								  became clear that most CPU time was consumed by the data transfers
 								  between the data processing stages, not by data processing itself.
 								  The CPython multithreading implementation based on GIL was a dead
 								  end too. When we found out about the "nogil" fork of Python it took
 								  a single person less than half a working day to adjust the codebase
 								  to use this fork and the results were astonishing. Now we can focus
 								  on data acquisition system development rather than fine-tuning data
 								  exchange algorithms.
 								.. _Dose-3D: https://dose3d.fis.agh.edu.pl/en/projekt-dose-3d-z-programu-team-net-fnp-eng/
 								Allen Goodman, author of `CellProfiler`_ and staff engineer at
 								Prescient Design and Genentech, describes how the GIL makes
 								biological methods research more difficult in Python:
 								  Issues with Python's global interpreter lock are a frequent source
 								  of frustration throughout biological methods research.
 								  I wanted to better understand the current multithreading situation
 								  so I reimplemented parts of `HMMER`_, a standard method for
 								  multiple-sequence alignment. I chose this method because it
 								  stresses both single-thread performance (scoring) and
 								  multi-threaded performance (searching a database of sequences). The
 								  GIL became the bottleneck when using only eight threads. This is a
 								  method where the current popular implementations rely on 64 or
 								  even 128 threads per process. I tried moving to subprocesses but
 								  was blocked by the prohibitive IPC costs.  HMMER is a relatively
 								  elementary bioinformatics method and newer methods have far bigger
 								  multi-threading demands.
 								  Method researchers are begging to use Python (myself included),
 								  because of its ease of use, the Python ecosystem, and because "it's
 								  what people know."  Many biologists only know a little bit of
 								  programming (and that's almost always Python). Until Python's
 								  multithreading situation is addressed, C and C++ will remain the
 								  lingua franca of the biological methods research community.
 								.. _CellProfiler: https://cellprofiler.org/
 								.. _HMMER: http://hmmer.org/
 								The GIL Affects Python Library Usability
 								----------------------------------------
 								The GIL is a CPython implementation detail that limits multithreaded
 								parallelism, so it might seem unintuitive to think of it as a
 								usability issue.  However, library authors frequently care a great
 								deal about performance and will design APIs that support working
 								around the GIL.  These workaround frequently lead to APIs that are
 								more difficult to use.  Consequently, users of these APIs may
 								experience the GIL as a *usability* issue and not just a performance
 								issue.
 								For example, PyTorch exposes a multiprocessing-based API called
 								``DataLoader`` for building data input pipelines.  It uses ``fork()``
 								on Linux because it is generally faster and uses less memory
 								than ``spawn()``, but this leads to additional challenges for users:
 								creating a ``DataLoader`` after accessing a GPU can lead to confusing
 								CUDA errors.  Accessing GPUs within a ``DataLoader`` worker quickly
 								leads to out-of-memory errors because processes do not share CUDA
 								contexts (unlike threads within a process).
 								Olivier Grisel, scikit-learn developer and software engineer at Inria,
 								describes how having to work around the GIL in scikit-learn related
 								libraries leads to a more complex and confusing user experience:
 								  Over the years, scikit-learn developers have maintained ancillary
 								  libraries such as ``joblib`` and ``loky`` to try to work around some
 								  of the limitations of multiprocessing: extra memory usage partially
 								  mitigated via semi-automated memory mapping of large data buffers,
 								  slow worker startup by transparently reusing a pool of long
 								  running workers, fork-safety problems of third-party native runtime
 								  libraries such as GNU OpenMP by never using the fork-only
 								  start-method, ability to perform parallel calls of interactively
 								  defined functions in notebooks and REPLs in cross-platform manner
 								  via cloudpickle. Despite our efforts, this multiprocessing-based
 								  solution is still brittle, complex to maintain and confusing to
 								  datascientists with limited understanding of system-level
 								  constraints. Furthermore, there are still irreducible limitations
 								  such as the overhead caused by the pickle-based
 								  serialization/deserialization steps required for inter-process
 								  communication. A lot of this extra work and complexity would not be
 								  needed anymore if we could use threads without contention on
 								  multicore hosts (sometimes with 64 physical cores or more) to run
 								  data science pipelines that alternate between Python-level
 								  operations and calls to native libraries.
 								Ralf Gommers, co-director of Quansight Labs and NumPy and SciPy
 								maintainer, describes how the GIL affects the user experience of
 								NumPy and numeric Python libraries:
 								  A key problem in NumPy and the stack of packages built around it is
 								  that NumPy is still (mostly) single-threaded --- and that has shaped
 								  significant parts of the user experience and projects built around
 								  it. NumPy does release the GIL in its inner loops (which do the
 								  heavy lifting), but that is not nearly enough. NumPy doesn't offer
 								  a solution to utilize all CPU cores of a single machine well, and
 								  instead leaves that to Dask and other multiprocessing solutions.
 								  Those aren't very efficient and are also more clumsy to use. That
 								  clumsiness comes mainly in the extra abstractions and layers the
 								  users need to concern themselves with when using, e.g.,
 								  ``dask.array`` which wraps ``numpy.ndarray``. It also shows up in
 								  oversubscription issues that the user must explicitly be aware of
 								  and manage via either environment variables or a third package,
 								  ``threadpoolctl``. The main reason is that NumPy calls into BLAS
 								  for linear algebra - and those calls it has no control over, they
 								  do use all cores by default via either pthreads or OpenMP.
 								  Coordinating on APIs and design decisions to control parallelism is
 								  still a major amount of work, and one of the harder challenges
 								  across the PyData ecosystem. It would have looked a lot different
 								  (better, easier) without a GIL.
 								GPU-Heavy Workloads Require Multi-Core Processing
 								-------------------------------------------------
 								Many high-performance computing (HPC) and AI workloads make heavy use
 								of GPUs.  These applications frequently require efficient multi-core
 								CPU execution even though the bulk of the computation runs on a GPU.
 								Zachary DeVito, PyTorch core developer and researcher at FAIR
 								(Meta AI), describes how the GIL makes multithreaded scaling
 								inefficient even when the bulk of computation is performed outside of
 								Python:
 								  In PyTorch, Python is commonly used to orchestrate ~8 GPUs and ~64
 								  CPU threads, growing to 4k GPUs and 32k CPU threads for big models.
 								  While the heavy lifting is done outside of Python, the speed of
 								  GPUs makes even just the orchestration in Python not scalable. We
 								  often end up with 72 processes in place of one because of the GIL.
 								  Logging, debugging, and performance tuning are orders-of-magnitude
 								  more difficult in this regime, continuously causing lower developer
 								  productivity.
 								The use of many processes (instead of threads) makes common tasks more
 								difficult. Zachary DeVito continues:
 								  On three separate occasions in the past couple of months
 								  (reducing redundant compute in data loaders, writing model
 								  checkpoints asynchronously, and parallelizing compiler
 								  optimizations), I spent an order-of-magnitude more time figuring
 								  out how to work around GIL limitations than actually solving the
 								  particular problem.
 								Even GPU-heavy workloads frequently have a CPU-intensive component.
 								For example, computer vision tasks typically require
 								multiple "pre-processing" steps in the data input pipeline, like
 								image decoding, cropping, and resizing.  These tasks are commonly
 								performed on the CPU and may use Python libraries like `Pillow`_
 								or `Pillow-SIMD`_.  It is necessary to run the data input pipeline
 								on multiple CPU cores in order to keep the GPU "fed" with data.
 								The increase in GPU performance compared to individual CPU cores makes
 								multi-core performance more important.  It is progressively more
 								difficult to keep the GPUs fully occupied.  To do so requires efficient
 								use of multiple CPU cores, especially on multi-GPU systems.  For
 								example, NVIDIA's DGX-A100 has 8 GPUs and two 64-core CPUs in order to
 								keep the GPUs "fed" with data.
 								.. _Pillow: https://pillow.readthedocs.io/en/stable/
 								.. _Pillow-SIMD: https://github.com/uploadcare/pillow-simd
 								The GIL Makes Deploying Python AI Models Difficult
 								--------------------------------------------------
 								Python is widely used to develop neural network-based AI models.  In
 								PyTorch, models are frequently deployed as part of multi-threaded,
 								mostly C++, environments.  Python is often viewed skeptically
 								because the GIL can be a global bottleneck, preventing efficient
 								scaling even though the vast majority of the computations
 								occur "outside" of Python with the GIL released.  The torchdeploy
 								paper [#torchdeploy]_ shows experimental evidence for these scaling
 								bottlenecks in multiple model architectures.
 								PyTorch provides a number of mechanisms for deploying Python AI
 								models that avoid or work around the GIL, but they all come with
 								substantial limitations.  For example, `TorchScript
 								<https://pytorch.org/docs/stable/jit.html>`_ captures a
 								representation of the model that can be executed from C++ without any
 								Python dependencies, but it only supports a limited subset of Python
 								and often requires rewriting some of the model's code.  The
 								`torch::deploy <https://pytorch.org/docs/stable/package.html>`_ API
 								allows multiple Python interpreters, each with its own GIL, in the
 								same process(similar to :pep:`684`).  However, ``torch::deploy`` has
 								limited support for Python modules that use C-API extensions.
 								Motivation Summary
 								------------------
 								Python's global interpreter lock makes it difficult to use modern
 								multi-core CPUs efficiently for many scientific and numeric computing
 								applications.  Heinrich Kuttler, Manuel Kroiss, and Paweł
 								Jurgielewicz found that multi-threaded implementations in Python did
 								not scale well for their tasks and that using multiple processes
 								was not a suitable alternative.
 								The scaling bottlenecks are not solely in core numeric tasks. Both
 								Zachary DeVito and Paweł Jurgielewicz described challenges with
 								coordination and communication in Python.
 								Olivier Grisel, Ralf Gommers, and Zachary DeVito described how current
 								workarounds for the GIL are "complex to maintain" and cause "lower
 								developer productivity."  The GIL makes it more difficult to develop
 								and maintain scientific and numeric computing libraries as well
 								leading to library designs that are more difficult to use.
 								Specification
 								=============
 								Build Configuration Changes
 								---------------------------
 								The global interpreter lock will remain the default for CPython builds
 								and python.org downloads. A new build configuration flag,
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								``--disable-gil`` will be added to the configure script that will build
 								CPython with support for running without the global interpreter lock.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								When built with ``--disable-gil``, CPython will define the ``Py_NOGIL``
-												PEP 703: Use 't' for proposed ABI tag (#3445)


											
										
										
											2023-09-19 14:21:12 -04:00
+								macro in Python/patchlevel.h.  The ABI tag will include the letter "t"
 								(for "threading").
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
 								The ``--disable-gil`` builds of CPython will still support optionally
 								running with the GIL enabled at runtime (see `PYTHONGIL Environment
 								Variable`_ and `Py_mod_gil Slot`_).
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								Overview of CPython Changes
 								---------------------------
 								Removing the global interpreter lock requires substantial changes to
 								CPython internals, but relatively few changes to the public Python
 								and C APIs. This section describes the required changes to the
 								CPython implementation followed by the proposed API changes.
 								The implementation changes can be grouped into the following four
 								categories:
 								* Reference counting
 								* Memory management
 								* Container thread-safety
 								* Locking and atomic APIs
 								Reference Counting
 								------------------
 								Removing the GIL requires changes to CPython's
 								reference counting implementation to make it thread-safe.
 								Furthermore, it needs to have low execution overhead and allow for
 								efficient scaling with multiple threads. This PEP proposes a
 								combination of three techniques to address these constraints. The
 								first is a switch from plain non-atomic reference counting to biased
 								reference counting, which is a thread-safe reference counting
 								technique with lower execution overhead than plain atomic reference
 								counting. The other two techniques are immortalization and a limited
 								form of deferred reference counting; they address some of the
 								multi-threaded scalability issues with reference counting by avoiding
 								some reference count modifications.
 								Biased reference counting (BRC) is a technique first described in 2018
 								by Jiho Choi, Thomas Shull, and Josep Torrellas [#brc]_. It is based on the
 								observation that most objects are only accessed by a single thread,
 								even in multi-threaded programs. Each object is associated with an
 								owning thread (the thread that created it). Reference counting
 								operations from the owning thread use non-atomic instructions to
 								modify a "local" reference count. Other threads use atomic
 								instructions to modify a "shared" reference count. This design avoids
 								many atomic read-modify-write operations that are expensive on
 								contemporary processors.
 								The implementation of BRC proposed in this PEP largely matches the
 								original description of biased reference counting, but differs in
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								details like the size of reference counting fields and special bits in
 								those fields. BRC requires storing three pieces of information in each
 								object's header: the "local" reference count, the "shared" reference
 								count, and the identifier of the owning thread.  The BRC paper packs
 								these three things into a single 64-bit field.  This PEP proposes using
 								three separate fields in each object's header to avoid potential issues
-												PEP 703: Minor edits (#3143)


											
										
										
											2023-05-11 14:11:40 -04:00
+								due to reference count overflow.  Additionally, the PEP supports a
 								faster deallocation path that avoids an atomic operation in the common
 								case.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								The proposed ``PyObject`` struct (also called ``struct _object``) is
 								below:
 								.. code-block:: c
 								  struct _object {
 								    _PyObject_HEAD_EXTRA
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								    uintptr_t ob_tid;         // owning thread id (4-8 bytes)
 								    uint16_t __padding;       // reserved for future use (2 bytes)
 								    PyMutex ob_mutex;         // per-object mutex (1 byte)
 								    uint8_t ob_gc_bits;       // GC fields (1 byte)
 								    uint32_t ob_ref_local;    // local reference count (4 bytes)
 								    Py_ssize_t ob_ref_shared; // shared reference count and state bits (4-8 bytes)
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								    PyTypeObject *ob_type;
 								  };
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								The ``ob_tid``, ``ob_ref_local``, and ``ob_ref_shared`` are used by
 								the biased reference counting implementation.  The ``ob_gc_bits`` field
 								is used store garbage collection flags that were previously stored in
 								``PyGC_Head`` (see `Garbage Collection (Cycle Collection)`_).  The
 								``ob_mutex`` field provides a per-object lock in a single byte.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								Immortalization
 								'''''''''''''''
 								Some objects, such as interned strings, small integers, statically
 								allocated PyTypeObjects, and the ``True``, ``False``, and ``None``
 								objects stay alive for the lifetime of the program. These objects are
 								marked as immortal by setting the local reference count field
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								(``ob_ref_local``) to ``UINT32_MAX``.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								The ``Py_INCREF`` and ``Py_DECREF`` macros are no-ops for immortal
 								objects.  This avoids contention on the reference count fields of
 								these objects when multiple threads access them concurrently.
 								This proposed immortalization scheme is very similar to :pep:`683`,
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								adopted in Python 3.12, but with slightly different bit representation
 								in the reference count fields for immortal objects in order to work
 								with biased reference counting and deferred reference counting.  See
 								also `Why Not Use PEP 683 Immortalization?`_.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								Biased Reference Counting
 								'''''''''''''''''''''''''
 								Biased reference counting has a fast-path for objects "owned" by the
 								current thread and a slow-path for other objects.  Ownership is
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								indicated by the ``ob_tid`` field.  Determining the thread id requires
 								platform specific code [#tid]_.  A value of ``0`` in ``ob_tid``
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								indicates that the object is not owned by any thread.
 								The ``ob_ref_local`` field stores the local reference count and two
 								flags.  The two most significant bits are used to indicate the object
 								is immortal or uses deferred reference counting (see `Deferred
 								reference counting`_).
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								The ``ob_ref_shared`` field stores the shared reference count.  The
 								two *least* significant bits are used to store the reference
 								counting state.  The shared reference count is therefore shifted left by
 								two.  The ``ob_ref_shared`` field uses the least significant bits
 								because the shared reference count can be temporarily negative; increfs
 								and decrefs may not be balanced between threads.
 								The possible reference counting states are listed below:
 								* ``0b00`` - default
 								* ``0b01`` - weakrefs
 								* ``0b10`` - queued
 								* ``0b11`` - merged
 								The states form a progression: during their lifecycle, objects may
 								transition to any numerically higher state.  Objects can only be
 								deallocated from the "default" and "merged" states.  Other states must
 								transition to the "merged" state before deallocation.  Transitioning
 								states requires an atomic compare-and-swap on the ``ob_ref_shared``
 								field.
 								Default (``0b00``)
 								""""""""""""""""""
-												PEP 703: Correct two spelling mistakes (#3298)


											
										
										
											2023-08-21 12:20:54 -04:00
+								Objects are initially created in the default state.  This is the only
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								state that allows for the quick deallocation code path.  Otherwise, the
 								thread must merge the local and shared reference count fields, which
 								requires an atomic compare-and-swap.
 								This quick deallocation code path would not be thread-safe with
 								concurrent dereferencing of weakrefs, so the first time a weak
 								reference is created, the object is transitioned to the "weakrefs"
 								state if it is currently in the "default" state.
 								Similarly, the quick deallocation code path would not be thread-safe
 								with the lockless list and dictionary accesses (see `Optimistically
 								Avoiding Locking`_), so the first time a non-owning thread thread
 								attempts to retrieve an object in the "default" state it falls back to
 								the slower locking code path and transitions the object to
 								the "weakrefs" state.
 								Weakrefs (``0b01``)
 								"""""""""""""""""""
 								Objects in weakref and higher states support dereferencing weakrefs
 								as well as the lockless list and dictionary access by non-owning
 								threads.  They require transitioning to the merged state before
 								deallocation, which is more expensive than the quick deallocation code
 								path supported by the "default" state.
 								Queued (``0b10``)
 								""""""""""""""""""
 								The queued state indicates that the a non-owning thread has requested
 								that the reference count fields be merged.  This can happen when the
 								shared reference count becomes negative (due to an imbalance between
 								increfs and decrefs between threads).  The object is inserted into the
 								owning thread's queue of objects to be merged.  The owning thread is
 								notified via the ``eval_breaker`` mechanism.  In practice, this
 								operation is rare.  Most objects are only accessed by a single thread
 								and those objects accessed by multiple threads rarely have negative
 								shared reference counts.
 								If the owning thread has terminated, the acting thread immediately
 								merges the local and shared reference count fields and transitions to
 								the merged state.
 								Merged (``0b11``)
 								"""""""""""""""""
 								The merged state indicates that the object is not owned by any thread.
 								The ``ob_tid`` field is zero in this state and ``ob_ref_local`` is not
 								used.  Once the shared reference count reaches zero, the object can
 								be deallocated from the merged state.
 								Reference counting pseudo-code
 								""""""""""""""""""""""""""""""
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								The proposed ``Py_INCREF`` and ``Py_DECREF`` operation should behave
 								as follows (using C-like pseudo-code):
 								.. code-block:: c
 								  // low two bits of "ob_ref_shared" are used for flags
 								  #define _Py_SHARED_SHIFT 2
 								  void Py_INCREF(PyObject *op)
 								  {
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								    uint32_t new_local = op->ob_ref_local + 1;
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								    if (new_local == 0)
 								      return;  // object is immortal
 								    if (op->ob_tid == _Py_ThreadId())
 								      op->ob_ref_local = new_local;
 								    else
 								      atomic_add(&op->ob_ref_shared, 1 << _Py_SHARED_SHIFT);
 								  }
 								  void Py_DECREF(PyObject *op)
 								  {
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								    if (op->ob_ref_local == _Py_IMMORTAL_REFCNT) {
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								      return;  // object is immortal
 								    }
 								    if (op->ob_tid == _Py_ThreadId()) {
 								      op->ob_ref_local -= 1;
 								      if (op->ob_ref_local == 0) {
 								        _Py_MergeZeroRefcount(); // merge refcount
 								      }
 								    }
 								    else {
 								      _Py_DecRefShared(); // slow path
 								    }
 								  }
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								  void _Py_MergeZeroRefcount(PyObject *op)
 								  {
 								    if (op->ob_ref_shared == 0) {
 								      // quick deallocation code path (common case)
 								      op->ob_tid = 0;
 								      _Py_Dealloc(op);
 								    }
 								    else {
 								      // slower merging path not shown
 								    }
 								  }
 								The reference implementation [#nogil312]_ contains implementations of
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								``_Py_MergeZeroRefcount`` and ``_Py_DecRefShared``.
 								Note that the above is pseudocode: in practice, the implementation
 								should use "relaxed atomics" to access ``ob_tid`` and
 								``ob_ref_local`` to avoid undefined behavior in C and C++.
 								Deferred Reference Counting
 								'''''''''''''''''''''''''''
 								A few types of objects, such as top-level functions, code objects,
 								modules, and methods, tend to be frequently accessed by many threads
 								concurrently.  These objects don't necessarily live for the lifetime of
 								the program, so immortalization is not a good fit. This PEP proposes a
 								limited form of deferred reference counting to avoid contention on
 								these objects' reference count fields in multi-threaded programs.
 								Typically, the interpreter modifies objects' reference counts as they
 								are pushed to and popped from the interpreter's stack. The
 								interpreter skips these reference counting operations for objects
 								that use deferred reference counting.  Objects that support deferred
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								reference counting are marked by setting the two most significant
 								bits in the local reference count field to one.
 								Because some reference counting operations are skipped, the reference
 								count fields no longer reflect the true number of references to these
 								objects.  The true reference count is the sum of the reference count
 								fields plus any skipped references from each thread's interpreter
 								stack.  The true reference count can only be safely computed when all
 								threads are paused during cyclic garbage collection.  Consequently,
 								objects that use deferred reference counting can only be deallocated
 								during garbage collection cycles.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								Note that the objects that use deferred reference counting already
 								naturally form reference cycles in CPython, so they would typically be
 								deallocated by the garbage collector even without deferred reference
 								counting. For example, top-level functions and modules form a reference
 								cycle as do methods and type objects.
 								Garbage Collector Modifications for Deferred Reference Counting
 								'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
 								The tracing garbage collector finds and deallocates unreferenced
 								objects.  Currently, the tracing garbage collector only finds
 								unreferenced objects that are part of a reference cycle. With
 								deferred reference counting, the tracing garbage collector will also
 								find and collect some unreferenced objects that may not be part of
 								any reference cycle, but whose collection has been delayed due to
 								deferred reference counting. This requires that all objects that
 								support deferred reference counting also have a corresponding type
 								object that supports tracing garbage collection (through the
 								``Py_TPFLAGS_HAVE_GC`` flag). Additionally, the garbage collector
 								will need to traverse each thread's stack to add references to the GC
 								reference count at the start of each collection.
 								Reference Counting Type Objects
 								'''''''''''''''''''''''''''''''
 								Type objects (``PyTypeObject``) use a mix of reference counting
 								techniques. Statically allocated type objects are immortalized because
 								the objects already live for the lifetime of the program.  Heap type
 								objects use deferred reference counting in combination with per-thread
 								reference counting.  Deferred reference counting is not sufficient to
 								address the multi-threaded scaling bottlenecks with heap types because
 								most references to heap types are from object instances, not references
 								on the interpreter stack.
 								To address this, heap type reference counts are partially stored in a
 								distributed manner in per-thread arrays.  Every thread stores an
 								array of local reference counts for each heap type object.  Heap type
 								objects are assigned a unique number that determines its position in
 								the local reference count arrays.  A heap type's true reference count
 								is the sum of its entries in the per-thread arrays, plus the reference
 								count on the ``PyTypeObject``, plus any deferred references in the
 								interpreter stack.
 								Threads may grow their own type reference count arrays as needed when
 								incrementing or decrementing the local reference count of a type
 								object.
 								Use of the per-thread reference count arrays is limited to a few
 								places:
 								* ``PyType_GenericAlloc(PyTypeObject *type, Py_ssize_t nitems)``:
 								  Increments the current thread's local reference count for ``type``,
 								  if it is a heap type.
 								* ``subtype_dealloc(PyObject *self)``: Decrements the current thread's
 								  local reference count for ``self->ob_type``, if the type is a heap
 								  type.
 								* ``gcmodule.c``: Adds each thread's local reference counts to the
 								  ``gc_refs`` count for the corresponding heap type object.
 								Additionally, when a thread terminates, it adds any non-zero local
 								reference counts to each type object's own reference count field.
 								Memory Management
 								-----------------
 								CPython currently uses an internal allocator, pymalloc, which is
 								optimized for small object allocation.  The pymalloc implementation is
 								not thread-safe without the GIL.  This PEP proposes replacing pymalloc
 								with mimalloc, a general-purpose thread-safe allocator with good
 								performance, including for small allocations.
 								Using mimalloc, with some modifications, also addresses two other
 								issues related to removing the GIL.  First, traversing the internal
 								mimalloc structures allows the garbage collector to find all Python
 								objects without maintaining a linked list.  This is described in more
 								detail in the garbage collection section.  Second, mimalloc heaps and
 								allocations based on size class enable collections like dict to
 								generally avoid acquiring locks during read-only operations. This is
 								described in more detail in the collection thread-safety section.
 								CPython already requires that objects that support garbage collection
 								use the GC allocator APIs (typically indirectly by calling
 								``PyType_GenericAlloc``). This PEP would add additional requirements
 								to the use of the Python allocator APIs. First, Python objects must
 								be allocated through object allocation APIs, such as
 								``PyType_GenericAlloc``, ``PyObject_Malloc``, or other Python APIs
 								that wrap those calls. Python objects should not be allocated through
 								other APIs, such as raw calls to C's malloc or the C++ new operator.
 								Additionally, ``PyObject_Malloc`` should be used only for allocating
 								Python objects; it should not be used for allocating buffers,
 								storages, or other data structures that are not PyObjects.
 								This PEP also imposes restrictions on the pluggable allocator API
 								(``PyMem_SetAllocator``). When compiling without the GIL, allocators
 								set using this API must eventually delegate the allocation to the
 								corresponding underlying allocator, such as ``PyObject_Malloc``, for
 								Python object allocations. This allows for allocators that "wrap"
 								underlying allocators, such as Python's tracemalloc and debug
 								allocator, but not for wholly replacing the allocator.
 								CPython Free Lists
 								''''''''''''''''''
 								CPython makes use of free lists to speed up the allocation of small,
 								frequently allocated objects like tuples and numbers.  These free
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								lists are moved to ``PyThreadState`` from per-interpreter state.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								Garbage Collection (Cycle Collection)
 								-------------------------------------
 								The CPython garbage collector requires the following changes to work
 								with this proposal:
 								* Use of "stop-the-world" to provide thread-safety guarantees that
 								  were previously provided by the GIL.
 								* Elimination of generational garbage collection in favor of
 								  non-generational collector.
 								* Integration with deferred reference counting and biased reference
 								  counting.
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								Additionally, the above changes enable removing the
 								``_gc_prev`` and ``_gc_next`` fields from GC objects.  The GC bits
 								that stored the tracked, finalized, and unreachable states are moved
 								to the ``ob_gc_bits`` field in the PyObject header.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								Stop-the-World
 								''''''''''''''
 								The CPython cycle garbage collector currently relies on the global
 								interpreter lock to prevent other threads from accessing Python
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								objects while the collector finds cycles.  The GIL is never released
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								during the cycle-finding routine, so the collector can rely on
 								stable (i.e., unchanging) reference counts and references for the
 								duration of that routine. However, following cycle detection, the GIL
 								may be temporarily released while calling objects' finalizers and
 								clear (``tp_clear``) functions, allowing other threads to run in an
 								interleaved fashion.
 								When running without the GIL, the implementation needs a way to ensure
 								that reference counts remain stable during cycle detection. Threads
 								running Python code must be paused to ensure that references and
 								reference counts remain stable. Once the cycles are identified, other
 								threads are resumed.
 								The current CPython cyclic garbage collector involves two
 								cycle-detection passes during each garbage collection cycle.
 								Consequently, this requires two stop-the-world pauses when running the
 								garbage collector without the GIL.  The first cycle-detection pass
 								identifies cyclic trash. The second pass runs after finalizers to
 								identify which objects still remain unreachable.  Note that other
 								threads are resumed before finalizers and ``tp_clear`` functions are
 								called to avoid introducing potential deadlocks that are not present in
 								the current CPython behavior.
 								Thread States
 								'''''''''''''
 								To support pausing threads for garbage collection, the PyThreadState
 								gets a new "status" field. Like the other fields in PyThreadState,
 								the status field is not part of the public CPython API. The status
 								field may be in one of three states:
 								* ``ATTACHED``
 								* ``DETACHED``
 								* ``GC``
 								The ``ATTACHED`` and ``DETACHED`` states correspond closely to
 								acquiring and releasing the global interpreter lock. When compiling
 								without the GIL, functions that previously acquired the GIL instead
 								transition the thread state to ``ATTACHED``, and functions that
 								previously released the GIL transition the thread state
 								to ``DETACHED``. Just as threads previously needed to acquire the
 								GIL before accessing or modifying Python objects, they now must be in
 								the ``ATTACHED`` state before accessing or modifying Python
 								objects. Since the same public C-API functions "attach" the thread as
 								previously acquired the GIL (e.g., ``PyEval_RestoreThread``), the
 								requirements for thread initialization in extensions remain the same.
 								The substantial difference is that multiple threads can be in the
 								attached state simultaneously, while previously only one thread could
 								acquire the GIL at a time.
 								During stop-the-world pauses, the thread performing garbage collection
 								needs to ensure that no other thread is accessing or modifying Python
 								objects.  All other threads must be in the "GC" state. The garbage
 								collection thread can transition other threads from the ``DETACHED``
 								state to the GC state using an atomic compare-and-swap operation on
 								the status field. Threads in the ``ATTACHED`` state are requested to
 								pause themselves and set their status to "GC", using the
 								existing "eval breaker" mechanism. At the end of the stop-the-world
 								pause, all threads in the "GC" state are set to ``DETACHED`` and
 								woken up if they are paused. Threads that were previously attached
 								(i.e., executing Python bytecode) can re-attach (set their thread
 								states to ``ATTACHED``) and resume executing Python code. Threads
 								that were previously ``DETACHED`` ignore the notification.
 								Generations
 								'''''''''''
 								The existing Python garbage collector uses three generations.  When
 								compiling without the GIL, the garbage collector will only use a single
 								generation (i.e., it will be non-generational).  The primary reason for
 								this change is to reduce the impact of the stop-the-world pauses in
 								multithreaded applications.  Frequent stop-the-world pauses for
 								collecting the young generation would have more of an impact on
 								multi-threaded applications than less frequent collections.
 								Integration With Deferred and Biased Reference Counting
 								'''''''''''''''''''''''''''''''''''''''''''''''''''''''
 								To find unreferenced objects, the cyclic garbage collector computes
 								the difference between the number of incoming references and the
 								object's reference count.  This difference is called ``gc_refs`` and
 								is stored in the ``_gc_prev`` field.  If ``gc_refs`` is greater than
 								zero, then the object is guaranteed to be alive (i.e., not cyclic
 								trash). If ``gc_refs`` is zero, then the object is only alive if it
 								is transitively referenced by another live object. When computing
 								this difference, the collector should traverse each thread's stack,
 								and for every deferred reference, increment the ``gc_refs`` for the
 								referred object. Since generator objects also have stacks with
 								deferred references, the same procedure is applied to each
 								generator's stack.
 								Python unit tests commonly use ``gc.collect()`` to ensure that any
 								unreferenced objects are destructed and their finalizers run.  Since
 								biased reference counting can delay the destruction of some objects
 								that are referenced by multiple threads, it's convenient to ensure
 								that those objects are destructed during garbage collection, even
 								though they may not be part of any reference cycles.  While other
 								threads are paused, the garbage collector thread should merge the
 								reference counts for any queued objects, but not call any destructors
 								even if the combined reference count is zero. (Calling destructors
 								while other threads are paused risks introducing deadlocks.) Once
 								other threads are resumed, the GC thread should call ``_Py_Dealloc``
 								on those objects with a zero merged reference count.
 								Container Thread-Safety
 								-----------------------
 								In CPython, the global interpreter lock protects against corruption of
 								internal interpreter states when multiple threads concurrently access
 								or modify Python objects.  For example, if multiple threads
 								concurrently modify the same list, the GIL ensures that the length of
 								the list (``ob_size``) accurately matches the number of elements, and
 								that the reference counts of each element accurately reflect the
 								number of references to those elements. Without the GIL --- and
 								absent other changes --- concurrent modifications would corrupt those
 								fields and likely lead to program crashes.
 								The GIL does not necessarily ensure that operations are atomic or
 								remain correct when multiple operations occur concurrently. For
 								example, ``list.extend(iterable)`` may not appear atomic if the
 								iterable has an iterator implemented in Python (or releases the GIL
 								internally). Similarly, ``list.remove(x)`` can remove the wrong
 								object if it overlaps with another operation that modifies the list,
 								depending on the implementation of the equality operator.  Still, the
 								GIL ensures that some operations are effectively atomic. For example,
 								the constructor ``list(set)`` atomically copies the items of the set
 								to a new list, and some code relies on that copy being atomic
 								(i.e., having a snapshot of the items in the set). This PEP preserves
 								that property.
 								This PEP proposes using per-object locks to provide many of the same
 								protections that the GIL provides.  For example, every list,
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								dictionary, and set will have an associated lightweight lock.  All
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								operations that modify the object must hold the object's lock.  Most
 								operations that read from the object should acquire the object's lock
 								as well; the few read operations that can proceed without holding a
 								lock are described below.
 								Per-object locks with critical sections provide weaker protections
 								than the GIL. Because the GIL doesn't necessarily ensure that
 								concurrent operations are atomic or correct, the per-object locking
 								scheme also cannot ensure that concurrent operations are atomic or
 								correct. Instead, per-object locking aims for similar protections as
 								the GIL, but with mutual exclusion limited to individual objects.
 								Most operations on an instance of a container type require locking
 								that object. For example:
 								* ``list.append``, ``list.insert``, ``list.repeat``,
 								  ``PyList_SetItem``
 								* ``dict.__setitem__``, ``PyDict_SetItem``
 								* ``list.clear``, ``dict.clear``
 								* ``list.__repr__``, ``dict.__repr__``, etc.
 								* ``list.extend(iterable)``
 								* ``setiter_iternext``
 								Some operations operate directly on two container objects, with
 								knowledge about both containers' internal structure.  For example,
 								there are internal specializations of ``list.extend(iterable)`` for
 								specific iterable types, like ``set``. These operations need to lock
 								both container objects because they access the internals of both
 								objects simultaneously.  Note that the generic implementation of
 								``list.extend`` only needs to lock one object (the list) because the
 								other object is accessed indirectly through the thread-safe iterator
 								API.  Operations that lock two containers are:
 								* ``list.extend(list)``, ``list.extend(set)``, ``list.extend
 								  (dictitems)``, and other specializations where the implementation
 								  is specialized for argument type.
 								* ``list.concat(list)``
 								* ``list.__eq__(list)``, ``dict.__eq__(dict)``
 								Some simple operations can be implemented directly with atomic
 								accesses and do not need locks because they only access a single
 								field.  These operations include:
 								* ``len(list)`` i.e., ``list_length(PyListObject *a)``
 								* ``len(dict)``
 								* ``len(set)``
 								A select few operations optimistically avoid locking to improve
 								performance. These require special implementations and cooperation
 								from the memory allocator:
 								* ``list[idx]`` (``list_subscript``)
 								* ``dict[key]`` (``dict_subscript``)
 								* ``listiter_next``, ``dictiter_iternextkey/value/item``
 								* ``list.contains``
 								Borrowed References
-												PEP 703: Minor edits (#3143)


											
										
										
											2023-05-11 14:11:40 -04:00
+								'''''''''''''''''''
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								Per-object locking provides many of the important protections that the
 								GIL provides, but there are a few cases where it's not sufficient.
 								For example, code that relies on upgrading a borrowed reference to
 								an "owned" reference may be unsafe in certain circumstances:
 								.. code-block:: c
 								  PyObject *item = PyList_GetItem(list, idx);
 								  Py_INCREF(item);
 								The GIL ensures that no other thread can modify the list in between
 								the access and the ``Py_INCREF`` call. Without the GIL -- even with
 								per-object locking -- another thread might modify the list leading to
 								``item`` being freed between the access and the ``Py_INCREF`` call.
 								The problematic borrowed reference APIs are supplemented with
 								functions that return "new references" but are otherwise
 								equivalent:
 								* ``PyList_FetchItem(list, idx)`` for ``PyList_GetItem``
 								* ``PyDict_FetchItem(dict, key)`` for ``PyDict_GetItem``
 								* ``PyWeakref_FetchObject`` for ``PyWeakref_GetObject``
 								Note that some APIs that return borrowed references, such as
 								``PyTuple_GetItem``, are not problematic because tuples are
 								immutable. Similarly, not all uses of the above APIs are problematic.
 								For example, ``PyDict_GetItem`` is often used for parsing keyword
 								argument dictionaries in function calls; those keyword argument
 								dictionaries are effectively private (not accessible by other
 								threads).
 								Python Critical Sections
-												PEP 703: Minor edits (#3143)


											
										
										
											2023-05-11 14:11:40 -04:00
+								''''''''''''''''''''''''
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								Straightforward per-object locking could introduce deadlocks that were
 								not present when running with the GIL.  Threads may hold locks for
 								multiple objects simultaneously because Python operations can nest.
 								Operations on objects can invoke operations on other objects,
 								acquiring multiple per-object locks.  If threads try to acquire the
 								same locks in different orders, they will deadlock.
 								This PEP proposes a scheme called "Python critical sections" to
 								implicitly release per-object locks to avoid deadlocks.  To
 								understand the scheme, we first introduce a general approach to avoid
 								deadlocks, and then propose a refinement of that approach with better
 								performance.
 								One way to avoid deadlocks is to allow threads to hold only the lock
 								(or locks) for a single operation at a time (typically a single lock,
 								but some operations involve two locks as described above).  When a
 								thread begins a nested operation it should suspend the locks for any
 								outer operation: before beginning the nested operation, the locks for
 								the outer operation are released and when the nested operation
 								completes, the locks for the outer operation are reacquired.
 								Additionally, the locks for any active operation should be suspended
 								around potentially blocking operations, such as I/O (i.e., operations
 								that would have released the GIL). This is because the interaction
 								between locks and blocking operations can lead to deadlocks in the
 								same way as the interaction between multiple locks.
 								To improve performance, this PEP proposes a variation of the above
 								scheme that still avoids deadlocks.  Instead of immediately
 								suspending locks any time a nested operation begins, locks are only
 								suspended if the thread would block (i.e., would have released the
 								GIL).  This reduces the number of lock acquisitions and releases for
 								nested operations, while avoiding deadlocks.
 								The proposed API for Python critical sections are the following four
 								macros. These are intended to be public (usable by C-API extensions),
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								but not part of the limited API:
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								- ``Py_BEGIN_CRITICAL_SECTION(PyObject *op);``:
 								  Begins a critical section by acquiring the mutex for the referenced
 								  object.  If the object is  already locked, then locks for any
 								  outstanding critical sections are released before this thread waits
 								  for referenced object to be unlocked.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								- ``Py_END_CRITICAL_SECTION;``:
 								  Ends the most recent operation, unlocking the mutex. The next
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								  most recent previous critical section (if any) is resumed if it is
 								  currently suspended.
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								- ``Py_BEGIN_CRITICAL_SECTION2(PyObject *a, PyObject *b);``:
 								  Begins a critical section by acquiring the mutexes for two objects.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								  To ensure consistent lock ordering, the order of acquisition is
 								  determined by memory address (i.e., the mutex with lower memory
 								  address is acquired first). If either mutex is already locked, then
 								  locks for any outstanding critical sections are released before this
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								  thread waits for the referenced objects to be unlocked.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								- ``Py_END_CRITICAL_SECTION2;``:
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								  Behaves the same as ``Py_END_CRITICAL_SECTION`` but unlocks two
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								  objects.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								Additionally, when a thread transitions from the ``ATTACHED`` state to
 								the ``DETACHED`` state, it should suspend any active critical
 								sections. When transitioning from ``DETACHED`` to ``ATTACHED``, the
 								most recent suspended critical section, if any, should be resumed.
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								Note that operations that lock two containers simultaneously need to use
 								the ``Py_BEGIN_CRITICAL_SECTION2`` macro.  It is not sufficient to nest
 								two calls to ``Py_BEGIN_CRITICAL_SECTION`` because the inner critical
 								section may release the locks from the outer critical section.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								Optimistically Avoiding Locking
-												PEP 703: Minor edits (#3143)


											
										
										
											2023-05-11 14:11:40 -04:00
+								'''''''''''''''''''''''''''''''
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								A few operations on ``dict`` and ``list`` optimistically avoid
 								acquiring the per-object locks. They have a fast path operation that
 								does not acquire locks, but may fall back to a slower operation that
 								acquires the dictionary's or list's lock when another thread is
 								concurrently modifying that container.
 								The operations with an optimistic fast path are:
 								* ``PyDict_FetchItem/GetItem`` and ``dict.__getitem__``
 								* ``PyList_FetchItem/GetItem`` and ``list.__getitem__``
 								Additionally, iterators for ``dict`` and ``list`` use the above
 								functions so they also optimistically avoid locking when returning
 								the next item.
 								There are two motivations for avoiding lock acquisitions in these
 								functions. The primary reason is that it is necessary for scalable
 								multi-threaded performance even for simple applications. Dictionaries
 								hold top-level functions in modules and methods for classes. These
 								dictionaries are inherently highly shared by many threads in
 								multi-threaded programs. Contention on these locks in multi-threaded
 								programs for loading methods and functions would inhibit efficient
 								scaling in many basic programs.
 								The secondary motivation for avoiding locking is to reduce overhead
 								and improve single-threaded performance.  Although lock acquisition
 								has low overhead compared to most operations, accessing individual
 								elements of lists and dictionaries are fast operations (so the
 								locking overhead is comparatively larger) and frequent (so the
 								overhead has more impact).
 								This section describes the challenges with implementing dictionary and
 								list accesses without locking followed by a description of this PEP's
 								changes to the Python interpreter required to address those
 								challenges.
 								The main challenge is that retrieving an item from a list or
 								dictionary and incrementing the reference count of that item is not
 								an atomic operation. In between the time the item is retrieved and
 								the reference count is incremented, another thread may modify the
 								list or dictionary, possibly freeing the memory for the previously
 								retrieved item.
 								A partial attempt at addressing this issue would be to convert the
 								reference count increment to a conditional increment, only
 								incrementing the reference count if it's not zero.  This change is
 								not sufficient because when a Python object's reference count reaches
 								zero, the object's destructor is called and the memory storing the
 								object may be re-used for other data structures or returned to the
 								operating system.  Instead, this PEP proposes a technique to ensure
 								that the reference count fields remain valid for the duration of the
 								access, so that the conditional reference count increment is safe.
 								This technique requires cooperation from the memory allocator
 								(mimalloc) as well as changes to the list and dictionary objects. The
 								proposed technique is similar to read-copy update (RCU) [#rcu]_, a
 								synchronization mechanism widely used in the Linux kernel.
 								The current implementation of ``list_item`` (the C function
 								implementing ``list.__getitem__``) is the following:
 								.. code-block:: c
 								    Py_INCREF(a->ob_item[i]);
 								    return a->ob_item[i];
 								The proposed implementation uses the conditional increment
 								(``_Py_TRY_INCREF``) and has additional checks:
 								.. code-block:: c
 								  PyObject **ob_item = atomic_load(&a->ob_item);
 								  PyObject *item = atomic_load(&ob_item[i]);
 								  if (!item || !_Py_TRY_INCREF(item)) goto retry;
 								  if (item != atomic_load(&ob_item[i])) {
 								    Py_DECREF(item);
 								    goto retry;
 								  }
 								  if (ob_item != atomic_load(&a->ob_item)) {
 								    Py_DECREF(item);
 								    goto retry;
 								 }
 								 return item;
 								The "retry" subroutine implements the locked fallback path when
 								concurrent modifications to the list cause the above fast,
 								non-locking path to fail:
 								.. code-block:: c
 								  retry:
 								    PyObject *item;
 								    Py_BEGIN_CRITICAL_SECTION(a->ob_mutex);
 								    item = a->ob_item[i];
 								    Py_INCREF(item);
 								    Py_END_CRITICAL_SECTION(a->ob_mutex);
 								    return item;
 								The modifications to the ``dict`` implementation are similar, because
 								the relevant parts of both list and dictionary retrieval involve
 								loading an item/value from an array at a known index.
 								The additional checks following the conditional increment are
 								necessary because the scheme allows immediate re-use of memory,
 								including the memory that previously held a ``PyObject`` structure or
 								``list`` or ``dict`` array.  Without these extra checks, the function
 								might return a Python object that was never in the list, if the
 								memory occupied by the Python object previously held a different
 								``PyObject`` whose memory previously stored an item in the list.
 								Mimalloc Changes for Optimistic ``list`` and ``dict`` Access
-												PEP 703: Minor edits (#3143)


											
										
										
											2023-05-11 14:11:40 -04:00
+								''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								The implementation requires additional constraints to the memory
 								allocator, including some changes to the mimalloc code.  Some
 								background on mimalloc's implementation is helpful to understand the
 								required changes.  Individual allocations from mimalloc are
 								called "blocks."  Mimalloc "pages" contain consecutive blocks that
 								are all the same size.  A mimalloc "page" is similar to
 								a "superblock" in other allocators; it is NOT an operating system
 								page.  A mimalloc "heap" contains pages of various size classes; each
 								page belongs to a single heap. If none of the blocks of a page are
 								allocated, then mimalloc may re-use the page for a different size
 								class or different heap (i.e., it might reinitialize the page).
 								The list and dictionary access scheme works by partially restricting
 								re-use of mimalloc pages so that reference count fields remains valid
 								for the duration of the access.  The restricted re-use of mimalloc
 								pages is enforced by having separate heaps for Python objects
 								[#heaps]_.  This ensures that even if an item is freed during access
 								and the memory reused for a new object, the new object's reference
 								count field is placed at the same location in memory.  The reference
 								count field remains valid (or zero) across allocations.
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								Python objects that support ``Py_TPFLAGS_MANAGED_DICT`` have their
 								dictionary and weak reference fields preceding the  ``PyObject``
 								header, so their reference count fields are at a different offset from
 								the start of their allocations.  They are stored in a separate mimalloc
 								heap.  Additionally, non-GC objects are stored in their own heap so
 								that the GC only has to look at GC objects.  There are therefore three
 								mimalloc heaps for Python objects, one for non-GC objects, one for GC
 								objects with managed dictionaries, and one for GC objects without
 								managed dictionaries.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								Mimalloc Page Reuse
-												PEP 703: Minor edits (#3143)


											
										
										
											2023-05-11 14:11:40 -04:00
+								'''''''''''''''''''
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								It is beneficial to keep the restrictions on mimalloc page reuse to a
 								short period of time to avoid increasing overall memory usage.
 								Precisely limiting the restrictions to list and dictionary accesses
 								would minimize memory usage, but would require expensive
 								synchronizations.  At the other extreme, keeping the restrictions
 								until the next GC cycle would avoid introducing any extra
 								synchronizations, but would potentially increase memory usage.
 								This PEP proposes a system that lies between those two extremes based
 								on FreeBSD's "GUS" [#gus]_.  It uses a combination of global and
 								per-thread counters (or "sequence numbers") to coordinate the
 								determination of when it is safe to reuse an empty mimalloc page for
 								a different heap or for a different size class, or to return it to
 								the operating system:
 								* There is a global write sequence number that monotonically
 								  increases.
 								* When a mimalloc page is empty, it's tagged with the current write
 								  sequence number.  The thread may also atomically increment the
 								  global write sequence number.
 								* Each thread has a local read sequence number that records the most
 								  recent write sequence number it has observed.
 								* Threads may observe the write sequence number whenever they are not
 								  in a list or dictionary access.  The reference implementation does
 								  this in mimalloc's slow-path allocation function.  This is called
 								  regularly enough to be useful, but not so frequently as to
 								  introduce significant overhead.
 								* There is a global read sequence number that stores the minimum of
 								  all active threads' read sequence numbers.  A thread may update the
 								  global read sequence number by scanning each threads' local read
 								  sequence number.  The reference implementation does this before
 								  allocating a fresh mimalloc page if there are restricted pages
 								  that could possibly be reused.
 								* An empty mimalloc page may be reused for a different heap or size
 								  class when the global read sequence number is larger than the
 								  page's tag number.
 								The condition that the global read sequence number is larger than the
 								page's tag is sufficient because it ensures that any thread that had
 								a concurrent optimistic list or dictionary access is finished with
 								that access.  In other words, there are no threads accessing the
 								empty blocks in the freed page, so the page can be used for any other
 								purpose or even returned to the operating system.
 								Optimistic ``dict`` and ``list`` Access Summary
-												PEP 703: Minor edits (#3143)


											
										
										
											2023-05-11 14:11:40 -04:00
+								'''''''''''''''''''''''''''''''''''''''''''''''
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								This PEP proposes a technique for thread-safe list and dictionary
 								accesses that typically avoids acquiring locks.  This reduces
 								execution overhead and avoids some multi-threaded scaling bottlenecks
 								in common operations, like calling functions and methods.  The scheme
 								works by placing temporary restrictions on mimalloc page reuse to
 								ensure that objects' reference count fields remain valid after
 								objects are freed so that conditional reference count increment
 								operations are safe.  The restrictions are placed on mimalloc pages
 								instead of on individual objects to improve opportunities for memory
 								reuse.  The restrictions are lifted as soon as the system can
 								determine that there are no outstanding accesses involving the empty
 								mimalloc page.  To determine this, the system uses a combination of
 								lightweight per-thread sequence counters and also tags pages when
 								they are empty.  Once each thread's local counter is larger than the
 								page's tag, it can be reused for any purpose or returned to the
 								operating system.  The restrictions are also lifted whenever the
 								cyclic garbage collector runs because the stop-the-world pause
 								ensures that threads do not have any outstanding references to empty
 								mimalloc pages.
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								Specializing Interpreter
 								------------------------
 								The specializing interpreter requires some changes to be thread-safe
 								when running without the GIL:
 								* Concurrent specializations are prevented by using a mutex.  This
 								  prevents multiple threads writing to the same inline cache.
 								* In multi-threaded programs running without the GIL, each bytecode is
 								  only specialized once.  This prevents a thread from reading a
 								  partially written inline cache.
 								* Locking also ensures that cached values of ``tp_version_tag`` and
 								  ``keys_version`` are consistent with the cached descriptors and other
 								  values.
 								* Modifications to inline counters use "relaxed atomics".  In other
 								  words, some counter decrements may be missed or overwritten, but that
 								  does not affect correctness.
 								``Py_mod_gil`` Slot
 								-------------------
 								In ``--disable-gil`` builds, when loading an extension, CPython will
 								check for a new :pep:`489`-style ``Py_mod_gil`` slot.  If the slot is
 								set to ``Py_mod_gil_not_used``, then extension loading proceeds as
 								normal. If the slot is not set, the interpreter pauses all threads and
 								enables the GIL before continuing.  Additionally, the interpreter will
 								issue a visible warning naming the extension, that the GIL was enabled
 								(and why) and the steps the user can take to override it.
 								``PYTHONGIL`` Environment Variable
 								----------------------------------
 								In ``--disable-gil`` builds, the user can also override the behavior at
 								runtime by setting the ``PYTHONGIL`` environment variable. Setting
 								``PYTHONGIL=0``, forces the GIL to be disabled, overriding the module
 								slot logic.  Setting ``PYTHONGIL=1``, forces the GIL to be enabled.
 								The ``PYTHONGIL=0`` override is important because extensions that are
 								not thread-safe can still be useful in multi-threaded applications. For
 								example, one may want to use the extension from only a single thread or
 								guard access by locks.  For context, there are already some extensions
-												PEP 703: Correct two spelling mistakes (#3298)


											
										
										
											2023-08-21 12:20:54 -04:00
+								that are not thread-safe even with the GIL, and users already have to
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								take these sorts of steps.
 								The ``PYTHONGIL=1`` override is sometimes useful for debugging.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								Rationale
 								=========
 								Non-Generational Garbage Collection
 								-----------------------------------
 								This PEP proposes switching from a generational cyclic garbage
 								collector to a non-generational collector (when CPython is built
 								without the GIL). That is equivalent to only having one generation
 								(the "old" generation). There are two reasons for this proposed
 								change.
 								Cyclic garbage collection, even for just the young generation,
 								requires pausing other threads in the program. The author is
 								concerned that frequent collections of the young generation would
 								inhibit efficient scaling in multi-threaded programs. This is a
 								concern for young generations (but not the old generation) because
 								the young generations are collected after a fixed number of
 								allocations, while the collections for the older generation are
 								scheduled in proportion to the number of live objects in the heap.
 								Additionally, it is difficult to efficiently keep track of objects in
 								each generation without the GIL. For example, CPython currently uses
 								a linked list of objects in each generation. If CPython were to keep
 								that design, those lists would need to be made thread-safe, and it's
 								not clear how to do that efficiently.
 								Generational garbage collection is used to good effect in many other
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								language runtimes.  For example, many of the Java HotSpot garbage
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								collector implementations use multiple generations [#hotspotgc]_. In
 								these runtimes, a young generation is frequently a throughput win:
 								since a large percentage of the young generation is typically "dead,"
 								the GC is able to reclaim a large amount memory relative to the
 								amount of work performed. For example, several Java benchmarks show
 								over 90% of "young" objects are typically collected [#decapo]_
 								[#exploitingmemoryjava]_. This is commonly referred to as the "weak
 								generational hypothesis;" the observation is that most objects die
 								young. This pattern is reversed in CPython due to the use of
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								reference counting.  Although most objects still die young, they are
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								collected when their reference counts reach zero. Objects that
 								survive to a garbage collection cycle are most likely to remain
 								alive [#cpythongc]_. This difference means that generational
 								collection is much less effective in CPython than in many other
 								language runtimes [#golangc]_.
 								Optimistic Avoiding Locking in ``dict`` and ``list`` Accesses
 								-------------------------------------------------------------
 								This proposal relies on a scheme that mostly avoids acquiring locks
 								when accessing individual elements in lists and dictionaries.  Note
 								that this is not "lock free" in the sense of "lock-free"
 								and "wait-free" algorithms that guarantee forward progress.  It
 								simply avoids acquiring locks (mutexes) in the common case to improve
 								parallelism and reduce overhead.
 								A much simpler alternative would be to use reader-writer locks to
 								protect dictionary and list accesses. Reader-writer locks allow
 								concurrent reads, but not updates, which might seem ideal for list
 								and dictionaries. The problem is that reader-writer locks have
 								substantial overhead and poor scalability, particularly when the
 								critical sections are small, as they are for single-element
 								dictionary and list accesses [#perfbook]_. The poor reader
 								scalability stems from the fact that readers must all update the same
 								data structure, such as the number of readers in
 								``pthread_rwlocks``.
 								The technique described in this PEP is related to RCU
 								("read-copy-update") [#rcu]_ and, to a lesser extent, hazard
 								pointers, two well-known schemes for optimizing concurrent,
 								read-mostly data structures. RCU is widely used in the Linux kernel
 								to protect shared data structures in a scalable manner. Both the
 								technique in this PEP and RCU work by deferring reclamation while
 								readers may be accessing the concurrent data structure. RCU is most
 								commonly used to protect individual objects (like hash tables or
 								linked lists), while this PEP proposes a scheme to protect larger
 								blocks of memory (mimalloc "pages") [#typesafe_rcu]_.
 								The need for this scheme is largely due to the use of reference
 								counting in CPython.  If CPython only relied on a tracing garbage
 								collector, then this scheme would probably not be necessary because
 								tracing garbage collectors already defer reclamation in the required
 								manner.  This would not "solve" scaling issues, but would shift many
 								of the challenges to the garbage collector implementation.
 								Backwards Compatibility
 								=======================
 								This PEP poses a number of backwards compatibility issues when
-												PEP 703: --disable-gil and other small changes (#2979)

* Rename `--without-gil` to `--disable-gil`.

Suggested by Inada Naoki.

* Add two sections to "Rejected Ideas":

 - Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
 - Why Not Use PEP 683 Immortalization?

* Expand on Python build modes
											
										
										
											2023-01-23 12:06:13 -05:00
+								building CPython with the ``--disable-gil`` flag, but those issues do
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								not occur when using the default build configuration.  Nearly all the
 								backwards compatibility concerns involve the C-API:
 								* CPython builds without the GIL will not be ABI compatible with the
 								  standard CPython build or with the stable ABI due to changes to the
 								  Python object header needed to support biased reference counting.
 								  C-API extensions will need to be rebuilt specifically for this
 								  version.
 								* C-API extensions that rely on the GIL to protect global state or
 								  object state in C code will need additional explicit locking to
 								  remain thread-safe when run without the GIL.
 								* C-API extensions that use borrowed references in ways that are not
 								  safe without the GIL will need to use the equivalent new APIs that
 								  return non-borrowed references. Note that only some uses of
 								  borrowed references are a concern; only references to objects that
 								  might be freed by other threads pose an issue.
 								* Custom memory allocators (``PyMem_SetAllocator``) are required to
 								  delegate the actual allocation to the previously set allocator. For
 								  example, the Python debug allocator and tracing allocators will
 								  continue to work because they delegate the allocation to the
 								  underlying allocator. On the other hand, wholesale replacing of the
 								  allocator (e.g., with jemalloc or tcmalloc) will not work
 								  correctly.
 								* Python objects must be allocated through the standard APIs, such as
 								  ``PyType_GenericNew`` or ``PyObject_Malloc``. Non-Python objects
 								  must **not** be allocated through those APIs. For example, it is
 								  currently acceptable to allocate buffers(non-Python objects)
 								  through ``PyObject_Malloc``; that will no longer be allowed and
 								  buffers should instead be allocated through ``PyMem_Malloc``,
 								  ``PyMem_RawMalloc``, or ``malloc``.
 								There are fewer potential backwards compatibility issues for Python
 								code:
 								* Destructors and weak reference callbacks for code objects and
 								  top-level function objects are delayed until the next cyclic
 								  garbage collection due to the use of deferred reference counting.
 								* Destructors for some objects accessed by multiple threads may be
 								  delayed slightly due to biased reference counting. This is rare:
 								  most objects, even those accessed by multiple threads, are
 								  destroyed immediately as soon as their reference counts are zero.
 								  Two places in the Python standard library tests required
 								  ``gc.collect()`` calls to continue to pass.
 								Distribution
 								============
 								This PEP poses new challenges for distributing Python.  At least for
 								some time, there will be two versions of Python requiring separately
 								compiled C-API extensions.  It may take some time for C-API extension
-												PEP 703: --disable-gil and other small changes (#2979)

* Rename `--without-gil` to `--disable-gil`.

Suggested by Inada Naoki.

* Add two sections to "Rejected Ideas":

 - Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
 - Why Not Use PEP 683 Immortalization?

* Expand on Python build modes
											
										
										
											2023-01-23 12:06:13 -05:00
+								authors to build ``--disable-gil`` compatible packages and upload
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								them to PyPI.  Additionally, some authors may be hesitant to support
-												PEP 703: --disable-gil and other small changes (#2979)

* Rename `--without-gil` to `--disable-gil`.

Suggested by Inada Naoki.

* Add two sections to "Rejected Ideas":

 - Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
 - Why Not Use PEP 683 Immortalization?

* Expand on Python build modes
											
										
										
											2023-01-23 12:06:13 -05:00
+								the ``--disable-gil``  mode until it has wide adoption, but adoption
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								will likely depend on the availability of Python's rich set of
 								extensions.
 								To mitigate this, the author will work with Anaconda to distribute
-												PEP 703: --disable-gil and other small changes (#2979)

* Rename `--without-gil` to `--disable-gil`.

Suggested by Inada Naoki.

* Add two sections to "Rejected Ideas":

 - Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
 - Why Not Use PEP 683 Immortalization?

* Expand on Python build modes
											
										
										
											2023-01-23 12:06:13 -05:00
+								a ``--disable-gil`` version of Python together with compatible
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								packages from conda channels.  This centralizes the challenges of
 								building extensions, and the author believes this will enable more
 								people to use Python without the GIL sooner than they would otherwise
 								be able to.
 								Performance
 								===========
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								The changes to make CPython thread-safe without the GIL increase
 								execution overhead for ``--disable-gil`` builds.  The performance
 								impact is different for programs that use only a single thread compared
 								to programs that use multiple threads, so the table below reports
 								execution overhead separately for these types of programs separately.
 								.. list-table:: Execution Overhead on pyperformance 1.0.6
 								   :header-rows: 1
 								   :widths: auto
 								   * -
 								     - Intel Skylake
 								     - AMD Zen 3
 								   * - One thread
 								     - 6%
 								     - 5%
 								   * - Multiple threads
 								     - 8%
 								     - 7%
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								The baseline used to measure overhead is ``018be4c`` from `PR 19474`_,
 								which implements immortal objects for Python 3.12.  The largest
 								contribution to execution overhead is biased reference counting
 								followed by per-object locking.  For thread-safety reasons, an
 								application running with multiple threads will only specialize a given
 								bytecode once; this is why the overhead for programs that use multiple
 								threads is larger compared to programs that only use one thread.
 								However, with the GIL disabled, programs that use multiple threads
 								should also be able to more effectively use multiple CPU cores.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								Note that this PEP would not affect the performance of the default
 								(non ``--disable-gil``) builds of CPython.
 								.. _PR 19474: https://github.com/python/cpython/pull/19474
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
-												PEP 703: Minor edits (#3143)


											
										
										
											2023-05-11 14:11:40 -04:00
+								Build Bots
 								==========
 								The stable build bots will also include ``--disable-gil`` builds.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								How to Teach This
 								=================
-												PEP 703: --disable-gil and other small changes (#2979)

* Rename `--without-gil` to `--disable-gil`.

Suggested by Inada Naoki.

* Add two sections to "Rejected Ideas":

 - Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
 - Why Not Use PEP 683 Immortalization?

* Expand on Python build modes
											
										
										
											2023-01-23 12:06:13 -05:00
+								As part of implementing the ``--disable-gil`` mode, the author will
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								write a "HOWTO" guide [#howto]_ for making packages compatible when
 								running Python without the GIL.
 								Reference Implementation
 								========================
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								There are two GitHub repositories implementing versions of CPython
 								without the GIL:
 								* https://github.com/colesbury/nogil-3.12
 								* https://github.com/colesbury/nogil
 								The ``nogil-3.12`` is based on Python 3.12.0a4.  It is useful for
 								evaluating single-threaded execution overhead and as a reference
 								implementation for this PEP.  It is less useful for evaluating C-API
 								extension compatibility because many extensions are not currently
 								compatible with Python 3.12.  Due to limited time for the 3.12 port,
 								the ``nogil-3.12`` implementation does not skip all deferred reference
 								counts.  As a temporary work around, the implementation immortalizes
 								objects that use deferred reference counting in programs that spawn
 								multiple threads.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								The ``nogil`` repository is based on Python 3.9.10.  It is useful for
 								evaluating multi-threading scaling in real world applications and
 								extension compatibility.  It is more stable and well tested than the
 								``nogil-3.12`` repository.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								Alternatives
 								============
 								Python currently supports a number of ways to enable parallelism, but
 								the existing techniques come with significant limitations.
 								Multiprocessing
 								---------------
 								The multiprocessing library allows Python programs to start and
 								communicate with Python subprocesses.  This allows for parallelism
 								because each subprocess has its own Python interpreter (i.e., there's
 								one GIL per process).  Multiprocessing has a few substantial
 								limitations.  Communication between processes is limited: objects
 								generally need to be serialized or copied to shared memory.  This
 								introduces overhead (due to serialization) and complicates building
 								APIs on top of multiprocessing.  Starting a subprocess is also more
 								expensive than starting a thread, especially with the "spawn"
 								implementation.  Starting a thread takes ~100 µs, while spawning a
 								subprocess takes ~50 ms (50,000 µs) due to Python re-initialization.
 								Finally, many C and C++ libraries support access from multiple
 								threads but do not support access or use across multiple processes.
 								Releasing the GIL in C-API Extensions
 								-------------------------------------
 								C-API extensions can release the GIL around long running functions.
 								This allows for some degree of parallelism, since multiple threads
 								can run concurrently when the GIL is released, but the overhead of
 								acquiring and releasing the GIL typically prevents this from scaling
 								efficiently beyond a few threads.  Many scientific computing
 								libraries release the GIL in computational heavy functions, and the
 								CPython standard library releases the GIL around blocking I/O.
 								Internal Parallelization
 								------------------------
 								Functions implemented in C may use multiple threads internally. For
 								example, Intel's NumPy distribution, PyTorch, and TensorFlow all use
 								this technique to internally parallelize individual operations. This
 								works well when the basic operations are large enough to be
 								parallelized efficiently, but not when there are many small
 								operations or when the operations depend on some Python code. Calling
 								into Python from C requires acquiring the GIL -- even short snippets
 								of Python code can inhibit scaling.
 								Related Work
 								=============
 								Per-Interpreter GIL
 								-------------------
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								The recently accepted :pep:`684` proposes a per-interpreter GIL to
 								address multi-core parallelism.  This would allow parallelism between
 								interpreters in the same process, but places substantial restrictions
 								on sharing Python data between interpreters.  Both this PEP
 								and :pep:`684` address the multi-core parallelism, but with different
 								tradeoffs and techniques.  It is feasible to implement both PEPs in
 								CPython at the same time.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								Gilectomy
 								---------
 								Gilectomy [#gilectomy]_ was a project by Larry Hastings to remove the
 								GIL in CPython.  Like the design proposed by this PEP, the Gilectomy
 								supported multiple threads running in parallel within the same
 								interpreter (i.e., "free-threading") and made use of fine-grained
 								locking.  The reference implementation in this PEP improves on
 								single-threaded performance and scalability compared to the
 								Gilectomy.
 								PyParallel
 								----------
 								PyParallel [#pyparallel]_ was a proof-of-concept fork of Python 3.3 by
 								Trent Nelson that supported multiple threads running simultaneously
 								in a single Python process.  The fork introduced the concept
 								of "parallel threads" -- threads that can run simultaneously while
 								the main Python thread is suspended.  Parallel threads had read-only
 								access to objects created by the main thread.  Objects created within
 								parallel threads lived for the lifetime of the creating thread.  For
 								HTTP servers, this might correspond to the lifetime of a request.
 								python-safethread
 								-----------------
 								The python-safethread [#pythonsafethread]_ project was a patch to
 								Python 3.0 by Adam Olsen to remove the GIL.  Some aspects of the
 								project are similar to the design proposed by this PEP.  Both use
 								fine-grained locking and optimize reference counting for cases
 								where the object is created and accessed by the same thread.
 								Greg Stein's Free-Threading Patch
 								---------------------------------
 								In 1996, Greg Stein published a patch against Python 1.4 that removed
 								the GIL [#gsteinpatch]_.  The patch used atomic reference counting on
 								Windows and a global reference count lock on Linux. List and
 								dictionary accesses were protected by mutexes.  Parts of the patch
 								were adopted in CPython. In particular, the patch introduced a
 								PyThreadState structure and correct per-thread exception handling.
 								Dave Beazley revisited the patch in a 2011 blog post [#dabeaz]_.
 								Jython and IronPython
 								---------------------
 								Some alternative Python implementations like Jython [#jython]_ and
 								IronPython [#ironpython]_ do not have a global interpreter lock.
 								However, they do not support CPython extensions. (The implementations
 								can interface with code written in Java or C#).
 								PyPy-STM
 								--------
 								The pypy-stm [#pypystm]_ interpreter is a variant of PyPy that uses
 								software transactional memory.  The authors report single-threaded
 								performance overhead in the 20%-50% range compared to PyPy.  It is
 								not compatible with CPython extensions.
 								Rejected Ideas
 								==============
 								Why Not Use a Concurrent Garbage Collector?
 								-------------------------------------------
 								Many recent garbage collectors are mostly concurrent -- they avoid long
 								stop-the-world pauses by allowing the garbage collector to run
 								concurrently with the application. So why not use a concurrent
 								collector?
 								Concurrent collection requires write barriers (or read barriers).  The
 								author is not aware of a way to add write barriers to CPython without
 								substantially breaking the C-API.
-												PEP 703: --disable-gil and other small changes (#2979)

* Rename `--without-gil` to `--disable-gil`.

Suggested by Inada Naoki.

* Add two sections to "Rejected Ideas":

 - Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
 - Why Not Use PEP 683 Immortalization?

* Expand on Python build modes
											
										
										
											2023-01-23 12:06:13 -05:00
+								Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
 								----------------------------------------------------------------------
 								This PEP proposes a new API ``PyDict_FetchItem`` which behaves like
 								``PyDict_GetItem``, but returns a new reference instead of a borrowed
 								reference.  As described in `Borrowed References`_, some uses of
 								borrowed references that were safe when running with the GIL are
 								unsafe when running without the GIL and need to be replaced by
 								functions like ``PyDict_FetchItem`` that return new references.
 								This PEP does *not* propose deprecating ``PyDict_GetItem`` and similar
 								functions that return borrowed references for a few reasons:
 								* Many of the uses of borrowed references are safe, even when running
 								  without the GIL.  For example, C API functions often use
 								  ``PyDict_GetItem`` to retrieve items from the keyword
 								  argument dictionary.  These calls are safe because the keyword
 								  argument dictionary is only visible to a single thread.
 								* I tried this approach early on and found that wholesale replacing of
 								  ``PyDict_GetItem`` with ``PyDict_FetchItem`` frequently introduced
 								  new reference counting bugs.  In my opinion, the risk of
 								  introducing new reference counting bugs generally outweighs the
 								  risks of missing a ``PyDict_GetItem`` call that is unsafe without
 								  the GIL.
 								Why Not Use PEP 683 Immortalization?
 								------------------------------------
 								Like :pep:`683`, this PEP proposes an immortalization scheme for
 								Python objects, but the PEPs use different bit representations to
 								mark immortal objects.  The schemes cannot be identical because this
 								PEP depends on biased reference counting, which has two reference
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								count fields instead of one.
-												PEP 703: --disable-gil and other small changes (#2979)

* Rename `--without-gil` to `--disable-gil`.

Suggested by Inada Naoki.

* Add two sections to "Rejected Ideas":

 - Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
 - Why Not Use PEP 683 Immortalization?

* Expand on Python build modes
											
										
										
											2023-01-23 12:06:13 -05:00
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								Open Issues
 								===========
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								Improved Specialization
 								-----------------------
 								The Python 3.11 release introduced quickening and specialization as part
 								of the faster CPython project, substantially improving performance.
 								Specialization replaces slow bytecode instructions with faster
 								variants [#pep659]_.  To maintain thread-safety, applications that use
 								multiple threads (and run without the GIL) will only specialize each
 								bytecode once, which can lower performance on some programs.  It is
 								possible to support specializing multiple times, but that requires more
 								investigation and is not part of this PEP.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								Python Build Modes
 								------------------
-												PEP 703: --disable-gil and other small changes (#2979)

* Rename `--without-gil` to `--disable-gil`.

Suggested by Inada Naoki.

* Add two sections to "Rejected Ideas":

 - Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
 - Why Not Use PEP 683 Immortalization?

* Expand on Python build modes
											
										
										
											2023-01-23 12:06:13 -05:00
+								This PEP introduces a new build mode (``--disable-gil``) that is not
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								ABI compatible with the standard build mode.  The additional build
 								mode adds complexity for both Python core developers and extension
-												PEP 703: --disable-gil and other small changes (#2979)

* Rename `--without-gil` to `--disable-gil`.

Suggested by Inada Naoki.

* Add two sections to "Rejected Ideas":

 - Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
 - Why Not Use PEP 683 Immortalization?

* Expand on Python build modes
											
										
										
											2023-01-23 12:06:13 -05:00
+								developers.  The author believes a worthwhile goal is to combine
 								these build modes and have the global interpreter lock controlled at
 								runtime, possibly disabled by default.  The path to this goal remains
 								an open issue, but a possible path might look like the following:
 								#. In 2024, CPython 3.13 is released with support for a
 								   ``--disable-gil`` build time flag.  There are two ABIs for
 								   CPython, one with the GIL and one without.  Extension authors
 								   target both ABIs.
 								#. After 2--3 releases, (i.e., in 2026--2027), CPython is released
-												PEP 703: typo fix, "with with" -> "with" (#3456)


											
										
										
											2023-09-30 03:41:31 -04:00
+								   with the GIL controlled by a runtime environment variable or
-												PEP 703: --disable-gil and other small changes (#2979)

* Rename `--without-gil` to `--disable-gil`.

Suggested by Inada Naoki.

* Add two sections to "Rejected Ideas":

 - Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
 - Why Not Use PEP 683 Immortalization?

* Expand on Python build modes
											
										
										
											2023-01-23 12:06:13 -05:00
+								   flag. The GIL is enabled by default.  There is only a single ABI.
 								#. After another 2--3 release (i.e., 2028--2030), CPython switches to
 								   the GIL being disabled by default.  The GIL can still be enabled
 								   at runtime via an environment variable or command line flag.
 								This PEP covers the first step, with the remaining steps left as open
 								issues.  In this scenario, there would be a two to three year period
 								where extension authors would target an extra CPython build per
 								supported CPU architecture and OS.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								Integration
 								-----------
 								The reference implementation changes approximately 15,000 lines of code
 								in CPython and includes mimalloc, which is also approximately 15,000
 								lines of code.  Most changes are not performance sensitive and can be
 								included in both ``--disable-gil`` and the default builds.  Some
 								macros, like ``Py_BEGIN_CRITICAL_SECTION`` will be no-ops in the
 								default build.  Thee author does not expect a huge number of ``#ifdef``
 								statements to support the ``--disable-gil`` builds.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
 								Mitigations for Single-Threaded Performance
 								-------------------------------------------
 								The changes proposed in the PEP will increase execution overhead for
-												PEP 703: --disable-gil and other small changes (#2979)

* Rename `--without-gil` to `--disable-gil`.

Suggested by Inada Naoki.

* Add two sections to "Rejected Ideas":

 - Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
 - Why Not Use PEP 683 Immortalization?

* Expand on Python build modes
											
										
										
											2023-01-23 12:06:13 -05:00
+								``--disable-gil`` builds compared to Python builds with the GIL.  In
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								other words, it will have slower single-threaded performance.  There
 								are some possible optimizations to reduce execution overhead,
-												PEP 703: --disable-gil and other small changes (#2979)

* Rename `--without-gil` to `--disable-gil`.

Suggested by Inada Naoki.

* Add two sections to "Rejected Ideas":

 - Why Not Deprecate ``PyDict_GetItem`` in Favor of ``PyDict_FetchItem``?
 - Why Not Use PEP 683 Immortalization?

* Expand on Python build modes
											
										
										
											2023-01-23 12:06:13 -05:00
+								especially for ``--disable-gil`` builds that only use a single
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								thread.  These may be worthwhile if a longer term goal is to have a
 								single build mode, but the choice of optimizations and their
 								trade-offs remain an open issue.
 								References
 								==========
 								.. [#yuemmwang2019] "Exploiting Parallelism Opportunities with Deep Learning Frameworks."
 								   Yu Emma Wang, Carole-Jean Wu, Xiaodong Wang, Kim Hazelwood, David Brooks. 2019.
 								   https://arxiv.org/abs/1908.04705.
 								.. [#torchdeploy] "Using Python for Model Inference in Deep Learning."
 								   Zachary DeVito, Jason Ansel, Will Constable, Michael Suo, Ailing Zhang, Kim Hazelwood. 2021.
 								   https://arxiv.org/abs/2104.00254. See Figure 5.
 								.. [#brc] "Biased reference counting: minimizing atomic operations in garbage collection".
 								   Jiho Choi, Thomas Shull, and Josep Torrellas. PACT 2018.
 								   https://dl.acm.org/doi/abs/10.1145/3243176.3243195.
 								.. [#pep683] :pep:`683` -- Immortal Objects, Using a Fixed Refcount.
 								.. [#tid] https://github.com/colesbury/nogil/blob/f7e45d6bfbbd48c8d5cf851c116b73b85add9fc6/Include/object.h#L428-L455.
 								.. [#rcu] "What is RCU, Fundamentally?"
 								   Paul E. McKenney, Jonathan Walpole. 2017.
 								   https://lwn.net/Articles/262464/
 								.. [#heaps] There are two heaps for Python objects because PyObjects
 								   that support cyclic garbage collection have extra fields preceding
 								   the PyObject struct.
 								.. [#gus] "Global Unbounded Sequences (GUS)"
 								   https://github.com/freebsd/freebsd-src/blob/9408f36627b74a472dc82f7a43320235c0c9055a/sys/kern/subr_smr.c#L44.
 								   See also https://people.kernel.org/joelfernandes/gus-vs-rcu.
 								.. [#perfbook] "Is Parallel Programming Hard, And, If So, What Can You Do About It?"
 								   Paul E. McKenney. 2022.
 								   https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html.
 								.. [#typesafe_rcu] ``SLAB_TYPESAFE_BY_RCU`` is an example in which RCU
 								   protects blocks of memory and not any individual object.  See
 								   https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html#analogy-with-reference-counting.
 								.. [#hotspotgc] "HotSpot Virtual Machine Garbage Collection Tuning Guide."
 								   https://docs.oracle.com/en/java/javase/12/gctuning/hotspot-virtual-machine-garbage-collection-tuning-guide.pdf.
 								   Most of the hotspot garbage collectors are generational, with the
 								   notable exception of ZGC, although there is ongoing work to make
 								   that generational.
 								.. [#decapo] `The DaCapo Benchmarks: Java Benchmarking Development and
 								   Analysis
 								   <https://openresearch-repository.anu.edu.au/bitstream/1885/33723/2/01_Blackburn_The_DaCapo_Benchmarks:_Java_2006.pdf>`_.
 								   See column "Nursery Survival" in Table 4.
 								.. [#exploitingmemoryjava] "Exploiting memory usage patterns to improve garbage collections in Java."
 								   https://dl.acm.org/doi/abs/10.1145/1852761.1852768.
 								.. [#cpythongc] "most things usually turn out to be reachable"
 								    https://github.com/python/cpython/blob/cd6655a8589e99ae4088b3bed4a692a19ed48779/Modules/gcmodule.c#L1106.
 								.. [#golangc] The Go team observed something similar in Go, but due to
 								   escape analysis and pass-by-value instead of reference
 								   counting. Recent versions of Go use a non-generational garbage
 								   collector. https://go.dev/blog/ismmkeynote.
 								.. [#nogil] https://github.com/colesbury/nogil.
-												PEP 703: Update with results from nogil 3.12 rebase (#3126)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
											
										
										
											2023-05-04 13:00:54 -04:00
+								.. [#nogil312] https://github.com/colesbury/nogil-3.12.
-												PEP 703: Making the Global Interpreter Lock Optional in CPython (#2955)

Sponsor: Łukasz Langa <lukasz at python.org>
Co-authored-by: Hugo van Kemenade <hugovk@users.noreply.github.com>
Co-authored-by: Qexat <43090614+qexat@users.noreply.github.com>

											
										
										
											2023-01-10 10:52:02 -05:00
+								.. [#howto] Python HOWTOs.
 								   https://docs.python.org/3/howto/index.html.
 								.. [#pep659] :pep:`659` -- Specializing Adaptive Interpreter.
 								.. [#gilectomy] Gilectomy.
 								   Larry Hastings. 2016.
 								   https://github.com/larryhastings/gilectomy/tree/gilectomy.
 								.. [#pyparallel] PyParallel.
 								   Trent Nelson. 2016.
 								   http://pyparallel.org/.
 								.. [#pythonsafethread] python-safethread.
 								   Adam Olsen. 2008.
 								   https://launchpad.net/python-safethread
 								.. [#gsteinpatch] https://www.python.org/ftp/python/contrib-09-Dec-1999/System/threading.tar.gz.
 								.. [#dabeaz] An Inside Look at the GIL Removal Patch of Lore.
 								   David Beazley. 2011.
 								   https://dabeaz.blogspot.com/2011/08/inside-look-at-gil-removal-patch-of.html.
 								.. [#jython] Jython.
 								   https://www.jython.org/
 								.. [#ironpython] IronPython.
 								   https://ironpython.net/
 								.. [#pypystm] PyPy: Software Transactional Memory.
 								   https://doc.pypy.org/en/latest/stm.html
 								Acknowledgments
 								===============
 								Thanks to Hugh Leather, Łukasz Langa, and Eric Snow for providing
 								feedback on drafts of this PEP.
 								Copyright
 								=========
 								This document is placed in the public domain or under the
 								CC0-1.0-Universal license, whichever is more permissive.