python-peps/pep-0509.txt

438 lines
15 KiB
Plaintext
Raw Normal View History

PEP: 509
Title: Add dict.__version__
Version: $Revision$
Last-Modified: $Date$
Author: Victor Stinner <victor.stinner@gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 4-January-2016
Python-Version: 3.6
Abstract
========
Add a new read-only ``__version__`` property to ``dict`` and
``collections.UserDict`` types, incremented at each change.
Rationale
=========
In Python, the builtin ``dict`` type is used by many instructions. For
example, the ``LOAD_GLOBAL`` instruction searchs for a variable in the
global namespace, or in the builtins namespace (two dict lookups).
Python uses ``dict`` for the builtins namespace, globals namespace, type
namespaces, instance namespaces, etc. The local namespace (namespace of
a function) is usually optimized to an array, but it can be a dict too.
Python is hard to optimize because almost everything is mutable: builtin
functions, function code, global variables, local variables, ... can be
modified at runtime. Implementing optimizations respecting the Python
semantic requires to detect when "something changes": we will call these
checks "guards".
The speedup of optimizations depends on the speed of guard checks. This
PEP proposes to add a version to dictionaries to implement efficient
guards on namespaces.
Example of optimization: replace loading a global variable with a
constant. This optimization requires a guard on the global variable to
check if it was modified. If the variable is modified, the variable must
be loaded at runtime, instead of using the constant.
Guard example
=============
Pseudo-code of an efficient guard to check if a dictionary key was
modified (created, updated or deleted)::
UNSET = object()
class Guard:
def __init__(self, dict, key):
self.dict = dict
self.key = key
self.value = dict.get(key, UNSET)
self.version = dict.__version__
def check(self):
"""Return True if the dictionary value did not changed."""
version = self.dict.__version__
if version == self.version:
# Fast-path: avoid the dictionary lookup
return True
value = self.dict.get(self.key, UNSET)
if value == self.value:
# another key was modified:
# cache the new dictionary version
self.version = version
return True
return False
Changes
=======
Add a read-only ``__version__`` property to builtin ``dict`` type and to
the ``collections.UserDict`` type. New empty dictionaries are initilized
to version ``0``. The version is incremented at each change:
* ``clear()`` if the dict was non-empty
* ``pop(key)`` if the key exists
* ``popitem()`` if the dict is non-empty
* ``setdefault(key, value)`` if the `key` does not exist
* ``__detitem__(key)`` if the key exists
* ``__setitem__(key, value)`` if the `key` doesn't exist or if the value
is different
* ``update(...)`` if new values are different than existing values (the
version can be incremented multiple times)
Example::
>>> d = {}
>>> d.__version__
0
>>> d['key'] = 'value'
>>> d.__version__
1
>>> d['key'] = 'new value'
>>> d.__version__
2
>>> del d['key']
>>> d.__version__
3
If a dictionary is created with items, the version is also incremented
at each dictionary insertion. Example::
>>> d=dict(x=7, y=33)
>>> d.__version__
2
The version is not incremented is an existing key is modified to the
same value, but only the identifier of the value is tested, not the
content of the value. Example::
>>> d={}
>>> value = object()
>>> d['key'] = value
>>> d.__version__
2
>>> d['key'] = value
>>> d.__version__
2
.. note::
CPython uses some singleton like integers in the range [-5; 257],
empty tuple, empty strings, Unicode strings of a single character in
the range [U+0000; U+00FF], etc. When a key is set twice to the same
singleton, the version is not modified.
The PEP is designed to implement guards on namespaces, only the ``dict``
type can be used for namespaces in practice. ``collections.UserDict``
is modified because it must mimicks ``dict``. ``collections.Mapping`` is
unchanged.
Integer overflow
================
The implementation uses the C unsigned integer type ``size_t`` to store
the version. On 32-bit systems, the maximum version is ``2**32-1``
(more than ``4.2 * 10 ** 9``, 4 billions). On 64-bit systems, the maximum
version is ``2**64-1`` (more than ``1.8 * 10**19``).
The C code uses ``version++``. On integer overflow, the version is
wrapped to ``0`` (and then continue to be incremented).
The check ``dict.__version__ == old_version`` can be true after an
integer overflow, so a guard can return false even if the value changed,
which is wrong. The bug occurs if the dict is modified at least ``2**64``
times (on 64-bit system) between two checks of the guard.
Using a more complex type (ex: ``PyLongObject``) to avoid the overflow
would slow down operations on the ``dict`` type. Even if there is a
theorical risk of missing a value change, the risk is considered too low
compared to the slow down of using a more complex type.
Usage of dict.__version__
=========================
Detect dictionary mutation during iteration
-------------------------------------------
Currently, iterating on a dictionary only detects when the dictionary
size changes, but not when keys or values are modified. Using the
dictionary version, it would be possible to detect when keys and values
are modified.
See the `issue #19332: Guard against changing dict during iteration
<https://bugs.python.org/issue19332>`_.
astoptimizer of FAT Python
--------------------------
The astoptimizer of the FAT Python project implements many optimizations
which require guards on namespaces. Examples:
* Call pure builtins: to replace ``len("abc")`` with ``3``, guards on
``builtins.__dict__['len']`` and ``globals()['len']`` are required
* Loop unrolling: to unroll the loop ``for i in range(...): ...``,
guards on ``builtins.__dict__['range']`` and ``globals()['range']``
are required
The `FAT Python
<http://faster-cpython.readthedocs.org/fat_python.html>`_ project is a
static optimizer for Python 3.6.
Pyjion
------
According of Brett Cannon, one of the two main developers of Pyjion, Pyjion can
also benefit from dictionary version to implement optimizations.
Pyjion is a JIT compiler for Python based upon CoreCLR (Microsoft .NET Core
runtime).
Unladen Swallow
---------------
Even if dictionary version was not explicitly mentionned, optimization globals
and builtins lookup was part of the Unladen Swallow plan: "Implement one of the
several proposed schemes for speeding lookups of globals and builtins."
Source: `Unladen Swallow ProjectPlan
<https://code.google.com/p/unladen-swallow/wiki/ProjectPlan>`_.
Unladen Swallow is a fork of CPython 2.6.1 adding a JIT compiler implemented
with LLVM. The project stopped in 2011: `Unladen Swallow Retrospective
<http://qinsb.blogspot.com.au/2011/03/unladen-swallow-retrospective.html>`_.
Implementation
==============
See the `issue #26058: Add dict.__version__ read-only property
<https://bugs.python.org/issue26058>`_.
On pybench and timeit microbenchmarks, the patch does not seem to add
any overhead on dictionary operations.
Alternatives
============
2016-01-09 18:30:32 -05:00
Bikeshedding on the property name
---------------------------------
* ``__cache_token__``: name proposed by Nick Coghlan, name coming from
`abc.get_cache_token()
<https://docs.python.org/3/library/abc.html#abc.get_cache_token>`_.
* ``__version__``
* ``__timestamp__``
Add a version but don't expose it at the Python level
-----------------------------------------------------
Guards can be implemented in C to access directly the ``ma_version``
field of the ``PyDictObject`` structure.
Add a version to each dict entry
--------------------------------
A single version per dictionary requires to keep a strong reference to
the value which can keep the value alive longer than expected. If we add
also a version per dictionary entry, the guard can rely on the entry
version and so avoid the strong reference to the value (only strong
references to a dictionary and key are needed).
Changes: add a ``getversion(key)`` method to dictionary which returns
``None`` if the key doesn't exist. When a key is created or modified,
the entry version is set to the dictionary version which is incremented
at each change (create, modify, delete).
Pseudo-code of an efficient guard to check if a dict key was modified
using ``getversion()``::
UNSET = object()
class Guard:
def __init__(self, dict, key):
self.dict = dict
self.key = key
self.dict_version = dict.__version__
self.entry_version = dict.getversion(key)
def check(self):
"""Return True if the dictionary value did not changed."""
dict_version = self.dict.__version__
if dict_version == self.version:
# Fast-path: avoid the dictionary lookup
return True
# lookup in the dictionary, but get the entry version,
#not the value
entry_version = self.dict.getversion(self.key)
if entry_version == self.entry_version:
# another key was modified:
# cache the new dictionary version
self.dict_version = dict_version
return True
return False
This main drawback of this option is the impact on the memory footprint.
It increases the size of each dictionary entry, so the overhead depends
on the number of buckets (dictionary entries, used or unused yet). For
example, it increases the size of each dictionary entry by 8 bytes on
64-bit system if we use ``size_t``.
In Python, the memory footprint matters and the trend is more to reduce
it. Examples:
* `PEP 393 -- Flexible String Representation
<https://www.python.org/dev/peps/pep-0393/>`_
* `PEP 412 -- Key-Sharing Dictionary
<https://www.python.org/dev/peps/pep-0412/>`_
Add a new dict subtype
----------------------
Add a new ``verdict`` type, subtype of ``dict``. When guards are needed,
use the ``verdict`` for namespaces (module namespace, type namespace,
instance namespace, etc.) instead of ``dict``.
Leave the ``dict`` type unchanged to not add any overhead (memory
footprint) when guards are not needed.
Technical issue: a lot of C code in the wild, including CPython core,
expect the exact ``dict`` type. Issues:
* ``exec()`` requires a ``dict`` for globals and locals. A lot of code
use ``globals={}``. It is not possible to cast the ``dict`` to a
``dict`` subtype because the caller expects the ``globals`` parameter
to be modified (``dict`` is mutable).
* Functions call directly ``PyDict_xxx()`` functions, instead of calling
``PyObject_xxx()`` if the object is a ``dict`` subtype
* ``PyDict_CheckExact()`` check fails on ``dict`` subtype, whereas some
functions require the exact ``dict`` type.
* ``Python/ceval.c`` does not completly supports dict subtypes for
namespaces
The ``exec()`` issue is a blocker issue.
Other issues:
* The garbage collector has a special code to "untrack" ``dict``
instances. If a ``dict`` subtype is used for namespaces, the garbage
collector may be unable to break some reference cycles.
* Some functions have a fast-path for ``dict`` which would not be taken
for ``dict`` subtypes, and so it would make Python a little bit
slower.
Prior Art
=========
2016-01-10 18:15:41 -05:00
Method cache and type version tag
---------------------------------
In 2007, Armin Rigo wrote a patch to to implement a cache of methods. It
was merged into Python 2.6. The patch adds a "type attribute cache
version tag" (``tp_version_tag``) and a "valid version tag" flag to
types (the ``PyTypeObject`` structure).
The type version tag is not available at the Python level.
The version tag has the C type ``unsigned int``. The cache is a global
hash table of 4096 entries, shared by all types. The cache is global to
"make it fast, have a deterministic and low memory footprint, and be
easy to invalidate". Each cache entry has a version tag. A global
version tag is used to create the next version tag, it also has the C
type ``unsigned int``.
By default, a type has its "valid version tag" flag cleared to indicate
that the version tag is invalid. When the first method of the type is
cached, the version tag and the "valid version tag" flag are set. When a
type is modified, the "valid version tag" flag of the type and its
subclasses is cleared. Later, when a cache entry of these types is used,
the entry is removed because its version tag is outdated.
On integer overflow, the whole cache is cleared and the global version
tag is reset to ``0``.
See also `issue #1685986: Method cache
<https://bugs.python.org/issue1685986>`_ and `issue #1700288: Armin's
method cache optimization updated for Python 2.6
<https://bugs.python.org/issue1700288>`_.
Guard against changing dict during iteration
--------------------------------------------
In 2013, Serhiy Storchaka proposed a patch for the `issue #19332: Guard
against changing dict during iteration
<https://bugs.python.org/issue19332>`_ (mentioned above) which adds a
``size_t ma_count`` field to the ``PyDictObject`` structure. This field
is incremented when the dictionary is modified, and so is very similar
to the proposed ``dict.__version__``.
Cached globals+builtins lookup
------------------------------
In 2006, Andrea Griffini proposes a patch implementing a `Cached
globals+builtins lookup optimization <https://bugs.python.org/issue1616125>`_.
The patch adds a private ``timestamp`` field to dict.
See the thread on python-dev: `About dictionary lookup caching
<https://mail.python.org/pipermail/python-dev/2006-December/070348.html>`_.
Globals / builtins cache
------------------------
In 2010, Antoine Pitrou proposed a `Globals / builtins cache
<http://bugs.python.org/issue10401>`_ which adds a private
``ma_version`` field to the ``dict`` type. The patch adds a "global and
builtin cache" to functions and frames, and changes ``LOAD_GLOBAL`` and
``STORE_GLOBAL`` instructions to use the cache.
PySizer
-------
`PySizer <http://pysizer.8325.org/>`_: a memory profiler for Python,
Google Summer of Code 2005 project by Nick Smallbone.
This project has a patch for CPython 2.4 which adds ``key_time`` and
``value_time`` fields to dictionary entries. It uses a global
process-wide counter for dictionaries, incremented each time that a
dictionary is modified. The times are used to decide when child objects
first appeared in their parent objects.
Discussion
==========
Thread on the python-ideas mailing list: `RFC: PEP: Add dict.__version__
<https://mail.python.org/pipermail/python-ideas/2016-January/037702.html>`_.
Copyright
=========
This document has been placed in the public domain.