167 lines
6.9 KiB
ReStructuredText
167 lines
6.9 KiB
ReStructuredText
PEP: 552
|
||
Title: Deterministic pycs
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Benjamin Peterson <benjamin@python.org>
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Content-Type: text/x-rst
|
||
Created: 2017-09-04
|
||
Python-Version: 3.7
|
||
Post-History: 2017-09-07
|
||
|
||
|
||
Abstract
|
||
========
|
||
|
||
This PEP proposes an extension to the pyc format to make it more deterministic.
|
||
|
||
|
||
Rationale
|
||
=========
|
||
|
||
A `reproducible build`_ is one where the same byte-for-byte output is generated
|
||
every time the same sources are built—even across different machines (naturally
|
||
subject to the requirement that they have rather similar environments
|
||
setup). Reproducibility is important for security. It is also a key concept in
|
||
content-based build systems such as Bazel_, which are most effective when the
|
||
output files’ contents are a deterministic function of the input files’
|
||
contents.
|
||
|
||
The current Python pyc format is the marshaled code object of the module
|
||
prefixed by a `magic number`_, the source timestamp, and the source file
|
||
size. The presence of a source timestamp means that a pyc is not a deterministic
|
||
function of the input file’s contents—it also depends on volatile metadata, the
|
||
mtime of the source. Thus, pycs are a barrier to proper reproducibility.
|
||
|
||
Distributors of Python code are currently stuck with the options of
|
||
|
||
1. not distributing pycs and losing the caching advantages
|
||
|
||
2. distributing pycs and losing reproducibility
|
||
|
||
3. carefully giving all Python source files a deterministic timestamp
|
||
(see, for example, https://github.com/python/cpython/pull/296)
|
||
|
||
4. doing a complicated mixture of 1. and 2. like generating pycs at installation
|
||
time
|
||
|
||
None of these options are very attractive. This PEP proposes allowing the
|
||
timestamp to be replaced with a deterministic hash. The current timestamp
|
||
invalidation method will remain the default, though. Despite its nondeterminism,
|
||
timestamp invalidation works well for many workflows and usecases. The
|
||
hash-based pyc format can impose the cost of reading and hashing every source
|
||
file, which is more expensive than simply checking timestamps. Thus, for now, we
|
||
expect it to be used mainly by distributors and power use cases.
|
||
|
||
(Note there are other problems [#frozensets]_ [#interning]_ we do not
|
||
address here that can make pycs non-deterministic.)
|
||
|
||
|
||
Specification
|
||
=============
|
||
|
||
The pyc header currently consists of 3 32-bit words. We will expand it to 4. The
|
||
first word will continue to be the magic number, versioning the bytecode and pyc
|
||
format. The second word, conceptually the new word, will be a bit field. The
|
||
interpretation of the rest of the header and invalidation behavior of the pyc
|
||
depends on the contents of the bit field.
|
||
|
||
If the bit field is 0, the pyc is a traditional timestamp-based pyc. I.e., the
|
||
third and forth words will be the timestamp and file size respectively, and
|
||
invalidation will be done by comparing the metadata of the source file with that
|
||
in the header.
|
||
|
||
If the lowest bit of the bit field is set, the pyc is a hash-based pyc. We call
|
||
the second lowest bit the ``check_source`` flag. Following the bit field is a
|
||
64-bit hash of the source file. We will use a SipHash_ with a hardcoded key of
|
||
the contents of the source file. Another fast hash like MD5 or BLAKE2_ would
|
||
also work. We choose SipHash because Python already has a builtin implementation
|
||
of it from :pep:`456`, although an interface that allows picking the SipHash key
|
||
must be exposed to Python. Security of the hash is not a concern, though we pass
|
||
over completely-broken hashes like MD5 to ease auditing of Python in controlled
|
||
environments.
|
||
|
||
When Python encounters a hash-based pyc, its behavior depends on the setting of
|
||
the ``check_source`` flag. If the ``check_source`` flag is set, Python will
|
||
determine the validity of the pyc by hashing the source file and comparing the
|
||
hash with the expected hash in the pyc. If the pyc needs to be regenerated, it
|
||
will be regenerated as a hash-based pyc again with the ``check_source`` flag
|
||
set.
|
||
|
||
For hash-based pycs with the ``check_source`` unset, Python will simply load the
|
||
pyc without checking the hash of the source file. The expectation in this case
|
||
is that some external system (e.g., the local Linux distribution’s package
|
||
manager) is responsible for keeping pycs up to date, so Python itself doesn’t
|
||
have to check. Even when validation is disabled, the hash field should be set
|
||
correctly, so out-of-band consistency checkers can verify the up-to-dateness of
|
||
the pyc. Note also that the :pep:`3147` edict that pycs without corresponding
|
||
source files not be loaded will still be enforced for hash-based pycs.
|
||
|
||
The programmatic APIs of ``py_compile`` and ``compileall`` will support
|
||
generation of hash-based pycs. Principally, ``py_compile`` will define a new
|
||
enumeration corresponding to all the available pyc invalidation modules::
|
||
|
||
class PycInvalidationMode(Enum):
|
||
TIMESTAMP
|
||
CHECKED_HASH
|
||
UNCHECKED_HASH
|
||
|
||
``py_compile.compile``, ``compileall.compile_dir``, and
|
||
``compileall.compile_file`` will all gain an ``invalidation_mode`` parameter,
|
||
which accepts a value of the ``PycInvalidationMode`` enumeration.
|
||
|
||
The ``compileall`` tool will be extended with a command new option,
|
||
``--invalidation-mode`` to generate hash-based pycs with and without the
|
||
``check_source`` bit set. ``--invalidation-mode`` will be a tristate option
|
||
taking values ``timestamp`` (the default), ``checked-hash``, and
|
||
``unchecked-hash`` corresponding to the values of ``PycInvalidationMode``.
|
||
|
||
``importlib`` will be extended with a ``source_hash(source)`` function that
|
||
computes the hash used by the pyc writing code for a bytestring **source**.
|
||
|
||
Runtime configuration of hash-based pyc invalidation will be facilitated by a
|
||
new ``--check-hash-based-pycs`` interpreter option. This is a tristate option,
|
||
which may take 3 values: ``default``, ``always``, and ``never``. The default
|
||
value, ``default``, means the ``check_source`` flag in hash-based pycs
|
||
determines invalidation as described above. ``always`` causes the interpreter to
|
||
hash the source file for invalidation regardless of value of ``check_source``
|
||
bit. ``never`` causes the interpreter to always assume hash-based pycs are
|
||
valid. Timestamp-based pycs are unaffected by this option.
|
||
|
||
|
||
References
|
||
==========
|
||
|
||
.. _reproducible build: https://reproducible-builds.org/
|
||
.. _Bazel: https://bazel.build/
|
||
.. _BLAKE2: https://blake2.net/
|
||
.. _SipHash: https://131002.net/siphash/
|
||
.. [#frozensets] http://benno.id.au/blog/2013/01/15/python-determinism
|
||
.. [#interning] http://bugzilla.opensuse.org/show_bug.cgi?id=1049186
|
||
.. _magic number: https://docs.python.org/3/library/importlib.html#importlib.util.MAGIC_NUMBER
|
||
|
||
|
||
Credits
|
||
=======
|
||
|
||
The author would like to thank Gregory P. Smith, Christian Heimes, and Steve
|
||
Dower for useful conversations on the topic of this PEP.
|
||
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 80
|
||
coding: utf-8
|
||
End:
|