126 lines
4.5 KiB
Plaintext
126 lines
4.5 KiB
Plaintext
PEP: 552
|
||
Title: Deterministic pycs
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Benjamin Peterson <benjamin@python.org>
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Content-Type: text/x-rst
|
||
Created: 2017-09-04
|
||
Python-Version: 3.7
|
||
|
||
|
||
Abstract
|
||
========
|
||
|
||
This PEP proposes to an extension to the pyc format to make it more
|
||
deterministic.
|
||
|
||
|
||
Rationale
|
||
=========
|
||
|
||
A `reproducible build`_ is one where the same byte-for-byte output is generated
|
||
every time the same sources are built—even across different machines (naturally
|
||
subject to the requirement that they have rather similar environments
|
||
setup). Reproducibility is important for security. It is also a key concept in
|
||
content-based build systems such as Bazel_, which are most effective when the
|
||
output files’ contents are a deterministic function of the input files’
|
||
contents.
|
||
|
||
The current Python pyc format is the marshaled code object of the module
|
||
prefixed by a `magic number`_, the source timestamp, and the source file
|
||
size. The presence of a source timestamp means that a pyc is not a deterministic
|
||
function of the input file’s contents—it also depends on the volatile metadata,
|
||
mtime, of the source. Thus, the pycs are a barrier to proper reproducibility.
|
||
|
||
Distributors of Python code are currently stuck with the options of
|
||
|
||
1. not distributing pycs and losing the caching advantages
|
||
|
||
2. distributing pycs and losing reproducibility
|
||
|
||
3. carefully giving all Python source files a deterministic timestamp
|
||
(see https://github.com/python/cpython/pull/296)
|
||
|
||
4. doing a complicated mixture of 1. and 2. like generating pycs at installation
|
||
time
|
||
|
||
None of these options are very attractive. This PEP proposes a better solution.
|
||
|
||
(Note there are other problems [#frozensets]_ [#interning]_ we do not
|
||
address here that can make pycs non-deterministic.)
|
||
|
||
|
||
Specification
|
||
=============
|
||
|
||
Python will begin to recognize two magic number variants for every pyc
|
||
version. One magic number will correspond to the current pyc format and the
|
||
other to "hash-based" pycs introduced by this PEP.
|
||
|
||
In hash-based pycs, the second field in the pyc header (currently the
|
||
"timestamp" field) will become a bitset of flags. We define the lowest flag in
|
||
this bitset called ``check_source``. Following the bitset is a 64-bit hash of
|
||
the source file. We will use a SipHash_ (with a hardcoded key) of the contents
|
||
of the source file. Another a fast hash like MD5 or BLAKE2_ would also work. We
|
||
choose SipHash because Python already has a builtin implementation of it from
|
||
:pep:`456`, although an interface that allows picking the SipHash key must be
|
||
exposed to Python.
|
||
|
||
When Python encounters a hash-based pyc, its behavior depends on the setting of
|
||
the ``check_source`` flag. If the ``check_source`` flag is set, Python will
|
||
determine the validity of the pyc by hashing the source file and comparing the
|
||
hash with the expected hash in the pyc. If the pyc needs to be regenerated, it
|
||
will be regenerated as a hash-based pyc again with the ``check_source`` flag
|
||
set.
|
||
|
||
For hash-based pycs with the ``check_source`` unset, Python will simply load the
|
||
pyc without checking the hash of the source file. The expectation in this case
|
||
is that some external system (e.g., the local Linux distribution’s package
|
||
manager) is responsible for keeping pycs up to date, so Python itself doesn’t
|
||
have to check. Even when validation is disabled, the hash field should be set
|
||
correctly, so out-of-band consistency checkers can verify the up-to-dateness of
|
||
the pyc. Note also that the :pep:`3147` edict that pycs without corresponding
|
||
source files not be loaded will still be enforced for hash-based pycs.
|
||
|
||
The ``py_compile`` and ``compileall`` tools will be extended with new options to
|
||
generate hash-based pycs with and without the ``check_source`` bit set. Their
|
||
programmatic APIs will also gain similar functionality.
|
||
|
||
|
||
References
|
||
==========
|
||
|
||
.. _reproducible build: https://reproducible-builds.org/
|
||
.. _Bazel: https://bazel.build/
|
||
.. _BLAKE2: https://blake2.net/
|
||
.. _SipHash: https://131002.net/siphash/
|
||
.. [#frozensets] http://benno.id.au/blog/2013/01/15/python-determinism
|
||
.. [#interning] http://bugzilla.opensuse.org/show_bug.cgi?id=1049186
|
||
.. _magic number: https://docs.python.org/3/library/importlib.html#importlib.util.MAGIC_NUMBER
|
||
|
||
|
||
Credits
|
||
=======
|
||
|
||
The author would like to thank Gregory P. Smith for useful conversations on the
|
||
topic of this PEP.
|
||
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
coding: utf-8
|
||
End:
|