2017-09-05 14:21:50 -04:00
|
|
|
|
PEP: 552
|
|
|
|
|
Title: Deterministic pycs
|
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
|
|
|
|
Author: Benjamin Peterson <benjamin@python.org>
|
|
|
|
|
Status: Draft
|
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
|
Created: 2017-09-04
|
2017-09-05 14:26:53 -04:00
|
|
|
|
Python-Version: 3.7
|
2017-09-05 14:21:50 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
========
|
|
|
|
|
|
|
|
|
|
This PEP proposes to an extension to the pyc format to make it more
|
|
|
|
|
deterministic.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
A `reproducible build`_ is one where the same byte-for-byte output is generated
|
|
|
|
|
every time the same sources are built—even across different machines (naturally
|
|
|
|
|
subject to the requirement that they have rather similar environments
|
|
|
|
|
setup). Reproducibility is important for security. It is also a key concept in
|
|
|
|
|
content-based build systems such as Bazel_, which are most effective when the
|
|
|
|
|
output files’ contents are a deterministic function of the input files’
|
|
|
|
|
contents.
|
|
|
|
|
|
|
|
|
|
The current Python pyc format is the marshaled code object of the module
|
|
|
|
|
prefixed by a `magic number`_, the source timestamp, and the source file
|
|
|
|
|
size. The presence of a source timestamp means that a pyc is not a deterministic
|
|
|
|
|
function of the input file’s contents—it also depends on the volatile metadata,
|
|
|
|
|
mtime, of the source. Thus, the pycs are a barrier to proper reproducibility.
|
|
|
|
|
|
|
|
|
|
Distributors of Python code are currently stuck with the options of
|
2017-09-06 13:56:06 -04:00
|
|
|
|
|
2017-09-05 14:21:50 -04:00
|
|
|
|
1. not distributing pycs and losing the caching advantages
|
2017-09-06 13:56:06 -04:00
|
|
|
|
|
2017-09-05 14:21:50 -04:00
|
|
|
|
2. distributing pycs and losing reproducibility
|
2017-09-06 13:56:06 -04:00
|
|
|
|
|
2017-09-05 14:21:50 -04:00
|
|
|
|
3. carefully giving all Python source files a deterministic timestamp (see
|
2017-09-06 17:46:40 -04:00
|
|
|
|
https://github.com/python/cpython/pull/296)
|
2017-09-06 13:56:06 -04:00
|
|
|
|
|
2017-09-05 14:21:50 -04:00
|
|
|
|
4. doing a complicated mixture of 1. and 2. like generating pycs at installation
|
|
|
|
|
time
|
2017-09-06 13:56:06 -04:00
|
|
|
|
|
|
|
|
|
None of these options are very attractive. This PEP proposes a better solution.
|
2017-09-05 14:21:50 -04:00
|
|
|
|
|
|
|
|
|
(Note there are `other problems`_ we do not address here that can make pycs
|
|
|
|
|
non-deterministic.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Specification
|
|
|
|
|
=============
|
|
|
|
|
|
|
|
|
|
Python will begin to recognize two magic number variants for every pyc
|
|
|
|
|
version. One magic number will correspond to the current pyc format and the
|
|
|
|
|
other to "hash-based" pycs introduced by this PEP.
|
|
|
|
|
|
|
|
|
|
In hash-based pycs, the second field in the pyc header (currently the
|
|
|
|
|
"timestamp" field) will contain the SipHash_ of the contents of the source
|
|
|
|
|
file. Another a fast hash like MD5 or BLAKE2_ would also work. We choose SipHash
|
|
|
|
|
because Python already has a builtin implementation of it from :pep:`456`. The
|
|
|
|
|
third field in the pyc header (currently the "source size" field) will become a
|
|
|
|
|
bitset of flags. We define the lowest flag in this bitset called
|
|
|
|
|
``check_source``
|
|
|
|
|
|
|
|
|
|
When Python encounters a hash-based pyc, its behavior depends on the setting of
|
|
|
|
|
the ``check_source`` flag. If the ``check_source`` flag is set, Python will
|
|
|
|
|
determine the validity of the pyc by hashing the source file and comparing the
|
|
|
|
|
hash with the expected hash in the pyc. If the pyc needs to be regenerated, it
|
|
|
|
|
will be regenerated as a hash-based pyc again with the ``check_source`` flag
|
|
|
|
|
set.
|
|
|
|
|
|
|
|
|
|
For hash-based pycs with the ``check_source`` unset, Python will simply load the
|
|
|
|
|
pyc without checking the hash of the source file. The expectation in this case
|
|
|
|
|
is that some external system (e.g., the local Linux distribution’s package
|
|
|
|
|
manager) is responsible for keeping pycs up to date, so Python itself doesn’t
|
|
|
|
|
have to check. Even when validation is disabled, the hash field should be set
|
|
|
|
|
correctly, so out-of-band consistency checkers can verify the up-to-dateness of
|
|
|
|
|
the pyc. Note also that the :pep:`3147` edict that pycs without corresponding
|
|
|
|
|
source files not be loaded will still be enforced for hash-based pycs.
|
|
|
|
|
|
|
|
|
|
The ``py_compile`` and ``compileall`` tools will be extended with new options to
|
|
|
|
|
generate hash-based pycs with and without the ``check_source`` bit set. Their
|
|
|
|
|
programmatic APIs will also gain similar functionality.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
References
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
.. _reproducible build: https://reproducible-builds.org/
|
|
|
|
|
.. _Bazel: https://bazel.build/
|
|
|
|
|
.. _BLAKE2: https://blake2.net/
|
|
|
|
|
.. _SipHash: https://131002.net/siphash/
|
|
|
|
|
.. _other problems: http://benno.id.au/blog/2013/01/15/python-determinism
|
|
|
|
|
.. _magic number: https://docs.python.org/3/library/importlib.html#importlib.util.MAGIC_NUMBER
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Credits
|
|
|
|
|
=======
|
|
|
|
|
|
|
|
|
|
The author would like to thank Gregory P. Smith for useful conversations on the
|
|
|
|
|
topic of this PEP.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
..
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
coding: utf-8
|
|
|
|
|
End:
|