pep 552: deterministic pycs
This commit is contained in:
parent
bed0e724ff
commit
05e7f55fc1
|
@ -0,0 +1,118 @@
|
|||
PEP: 552
|
||||
Title: Deterministic pycs
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Benjamin Peterson <benjamin@python.org>
|
||||
Status: Draft
|
||||
Python-Version: 3.7
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 2017-09-04
|
||||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
This PEP proposes to an extension to the pyc format to make it more
|
||||
deterministic.
|
||||
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
||||
A `reproducible build`_ is one where the same byte-for-byte output is generated
|
||||
every time the same sources are built—even across different machines (naturally
|
||||
subject to the requirement that they have rather similar environments
|
||||
setup). Reproducibility is important for security. It is also a key concept in
|
||||
content-based build systems such as Bazel_, which are most effective when the
|
||||
output files’ contents are a deterministic function of the input files’
|
||||
contents.
|
||||
|
||||
The current Python pyc format is the marshaled code object of the module
|
||||
prefixed by a `magic number`_, the source timestamp, and the source file
|
||||
size. The presence of a source timestamp means that a pyc is not a deterministic
|
||||
function of the input file’s contents—it also depends on the volatile metadata,
|
||||
mtime, of the source. Thus, the pycs are a barrier to proper reproducibility.
|
||||
|
||||
Distributors of Python code are currently stuck with the options of
|
||||
1. not distributing pycs and losing the caching advantages
|
||||
2. distributing pycs and losing reproducibility
|
||||
3. carefully giving all Python source files a deterministic timestamp (see
|
||||
https://github.com/python/cpython/pull/296)
|
||||
4. doing a complicated mixture of 1. and 2. like generating pycs at installation
|
||||
time
|
||||
None of these are very attractive. This PEP proposes a better solution.
|
||||
|
||||
(Note there are `other problems`_ we do not address here that can make pycs
|
||||
non-deterministic.)
|
||||
|
||||
|
||||
Specification
|
||||
=============
|
||||
|
||||
Python will begin to recognize two magic number variants for every pyc
|
||||
version. One magic number will correspond to the current pyc format and the
|
||||
other to "hash-based" pycs introduced by this PEP.
|
||||
|
||||
In hash-based pycs, the second field in the pyc header (currently the
|
||||
"timestamp" field) will contain the SipHash_ of the contents of the source
|
||||
file. Another a fast hash like MD5 or BLAKE2_ would also work. We choose SipHash
|
||||
because Python already has a builtin implementation of it from :pep:`456`. The
|
||||
third field in the pyc header (currently the "source size" field) will become a
|
||||
bitset of flags. We define the lowest flag in this bitset called
|
||||
``check_source``
|
||||
|
||||
When Python encounters a hash-based pyc, its behavior depends on the setting of
|
||||
the ``check_source`` flag. If the ``check_source`` flag is set, Python will
|
||||
determine the validity of the pyc by hashing the source file and comparing the
|
||||
hash with the expected hash in the pyc. If the pyc needs to be regenerated, it
|
||||
will be regenerated as a hash-based pyc again with the ``check_source`` flag
|
||||
set.
|
||||
|
||||
For hash-based pycs with the ``check_source`` unset, Python will simply load the
|
||||
pyc without checking the hash of the source file. The expectation in this case
|
||||
is that some external system (e.g., the local Linux distribution’s package
|
||||
manager) is responsible for keeping pycs up to date, so Python itself doesn’t
|
||||
have to check. Even when validation is disabled, the hash field should be set
|
||||
correctly, so out-of-band consistency checkers can verify the up-to-dateness of
|
||||
the pyc. Note also that the :pep:`3147` edict that pycs without corresponding
|
||||
source files not be loaded will still be enforced for hash-based pycs.
|
||||
|
||||
The ``py_compile`` and ``compileall`` tools will be extended with new options to
|
||||
generate hash-based pycs with and without the ``check_source`` bit set. Their
|
||||
programmatic APIs will also gain similar functionality.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. _reproducible build: https://reproducible-builds.org/
|
||||
.. _Bazel: https://bazel.build/
|
||||
.. _BLAKE2: https://blake2.net/
|
||||
.. _SipHash: https://131002.net/siphash/
|
||||
.. _other problems: http://benno.id.au/blog/2013/01/15/python-determinism
|
||||
.. _magic number: https://docs.python.org/3/library/importlib.html#importlib.util.MAGIC_NUMBER
|
||||
|
||||
|
||||
Credits
|
||||
=======
|
||||
|
||||
The author would like to thank Gregory P. Smith for useful conversations on the
|
||||
topic of this PEP.
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
coding: utf-8
|
||||
End:
|
Loading…
Reference in New Issue