2017-09-05 14:21:50 -04:00
|
|
|
|
PEP: 552
|
|
|
|
|
Title: Deterministic pycs
|
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
|
|
|
|
Author: Benjamin Peterson <benjamin@python.org>
|
|
|
|
|
Status: Draft
|
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
|
Created: 2017-09-04
|
2017-09-05 14:26:53 -04:00
|
|
|
|
Python-Version: 3.7
|
2017-09-12 19:30:08 -04:00
|
|
|
|
Post-History: 2017-09-07
|
2017-09-05 14:21:50 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
========
|
|
|
|
|
|
2017-09-12 19:39:33 -04:00
|
|
|
|
This PEP proposes an extension to the pyc format to make it more deterministic.
|
2017-09-05 14:21:50 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
A `reproducible build`_ is one where the same byte-for-byte output is generated
|
|
|
|
|
every time the same sources are built—even across different machines (naturally
|
|
|
|
|
subject to the requirement that they have rather similar environments
|
|
|
|
|
setup). Reproducibility is important for security. It is also a key concept in
|
|
|
|
|
content-based build systems such as Bazel_, which are most effective when the
|
|
|
|
|
output files’ contents are a deterministic function of the input files’
|
|
|
|
|
contents.
|
|
|
|
|
|
|
|
|
|
The current Python pyc format is the marshaled code object of the module
|
|
|
|
|
prefixed by a `magic number`_, the source timestamp, and the source file
|
|
|
|
|
size. The presence of a source timestamp means that a pyc is not a deterministic
|
2017-09-15 18:04:43 -04:00
|
|
|
|
function of the input file’s contents—it also depends on volatile metadata, the
|
|
|
|
|
mtime of the source. Thus, pycs are a barrier to proper reproducibility.
|
2017-09-05 14:21:50 -04:00
|
|
|
|
|
|
|
|
|
Distributors of Python code are currently stuck with the options of
|
2017-09-06 13:56:06 -04:00
|
|
|
|
|
2017-09-05 14:21:50 -04:00
|
|
|
|
1. not distributing pycs and losing the caching advantages
|
2017-09-06 13:56:06 -04:00
|
|
|
|
|
2017-09-05 14:21:50 -04:00
|
|
|
|
2. distributing pycs and losing reproducibility
|
2017-09-06 13:56:06 -04:00
|
|
|
|
|
2017-09-06 18:23:50 -04:00
|
|
|
|
3. carefully giving all Python source files a deterministic timestamp
|
2017-09-15 18:07:40 -04:00
|
|
|
|
(see, for example, https://github.com/python/cpython/pull/296)
|
2017-09-06 13:56:06 -04:00
|
|
|
|
|
2017-09-05 14:21:50 -04:00
|
|
|
|
4. doing a complicated mixture of 1. and 2. like generating pycs at installation
|
2017-09-06 18:23:50 -04:00
|
|
|
|
time
|
2017-09-06 13:56:06 -04:00
|
|
|
|
|
2017-09-07 18:36:48 -04:00
|
|
|
|
None of these options are very attractive. This PEP proposes allowing the
|
|
|
|
|
timestamp to be replaced with a deterministic hash. The current timestamp
|
|
|
|
|
invalidation method will remain the default, though. Despite its nondeterminism,
|
|
|
|
|
timestamp invalidation works well for many workflows and usecases. The
|
|
|
|
|
hash-based pyc format can impose the cost of reading and hashing every source
|
|
|
|
|
file, which is more expensive than simply checking timestamps. Thus, for now, we
|
|
|
|
|
expect it to be used mainly by distributors and power use cases.
|
2017-09-05 14:21:50 -04:00
|
|
|
|
|
2017-09-07 13:55:30 -04:00
|
|
|
|
(Note there are other problems [#frozensets]_ [#interning]_ we do not
|
|
|
|
|
address here that can make pycs non-deterministic.)
|
2017-09-05 14:21:50 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Specification
|
|
|
|
|
=============
|
|
|
|
|
|
2017-09-08 13:34:58 -04:00
|
|
|
|
The pyc header currently consists of 3 32-bit words. We will expand it to 4. The
|
|
|
|
|
first word will continue to be the magic number, versioning the bytecode and pyc
|
|
|
|
|
format. The second word, conceptually the new word, will be a bit field. The
|
|
|
|
|
interpretation of the rest of the header and invalidation behavior of the pyc
|
|
|
|
|
depends on the contents of the bit field.
|
|
|
|
|
|
|
|
|
|
If the bit field is 0, the pyc is a traditional timestamp-based pyc. I.e., the
|
|
|
|
|
third and forth words will be the timestamp and file size respectively, and
|
2017-09-08 13:46:56 -04:00
|
|
|
|
invalidation will be done by comparing the metadata of the source file with that
|
|
|
|
|
in the header.
|
2017-09-08 13:34:58 -04:00
|
|
|
|
|
|
|
|
|
If the lowest bit of the bit field is set, the pyc is a hash-based pyc. We call
|
2017-09-08 13:47:43 -04:00
|
|
|
|
the second lowest bit the ``check_source`` flag. Following the bit field is a
|
2017-09-08 13:34:58 -04:00
|
|
|
|
64-bit hash of the source file. We will use a SipHash_ with a hardcoded key of
|
2017-09-08 13:48:18 -04:00
|
|
|
|
the contents of the source file. Another fast hash like MD5 or BLAKE2_ would
|
2017-09-08 13:34:58 -04:00
|
|
|
|
also work. We choose SipHash because Python already has a builtin implementation
|
|
|
|
|
of it from :pep:`456`, although an interface that allows picking the SipHash key
|
|
|
|
|
must be exposed to Python. Security of the hash is not a concern, though we pass
|
2017-09-15 18:09:49 -04:00
|
|
|
|
over completely-broken hashes like MD5 to ease auditing of Python in controlled
|
2017-09-08 13:34:58 -04:00
|
|
|
|
environments.
|
2017-09-05 14:21:50 -04:00
|
|
|
|
|
|
|
|
|
When Python encounters a hash-based pyc, its behavior depends on the setting of
|
|
|
|
|
the ``check_source`` flag. If the ``check_source`` flag is set, Python will
|
|
|
|
|
determine the validity of the pyc by hashing the source file and comparing the
|
|
|
|
|
hash with the expected hash in the pyc. If the pyc needs to be regenerated, it
|
|
|
|
|
will be regenerated as a hash-based pyc again with the ``check_source`` flag
|
|
|
|
|
set.
|
|
|
|
|
|
|
|
|
|
For hash-based pycs with the ``check_source`` unset, Python will simply load the
|
|
|
|
|
pyc without checking the hash of the source file. The expectation in this case
|
|
|
|
|
is that some external system (e.g., the local Linux distribution’s package
|
|
|
|
|
manager) is responsible for keeping pycs up to date, so Python itself doesn’t
|
|
|
|
|
have to check. Even when validation is disabled, the hash field should be set
|
|
|
|
|
correctly, so out-of-band consistency checkers can verify the up-to-dateness of
|
|
|
|
|
the pyc. Note also that the :pep:`3147` edict that pycs without corresponding
|
|
|
|
|
source files not be loaded will still be enforced for hash-based pycs.
|
|
|
|
|
|
2017-09-07 18:20:39 -04:00
|
|
|
|
The programmatic APIs of ``py_compile`` and ``compileall`` will support
|
|
|
|
|
generation of hash-based pycs. Principally, ``py_compile`` will define a new
|
|
|
|
|
enumeration corresponding to all the available pyc invalidation modules::
|
|
|
|
|
|
|
|
|
|
class PycInvalidationMode(Enum):
|
|
|
|
|
TIMESTAMP
|
|
|
|
|
CHECKED_HASH
|
|
|
|
|
UNCHECKED_HASH
|
|
|
|
|
|
|
|
|
|
``py_compile.compile``, ``compileall.compile_dir``, and
|
|
|
|
|
``compileall.compile_file`` will all gain an ``invalidation_mode`` parameter,
|
|
|
|
|
which accepts a value of the ``PycInvalidationMode`` enumeration.
|
|
|
|
|
|
|
|
|
|
The ``compileall`` tool will be extended with a command new option,
|
|
|
|
|
``--invalidation-mode`` to generate hash-based pycs with and without the
|
|
|
|
|
``check_source`` bit set. ``--invalidation-mode`` will be a tristate option
|
|
|
|
|
taking values ``timestamp`` (the default), ``checked-hash``, and
|
|
|
|
|
``unchecked-hash`` corresponding to the values of ``PycInvalidationMode``.
|
|
|
|
|
|
|
|
|
|
``importlib`` will be extended with a ``source_hash(source)`` function that
|
|
|
|
|
computes the hash used by the pyc writing code for a bytestring **source**.
|
2017-09-05 14:21:50 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
References
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
.. _reproducible build: https://reproducible-builds.org/
|
|
|
|
|
.. _Bazel: https://bazel.build/
|
|
|
|
|
.. _BLAKE2: https://blake2.net/
|
2017-09-07 15:26:21 -04:00
|
|
|
|
.. _SipHash: https://131002.net/siphash/
|
2017-09-07 14:05:26 -04:00
|
|
|
|
.. [#frozensets] http://benno.id.au/blog/2013/01/15/python-determinism
|
|
|
|
|
.. [#interning] http://bugzilla.opensuse.org/show_bug.cgi?id=1049186
|
2017-09-05 14:21:50 -04:00
|
|
|
|
.. _magic number: https://docs.python.org/3/library/importlib.html#importlib.util.MAGIC_NUMBER
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Credits
|
|
|
|
|
=======
|
|
|
|
|
|
2017-09-07 15:36:50 -04:00
|
|
|
|
The author would like to thank Gregory P. Smith, Christian Heimes, and Steve
|
|
|
|
|
Dower for useful conversations on the topic of this PEP.
|
2017-09-05 14:21:50 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
..
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
coding: utf-8
|
|
|
|
|
End:
|