python-peps/pep-0552.txt

158 lines
6.2 KiB
Plaintext
Raw Normal View History

2017-09-05 14:21:50 -04:00
PEP: 552
Title: Deterministic pycs
Version: $Revision$
Last-Modified: $Date$
Author: Benjamin Peterson <benjamin@python.org>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 2017-09-04
2017-09-05 14:26:53 -04:00
Python-Version: 3.7
2017-09-05 14:21:50 -04:00
Abstract
========
This PEP proposes to an extension to the pyc format to make it more
deterministic.
Rationale
=========
A `reproducible build`_ is one where the same byte-for-byte output is generated
every time the same sources are built—even across different machines (naturally
subject to the requirement that they have rather similar environments
setup). Reproducibility is important for security. It is also a key concept in
content-based build systems such as Bazel_, which are most effective when the
output files contents are a deterministic function of the input files
contents.
The current Python pyc format is the marshaled code object of the module
prefixed by a `magic number`_, the source timestamp, and the source file
size. The presence of a source timestamp means that a pyc is not a deterministic
function of the input files contents—it also depends on the volatile metadata,
mtime, of the source. Thus, the pycs are a barrier to proper reproducibility.
Distributors of Python code are currently stuck with the options of
2017-09-06 13:56:06 -04:00
2017-09-05 14:21:50 -04:00
1. not distributing pycs and losing the caching advantages
2017-09-06 13:56:06 -04:00
2017-09-05 14:21:50 -04:00
2. distributing pycs and losing reproducibility
2017-09-06 13:56:06 -04:00
2017-09-06 18:23:50 -04:00
3. carefully giving all Python source files a deterministic timestamp
(see https://github.com/python/cpython/pull/296)
2017-09-06 13:56:06 -04:00
2017-09-05 14:21:50 -04:00
4. doing a complicated mixture of 1. and 2. like generating pycs at installation
2017-09-06 18:23:50 -04:00
time
2017-09-06 13:56:06 -04:00
2017-09-07 18:36:48 -04:00
None of these options are very attractive. This PEP proposes allowing the
timestamp to be replaced with a deterministic hash. The current timestamp
invalidation method will remain the default, though. Despite its nondeterminism,
timestamp invalidation works well for many workflows and usecases. The
hash-based pyc format can impose the cost of reading and hashing every source
file, which is more expensive than simply checking timestamps. Thus, for now, we
expect it to be used mainly by distributors and power use cases.
2017-09-05 14:21:50 -04:00
2017-09-07 13:55:30 -04:00
(Note there are other problems [#frozensets]_ [#interning]_ we do not
address here that can make pycs non-deterministic.)
2017-09-05 14:21:50 -04:00
Specification
=============
2017-09-08 13:34:58 -04:00
The pyc header currently consists of 3 32-bit words. We will expand it to 4. The
first word will continue to be the magic number, versioning the bytecode and pyc
format. The second word, conceptually the new word, will be a bit field. The
interpretation of the rest of the header and invalidation behavior of the pyc
depends on the contents of the bit field.
If the bit field is 0, the pyc is a traditional timestamp-based pyc. I.e., the
third and forth words will be the timestamp and file size respectively, and
2017-09-08 13:46:56 -04:00
invalidation will be done by comparing the metadata of the source file with that
in the header.
2017-09-08 13:34:58 -04:00
If the lowest bit of the bit field is set, the pyc is a hash-based pyc. We call
the second lowest bit the ``check_source`` flag. Following the bitset is a
64-bit hash of the source file. We will use a SipHash_ with a hardcoded key of
the contents of the source file. Another a fast hash like MD5 or BLAKE2_ would
also work. We choose SipHash because Python already has a builtin implementation
of it from :pep:`456`, although an interface that allows picking the SipHash key
must be exposed to Python. Security of the hash is not a concern, though we pass
over red-flag hashes like MD5 to ease auditing of Python in controlled
environments.
2017-09-05 14:21:50 -04:00
When Python encounters a hash-based pyc, its behavior depends on the setting of
the ``check_source`` flag. If the ``check_source`` flag is set, Python will
determine the validity of the pyc by hashing the source file and comparing the
hash with the expected hash in the pyc. If the pyc needs to be regenerated, it
will be regenerated as a hash-based pyc again with the ``check_source`` flag
set.
For hash-based pycs with the ``check_source`` unset, Python will simply load the
pyc without checking the hash of the source file. The expectation in this case
is that some external system (e.g., the local Linux distributions package
manager) is responsible for keeping pycs up to date, so Python itself doesnt
have to check. Even when validation is disabled, the hash field should be set
correctly, so out-of-band consistency checkers can verify the up-to-dateness of
the pyc. Note also that the :pep:`3147` edict that pycs without corresponding
source files not be loaded will still be enforced for hash-based pycs.
2017-09-07 18:20:39 -04:00
The programmatic APIs of ``py_compile`` and ``compileall`` will support
generation of hash-based pycs. Principally, ``py_compile`` will define a new
enumeration corresponding to all the available pyc invalidation modules::
class PycInvalidationMode(Enum):
TIMESTAMP
CHECKED_HASH
UNCHECKED_HASH
``py_compile.compile``, ``compileall.compile_dir``, and
``compileall.compile_file`` will all gain an ``invalidation_mode`` parameter,
which accepts a value of the ``PycInvalidationMode`` enumeration.
The ``compileall`` tool will be extended with a command new option,
``--invalidation-mode`` to generate hash-based pycs with and without the
``check_source`` bit set. ``--invalidation-mode`` will be a tristate option
taking values ``timestamp`` (the default), ``checked-hash``, and
``unchecked-hash`` corresponding to the values of ``PycInvalidationMode``.
``importlib`` will be extended with a ``source_hash(source)`` function that
computes the hash used by the pyc writing code for a bytestring **source**.
2017-09-05 14:21:50 -04:00
References
==========
.. _reproducible build: https://reproducible-builds.org/
.. _Bazel: https://bazel.build/
.. _BLAKE2: https://blake2.net/
.. _SipHash: https://131002.net/siphash/
2017-09-07 14:05:26 -04:00
.. [#frozensets] http://benno.id.au/blog/2013/01/15/python-determinism
.. [#interning] http://bugzilla.opensuse.org/show_bug.cgi?id=1049186
2017-09-05 14:21:50 -04:00
.. _magic number: https://docs.python.org/3/library/importlib.html#importlib.util.MAGIC_NUMBER
Credits
=======
2017-09-07 15:36:50 -04:00
The author would like to thank Gregory P. Smith, Christian Heimes, and Steve
Dower for useful conversations on the topic of this PEP.
2017-09-05 14:21:50 -04:00
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: