From 05e7f55fc164e362039f70b4917ebd4275b2e962 Mon Sep 17 00:00:00 2001 From: Benjamin Peterson Date: Tue, 5 Sep 2017 11:21:50 -0700 Subject: [PATCH] pep 552: deterministic pycs --- pep-0552.txt | 118 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 118 insertions(+) create mode 100644 pep-0552.txt diff --git a/pep-0552.txt b/pep-0552.txt new file mode 100644 index 000000000..edb77d7a2 --- /dev/null +++ b/pep-0552.txt @@ -0,0 +1,118 @@ +PEP: 552 +Title: Deterministic pycs +Version: $Revision$ +Last-Modified: $Date$ +Author: Benjamin Peterson +Status: Draft +Python-Version: 3.7 +Type: Standards Track +Content-Type: text/x-rst +Created: 2017-09-04 + + +Abstract +======== + +This PEP proposes to an extension to the pyc format to make it more +deterministic. + + +Rationale +========= + +A `reproducible build`_ is one where the same byte-for-byte output is generated +every time the same sources are built—even across different machines (naturally +subject to the requirement that they have rather similar environments +setup). Reproducibility is important for security. It is also a key concept in +content-based build systems such as Bazel_, which are most effective when the +output files’ contents are a deterministic function of the input files’ +contents. + +The current Python pyc format is the marshaled code object of the module +prefixed by a `magic number`_, the source timestamp, and the source file +size. The presence of a source timestamp means that a pyc is not a deterministic +function of the input file’s contents—it also depends on the volatile metadata, +mtime, of the source. Thus, the pycs are a barrier to proper reproducibility. + +Distributors of Python code are currently stuck with the options of +1. not distributing pycs and losing the caching advantages +2. distributing pycs and losing reproducibility +3. carefully giving all Python source files a deterministic timestamp (see +https://github.com/python/cpython/pull/296) +4. doing a complicated mixture of 1. and 2. like generating pycs at installation +time +None of these are very attractive. This PEP proposes a better solution. + +(Note there are `other problems`_ we do not address here that can make pycs +non-deterministic.) + + +Specification +============= + +Python will begin to recognize two magic number variants for every pyc +version. One magic number will correspond to the current pyc format and the +other to "hash-based" pycs introduced by this PEP. + +In hash-based pycs, the second field in the pyc header (currently the +"timestamp" field) will contain the SipHash_ of the contents of the source +file. Another a fast hash like MD5 or BLAKE2_ would also work. We choose SipHash +because Python already has a builtin implementation of it from :pep:`456`. The +third field in the pyc header (currently the "source size" field) will become a +bitset of flags. We define the lowest flag in this bitset called +``check_source`` + +When Python encounters a hash-based pyc, its behavior depends on the setting of +the ``check_source`` flag. If the ``check_source`` flag is set, Python will +determine the validity of the pyc by hashing the source file and comparing the +hash with the expected hash in the pyc. If the pyc needs to be regenerated, it +will be regenerated as a hash-based pyc again with the ``check_source`` flag +set. + +For hash-based pycs with the ``check_source`` unset, Python will simply load the +pyc without checking the hash of the source file. The expectation in this case +is that some external system (e.g., the local Linux distribution’s package +manager) is responsible for keeping pycs up to date, so Python itself doesn’t +have to check. Even when validation is disabled, the hash field should be set +correctly, so out-of-band consistency checkers can verify the up-to-dateness of +the pyc. Note also that the :pep:`3147` edict that pycs without corresponding +source files not be loaded will still be enforced for hash-based pycs. + +The ``py_compile`` and ``compileall`` tools will be extended with new options to +generate hash-based pycs with and without the ``check_source`` bit set. Their +programmatic APIs will also gain similar functionality. + + +References +========== + +.. _reproducible build: https://reproducible-builds.org/ +.. _Bazel: https://bazel.build/ +.. _BLAKE2: https://blake2.net/ +.. _SipHash: https://131002.net/siphash/ +.. _other problems: http://benno.id.au/blog/2013/01/15/python-determinism +.. _magic number: https://docs.python.org/3/library/importlib.html#importlib.util.MAGIC_NUMBER + + +Credits +======= + +The author would like to thank Gregory P. Smith for useful conversations on the +topic of this PEP. + + +Copyright +========= + +This document has been placed in the public domain. + + + +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + coding: utf-8 + End: