PEP: 552 Title: Deterministic pycs Version: $Revision$ Last-Modified: $Date$ Author: Benjamin Peterson Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2017-09-04 Python-Version: 3.7 Abstract ======== This PEP proposes to an extension to the pyc format to make it more deterministic. Rationale ========= A `reproducible build`_ is one where the same byte-for-byte output is generated every time the same sources are built—even across different machines (naturally subject to the requirement that they have rather similar environments setup). Reproducibility is important for security. It is also a key concept in content-based build systems such as Bazel_, which are most effective when the output files’ contents are a deterministic function of the input files’ contents. The current Python pyc format is the marshaled code object of the module prefixed by a `magic number`_, the source timestamp, and the source file size. The presence of a source timestamp means that a pyc is not a deterministic function of the input file’s contents—it also depends on the volatile metadata, mtime, of the source. Thus, the pycs are a barrier to proper reproducibility. Distributors of Python code are currently stuck with the options of 1. not distributing pycs and losing the caching advantages 2. distributing pycs and losing reproducibility 3. carefully giving all Python source files a deterministic timestamp (see https://github.com/python/cpython/pull/296) 4. doing a complicated mixture of 1. and 2. like generating pycs at installation time None of these options are very attractive. This PEP proposes a better solution. (Note there are other problems [#frozensets]_ [#interning]_ we do not address here that can make pycs non-deterministic.) Specification ============= Python will begin to recognize two magic number variants for every pyc version. One magic number will correspond to the current pyc format and the other to "hash-based" pycs introduced by this PEP. In hash-based pycs, the second field in the pyc header (currently the "timestamp" field) will contain the SipHash_ of the contents of the source file. Another a fast hash like MD5 or BLAKE2_ would also work. We choose SipHash because Python already has a builtin implementation of it from :pep:`456`. The third field in the pyc header (currently the "source size" field) will become a bitset of flags. We define the lowest flag in this bitset called ``check_source`` When Python encounters a hash-based pyc, its behavior depends on the setting of the ``check_source`` flag. If the ``check_source`` flag is set, Python will determine the validity of the pyc by hashing the source file and comparing the hash with the expected hash in the pyc. If the pyc needs to be regenerated, it will be regenerated as a hash-based pyc again with the ``check_source`` flag set. For hash-based pycs with the ``check_source`` unset, Python will simply load the pyc without checking the hash of the source file. The expectation in this case is that some external system (e.g., the local Linux distribution’s package manager) is responsible for keeping pycs up to date, so Python itself doesn’t have to check. Even when validation is disabled, the hash field should be set correctly, so out-of-band consistency checkers can verify the up-to-dateness of the pyc. Note also that the :pep:`3147` edict that pycs without corresponding source files not be loaded will still be enforced for hash-based pycs. The ``py_compile`` and ``compileall`` tools will be extended with new options to generate hash-based pycs with and without the ``check_source`` bit set. Their programmatic APIs will also gain similar functionality. References ========== .. _reproducible build: https://reproducible-builds.org/ .. _Bazel: https://bazel.build/ .. _BLAKE2: https://blake2.net/ .. _SipHash: https://131002.net/siphash/ .. [#frozensets] http://benno.id.au/blog/2013/01/15/python-determinism .. [#interning] http://bugzilla.opensuse.org/show_bug.cgi?id=1049186 .. _magic number: https://docs.python.org/3/library/importlib.html#importlib.util.MAGIC_NUMBER Credits ======= The author would like to thank Gregory P. Smith for useful conversations on the topic of this PEP. Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: