simplify pyc header format

This commit is contained in:
Benjamin Peterson 2017-09-08 10:34:58 -07:00
parent 2c88e2d062
commit e909d34d27
1 changed files with 19 additions and 12 deletions

View File

@ -61,19 +61,26 @@ address here that can make pycs non-deterministic.)
Specification
=============
Python will begin to recognize two magic number variants for every pyc
version. One magic number will correspond to the current pyc format and the
other to "hash-based" pycs introduced by this PEP.
The pyc header currently consists of 3 32-bit words. We will expand it to 4. The
first word will continue to be the magic number, versioning the bytecode and pyc
format. The second word, conceptually the new word, will be a bit field. The
interpretation of the rest of the header and invalidation behavior of the pyc
depends on the contents of the bit field.
In hash-based pycs, the second field in the pyc header (currently the
"timestamp" field) will become a bitset of flags. We define the lowest flag in
this bitset called ``check_source``. Following the bitset is a 64-bit hash of
the source file. We will use a SipHash_ with a hardcoded key of the contents of
the source file. Another a fast hash like MD5 or BLAKE2_ would also work. We
choose SipHash because Python already has a builtin implementation of it from
:pep:`456`, although an interface that allows picking the SipHash key must be
exposed to Python. Security of the hash is not a concern, though we pass over
red-flag hashes like MD5 to ease auditing of Python in controlled environments.
If the bit field is 0, the pyc is a traditional timestamp-based pyc. I.e., the
third and forth words will be the timestamp and file size respectively, and
invalidation will be done by comparing the timestamp and file size of the source
file with that in the header.
If the lowest bit of the bit field is set, the pyc is a hash-based pyc. We call
the second lowest bit the ``check_source`` flag. Following the bitset is a
64-bit hash of the source file. We will use a SipHash_ with a hardcoded key of
the contents of the source file. Another a fast hash like MD5 or BLAKE2_ would
also work. We choose SipHash because Python already has a builtin implementation
of it from :pep:`456`, although an interface that allows picking the SipHash key
must be exposed to Python. Security of the hash is not a concern, though we pass
over red-flag hashes like MD5 to ease auditing of Python in controlled
environments.
When Python encounters a hash-based pyc, its behavior depends on the setting of
the ``check_source`` flag. If the ``check_source`` flag is set, Python will