From 0c6f86f0b910bd86b66caa04555d27b4dfc3be14 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fridol=C3=ADn=20Pokorn=C3=BD?= Date: Mon, 3 Apr 2023 16:54:21 +0200 Subject: [PATCH] PEP 710: Recording the provenance of installed packages (#3076) --- .github/CODEOWNERS | 1 + pep-0710.rst | 619 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 620 insertions(+) create mode 100644 pep-0710.rst diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 4c906f0db..dd5f5a5d4 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -590,6 +590,7 @@ pep-0706.rst @encukou pep-0707.rst @iritkatriel pep-0708.rst @dstufft pep-0709.rst @carljm +pep-0710.rst @dstufft # ... # pep-0754.txt # ... diff --git a/pep-0710.rst b/pep-0710.rst new file mode 100644 index 000000000..a63963921 --- /dev/null +++ b/pep-0710.rst @@ -0,0 +1,619 @@ +PEP: 710 +Title: Recording the provenance of installed packages +Author: Fridolín Pokorný +Sponsor: Donald Stufft +PEP-Delegate: Paul Moore +Discussions-To: https://discuss.python.org/t/draft-pep-recording-provenance-of-installed-packages/24838 +Status: Draft +Type: Standards Track +Topic: Packaging +Content-Type: text/x-rst +Created: 27-Mar-2023 +Post-History: `03-Dec-2021 `__, + `30-Jan-2023 `__, + `14-Mar-2023 `__, + +Abstract +======== + +This PEP describes a way to record the provenance of installed Python distributions. +The record is created by an installer and is available to users in +the form of a JSON file ``provenance_url.json`` in the ``.dist-info`` directory. +The mentioned JSON file captures additional metadata to allow recording a URL to a +:term:`distribution package` together with the installed distribution hash. This +proposal is built on top of :pep:`610` following +:ref:`its corresponding canonical PyPA spec ` and +complements ``direct_url.json`` with ``provenance_url.json`` for when packages +are identified by a name, and optionally a version. + +Motivation +========== + +Installing a Python :term:`Project` involves downloading a :term:`Distribution Package` +from a :term:`Package Index` +and extracting its content to an appropriate place. After the installation +process is done, information about the release artifact used as well as its source +is generally lost. However, there are use cases for keeping records of +distributions used for installing packages and their provenance. + +Python wheels can be built with different compiler flags or supporting +different wheel tags. In both cases, users might get into a situation in which +multiple wheels might be considered by installers (possibly from different +package indexes) and immediately finding out which wheel file was actually used +during the installation might be helpful. This way, developers can use +information about wheels to debug issues making sure the desired wheel was +actually installed. Another use case could be tools reporting software +installed, such as tools reporting a SBOM (Software Bill of Materials), that might +give more accurate reports. Yet another use case could be reconstruction of the +Python environment by pinning each installed package to a specific distribution +artifact consumed from a Python package index. + +Rationale +========= + +The motivation described in this PEP is an extension of that in :pep:`610`. +In addition to recording provenance information for packages installed using a direct URL, +installers should also do so for packages installed by name +(and optionally version) from Python package indexes. + +The idea described in this PEP originated in a tool called `micropipenv`_ +that is used to install +:term:`distribution packages ` in containerized +environments (see the reported issue `thoth-station/micropipenv#206`_). +Currently, the assembled containerized application does not implicitly carry +information about the provenance of installed distribution packages +(unless these are installed from full URLs and recorded via ``direct_url.json``). +This requires container image suppliers to link +container images with the corresponding build process, its configuration and +the application source code for checking requirements files in cases when +software present in containerized environments needs to be audited. + +The `subsequent discussion in the Discourse thread +`__ also brought up +pip's new ``--report`` option that can +`generate a detailed JSON report `__ about +the installation process. This option could help with the provenance problem +this PEP approaches. Nevertheless, this option needs to be *explicitly* passed +to pip to obtain the provenance information, and includes additional metadata that +might not be necessary for checking the provenance (such as Python version +requirements of each distribution package). Also, this option is +specific to pip as of the writing of this PEP. + +Note the current :ref:`spec for recording installed packages +` defines a ``RECORD`` file that +records installed files, but not the distribution artifact from which these +files were obtained. Auditing installed artifacts can be performed +based on matching the entries listed in the ``RECORD`` file. However, this +technique requires a pre-computed database of files each artifact provides or a +comparison with the actual artifact content. Both approaches are relatively +expensive and time consuming operations which could be eliminated with the +proposed ``provenance_url.json`` file. + +Recording provenance information for installed distribution packages, +both those obtained from direct URLs and by name/version from an index, +can simplify auditing Python environments in general, beyond just +the specific use case for containerized applications mentioned earlier. +A community project `pip-audit +`__ raised their possible interest in +`pypa/pip-audit#170`_. + +Specification +============= + +The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHOULD”, +“SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” +in this document are to be interpreted as described in :rfc:`2119`. + +The ``provenance_url.json`` file SHOULD be created in the ``.dist-info`` +directory by installers when installing a :term:`Distribution Package` +specified by name (and optionally by :term:`Version Specifier`). + +This file MUST NOT be created when installing a distribution package from a requirement +specifying a direct URL reference (including a VCS URL). + +Only one of the files ``provenance_url.json`` and ``direct_url.json`` (from :pep:`610`), +may be present in a given ``.dist-info`` directory; installers MUST NOT add both. + +The ``provenance_url.json`` JSON file MUST be a dictionary, compliant with +:rfc:`8259` and UTF-8 encoded. + +If present, it MUST contain exactly two keys. The first one is ``url``, with +type ``string``. The second key MUST be ``archive_info`` with a value defined +below. + +The value of the ``url`` key MUST be the URL from which the distribution package was downloaded. If a wheel is +built from a source distribution, the ``url`` value MUST be the URL from which +the source distribution was downloaded. If a wheel is downloaded and installed directly, +the ``url`` field MUST be the URL from which the wheel was downloaded. +As in the :ref:`direct URL origin specification`, the ``url`` value +MUST be stripped of any sensitive authentication information for security reasons. + +The user:password section of the URL MAY however be composed of environment +variables, matching the following regular expression: + +.. code-block:: text + + \$\{[A-Za-z0-9-_]+\}(:\$\{[A-Za-z0-9-_]+\})? + +Additionally, the user:password section of the URL MAY be a well-known, +non-security sensitive string. A typical example is ``git`` in the case of an +URL such as ``ssh://git@gitlab.com``. + +The value of ``archive_info`` MUST be a dictionary with a single key +``hashes``. The value of ``hashes`` is a dictionary mapping hash function names to a +hex-encoded digest of the file referenced by the ``url`` value. Multiple hashes +can be included, and it is up to the consumer to decide what to do with +multiple hashes (it may validate all of them or a subset of them, or nothing at +all). + +Each hash MUST be one of the single argument hashes provided by +:data:`py3.11:hashlib.algorithms_guaranteed`, excluding ``sha1`` and ``md5`` which MUST NOT be used. +As of Python 3.11, with ``shake_128`` and ``shake_256`` excluded +for being multi-argument, the allowed set of hashes is: + +.. code-block:: python + + >>> import hashlib + >>> sorted(hashlib.algorithms_guaranteed - {"shake_128", "shake_256", "sha1", "md5"}) + ['blake2b', 'blake2s', 'sha224', 'sha256', 'sha384', 'sha3_224', 'sha3_256', 'sha3_384', 'sha3_512', 'sha512'] + +Each hash MUST be referenced by the canonical name of the hash, always lower case. + +Hashes ``sha1`` and ``md5`` MUST NOT be present, due to the security +limitations of these hash algorithms. Conversely, hash ``sha256`` SHOULD +be included. + +Installers that cache distribution packages from an index SHOULD keep +information related to the cached distribution artifact, so that +the ``provenance_url.json`` file can be created even when installing distribution packages +from the installer's cache. + +Backwards Compatibility +======================= + +Following the :ref:`packaging:recording-installed-packages` specification, +installers may keep additional installer-specific files in the ``.dist-info`` +directory. To make sure this PEP does not cause any backwards compatibility +issues, a :ref:`comprehensive survey of installers and libraries <710-tool-survey>` +found no current tools that are using a similarly-named file, +or other major feasibility concerns. + +The :ref:`Wheel specification ` lists files that can be +present in the ``.dist-info`` directory. None of these file names collide with +the proposed ``provenance_url.json`` file from this PEP. + +Presence of provenance_url.json in installers and libraries +----------------------------------------------------------- + +A comprehensive survey of the existing installers, libraries, and dependency +managers in the Python ecosystem analyzed the implications of adding support for +``provenance_url.json`` to each tool. +In summary, no major backwards compatibility issues, conflicts or feasibility blockers +were found as of the time of writing of this PEP. More details about the survey +can be found in the :ref:`710-tool-survey` section. + +Compatibility with direct_url.json +---------------------------------- + +This proposal does not make any changes to the ``direct_url.json`` file +described in :pep:`610` and :ref:`its corresponding canonical PyPA spec +`. + +The content of ``provenance_url.json`` file was designed in a way to eventually +allow installers reuse some of the logic supporting ``direct_url.json`` when a +direct URL refers to a source archive or a wheel. + +The main difference between the ``provenance_url.json`` and ``direct_url.json`` +files are the mandatory keys and their values in the ``provenance_url.json`` file. +This helps make sure consumers of the ``provenance_url.json`` file can rely +on its content, if the file is present in the ``.dist-info`` directory. + +Security Implications +===================== + +One of the main security features of the ``provenance_url.json`` file is the +ability to audit installed artifacts in Python environments. Tools can check +which Python package indexes were used to install Python :term:`distribution +packages ` as well as the hash digests of their release +artifacts. + +As an example, we can take the recent compromised dependency chain in `the +PyTorch incident `__. +The PyTorch index provided a package named ``torchtriton``. An attacker +published ``torchtriton`` on PyPI, which ran a malicious binary. By checking +the URL of the installed Python distribution stated in the +``provenance_url.json`` file, tools can automatically check the source of the +installed Python distribution. In case of the PyTorch incident, the URL of +``torchtriton`` should point to the PyTorch index, not PyPI. Tools can help +identifying such malicious Python distributions installed by checking the +installed Python distribution URL. A more exact check can include also the hash +of the installed Python distribution stated in the ``provenance_url.json`` +file. Such checks on hashes can be helpful for mirrored Python package indexes +where Python distributions are not distinguishable by their source URLs, making +sure only desired Python package distributions are installed. + +A malicious actor can intentionally adjust the content of +``provenance_url.json`` to possibly hide provenance information of the +installed Python distribution. A security check which would uncover such +malicious activity is beyond scope of this PEP as it would require monitoring +actions on the filesystem and eventually reviewing user or file permissions. + +How to Teach This +================= + +The ``provenance_url.json`` metadata file is intended for tools and is not +directly visible to end users. + +Examples +======== + +Examples of a valid provenance_url.json +--------------------------------------- + +A valid ``provenance_url.json`` list multiple hashes: + +.. code-block:: json + + { + "archive_info": { + "hashes": { + "blake2s": "fffeaf3d0bd71dc960ca2113af890a2f2198f2466f8cd58ce4b77c1fc54601ff", + "sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f", + "sha3_256": "c856930e0f707266d30e5b48c667a843d45e79bb30473c464e92dfa158285eab", + "sha512": "6bad5536c30a0b2d5905318a1592948929fbac9baf3bcf2e7faeaf90f445f82bc2b656d0a89070d8a6a9395761f4793c83187bd640c64b2656a112b5be41f73d" + } + }, + "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl" + } + +A valid ``provenance_url.json`` listing a single hash entry: + +.. code-block:: json + + { + "archive_info": { + "hashes": { + "sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f" + } + }, + "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl" + } + +A valid ``provenance_url.json`` listing a source distribution which was used to +build and install a wheel: + +.. code-block:: json + + { + "archive_info": { + "hashes": { + "sha256": "8bfe29f17c10e2f2e619de8033a07a224058d96b3bfe2ed61777596f7ffd7fa9" + } + }, + "url": "https://files.pythonhosted.org/packages/1d/43/ad8ae671de795ec2eafd86515ef9842ab68455009d864c058d0c3dcf680d/micropipenv-0.0.1.tar.gz" + } + +Examples of an invalid provenance_url.json +------------------------------------------ + +The following example includes a ``hash`` key in the ``archive_info`` dictionary +as originally designed in :pep:`610` and the data structure documented in +:ref:`packaging:direct-url`. +The ``hash`` key MUST NOT be present to prevent from any possible confusion +with ``hashes`` and additional checks that would be required to keep hash +values in sync. + +.. code-block:: json + + { + "archive_info": { + "hash": "sha256=236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f", + "hashes": { + "sha256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f" + } + }, + "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl" + } + +Another example demonstrates an invalid hash name. The referenced hash name does not +correspond to the canonical hash names described in this PEP and +in the Python docs under :attr:`py3.11:hashlib.hash.name`. + +.. code-block:: json + + { + "archive_info": { + "hashes": { + "SHA-256": "236bcb61156d76c4b8a05821b988c7b8c35bf0da28a4b614e8d6ab5212c25c6f" + } + }, + "url": "https://files.pythonhosted.org/packages/07/51/2c0959c5adf988c44d9e1e0d940f5b074516ecc87e96b1af25f59de9ba38/pip-23.0.1-py3-none-any.whl" + } + + +Example pip commands and their effect on provenance_url.json and direct_url.json +-------------------------------------------------------------------------------- + +These commands generate a ``direct_url.json`` file but do not generate a +``provenance_url.json`` file. These examples follow examples from :pep:`610`: + +* ``pip install https://example.com/app-1.0.tgz`` +* ``pip install https://example.com/app-1.0.whl`` +* ``pip install "git+https://example.com/repo/app.git#egg=app&subdirectory=setup"`` +* ``pip install ./app`` +* ``pip install file:///home/user/app`` +* ``pip install --editable "git+https://example.com/repo/app.git#egg=app&subdirectory=setup"`` (in which case, ``url`` will be the local directory where the git repository has been cloned to, and ``dir_info`` will be present with ``"editable": true`` and no ``vcs_info`` will be set) +* ``pip install -e ./app`` + +Commands that generate a ``provenance_url.json`` file but do not generate +a ``direct_url.json`` file: + +* ``pip install app`` +* ``pip install app~=2.2.0`` +* ``pip install app --no-index --find-links "https://example.com/"`` + +This behaviour can be tested using changes to pip implemented in the PR +`pypa/pip#11865`_. + +Reference Implementation +======================== + +A proof-of-concept for creating the ``provenance_url.json`` metadata file when +installing a Python :term:`Distribution Package` is available in the PR to pip +`pypa/pip#11865`_. It reuses the already available implementation for the +:ref:`direct URL data structure ` to provide +the ``provenance_url.json`` metadata file for cases when ``direct_url.json`` is not +created. + +A prototype called `pip-preserve `_ was developed to +demonstrate creation of ``requirements.txt`` files considering ``direct_url.json`` +and ``provenance_url.json`` metadata files. This tool mimics the ``pip +freeze`` functionality, but the listing of installed packages also includes +the hashes of the Python distribution artifacts. + +Rejected Ideas +============== + +Naming the file direct_url.json instead of provenance_url.json +-------------------------------------------------------------- + +To preserve backwards compatibility with the +:ref:`Direct URL Origin specification `, +the file cannot be named ``direct_url.json``, as per the text of that specification: + + This file MUST NOT be created when installing a distribution from an other + type of requirement (i.e. name plus version specifier). + +Such a change might introduce backwards compatibility issues for consumers of +``direct_url.json`` who rely on its presence only when distributions are +installed using a direct URL reference. + +Deprecating direct_url.json and using only provenance_url.json +-------------------------------------------------------------- + +File ``direct_url.json`` is already well established with :pep:`610` being accepted and is +already used by installers. For example, ``pip`` uses ``direct_url.json`` to +report a direct URL reference on ``pip freeze``. Deprecating +``direct_url.json`` would require additional changes to the ``pip freeze`` +implementation in pip (see PR `fridex/pip#2`_) and could introduce backwards compatibility +issues for already existing ``direct_url.json`` consumers. + +Keeping the hash key in the archive_info dictionary +--------------------------------------------------- + +:pep:`610` and :ref:`its corresponding canonical PyPA spec ` discuss +the possibility to include the ``hash`` key alongside the ``hashes`` key in the +``archive_info`` dictionary. This PEP explicitly does not include the ``hash`` key in +the ``provenance_url.json`` file and allows only the ``hashes`` key to be present. +By doing so we eliminate possible redundancy in the file, possible confusion, +and any additional checks that would need to be done to make sure the hashes are in +sync. + +Making the hashes key optional +------------------------------ + +:pep:`610` and :ref:`its corresponding canonical PyPA spec ` +recommend including the ``hashes`` key of the ``archive_info`` in the +``direct_url.json`` file but it is not required (per the :rfc:`21119` language): + + A hashes key SHOULD be present as a dictionary mapping a hash name to a hex + encoded digest of the file. + +This PEP requires the ``hashes`` key be included in ``archive_info`` +in the ``provenance_url.json`` file if that file is created; per this PEP: + + The value of ``archive_info`` MUST be a dictionary with a single key + ``hashes``. + +By doing so, consumers of ``provenance_url.json`` can check +artifact digests when the ``provenance_url.json`` file is created by installers. + +Open Issues +=========== + +Availability of the provenance_url.json file in Conda +----------------------------------------------------- + +We would like to get feedback on the ``provenance_url.json`` file from the Conda +maintainers. It is not clear whether Conda would like to adopt the +``provenance_url.json`` file. Conda already stores provenance related +information (similar to the provenance information proposed in this PEP) in +JSON files located in the ``conda-meta`` directory `following its actions +during installation +`__. + +Using provenance_url.json in downstream installers +-------------------------------------------------- + +The proposed ``provenance_url.json`` file was meant to be adopted primarily by +Python installers. Other installers, such as APT or DNF, might record the +provenance of the installed downstream Python distributions in their own +way specific to downstream package management. The proposed file is +not expected to be created by these downstream package installers and thus they +were intentionally left out of this PEP. However, any input by developers or +maintainers of these installers is valuable to possibly enrich the +``provenance_url.json`` file with information that would help in some way. + +.. _710-tool-survey: + +Appendix: Survey of installers and libraries +============================================ + +pip +--- + +The function from pip's internal API responsible for installing wheels, named +`_install_wheel +`__, +does not store any ``provenance_url.json`` file in the ``.dist-info`` +directory. Additionally, a prototype introducing the mentioned file to pip in +`pypa/pip#11865`_ demonstrates incorporating logic for handling the +``provenance_url.json`` file in pip's source code. + +As pip is used by some of the tools mentioned below to install Python package +distributions, findings for pip apply to these tools, as well as pip does not +allow parametrizing creation of files in the ``.dist-info`` directory in its +internal API. Most of the tools mentioned below that use pip invoke pip as a +subprocess which has no effect on the eventual presence of the +``provenance_url.json`` file in the ``.dist-info`` directory. + +distlib +------- + +`distlib`_ implements low-level functionality to manipulate the +``dist-info`` directory. The database of installed distributions does not use +any file named ``provenance_url.json``, based on `the distlib's source code +`__. + +Pipenv +------ + +`Pipenv`_ uses pip `to install Python package distributions +`__. +There wasn't any additional identified logic that would cause backwards +compatibility issues when introducing the ``provenance_url.json`` file in the +``.dist-info`` directory. + +installer +--------- + +`installer`_ does not create a ``provenance_url.json`` file explicitly. +Nevertheless, as per the :ref:`Recording Installed Projects ` +specification, installer allows passing the ``additional_metadata`` argument to +create a file in the ``.dist-info`` directory - see `the source code +`__. +To avoid any backwards compatibility issues, any library or tool using +installer must not request creating the ``provenance_url.json`` file using the +mentioned ``additional_metadata`` argument. + +Poetry +------ + +The installation logic in `Poetry`_ depends on the +``installer.modern-installer`` configuration option (`see docs +`__). + +For cases when the ``installer.modern-installer`` configuration option is set +to ``false``, Poetry uses `pip for installing Python package distributions +`__. + +On the other hand, when ``installer.modern-installer`` configuration option is +set to ``true``, Poetry uses `installer to install Python package distributions +`__. +As can be seen from the linked sources, there isn't passed any additional +metadata file named ``provenance_url.json`` that would cause compatibility +issues with this PEP. + +Conda +----- + +`Conda`_ does not create any ``provenance_url.json`` file +`when Python package distributions are installed +`__. + +Hatch +----- + +`Hatch`_ uses pip `to install project dependencies +`__. + +micropipenv +----------- + +As `micropipenv`_ is a wrapper on top of pip, it uses +pip to install Python distributions, for both `lock files +`__ +as well as `for requirements files +`__. + +Thamos +------ + +`Thamos`_ uses micropipenv `to install Python package +distributions +`__, +hence any findings for micropipenv apply for Thamos. + +PDM +--- + +`PDM`_ uses installer `to install binary distributions +`__. +The only additional metadata file it eventually creates in the ``.dist-info`` +directory is `the REFER_TO file +`__. + +References +========== + +.. _pypa/pip#11865: https://github.com/pypa/pip/pull/11865 + +.. _fridex/pip#2: https://github.com/fridex/pip/pull/2/ + +.. _pip_preserve: https://pypi.org/project/pip-preserve/ + +.. _thoth-station/micropipenv#206: https://github.com/thoth-station/micropipenv/issues/206 + +.. _pypa/pip-audit#170: https://github.com/pypa/pip-audit/issues/170 + +.. _pip_installation_report: https://pip.pypa.io/en/stable/reference/installation-report/ + +.. _distlib: https://distlib.readthedocs.io/ + +.. _Pipenv: https://pipenv.pypa.io/ + +.. _installer: https://github.com/pypa/installer + +.. _Poetry: https://python-poetry.org/ + +.. _Conda: https://docs.conda.io/ + +.. _Hatch: https://hatch.pypa.io/ + +.. _micropipenv: https://github.com/thoth-station/micropipenv + +.. _Thamos: https://github.com/thoth-station/thamos/ + +.. _PDM: https://pdm.fming.dev/ + +Acknowledgements +================ + +Thanks to Dustin Ingram, Brett Cannon, and Paul Moore for the initial discussion in +which this idea originated. + +Thanks to Donald Stufft, Ofek Lev, and Trishank Kuppusamy for early feedback +and support to work on this PEP. + +Thanks to Gregory P. Smith, Stéphane Bidoul, and C.A.M. Gerlach for +reviewing this PEP and providing valuable suggestions. + +Thanks to Stéphane Bidoul and Chris Jerdonek for :pep:`610`. + +Last, but not least, thanks to Donald Stufft for sponsoring this PEP. + +Copyright +========= + +This document is placed in the public domain or under the CC0-1.0-Universal +license, whichever is more permissive.