PEP: 458
Title: Secure PyPI downloads with signed repository metadata
Version: $Revision$
Last-Modified: $Date$
Author: Trishank Karthik Kuppusamy <karthik@trishank.com>,
        Vladimir Diaz <vladimir.diaz@nyu.edu>,
        Marina Moore <mm9693@nyu.edu>,
        Lukas Puehringer <lukas.puehringer@nyu.edu>,
        Joshua Lock <jlock@vmware.com>,
        Lois Anne DeLong <lad278@nyu.edu>,
        Justin Cappos <jcappos@nyu.edu>
Sponsor: Alyssa Coghlan <ncoghlan@gmail.com>
BDFL-Delegate: Donald Stufft <donald@stufft.io>
Discussions-To: https://discuss.python.org/t/pep-458-secure-pypi-downloads-with-package-signing/2648
Status: Accepted
Type: Standards Track
Topic: Packaging
Content-Type: text/x-rst
Created: 27-Sep-2013
Post-History: 06-Jan-2019, 13-Nov-2019
Resolution: https://discuss.python.org/t/pep-458-secure-pypi-downloads-with-package-signing/2648/115


Abstract
========

This PEP describes changes to the PyPI infrastructure that are needed to ensure
that users get valid packages from PyPI. These changes should have minimal
impact on other parts of the ecosystem. The PEP focuses on communication between
PyPI and users, and so does not require any action by package developers.
Developers will upload packages using the current process, and PyPI will
automatically generate signed repository metadata for these packages.

In order for the security mechanism to be
effective, additional work will need to be done by PyPI consumers (like pip) to
verify the signatures and metadata provided by PyPI. This verification can be
transparent to users (unless it fails) and provides an automatic security
mechanism. There is documentation for how to consume TUF metadata in the TUF
repository. However, changes to PyPI consumers are not a pre-requisite for
publishing the metadata from PyPI, and can be done
according to the timelines and priorities of individual projects.


Proposed TUF Integration
========================

This PEP proposes how The Update Framework [2]_ (TUF__) should be integrated with the
Python Package Index (PyPI [1]_). TUF was designed to be a flexible
security add-on to a software updater or package manager.  A full implementation
of the framework
integrates best security practices, such as separating role responsibilities,
adopting the many-man rule for signing packages, keeping signing keys offline,
and revocation of expired or compromised signing keys.  As a result, attackers
would need to steal multiple signing keys, which are stored independently,
in order to compromise the role responsible for specifying a repository's available
files.  Or, alternatively, a role
responsible for indicating the latest snapshot of the repository may also have to be
compromised.

__ https://www.linuxfoundation.org/cloud-containers-virtualization/2017/10/cncf-host-two-security-projects-notary-tuf-specification/

The initial integration proposed in this PEP will allow modern package managers,
such as pip [3]_, to be more secure against attacks on PyPI mirrors and PyPI's
own content distribution network, and to better protect users from such attacks.
Specifically, this PEP describes how PyPI processes
should be adapted to generate and incorporate TUF metadata (i.e., the minimum
security model).  This minimum security model supports verification of PyPI
distributions that are signed with keys stored on PyPI. Distributions that are
uploaded by developers are signed by PyPI, requiring no action from developers
(other than uploading the distribution), and are immediately available for
download.  The minimum security model also minimizes PyPI administrative
responsibilities by automating much of the signing process.

There is no discussion in *this* PEP of support for project distributions that
are signed by developers (maximum security model). This possible future extension
is covered in detail in :pep:`480`.  The maximum security model requires more PyPI
administrative work (though no added work for clients), and also proposes
an easy-to-use key management solution for developers/publishers, ideas on how
to interface with a potential future build farm on PyPI infrastructure, and the
feasibility of end-to-end signing.

While it does provide implementation recommendations, this PEP does not
prescribe exactly how package managers, such as pip, should be adapted
to install or update projects from PyPI with TUF metadata.   Package managers
interested in adopting TUF on the client side may consult its `library
documentation`__, which was created for this purpose.

__ https://theupdateframework.readthedocs.io/en/stable/api/tuf.ngclient.html

Non-goals
=========

This PEP does not eliminate any existing features from PyPI. In particular, it
does not replace existing support for OpenPGP signatures. Developers can continue
to upload detached OpenPGP signatures along with distributions. In the future,
:pep:`480` may allow developers to directly sign TUF metadata using their OpenPGP keys.


PEP Status
==========

Due to the amount of work required to implement this PEP, in early
2019 it was deferred until appropriate funding could be secured to
implement the PEP. The Python Software Foundation secured this funding
[22]_ and new PEP coauthors restarted PEP discussion__.

__ https://discuss.python.org/t/pep-458-secure-pypi-downloads-with-package-signing/2648/


Motivation
==========

Attacks on software repositories are common, even in organizations with very
good security practices__. The resulting repository compromise allows an
attacker to edit all files stored on the repository and sign these files using
any keys stored on the repository (online keys). In many signing schemes (like
TLS), this access allows the attacker to replace files on the repository and
make it look like these files are coming from PyPI. Without a way to revoke and
replace the trusted private key, it is very challenging to recover from a
repository compromise. In addition to the dangers of repository compromise,
software repositories are vulnerable to an attacker on the network (MITM)
intercepting and changing files. These and other attacks on software
repositories are detailed here__.

This PEP, together with the follow-up proposal in :pep:`480`, aims to protect users
of PyPI from compromises of the integrity, consistency, and freshness properties
of PyPI packages, and enhances compromise resilience by mitigating key risk and
providing mechanisms to recover from a compromise of PyPI or its signing keys.

__ https://github.com/theupdateframework/pip/wiki/Attacks-on-software-repositories
__ https://theupdateframework.github.io/security.html

On January 5, 2013, the Python Software Foundation (PSF) announced that [4]_ a security
breach had occurred on the
python.org wikis for Python and Jython. As a result, all of the wiki data was destroyed.
Fortunately, the PyPI infrastructure was not affected by this breach.
However, the incident is a reminder that PyPI needed to take defensive steps to
protect users as much as possible in the event of a compromise.  Attacks on
software repositories happen all the time [5]_.  The PSF must accept the
possibility of security breaches and prepare PyPI accordingly because it is a
valuable resource used by thousands, if not millions, of people.

Before the wiki attack, PyPI used MD5 hashes to tell package managers, such as
pip, whether or not a distribution file was corrupted in transit.  However, the absence
of SSL made it hard for package managers to verify transport integrity to PyPI.
It was therefore easy to launch a man-in-the-middle attack between pip and
PyPI, and arbitrarily change the content of distributions.  As a result, users could be tricked into
installing malicious distributions.  After the wiki
attack, several steps were proposed (some of which were implemented) to deliver
a much higher level of security than was previously the case. These steps included
requiring SSL to
communicate with PyPI [6]_, restricting project names [7]_, and migrating from
MD5 to SHA-2 hashes [8]_.

Though necessary, these steps are insufficient to protect distributions because attacks are still
possible through other avenues.  For example, a public mirror is trusted to
honestly mirror PyPI, but some mirrors may misbehave, whether by accident or through
malicious intervention.
Package managers such as pip are supposed to use signatures from PyPI to verify
distribution files downloaded from a :pep:`public mirror <381>`, but none are known to actually
do so [10]_.  Therefore, it would be wise to add more security measures to
detect attacks from public mirrors or content delivery networks [11]_ (CDNs).

Even though official mirrors have been :pep:`deprecated on PyPI <449>`, a
wide variety of other attack vectors on package managers remain [13]_.  These attacks
can crash client systems, cause obsolete distributions to be installed, or even
allow an attacker to execute arbitrary code.  In `September 2013`__, a post was
made to the Distutils mailing list showing that the latest version of pip (at
the time) was susceptible to such attacks, and how TUF could protect users
against them [14]_.  Specifically, testing was done to see how pip would
respond to these attacks with and without TUF.  Attacks tested included replay
and freeze, arbitrary installation, slow retrieval, and endless data.  The post
also included a demonstration of how pip would respond if PyPI were
compromised.

__ https://mail.python.org/pipermail/distutils-sig/2013-September/022755.html

To provide compromise resilient protection of PyPI, this PEP proposes the use of
The Update Framework [2]_ (TUF). TUF provides protection from a variety of
attacks on software update systems, while also providing mechanisms to recover
from a repository compromise. TUF has been used in production by a number of
organizations, including use in Cloud Native Computing Foundation’s Notary
service, which provides the infrastructure for container image signing in Docker
Registry. The TUF specification has been the subject of three independent
security audits__.

__ https://theupdateframework.github.io/audits.html

The scope of *this* PEP is protecting users from compromises of PyPI mirrors,
and PyPI's own TLS termination and content distribution infrastructure.
Protection from compromises of PyPI itself is discussed in :pep:`480`.


Threat Model
============

The threat model assumes the following:

* Offline keys are safe and securely stored.

* Attackers *cannot* compromise PyPI's trusted keys stored online.

* Attackers can respond to client requests.

An attacker is considered successful if it can cause a client to install (or
leave installed) something other than the most up-to-date version of a
software distribution file. If the attacker is preventing the installation
of updates, they do not want clients to realize there is anything wrong.

This threat model describes the minimum security model. The maximum security
model described in :pep:`480` also assumes that attackers can compromise PyPI's
online keys.


Definitions
===========

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
interpreted as described in :rfc:`2119`.

This PEP focuses only on integrating TUF into PyPI. However, the reader is
encouraged to review TUF design principles [2]_ and SHOULD be
familiar with the TUF specification [16]_.

The following terms used in this PEP are defined in the Python Packaging
Glossary [17]_: *project*, *release*, *distribution*.

Additional terms used in this PEP are defined as follows:

* Role: TUF specifies one *root* role and multiple other roles to which the
  *root* role delegates responsibilities, directly or indirectly. The term
  *top-level* role refers to the *root* role and any role specified directly by
  the *root* role, i.e. *timestamp*, *snapshot* and *targets* roles. Each role
  has a single metadata file that it is trusted to provide.

* Distribution file: A versioned archive file that contains Python packages,
  modules, and other resource files that are used to distribute a release. The
  terms *distribution file*, *distribution package* [17]_, or simply
  *distribution* or *package* may be used interchangeably in this PEP.

* Simple index: The HTML page that contains internal links to distribution
  files.

* Target files: As a rule of thumb, target files are all files on PyPI whose
  integrity should be guaranteed with TUF. Typically, this includes
  distribution files and PyPI metadata, such as simple indices.

* Metadata: Metadata are signed files that describe roles, other metadata, and
  target files. If not specified otherwise metadata means TUF-specific
  metadata.

* Repository: A repository is a source for named metadata and target
  files.  Clients request metadata and target files stored on a repository.

* Consistent snapshot: A set of TUF metadata and target files that capture the
  complete state of all projects on PyPI as they existed at some fixed point in
  time.

* Developer: Either the owner or maintainer of a project who is allowed to
  update the TUF metadata, as well as target files for a project.

* Online key: A private cryptographic key that MUST be stored on the PyPI
  server infrastructure.  This is usually to allow automated signing with the
  key.  However, an attacker who compromises the PyPI infrastructure will be
  able to read these keys.

* Offline key: A private cryptographic key that MUST be stored independent of
  the PyPI server infrastructure.  This prevents automated signing with the
  key.  An attacker who compromises the PyPI infrastructure will not be able to
  immediately read these keys.

* Threshold signature scheme: A role can increase its resilience to key
  compromises by specifying that at least t out of n keys are REQUIRED to sign
  its metadata.  A compromise of t-1 keys is insufficient to compromise the
  role itself.  Saying that a role requires (t, n) keys denotes the threshold
  signature property.


Overview of TUF
===============

At its highest level, TUF provides applications with a secure method for
knowing about and obtaining new versions of files. On the
surface, this all sounds simple. The basic steps for updating applications are:

* Knowing an update exists.

* Downloading a correct copy of the latest version of the updated file.

The problem is that updating applications is only simple when there are no
malicious activities in the picture. If an attacker is trying to interfere with
these seemingly simple steps, there is plenty they can do.

Assume a software updater takes the approach of most systems (at least the ones
that try to be secure). It downloads both the file it wants and a cryptographic
signature of the file. The software updater already knows which key it trusts
to make the signature. It checks that the signature is correct and was made by
this trusted key. Unfortunately, the software updater is still at risk in many
ways, including the following scenarios:

* An attacker keeps giving the software updater the same update file, so it
  never realizes there is an update.

* An attacker gives the software updater an older, insecure version of a file
  that it already has, so it downloads that one and blindly uses it thinking it
  is newer.

* An attacker gives the software updater a newer version of a file, but
  not the newest one.  The file is newer to the software updater, but it
  may be insecure and exploitable by the attacker.

* An attacker compromises the key used to sign these files and now the software
  updater downloads a malicious file that is properly signed.

TUF is designed to address these attacks, and others, by adding signed metadata
(text files that describe the repository's files) to the repository and
referencing the metadata files during the update procedure.  Repository files
are verified against the information included in the metadata before they are
handed off to the software update system.  The framework also provides
multi-signature trust, explicit and implicit revocation of cryptographic keys,
responsibility separation of the metadata, and minimized key risk.  For a full
list and outline of the repository attacks and software updater weaknesses
addressed by TUF, see Appendix A.


Integrating PyPI with TUF
=========================

A software update system must complete two main tasks to integrate with TUF.
First, the repository on the server side MUST be modified to provide signed
TUF metadata. This PEP is concerned with the first part of the integration,
and the changes on PyPI required to support software updates with TUF.

Second, it must add the framework to the client side of the update system. For
example, TUF MAY be integrated with the pip package manager. Thus, new versions
of pip going forward SHOULD use TUF by default to download and verify distributions
from PyPI before installing them. However, there may be unforeseen issues that
might prevent users from installing or updating distributions, including pip itself,
via TUF. Therefore, pip SHOULD provide an option e.g.,
``--unsafely-disable-package-verification``, in order to work around such issues
until they are resolved. Note, the proposed option name is purposefully long,
because a user must be helped to understand that the action is unsafe and not
generally recommended.

We assume that pip would use TUF to verify distributions downloaded only from PyPI.
pip MAY support TAP 4__ in order use TUF to also verify distributions downloaded
from :pep:`elsewhere <470>`.

__ https://github.com/theupdateframework/taps/blob/master/tap4.md


What Additional Repository Files are Required on PyPI?
------------------------------------------------------

In order for package managers like pip to download and verify distributions with
TUF, a few extra files MUST be added to PyPI. These extra repository files are
called TUF metadata, and they contain such information as which keys can be trusted,
the `cryptographic hashes`__ of files, signatures, metadata version numbers, and
the date after which the metadata should be considered expired.

__ https://en.wikipedia.org/wiki/Cryptographic_hash_function

When a package manager wants to check for updates, it asks TUF to do the work.
That is, a package manager never has to deal with this additional metadata or
understand what's going on underneath. If TUF reports back that there are
updates available, a package manager can then ask TUF to download these files
from PyPI. TUF downloads them and checks them against the TUF metadata that it
also downloads from the repository. If the downloaded target files are
trustworthy, TUF then hands them over to the package manager.

The `Document formats`__ section of the TUF specification provides information
about each type of required metadata and its expected content.  The next
section covers the different kinds of metadata RECOMMENDED for PyPI.

__ https://theupdateframework.github.io/specification/latest/#document-formats

In addition, all target files SHOULD be available on disk at least two times.
Once under their original filename, to provide backwards compatibility, and
once with their SHA-512 hash included in their
filename. This is required to produce `Consistent Snapshots`_.

Depending on the used file system different data deduplication mechanisms MAY
be employed to avoid storage increase from hard copies of target files.


PyPI and TUF Metadata
=====================

TUF metadata provides information that clients can use to make update
decisions.  For example, a *targets* metadata lists the available target files
on PyPI and includes the required signatures, cryptographic hash, and
file sizes for each.  Different metadata files provide different information, which are
signed by separate roles. The *root* role indicates what metadata belongs to
each role. The concept of roles allows TUF to delegate responsibilities
to multiple roles, thus minimizing the impact of any one compromised role.

TUF requires four top-level roles.  These are *root*, *timestamp*, *snapshot*,
and *targets*.  The *root* role specifies the public cryptographic keys of the
top-level roles (including its own).  The *timestamp* role references the
latest *snapshot* and can signify when a new snapshot of the repository is
available.  The *snapshot* role indicates the latest version of all the TUF
metadata files (other than *timestamp*). The *targets* role lists the file
paths of available target files together with their cryptographic hashes.
The file paths must be specified relative to a base URL. This allows the
actual target files to be served from anywhere, as long as the base URL
can be accessed by the client. Each top-level role will serve its
responsibilities without exception.  Table 1 provides an overview of the
roles used in TUF.

+-----------------------------------------------------------------------------+
|                          Roles and Responsibilities                         |
+-----------------+-----------------------------------------------------------+
|       root      | The root role is the locus of trust for the entire        |
|                 | repository. The root role signs the root.json metadata    |
|                 | file. This file indicates which keys are authorized for   |
|                 | each of the top-level roles, including for the root role  |
|                 | itself. The roles "root", "snapshot", "timestamp" and     |
|                 | "targets" must be specified and each has a list of public |
|                 | keys.                                                     |
+-----------------+-----------------------------------------------------------+
|     targets     | The targets role is responsible for indicating which      |
|                 | target files are available from the repository. More      |
|                 | precisely, it shares the responsibility of providing      |
|                 | information about the content of updates. The targets     |
|                 | role signs targets.json metadata, and can delegate trust  |
|                 | for repository files to other roles (delegated roles).    |
+-----------------+-----------------------------------------------------------+
| delegated roles | If the top-level targets role performs delegation, the    |
|                 | resulting delegated roles can then provide their own      |
|                 | metadata files. The format of the metadata files provided |
|                 | by delegated targets roles is the same as that of         |
|                 | targets.json. As with targets.json, the latest version of |
|                 | metadata files belonging to delegated roles are described |
|                 | in the snapshot role's metadata.                          |
+-----------------+-----------------------------------------------------------+
|     snapshot    | The snapshot role is responsible for ensuring that        |
|                 | clients see a consistent repository state. It provides    |
|                 | repository state information by indicating the latest     |
|                 | versions of the top-level targets and delegated targets   |
|                 | metadata files on the repository in snapshot.json. root   |
|                 | and timestamp are not listed in snapshot.json, because    |
|                 | timestamp signs for its freshness, after snapshot.json    |
|                 | has been created, and root, which has all top-level keys, |
|                 | is required ahead of time to trust any of the top-level   |
|                 | roles.                                                    |
+-----------------+-----------------------------------------------------------+
|    timestamp    | The timestamp role is responsible for providing           |
|                 | information about the timeliness of available updates.    |
|                 | Timeliness information is made available by frequently    |
|                 | signing a new timestamp.json file that has a short        |
|                 | expiration time. This file indicates the latest version   |
|                 | of snapshot.json.                                         |
+-----------------+-----------------------------------------------------------+

Table 1: An overview of the TUF roles.

Unless otherwise specified, this PEP RECOMMENDS that every metadata or
target file be hashed using the SHA2-512 function of
the `SHA-2`__ family. SHA-2 has native and well-tested Python 2 and 3
support (allowing for verification of these hashes without additional,
non-Python dependencies). If stronger security guarantees are
required, then both SHA2-256 and SHA2-512 or both SHA2-256 and `SHA3-256`__
MAY be used instead. SHA2-256 and SHA3-256
are based on very different designs from each other, providing extra protection
against `collision attacks`__. However, SHA-3
requires installing additional, non-Python dependencies for `Python 2`__.

__ https://en.wikipedia.org/wiki/SHA-2
__ https://en.wikipedia.org/wiki/SHA-3
__ https://en.wikipedia.org/wiki/Collision_attack
__ https://pip.pypa.io/en/latest/development/release-process/#python-2-support


Signing Metadata and Repository Management
------------------------------------------

The top-level *root* role signs for the keys of the top-level *timestamp*,
*snapshot*, *targets*, and *root* roles.  The *timestamp* role signs for every
new snapshot of the repository metadata.  The *snapshot* role signs for *root*,
*targets*, and all delegated targets roles. The delegated targets role *bins*
further delegates to the *bin-n* roles, which sign for all distribution files
belonging to registered PyPI projects.

Figure 1 provides an overview of the roles available within PyPI, which
includes the top-level roles and the roles delegated to by *targets*. The figure
also indicates the types of keys used to sign each role, and which roles are
trusted to sign for files available on PyPI.  The next two sections cover the
details of signing repository files and the types of keys used for each role.

.. image:: pep-0458-1.png
   :class: invert-in-dark-mode

Figure 1: An overview of the role metadata available on PyPI.

The roles that change most frequently are *timestamp*, *snapshot* and roles
delegated to by *bins* (i.e., *bin-n*).  The *timestamp* and *snapshot*
metadata MUST be updated whenever *root*, *targets* or delegated metadata are
updated.  Observe, though, that *root* and *targets* metadata are much less
likely to be updated as often as delegated metadata.  Similarly, the *bins* role
will only be updated when a *bin-n* role is added, updated, or removed.  Therefore, *timestamp*,
*snapshot*, and *bin-n* metadata will most likely be updated frequently (possibly every
minute) due to delegated metadata being updated frequently in order to support
continuous delivery of projects.  Continuous delivery is a set of processes
that PyPI uses to produce snapshots that can safely coexist and be deleted
independent of other snapshots [18]_.

Every year, PyPI administrators SHOULD sign for *root* and *targets* role keys.
Automation will continuously sign for a timestamped snapshot of all projects. A
repository `Metadata API`__ is available that can be used to `manage a TUF
repository`__.

__ https://theupdateframework.readthedocs.io/en/stable/api/tuf.api.html
__ https://github.com/theupdateframework/python-tuf/blob/v0.20.0/examples/repo_example/basic_repo.py

In standard operation, the *bin-n* metadata will be updated and signed as new
distributions are uploaded to PyPI. However, there will also need to be a
one-time online initialization mechanism to create and sign *bin-n* metadata for
all existing distributions that are part of the PyPI repository every time PyPI
is re-initialized.


How to Establish Initial Trust in the PyPI Root Keys
----------------------------------------------------

Package managers like pip MUST ship the *root* metadata file with the
installation files that users initially download. This includes information
about the keys trusted for all top-level roles (including the root keys themselves).
Package managers must also bundle a TUF client library. Any new version of *root*
metadata that the TUF client library may download is verified against the root keys
initially bundled with the package manager. If a root key is compromised,
but a threshold of keys are still secured, then PyPI administrators MUST push new
*root* metadata that revokes trust in the compromised keys. If a threshold of root
keys are compromised, then the *root* metadata MUST be updated out-of-band.
(However, the threshold of root keys should be chosen so that this event is extremely
unlikely.) Package managers do not necessarily need to be updated immediately if root
keys are revoked or added between new releases of the package manager, as the TUF update
process automatically handles cases where a threshold of previous *root* keys sign
for new *root* keys (assuming no backwards-incompatibility in the TUF specification
used). So, for example, if a package manager was initially shipped with version 1 of
the *root* metadata, and a threshold of *root* keys in version 1 signed version 2 of
the *root metadata*, and a threshold of *root* keys in version 2 signed version 3 of
the *root metadata, then the package manager should be able to transparently update
its copy of the *root* metadata from version 1 to 3 using its TUF client library.

Thus, to repeat, the latest good copy of *root* metadata and a TUF client library MUST
be included in any new version of pip shipped with CPython (via ensurepip). The TUF
client library inside the package manager then loads the *root* metadata and downloads
the rest of the roles, including updating the *root* metadata if it has changed.
An `outline of the update process`__ is available.

__ https://github.com/theupdateframework/specification/blob/master/tuf-spec.md#5-detailed-workflows


Minimum Security Model
----------------------

There are two security models to consider when integrating TUF into PyPI.  The
one proposed in this PEP is the minimum security model, which supports
verification of PyPI distributions signed with private cryptographic
keys stored on PyPI.  Distributions uploaded by developers are signed by PyPI
and immediately available for download.  A possible future extension to this
PEP, discussed in :pep:`480`, proposes the maximum security model and allows
a developer to sign for their project.  Developer keys are not stored online:
therefore, projects are safe from PyPI compromises.

The minimum security model requires no action from a developer and protects
against malicious CDNs [19]_ and public mirrors.  To support continuous
delivery of uploaded distributions, PyPI signs for projects with an online key.
This level of security prevents projects from being accidentally or
deliberately tampered with by a mirror or a CDN because neither will
have any of the keys required to sign for projects.  However, it does not
protect projects from attackers who have compromised PyPI, since they can
then manipulate TUF metadata using the keys stored online.

This PEP proposes that the *bin-n* roles sign for all PyPI projects with online
keys. These *bin-n* roles MUST all be delegated by the upper-level *bins* role,
which is signed with an offline key, and in turn MUST be delegated by the
top-level *targets* role, which is also signed with an offline key.
This means that when a package manager such as pip (i.e., using TUF) downloads
a distribution file from a project on PyPI, it will consult the *targets* role about
the TUF metadata for that distribution file.  If ultimately no *bin-n* roles
delegated by *targets* via *bins* specify the distribution file, then it is
considered to be non-existent on PyPI.

Note, the reason why *targets* does not directly delegate to *bin-n*, but
instead uses the intermediary *bins* role, is so that other delegations can
easily be added or removed, without affecting the *bins*-to-*bin-n* mapping.
This is crucial for the implementation of :pep:`480`.


Metadata Expiry Times
---------------------

The metadata for the *root*, *targets*, and *bins* roles SHOULD each expire in
one year, because these metadata files are expected to change very rarely.

The *timestamp*, *snapshot*, and *bin-n* metadata SHOULD each expire in one day
because a CDN or mirror SHOULD synchronize itself with PyPI every day.
Furthermore, this generous time frame also takes into account client clocks
that are highly skewed or adrift.


Metadata Scalability
--------------------

As the number of projects and distributions on a repository grows, TUF metadata will need to
grow correspondingly.  For example, consider the *bins* role.  In August 2013,
it was found that the size of the *bins* metadata was about 42MB if the *bins*
role itself signed for about 220K PyPI targets (which are simple indices and
distributions).  This PEP does not delve into the details, but TUF features a
so-called `"hashed bin delegation"`__ scheme that splits a large targets metadata file
into many small ones.  This allows a TUF client updater to intelligently
download only a small number of TUF metadata files in order to update any
project signed for by the *bins* role.  For example, applying this scheme to
the previous repository resulted in pip downloading between 1.3KB and 111KB to
install or upgrade a PyPI project via TUF.

__ https://github.com/theupdateframework/python-tuf/blob/v0.20.0/examples/repo_example/hashed_bin_delegation.py

Based on our findings as of the time this document was updated for
implementation (Nov 7 2019), summarized in Tables 2-3, PyPI SHOULD
split all targets in the *bins* role by delegating them to 16,384
*bin-n* roles (see C10 in Table 2). Each *bin-n* role would sign
for the PyPI targets whose SHA2-512 hashes fall into that bin
(see Figure 1 and `Consistent Snapshots`_). It was found
that this number of bins would result in a 5-9% metadata overhead
(relative to the average size of downloaded distribution files; see V13 and
V15 in Table 3) for returning users, and a 69% overhead for new
users who are installing pip for the first time (see V17 in Table 3).


A few assumptions used in calculating these metadata overhead percentages:

1. We are ignoring root, timestamp, and top-level targets metadata.
2. pip will always be bundled with the latest good copy of metadata for all
   roles.

+------+--------------------------------------------------+-----------+
| Name | Description                                      | Value     |
+------+--------------------------------------------------+-----------+
| C1   | # of bytes in a SHA2-512 hexadecimal digest      | 128       |
+------+--------------------------------------------------+-----------+
| C2   | # of bytes for a SHA2-512 public key ID          | 64        |
+------+--------------------------------------------------+-----------+
| C3   | # of bytes for an Ed25519 signature              | 128       |
+------+--------------------------------------------------+-----------+
| C4   | # of bytes for an Ed25519 public key             | 64        |
+------+--------------------------------------------------+-----------+
| C5   | # of bytes for a target relative file path       | 256       |
+------+--------------------------------------------------+-----------+
| C6   | # of bytes to encode a target file size          | 7         |
+------+--------------------------------------------------+-----------+
| C7   | # of bytes to encode a version number            | 6         |
+------+--------------------------------------------------+-----------+
| C8   | # of targets (simple indices and distributions)  | 2,273,539 |
+------+--------------------------------------------------+-----------+
| C9   | Average # of bytes for a downloaded distribution | 2,184,393 |
+------+--------------------------------------------------+-----------+
| C10  | # of bins                                        | 16,384    |
+------+--------------------------------------------------+-----------+

C8 was computed by querying the number of release files.
C9 was derived by taking the average between a rough estimate of the average
size of release files *downloaded* over the past 31 days (1,628,321 bytes),
and the average size of releases files on disk (2,740,465 bytes).
Ee Durbin helped to provide these numbers on November 7, 2019.

Table 2: A list of constants used to calculate metadata overhead.

+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| Name | Description                                                                        | Formula                      | Value     |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V1   | Length of a path hash prefix                                                       | math.ceil(math.log(C10, 16)) | 4         |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V2   | Total # of path hash prefixes                                                      | 16**V1                       | 65,536    |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V3   | Avg # of targets per bin                                                           | math.ceil(C8/C10)            | 139       |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V4   | Avg size of SHA-512 hashes per bin                                                 | V3*C1                        | 17,792    |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V5   | Avg size of target paths per bin                                                   | V3*C5                        | 35,584    |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V6   | Avg size of lengths per bin                                                        | V3*C6                        | 973       |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V7   | Avg size of bin-n metadata (bytes)                                                 | V4+V5+V6                     | 54,349    |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V8   | Total size of public key IDs in bins                                               | C10*C2                       | 1,048,576 |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V9   | Total size of path hash prefixes in bins                                           | V1*V2                        | 262,144   |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V10  | Est. size of bins metadata (bytes)                                                 | V8+V9                        | 1,310,720 |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V11  | Est. size of snapshot metadata (bytes)                                             | C10*C7                       | 98,304    |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V12  | Est. size of metadata overhead per distribution per returning user (same snapshot) | 2*V7                         | 108,698   |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V13  | Est. metadata overhead per distribution per returning user (same snapshot)         | round((V12/C9)*100)          | 5%        |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V14  | Est. size of metadata overhead per distribution per returning user (diff snapshot) | V12+V11                      | 207,002   |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V15  | Est. metadata overhead per distribution per returning user (diff snapshot)         | round((V14/C9)*100)          | 9%        |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V16  | Est. size of metadata overhead per distribution per new user                       | V14+V10                      | 1,517,722 |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+
| V17  | Est. metadata overhead per distribution per new user                               | round((V16/C9)*100)          | 69%       |
+------+------------------------------------------------------------------------------------+------------------------------+-----------+

Table 3: Estimated metadata overheads for new and returning users.

The interested reader may find an interactive version of the metadata overhead
calculator here__:

__ https://docs.google.com/spreadsheets/d/11_XkeHrf4GdhMYVqpYWsug6JNz5ZK6HvvmDZX0__K2I/edit?usp=sharing

This number of bins SHOULD increase when the metadata overhead for returning
users exceeds 50%. Presently, this SHOULD happen when the number of targets
increase at least 10x from over 2M to over 22M, at which point the metadata
overhead for returning and new users would be around 50-54% and 114%
respectively, assuming that the number of bins stay fixed. If the number of
bins is increased, then the cost for all users would effectively be the cost
for new users, because their cost would be dominated by the (once-in-a-while)
cost of downloading the large number of delegations in the ``bins`` metadata.
If the cost for new users should prove to be too much, primarily due to the
overhead of downloading the ``bins`` metadata, then this subject SHOULD be
revisited before that happens.

Note that changes to the number of bins on the server are transparent to the
client.  The package manager will be required to download a fresh set of
metadata, as though it were a new user, but this operation will not require any
explicit code logic or user interaction in order to do so.

It is possible to make TUF metadata more compact by representing it in a binary
format, as opposed to the JSON text format.  Nevertheless, a sufficiently large
number of projects and distributions will introduce scalability challenges at
some point, and therefore the *bins* role will still need delegations (as
outlined in Figure 1) in order to address the problem.  The JSON format is an
open and well-known standard for data interchange, which is already supported by
the TUF reference implementation, and therefore the recommended data format by
this PEP.  However, due to the large number of delegations, compressed
versions of all metadata SHOULD also be made available to clients via the
existing Warehouse mechanisms for HTTP compression. In addition, the JSON
metadata could be compressed before being sent to clients. The TUF reference
implementation does not currently support downloading compressed JSON metadata,
but this could be added to reduce the metadata size.


PyPI and Key Requirements
=========================

In this section, the kinds of keys required to sign for TUF roles on PyPI are
examined.  TUF is agnostic with respect to choices of digital signature
algorithms.  However, this PEP RECOMMENDS that all digital signatures be
produced with the Ed25519 algorithm [15]_.  Ed25519 has native and
well-tested Python support (allowing for verification of signatures without
additional, non-Python dependencies), uses small keys, and is supported
by modern HSM and authentication token hardware.


Number and Type Of Keys Recommended
-----------------------------------

The *root* role key is critical for security and should very rarely be used.
It is primarily used for key revocation, and it is the locus of trust for all
of PyPI.  The *root* role signs for the keys that are authorized for each of
the top-level roles (including its own).  Keys belonging to the *root* role are
intended to be very well-protected and used with the least frequency of all
keys.  It is RECOMMENDED that the PSF board determine the current set of trusted
root key holders, each of whom will own a (strong) root key.
A majority of them can then constitute a quorum to revoke or endow trust in all
top-level keys.  Alternatively, the system administrators of PyPI could be
given responsibility for signing for the *root* role.  Therefore, the *root*
role SHOULD require (t, n) keys, where n is the number of key holders determined
by the PSF board, and t > 1 (so that at least two members must sign the *root*
role).

The *targets* role will be used only to sign for the static delegation of all
targets to the *bins* role.  Since these target delegations must be secured
against attacks in the event of a compromise, the keys for the *targets* role
MUST be offline and independent of other keys.  For simplicity of key
management, without sacrificing security, it is RECOMMENDED that the keys of
the *targets* role be permanently discarded as soon as they have been created
and used to sign for the role.  Therefore, the *targets* role SHOULD require
(2, 2) keys.  Again, this is because the keys are going to be permanently
discarded, and more offline keys will not help resist key recovery
attacks [20]_ unless the diversity of cryptographic algorithms is maintained.

For similar reasons, the keys for the *bins* role SHOULD be set up similar to
the keys for the *targets* role.

In order to support continuous delivery, the keys for the *timestamp*,
*snapshot*, and all *bin-n* roles MUST be online. There is little benefit in
requiring all of these roles to use different online keys, since attackers
would presumably be able to compromise all of them if they compromise PyPI.
Therefore, it is reasonable to use one online key for all of them.


Managing online keys
---------------------

The online key shared by the *timestamp*, *snapshot*, and all *bin-n* roles
MAY be stored, encrypted or not, on the Python infrastructure. For example,
the key MAY be kept on a self-hosted key management service (e.g. Hashicorp
Vault__), or a third-party one (e.g. AWS KMS__, Google Cloud KMS__, or Azure
Key Vault__).

__ https://www.vaultproject.io/
__ https://aws.amazon.com/kms/
__ https://cloud.google.com/kms/
__ https://docs.microsoft.com/en-us/azure/key-vault/basic-concepts

Some of these key management services allow keys to be stored on Hardware
Security Modules (HSMs) (e.g., Hashicorp Vault__, AWS CloudHSM__, Google
Cloud HSM__, Azure Key Vault__). This prevents attackers from exfiltrating
the online private key (albeit not from using it, although their actions
may now be cryptographically auditable). However, this requires modifying
the reference TUF implementation to support HSMs (WIP__).

__ https://www.vaultproject.io/docs/enterprise/hsm/index.html
__ https://aws.amazon.com/cloudhsm/
__ https://cloud.google.com/hsm/
__ https://docs.microsoft.com/en-us/azure/key-vault/key-vault-hsm-protected-keys
__ https://github.com/secure-systems-lab/securesystemslib/pull/170

Regardless of where and how this online key is kept, its use SHOULD be
carefully logged, monitored, and audited, ideally in such a manner that
attackers who compromise PyPI are unable to immediately turn off this logging,
monitoring, and auditing.

Managing offline keys
----------------------

As explained in the previous section, the *root*, *targets*, and *bins* role
keys MUST be offline for maximum security. These keys will be offline in the
sense that their private keys MUST NOT be stored on PyPI, though some of them
MAY be online in the private infrastructure of the project.

There SHOULD be an offline key ceremony to generate, backup, and store these
keys in such a manner that the private keys can be read only by the Python
administrators when necessary (e.g., such as rotating the keys for the
top-level TUF roles). Thus, keys SHOULD be generated, preferably in a physical
location where side-channel attacks__ are not a concern, using:

1. A trusted, airgapped__ computer with a true random number generator__, and
   with no **data** persisting after the ceremony
2. A trusted operating system
3. A trusted set of third-party packages (such as updated versions of
   cryptographic libraries or the TUF reference implementation, where the
   versions provided by the trusted operating system are not recent enough)

__ https://en.wikipedia.org/wiki/Side-channel_attack
__ https://en.wikipedia.org/wiki/Air_gap_(networking)
__ https://en.wikipedia.org/wiki/Hardware_random_number_generator

In order to avoid the persistence of sensitive data (e.g., private keys) other than
on backup media after the ceremony, offline keys SHOULD be generated
encrypted using strong passwords, either on (in decreasing order of trust):
private HSMs (e.g., YubiHSM__), cloud-based HSMs (e.g., those listed above),
in volatile memory (e.g., RAM), or in nonvolatile memory
(e.g., SSD or microSD).  If keys must be generated on nonvolatile memory,
then this memory MUST be irrecoverably destroyed after having securely
backed up the keys.

__ https://www.yubico.com/products/yubihsm/

Passwords used to encrypt keys SHOULD be stored somewhere durable and
trustworthy to which only Python admins have access.

In order to minimize OPSEC__ errors during the ceremony, scripts SHOULD be
written, for execution on the trusted key-generation computer, to automate
tedious steps of the ceremony, such as:

- Exporting to sneakernet__ all code and data (previous TUF metadata and *root*
  keys) required to generate new keys and replace old ones
- Tightening the firewall, updating the entire operating system in order to
  fix security vulnerabilities, and airgapping the computer
- Exporting *all* new TUF metadata and keys to encrypted backup media.
  This backup provides a complete copy of the data required to restore the PyPI
  TUF repository
- Exporting *only* new TUF metadata and online keys to encrypted backup media.
  This backup provides all online data for import into the PyPI infrastructure
  and is useful, e.g., when the online data needs to be restored from a previous
  archived state
- Printing and saving cryptographic hashes of new TUF metadata. This printed copy
  provides an additional offline paper backup, which can be used as a comparison
  in the case of a compromise

__ https://en.wikipedia.org/wiki/Operations_security
__ https://en.wikipedia.org/wiki/Sneakernet

Note the one-time keys for the *targets* and *bins* roles MAY be safely
generated, used, and deleted during the offline key ceremony. Furthermore,
the *root* keys MAY not be generated during the offline key ceremony itself.
Instead, a threshold t of n Python administrators, as discussed above, MAY
independently sign the *root* metadata **after** the offline key ceremony used
to generate all other keys.


How Should Metadata be Generated?
=================================

Project developers expect the distributions they upload to PyPI to be
immediately available for download.  Unfortunately, there will be problems when
many readers and writers simultaneously access the same metadata and
target files.  That is, there needs to be a way to ensure consistency of
metadata and target files when multiple developers simultaneously change these
files. There are also issues with consistency on PyPI
without TUF, but the problem is more severe with signed metadata that MUST keep
track of the files available on PyPI in real-time.

Suppose that PyPI generates a *snapshot* that indicates the latest version of
every metadata, except *timestamp*, at version 1 and a client requests this
*snapshot* from PyPI.  While the client is busy downloading this *snapshot*,
PyPI then timestamps a new snapshot at, say, version 2.  Without ensuring
consistency of metadata, the client would find itself with a copy of *snapshot*
that disagrees with what is available on PyPI. The result would be indistinguishable from
arbitrary metadata injected by an attacker.  The problem would also occur with
mirrors attempting to sync with PyPI.

Consistent Snapshots
--------------------

To keep TUF metadata on PyPI consistent with the highly volatile target files,
consistent snapshots SHOULD be used. Each consistent snapshot captures the
state of all known projects at a given time and MAY safely coexist with any
other snapshot, or be deleted independently, without affecting any other
snapshot.

To maintain consistent snapshots, all TUF metadata MUST, when written to disk,
include a version number in their filename:

  VERSION_NUMBER.ROLENAME.json,
    where VERSION_NUMBER is an incrementing integer, and ROLENAME is one of the
    top-level metadata roles -- *root*, *snapshot* or *targets* -- or one of
    the delegated targets roles -- *bins* or *bin-n*.

The only exception is the *timestamp* metadata file, whose version would not be known
in advance when a client performs an update. The *timestamp* metadata
lists the
version of the *snapshot* metadata, which in turn lists the versions of the
*targets* and delegated targets metadata, all as part of a given consistent
snapshot.

In normal usage, version number overflow is unlikely to occur. An 8-byte integer,
for instance, can be incremented once per millisecond and last almost 300 million
years. If an attacker increases the version number arbitrarily, the repository
can recover by revoking the compromised keys and resetting the version number as
described in the TUF specification__.

__ https://github.com/theupdateframework/specification/blame/4b82990afdc6c6d77aa9d43e0632f01bb9e7752c/tuf-spec.md#L1112-L1120

The *targets* or delegated targets metadata refer to the actual target
files, including their cryptographic hashes as specified above.
Thus, to mark a target file as part of a consistent snapshot it MUST, when
written to disk, include its hash in its filename:

  HASH.FILENAME
    where HASH is the `hex digest`__ of the hash of the file contents and
    FILENAME is the original filename.

This means that there MAY be multiple copies of every target file, one for each
of the cryptographic hash functions specified above.

__ https://docs.python.org/3.7/library/hashlib.html#hashlib.hash.hexdigest

Assuming infinite disk space, strictly incrementing version numbers, and no
`hash collisions`__, a client may safely read from one snapshot while PyPI
produces another snapshot.

__ https://en.wikipedia.org/wiki/Collision_(computer_science)

Clients, such as pip, that use the TUF protocol MUST be modified to download
every metadata and target file, except for *timestamp* metadata. This is done
by including, in the file request, the version of the file (for metadata),
or the cryptographic hash of the file (for target files) in the filename.

In this simple but effective manner, PyPI is able to capture a consistent
snapshot of all projects and the associated metadata at a given time.  The next
subsection provides implementation details of this idea.

Note: This PEP does not prohibit using advanced file systems or tools to
produce consistent snapshots. There are two important reasons for proposing a simple solution in this PEP.
First, the solution does not mandate that PyPI
use any particular file system or tool.  Second, the generic file-system based
approach allows mirrors to use extant file transfer tools, such as rsync, to
efficiently transfer consistent snapshots from PyPI.


Producing Consistent Snapshots
------------------------------

When a new distribution file is uploaded to PyPI, PyPI MUST update the
responsible *bin-n* metadata. Remember that all target files are sorted into
bins by their filename hashes. PyPI MUST also update *snapshot* to account for
the updated *bin-n* metadata, and *timestamp* to account for the updated
*snapshot* metadata. These updates SHOULD be handled by an automated *snapshot
process*.

File uploads MAY be handled in parallel, however, consistent snapshots MUST be
produced in a strictly sequential manner. Furthermore, as long as distribution
files are self-contained, a consistent snapshot MAY be produced for each
uploaded file. To do so upload processes place new distribution files into a
concurrency-safe FIFO queue and the snapshot process reads from that queue one
file at a time and performs the following tasks:

First, it adds the new file path to the relevant *bin-n* metadata, increments
its version number, signs it with the *bin-n* role key, and writes it to
*VERSION_NUMBER.bin-N.json*.

Then, it takes the most recent *snapshot* metadata, updates its *bin-n*
metadata version numbers, increments its own version number, signs it with the
*snapshot* role key, and writes it to *VERSION_NUMBER.snapshot.json*.

And finally, the snapshot process takes the most recent *timestamp* metadata,
updates its *snapshot* metadata hash and version number, increments its own
version number, sets a new expiration time, signs it with the *timestamp* role
key, and writes it to *timestamp.json*.

When updating *bin-n* metadata for a consistent snapshot, the snapshot process
SHOULD also include any new or updated hashes of simple index pages in the
relevant *bin-n* metadata. Note that, simple index pages may be generated
dynamically on API calls, so it is important that their output remains stable
throughout the validity of a consistent snapshot.

Since the snapshot process MUST generate consistent snapshots in a strictly
sequential manner it constitutes a bottleneck. Fortunately, the operation of
signing is fast enough that this may be done a thousand or more times per
second.

Moreover, PyPI MAY serve distribution files to clients before the corresponding
consistent snapshot metadata is generated. In that case the client software
SHOULD inform the user that full TUF protection is not yet available but will
be shortly.

PyPI SHOULD use a `transaction log`__ to record upload processes and the
snapshot queue for auditing and to recover from errors after a server failure.

__ https://en.wikipedia.org/wiki/Transaction_log


Cleaning up old metadata
------------------------

To avoid running out of disk space due to the constant production of new
consistent snapshots, PyPI SHOULD regularly delete old consistent snapshots,
i.e. metadata and target files that were obsoleted some reasonable time in
the past, such as 1 hour.

In order to preserve the latest consistent snapshot PyPI MAY use a
"mark-and-sweep" algorithm. That is, walk from the root of the latest
consistent snapshot, i.e. *timestamp* over *snapshot* over *targets* and
delegated targets until the target files, marking all visited files, and
delete all unmarked files. The last few consistent snapshots may be preserved
in a similar fashion.

Deleting a consistent snapshot will cause clients to see nothing except HTTP
404 responses to any request for a file within that consistent snapshot.
Clients SHOULD then retry their requests (as before) with the latest consistent
snapshot.

Note that *root* metadata, even though versioned, is not part of any consistent
snapshot. PyPI MUST NOT delete old versions of *root* metadata. This guarantees
that clients can update to the latest *root* role keys, no matter how outdated
their local *root* metadata is.


Revoking Trust in Projects and Distributions
============================================

From time to time either a project or a distribution will need to be revoked.
To revoke trust in either a project or a distribution, the associated bin-n
role can simply remove the corresponding targets and re-sign the bin-n
metadata. This action only requires actions with the online bin-n key.


Key Compromise Analysis
=======================

This PEP has covered the minimum security model, the TUF roles that should be
added to support continuous delivery of distributions, and how to generate and
sign the metadata for each role.  The remaining sections discuss how PyPI
SHOULD audit repository metadata, and the methods PyPI can use to detect and
recover from a PyPI compromise.

Table 4 summarizes a few of the attacks possible when a threshold number of
private cryptographic keys (belonging to any of the PyPI roles) are
compromised.  The leftmost column lists the roles (or a combination of roles)
that have been compromised, and the columns to its right show whether the
compromised roles leave clients susceptible to malicious updates, a freeze
attack, or metadata inconsistency attacks. Note that if the timestamp, snapshot,
and bin-n roles are stored in the same online location, a compromise of one
means they will all be compromised. Therefore, the table considers these
roles together. A version of this table that considers these roles separately
is included in :pep:`480`.

+-----------------+-------------------+----------------+--------------------------------+
| Role Compromise | Malicious Updates | Freeze Attack  | Metadata Inconsistency Attacks |
+=================+===================+================+================================+
| targets         | NO                                                                  |
| **OR**          | timestamp and snapshot need to cooperate                            |
| bins            |                                                                     |
+-----------------+-------------------+----------------+--------------------------------+
| timestamp       | YES                                                                 |
| **AND**         | limited by earliest root, targets, or bins metadata expiry time     |
| snapshot        |                                                                     |
| **AND**         |                                                                     |
| bin-n           |                                                                     |
+-----------------+-------------------+----------------+--------------------------------+
| root            | YES                                                                 |
+-----------------+-------------------+----------------+--------------------------------+

Table 4: Attacks possible by compromising certain combinations of role keys.
In `September 2013`__, it was shown how the latest version (at the time) of pip
was susceptible to these attacks and how TUF could protect users against them
[14]_.

__ https://mail.python.org/pipermail/distutils-sig/2013-September/022755.html

Note that compromising *targets* or *bins*
does not immediately allow an attacker to serve malicious
updates.  The attacker must also compromise the *timestamp* and *snapshot*
roles, which are both online and therefore more likely to be compromised.
This means that, in order to launch any attack, one must not only be able to
act as a man-in-the-middle, but also compromise the *timestamp* key (or
compromise the *root* keys and sign a new *timestamp* key).  To launch any
attack other than a freeze attack, one must also compromise the *snapshot* key.
In practice, this PEP recommends storing the *snapshot*, *timestamp*, and
*bin-n* keys together, or even using the same key for all of these roles.
Because of this, the attacker only needs to compromise this single server to
perform any of the attacks listed above. Note that clients are still protected
against compromises of non-signing infrastructure such as CDNs or mirrors.
Moreover, the offline *root* key will
allow the repository to recover from an attack by revoking the online key(s).

The maximum security model shows how TUF mitigates online key compromises by
introducing additional roles for end-to-signing. Details about how to generate
developer keys and sign upload distributions are provided in :pep:`480`.


In the Event of a Key Compromise
--------------------------------

A key compromise means that a threshold of keys (belonging to the metadata
roles on PyPI), as well as the PyPI infrastructure have been compromised and
used to sign new metadata on PyPI.

If a threshold number of *timestamp*, *snapshot*, *targets*, *bins* or *bin-n*
keys have been compromised, then PyPI MUST take the following steps:

1. Revoke the *timestamp*, *snapshot* and *targets* role keys from
   the *root* role.  This is done by replacing the compromised *timestamp*,
   *snapshot* and *targets* keys with newly issued keys.

2. Revoke the *bins* keys from the *targets* role by replacing their keys with
   newly issued keys.  Sign the new *targets* role metadata and discard the new
   keys (because, as explained earlier, this increases the security of
   *targets* metadata).

3. All targets of the *bin-n* roles SHOULD be compared with the last known
   good consistent snapshot in which none of the *timestamp*, *snapshot*,
   *bins* or *bin-n* keys
   were known to have been compromised.  Added, updated or deleted targets in
   the compromised consistent snapshot that do not match the last known good
   consistent snapshot MAY be restored to their previous versions.  After
   ensuring the integrity of all *bin-n* targets, their keys should be renewed
   in the *bins* metadata.

4. The *bins* and *bin-n* metadata MUST have their version numbers incremented,
   expiry times suitably extended, and signatures renewed.

5. A new timestamped consistent snapshot MUST be issued.

Following these steps would preemptively protect all of these roles, even if
only one of them may have been compromised.

If a threshold number of *root* keys have been compromised, then PyPI MUST take
the above steps and also replace all *root* keys in the *root* role.

It is also RECOMMENDED that PyPI sufficiently document compromises with
security bulletins.  These security bulletins will be most informative when
users of pip-with-TUF are unable to install or update a project because the
keys for the *timestamp*, *snapshot* or *root* roles are no longer valid.  They
could then visit the PyPI web site to consult security bulletins that would
help to explain why they are no longer able to install or update, and then take
action accordingly.  When a threshold number of *root* keys have not been
revoked due to a compromise, then new *root* metadata may be safely updated
because a threshold number of existing *root* keys will be used to sign for the
integrity of the new *root* metadata.  TUF clients will be able to verify the
integrity of the new *root* metadata with a threshold number of previously
known *root* keys.  This will be the common case.  Otherwise, in the worst
case, in which a threshold number of *root* keys have been revoked due to a
compromise, an end-user may choose to update new *root* metadata with
`out-of-band`__ mechanisms.

__ https://en.wikipedia.org/wiki/Out-of-band#Authentication


Auditing Snapshots
------------------

If a malicious party compromises PyPI, they can sign arbitrary files with any
of the online keys.  The roles with offline keys (i.e., *root*, *targets* and *bins*)
are still protected.  To safely recover from a repository compromise, snapshots
should be audited to ensure files are only restored to trusted versions.

When a repository compromise has been detected, the integrity of three types of
information must be validated:

1. If the online keys of the repository have been compromised, they can be
   revoked by having the *targets* role sign new metadata delegating to a new
   key.

2. If the role metadata on the repository has been changed, this would impact
   the metadata that is signed by online keys.  Any role information created
   since the last period should be discarded. As a result, developers of new
   projects will need to re-register their projects.

3. If the target files themselves may have been tampered with, they can be
   validated using the stored hash information for target files that existed
   at the time of the last period.

In order to safely restore snapshots in the event of a compromise, PyPI SHOULD
maintain a small number of its own mirrors to copy PyPI snapshots according to
some schedule.  The mirroring protocol can be used immediately for this
purpose.  The mirrors must be secured and isolated such that they are
responsible only for mirroring PyPI. The mirrors can be checked against one
another to detect accidental or malicious failures.

Another approach is to generate the cryptographic hash of *snapshot*
periodically and tweet it.  Perhaps a user comes forward with the actual
metadata and the repository maintainers can verify the metadata file's cryptographic
hash.  Alternatively, PyPI may periodically archive its own versions of
*snapshot* rather than rely on externally provided metadata.  In this case,
PyPI SHOULD take the cryptographic hash of every target file on the
repository and store this data on an offline device. If any target file
hash has changed, this indicates an attack.

As for attacks that serve different versions of metadata, or freeze a version
of a distribution at a specific version, they can be handled by TUF with techniques
like implicit key revocation and metadata mismatch detection [2]_.


Managing Future Changes to the Update Process
=============================================

If breaking changes are made to the update process, PyPI should implement these
changes without disrupting existing clients. For general guidance on how to do
so, see the ongoing discussion in the TAP repository__.

__ https://github.com/theupdateframework/taps/pull/107

Note that the changes to PyPI from this PEP will be backwards compatible. The
location of target files and simple indices are not changed in this PEP, so any
existing PyPI clients will still be able to perform updates using these files.
This PEP adds the ability for clients to use TUF metadata to improve the
security of the update process.


Hash Algorithm Transition Plan
------------------------------

If the algorithm used to hash target and metadata files becomes vulnerable, it
SHOULD be replaced by a stronger hash algorithm.

The TUF metadata format allows to list digests from different hash algorithms
alongside each other, together with an algorithm identifier, so that clients
can seamlessly switch between algorithms.

However, once support for an old algorithm is turned off, clients that don't
support the new algorithm will only be able to install or update packages,
including the client itself, by disabling TUF verification. To allow clients to
transition without temporarily losing TUF security guarantees, we recommend
the following procedure.

1. Implement new algorithm in Warehouse.

2. Regenerate existing, unexpired TUF metadata to include hashes using both the
   old and new algorithms. All new metadata going forward shall list both hash
   algorithms.
   Note, only TUF metadata that lists hash digests for target files or other
   metadata needs to be renewed, that is *bin-n*, *snapshot* and *timestamp*.
   Thus, only online keys are required to sign renewed metadata.

3. Announce transition on high-visibility channels, such as `packaging on
   Python Discourse`__ and the `PyPI changes mailing list`__.

4. Give popular clients such as pip and bandersnatch the chance to adopt new
   hash algorithm.

5. Give end-users the chance to update clients.

6. Get rough consensus to remove old hash algorithm from PyPI maintainers.

7. Remove Warehouse support for old algorithm and only support new algorithm.

__ https://discuss.python.org/c/packaging
__ https://mail.python.org/mailman3/lists/pypi-announce.python.org/


Appendix A: Repository Attacks Prevented by TUF
===============================================

* **Arbitrary software installation**: An attacker installs anything it wants
  on the client system. That is, an attacker can provide arbitrary files in
  response to download requests and the files will not be detected as
  illegitimate.

* **Rollback attacks**: An attacker presents a software update system with
  files older than those the client has already seen. This causes the client to use
  outdated files.

* **Indefinite freeze attacks**: An attacker continues to present a software
  update system with the same files the client has already seen. The result is
  that the client does not know that new files are available.

* **Endless data attacks**: An attacker responds to a file download request
  with an endless stream of data, causing harm to clients (e.g., a disk
  partition filling up or memory exhaustion).

* **Slow retrieval attacks**: An attacker responds to clients with a very slow
  stream of data that essentially results in the client never continuing the
  update process.

* **Extraneous dependencies attacks**: An attacker indicates to clients that in
  order to install the software they want, they also need to install
  unrelated software.  This unrelated software can be from a trusted source
  but may have known vulnerabilities that are exploitable by the attacker.

* **Mix-and-match attacks**: An attacker presents clients with a view of a
  repository that includes files that never existed together on the repository
  at the same time. This can result in, for example, outdated versions of
  dependencies being installed.

* **Wrong software installation**: An attacker provides a client with a trusted
  file that is not the one the client wanted.

* **Malicious mirrors preventing updates**: An attacker in control of one
  repository mirror is able to prevent users from obtaining updates from
  other, good mirrors.

* **Vulnerability to key compromises**: An attacker who is able to compromise a
  single key or less than a given threshold of keys can compromise clients.
  This includes relying on a single online key, such as only being protected
  by SSL, or a single offline key, as most software update systems use
  to sign files.


References
==========

.. [1] https://pypi.python.org
.. [2] https://theupdateframework.github.io/papers/survivable-key-compromise-ccs2010.pdf
.. [3] http://www.pip-installer.org
.. [4] https://wiki.python.org/moin/WikiAttack2013
.. [5] https://github.com/theupdateframework/pip/wiki/Attacks-on-software-repositories
.. [6] https://mail.python.org/pipermail/distutils-sig/2013-April/020596.html
.. [7] https://mail.python.org/pipermail/distutils-sig/2013-May/020701.html
.. [8] https://mail.python.org/pipermail/distutils-sig/2013-July/022008.html
.. [10] https://mail.python.org/pipermail/distutils-sig/2013-September/022773.html
.. [11] https://mail.python.org/pipermail/distutils-sig/2013-May/020848.html
.. [13] https://theupdateframework.github.io/papers/attacks-on-package-managers-ccs2008.pdf
.. [14] https://mail.python.org/pipermail/distutils-sig/2013-September/022755.html
.. [15] http://ed25519.cr.yp.to/
.. [16] https://github.com/theupdateframework/specification/blob/master/tuf-spec.md
.. [17] https://packaging.python.org/glossary
.. [18] https://en.wikipedia.org/wiki/Continuous_delivery
.. [19] https://mail.python.org/pipermail/distutils-sig/2013-August/022154.html
.. [20] https://en.wikipedia.org/wiki/Key-recovery_attack
.. [22] https://pyfound.blogspot.com/2019/09/pypi-security-q4-2019-request-for.html

Acknowledgements
================

This material is based upon work supported by the National Science Foundation
under Grants No. CNS-1345049 and CNS-0959138. Any opinions, findings, and
conclusions or recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the views of the National Science
Foundation.

We thank Alyssa Coghlan, Daniel Holth, Donald Stufft, and the distutils-sig
community in general for helping us to think about how to usably and
efficiently integrate TUF with PyPI.

Roger Dingledine, Sebastian Hahn, Nick Mathewson, Martin Peck and Justin Samuel
helped us to design TUF from its predecessor, Thandy of the Tor project.

We appreciate the efforts of Konstantin Andrianov, Geremy Condra, Zane Fisher,
Justin Samuel, Tian Tian, Santiago Torres, John Ward, and Yuyu Zheng in
developing TUF.

Vladimir Diaz, Monzur Muhammad, Sai Teja Peddinti, Sumana Harihareswara,
Ee Durbin and Dustin Ingram helped us to review this PEP.

Zane Fisher helped us to review and transcribe this PEP.

Copyright
=========

This document has been placed in the public domain.