python-peps/pep-0691.rst

PEP: 691
Title: JSON-based Simple API for Python Package Indexes
Author: Donald Stufft <donald@stufft.io>,
        Pradyun Gedam <pradyunsg@gmail.com>,
        Cooper Lees <me@cooperlees.com>,
        Dustin Ingram <di@python.org>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
BDFL-Delegate: Donald Stufft <donald@stufft.io>
Discussions-To: https://discuss.python.org/t/pep-691-json-based-simple-api-for-python-package-indexes/15553
Created: 04-May-2022
Post-History: `05-May-2022 <https://discuss.python.org/t/pep-691-json-based-simple-api-for-python-package-indexes/15553>`__


Abstract
========

The "Simple Repository API" that was defined in :pep:`503` (and was in use much
longer than that) has served us reasonably well for a very long time. However,
the reliance on using HTML as the data exchange mechanism has several
shortcomings.

There are two major issues with an HTML-based API:

- While HTML5 is a standard, it's an incredibly complex standard and ensuring
  completely correct parsing of it involves complex logic that does not
  currently exist within the Python standard library (nor the standard library
  of many other languages).

  This means that to actually accept everything that is technically valid, tools
  have to pull in large dependencies or they have to rely on the standard library's
  ``html.parser`` library, which is lighter weight but potentially doesn't
  fully support HTML5.

- HTML5 is primarily designed as a markup language to present documents for human
  consumption. Our use of it is driven largely for historical reasons and accidental
  reasons, and it's unlikely anyone would design an API that relied on it if
  they were starting from scratch.

  The primary issue with using a markup format designed for human consumption
  is that there's not a great way to actually encode data within HTML. We've
  gotten around this by limiting the data we put in this API and being creative
  with how we can cram data into the API (for instance, hashes are embedded as
  URL fragments, adding the ``data-yanked`` attribute in :pep:`592`).

:pep:`503` was largely an attempt to standardize what was already in use, so it
did not propose any large changes to the API.

In the intervening years, we've regularly talked about an "API V2" that would
re-envision the entire API of PyPI. However, due to limited time constraints,
that effort has not gained much if any traction beyond people thinking that it
would be nice to do it.

This PEP attempts to take a different route. It doesn't fundamentally change
the overall API structure, but instead specifies a new representation of the
existing data contained in existing :pep:`503` responses in a format that is
easier for software to parse rather than using a human centric document format.

Goals
=====

- **Enable zero configuration discovery.** Clients of the simple API **MUST** be
  able to gracefully determine whether a target repository supports this PEP
  without relying on any form of out of band communication (configuration, prior
  knowledge, etc). Individual clients **MAY** choose to require configuration
  to enable the use of this API, however.
- **Enable clients to drop support for "legacy" HTML parsing.** While it is expected
  that most clients will keep supporting HTML-only repositories for a while, if not
  forever, it should be possible for a client to choose to support only the new
  API formats and no longer invoke an HTML parser.
- **Enable repositories to drop support for "legacy" HTML formats.** Similar to
  clients, it is expected that most repositories will continue to support HTML
  responses for a long time, or forever. It should be possible for a repository to
  choose to only support the new formats.
- **Maintain full support for existing HTML-only clients.** We **MUST** not break
  existing clients that are accessing the API as a strictly :pep:`503` API. The only
  exception to this, is if the repository itself has chosen to no longer support
  the HTML format.
- **Minimal additional HTTP requests.** Using this API **MUST** not drastically
  increase the amount of HTTP requests an installer must do in order to function.
  Ideally it will require 0 additional requests, but if needed it may require one
  or two additional requests (total, not per dependency).
- **Minimal additional unique reponses.** Due to the nature of how large
  repositories like PyPI cache responses, this PEP should not introduce a
  significantly or combinatorially large number of additional unique responses
  that the repository may produce.
- **Supports TUF.** This PEP **MUST** be able to function within the bounds of
  what TUF can support (:pep:`458`), and must be able to be secured using it.
- **Require only the standard library, or small external dependencies for clients.**
  Parsing an API response should ideally require nothing but the standard
  library, however it would be acceptable to require a small, pure Python
  dependency.


Specification
=============

To enable parsing responses with only the standard library, this PEP specifies that
all responses (besides the files themselves, and the HTML responses from
:pep:`503`) should be encoded using `JSON <https://www.json.org/>`_.

To enable zero configuration discovery and to minimize the amount of additional HTTP
requests, this PEP extends :pep:`503` such that all of the API endpoints (other than the
files themselves) will utilize HTTP content negotiation to allow client and server to
select the correct format to serve, i.e. either HTML or JSON.

Format Selection
----------------

A HTML response will be the default when requesting in version 1.0:

- ``/simple/``
- ``/simple/foo/``
  - Like :pep:`503`, the trailing ``/`` is expected

To request a JSON response, the ``Accept`` header will need to be added to the
request specify the response type and version. For version 1.0 this will look like:

  ``Accept: application/vnd.pypi.simple.v1+json``

The version is also optional and will then always return the latest version:

  ``Accept: application/vnd.pypi.simple+json``

This is for clients who always want latest and should expect potential
breakages. Additionally, it is potential useful way to run integration tests
against a possibly breaking version.

Specifying HTML is also allowed so clients can be explicit to backends (e.g if we
switch to JSON default in the future):

  ``Accept: application/vnd.pypi.simple.v1+html``

Using ``text/html`` will also work, which will serve the latest API version. To
be explicit, clients should use specific HTML ``Accept``.  If no
``Accept`` is specified, the latest HTML version will be returned unless
the backend *only* supports JSON. Backends may default to returning JSON in the
future.

The ``Accept:`` header also allows you to say that you prefer the the V1 Simple JSON API,
if that's not available then you prefer the V1 HTML API, and if that's not available,
just ``text/html``. To do this would look like:

  ``Accept: application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, text/html``

Versioning
----------

Versioning will adhere to :pep:`629` format (``Major.Minor``) and will be
included in the ``Accept`` request that clients add to obtain a JSON
response. We don't foresee the use of *Minor* versioning but will support it if
the need does arise.

The header for clients accessing version 1.0 of the API will be:

  ``application/vnd.pypi.simple.index.v1+json``

An example for Accept values that a newer APIs could support **would** look like:

  ``application/vnd.pypi.simple.index.v2+json``

If a version that does not exist is requested, the server will explicitly return a
`406 Not Acceptable
<https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/406>`_ HTTP status
code. The response will also indicate available API versions and links to
version formats.


TUF Support - PEP 458
---------------------

:pep:`458` states that the "Simple Index" needs to be hashable. To adhere to the TUF
standard, we will need a target for each response, i.e. the HTML and JSON (plus any
future type) response. To provide this we could have two targets per API endpoint:

- ``/simple/foo/vnd.pypi.simple.v1.html``
- ``/simple/foo/vnd.pypi.simple.v1.json``

Additionally, when calculating the digest of a JSON response, indices should
use the `Canonical JSON <https://wiki.laptop.org/go/Canonical_JSON>`_ format.


Root URL
--------

The root URL ``/`` for this PEP (which represents the base URL) will be a JSON encoded
dictionary where each key is a string of the normalized project name, and the value is
a dictionary with a single key, ``url``, which represents the URL that the project can
be fetched from. As an example::

    {
      "frob": {"url": "/frob/"},
      "spamspamspam": {"url": "/spamspamspam/"}
    }

Below the root URL is another URL for each individual project contained within
a repository. The format of this URL is ``/<project>/`` where the ``<project>``
is replaced by the :pep:`503`-canonicalized name for that project, so a project named
"Holy_Grail" would have a URL like ``/holy-grail/``. This URL must respond with a
JSON encoded dictionary that has two keys, ``name``, which represents the normalized
name of the project and ``files``. The ``files`` key is a list of dictionaries,
each one representing an individual file.

Each individual file dictionary has the following keys:

- ``filename``: The filename that is being represented.
- ``url``: The URL that the file can be fetched from.
- ``hashes``: A dictionary mapping a hash name to a hex encoded digest of the file.
  Multiple hashes can be included, and it is up to the client to decide what to do
  with multiple hashes (it may validate all of them or a subset of them, or nothing
  at all). These hash names **SHOULD** always be normalized to be lowercase.

  The ``hashes`` dictionary **MUST** be present, even if no hashes are available
  for the file, however it is **HIGHLY** recommended that at least one secure,
  guaranteed to be available hash is always included.
- ``requires-python``: An **optional** key that exposes the *Requires-Python*
  metadata field, specified in :pep:`345`. Where this is present, installer tools
  **SHOULD** ignore the download when installing to a Python version that
  doesn't satisfy the requirement.
- ``dist-info-metadata-available``: An **optional** key that indicates
  that metadata for this file is available, via the same location as specified in
  :pep:`658` (`{file_url}.metadata`). Where this is present, it **MUST** be true,
  or a dictionary mapping a hash name to a hex encoded digest of the metadata hash.
- ``gpg-sig``: An **optional** key that acts a boolean to indicate if the file has
  an associated GPG signature or not. If this key does not exist, then the signature
  may or may not exist.
- ``yanked``: An **optional** key which may have no value, or may have an
  arbitrary string as a value. The presence of a ``yanked`` key SHOULD
  be interpreted as indicating that the file pointed to by the ``url`` field
  has been "Yanked" as per :pep:`592`.

As an example::

    {
      "name": "holygrail",
      "files": [
        {
          "filename": "holygrail-1.0.tar.gz",
          "url": "https://example.com/files/holygrail-1.0.tar.gz",
          "hashes": {"sha256": "...", "blake2b": "..."},
          "requires-python": ">=3.7",
          "yanked": "Had a vulnerability"
        },
        {
          "filename": "holygrail-1.0-py3-none-any.whl",
          "url": "https://example.com/files/holygrail-1.0-py3-none-any.whl",
          "hashes": {"sha256": "...", "blake2b": "..."},
          "requires-python": ">=3.7",
          "dist-info-metadata-available": true
        },
      ]
    }

In addition to the above, the following constraints are placed on the API:

* While JSON doesn't natively support an URL type, any value that represents an
  URL in this API may be either absolute or relative as long as they point to
  the correct location. If relative, they are relative to the current URL as if
  it were HTML.

* Additional keys may be added to any dictionary objects in the API responses
  and clients **MUST** ignore keys that they don't understand.

* By default, any hash algorithm available via `hashlib
  <https://docs.python.org/3/library/hashlib.html>`_ (specifically any that can
  be passed to ``hashlib.new()`` and do not require additional parameters) can
  be used as a key for the hashes dictionary. At least one secure algorithm from
  ``hashlib.algorithms_guaranteed`` **SHOULD** always be included. At the time
  of this PEP, ``sha256`` specifically is recommended.

* Unlike ``data-requires-python`` in :pep:`503`, the ``requires-python`` key does not
  require any special escaping other than anything JSON does naturally.

* Future features **MAY** be implemented or only supported when operating under JSON.
  This would be decided on a case by case basis depending on how important the feature
  is, how widely used HTML is at that point, and how difficult representing the feature
  in HTML would be.

* All requirements of :pep:`503` that are not HTML specific still apply.


FAQ
===


Why JSON instead of X format?
-----------------------------

JSON parsers are widely available in most, if not every, language. A JSON
parser is also available in the Python standard library. It's not the perfect
format, but it's good enough.


Why not add X feature?
----------------------

The general goal of this PEP is to change or add very little. We will instead focus
largely on translating the existing information contained within our HTML responses
into a sensible JSON representation. This will include :pep:`658` metadata required
for packaging tooling.

The only real new capability that is added in this PEP is the ability to have
multiple hashes for a single file. That was done because the current mechanism being
limited to a single hash has made it painful in the past to migrate hashes
(md5 to sha256) and the cost of making the hashes a dictionary and allowing multiple
is pretty low.

The API was generally designed to allow further extension through adding new keys,
so if there's some new piece of data that an installer might need, future PEPs can
easily make that available.


Why is the root URL a dictionary instead of a list?
---------------------------------------------------

The most natural direct translation of the root URL being a list of links is to turn
it into a list of objects. However, stepping back, that's not the most natural way
to actually represent this data. This was a result of a HTML limitation that we had to
work around. With a list (either of ``<a>`` tags, or objects) there's nothing stopping
you from listing the same project twice and other unwanted patterns.

A dictionary also allows for an average of constant-time access given the project name.


Why include the filename when the URL has it already?
-----------------------------------------------------

We could reduce the size of our responses by removing the ``filename`` key and expecting
clients to pull that information out of the URL.

Currently this PEP chooses not to do that, largely because :pep:`503` explicitly required
that the filename be available via the anchor tag of the links, though that was largely
because *something* had to be there. It's not clear if repositories in the wild always
have a filename as the last part of the URL or if they're relying on the filename in the
anchor tag.

It also makes the responses slightly nicer to read for a human, as you get a nice short
unique identifier.

If we got reasonable confidence that mandating the filename is in the URL, then we could
drop this data and reduce the size of the JSON response.


Why not break out other pieces of information from the filename?
----------------------------------------------------------------

Currently clients are expected to parse a number of pieces of information from the
filename such as project name, version, ABI tags, etc. We could break these out
and add them as keys to the file object.

This PEP has chosen not to do that because doing so would increase the size of the
API responses, and most clients are going to require the ability to parse that
information out of file names anyways regardless of what the API does. Thus it makes
sense to keep that functionality inside of the clients.


Why Content Negotiation instead of multiple URLs?
-------------------------------------------------

Another reasonable way to implement this would be to duplicate the API routes and
include some marker in the URL itself for JSON. Such as making the URLs be something
like ``/simple/foo.json``, ``/simple/_index.json``, etc.

This makes some things simpler like TUF integration and fully static serving of a
repository (since ``.json`` files can just be written out).

However, this is two pretty major issues:

- Our current URL structure relies on the fact that there is an URL that represents
  the "root", ``/`` to serve the list of projects. If we want to have separate URLs
  for JSON and HTML, we would need to come up with some way to have two root URLs.

  Something like ``/`` being HTML and ``/_index.json`` being JSON, since ``_index``
  isn't a valid project name could work. But ``/`` being HTML doesn't work great if
  a repository wants to remove support for HTML.

  Another option could be moving all of the existing HTML URLs under a namespace while
  making a new namespace for JSON. Since ``/<project>/`` was defined, we would have to
  make these namespaces not valid project names, so something like ``/_html/`` and
  ``/_json/`` could work, then just redirect the non namespaced URLs to whatever the
  "default" for that repository is (likely HTML, unless they've disabled HTML then JSON).
- With separate URLs, there's no good way to support zero configuration discovery
  that a repository supports the JSON URLs without making additional HTTP requests to
  determine if the JSON URL exists or not.

  The most naive implementation of this would be to request the JSON URL and fall back
  to the HTML URL for *every* single request, but that would be horribly performant
  and violate the goal of minimal additional HTTP requests.

  The most likely implementation of this would be to make some sort of repository level
  configuration file that somehow indicates what is supported. We would have the same
  namespace problem as above, with the same solution, something like ``/_config.json``
  or so could hold that data, and a client could first make an HTTP request to that,
  and if it exists pull it down and parse it to learn about the capabilities of this
  particular repository.
- The use of ``Accept`` also allows us to add versioning into this field

All being said, it is the opinion of this PEP that those three issues combined make
using separate API routes a less desirable solution than relying on content
negotiation to select the most ideal representation of the data.


Appendix 1: Survey of use cases to cover
========================================

This was done through a discussion between ``pip`` and ``bandersnarch``
maintainers, who are the two first potential users for the new API. This is
how they use the Simple + JSON APIs today:

- ``pip``:

  - List of all files for a particular release
  - Metadata of each individual artifact:

    - was it yanked? (`data-yanked`)
    - what's the python-requires? (`data-python-requires`)
    - what's the hash of this file? (currently, hash in URL)
    - Full metadata (`data-dist-info-metadata`)
    - [Bonus] what are the declared dependencies, if available (list-of-strings, null if unavailable)?

- ``bandersnatch`` - Only uses legacy JSON API + XMLRPC today:

  - Generates Simple HTML rather than copying from PyPI

    - Maybe this changes with the new API and we verbatim pull these API assets from PyPI

  - List of all files for a particular release.

    - Workout URL for release files to download

  - Metadata of each individual artifact.

    - Write out the JSON to mirror storage today (disk/S3)

      - Required metadata used (via Package class - https://github.com/pypa/bandersnatch/blob/main/src/bandersnatch/package.py):

        - metadata["info"]
        - metadata["last_serial"]
        - metadata["releases"]

          - digests
          - URL

  - XML-RPC calls (we'd love to deprecate - but we don't think should go in the Simple API)

    - [Bonus] Get packages since serial X (or all)

      - XML-RPC Call: ``changelog_since_serial``

    - [Bonus] Get all packages with serial

      - XML-RPC Call: ``list_packages_with_serial``