PEP 691: Draft PEP for Simple JSON API (#2578)
Co-authored-by: Cooper Lees <me@cooperlees.com> Co-authored-by: Donald Stufft <donald@stufft.io> Co-authored-by: Pradyun Gedam <pradyunsg@users.noreply.github.com>
This commit is contained in:
parent
5232fda0ab
commit
171c27e292
|
@ -571,6 +571,7 @@ pep-0687.rst @encukou
|
|||
pep-0688.rst @jellezijlstra
|
||||
pep-0689.rst @encukou
|
||||
pep-0690.rst @warsaw
|
||||
pep-0691.rst @dstufft
|
||||
# ...
|
||||
# pep-0754.txt
|
||||
# ...
|
||||
|
|
|
@ -0,0 +1,452 @@
|
|||
PEP: 691
|
||||
Title: JSON-based Simple API for Python Package Indexes
|
||||
Author: Donald Stufft <donald@stufft.io>,
|
||||
Pradyun Gedam <pradyunsg@gmail.com>,
|
||||
Cooper Lees <me@cooperlees.com>,
|
||||
Dustin Ingram <di@python.org>
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
BDFL-Delegate: Donald Stufft <donald@stufft.io>
|
||||
Discussions-To: https://discuss.python.org/t/AAAAAA/999999
|
||||
Created: 04-May-2022
|
||||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
The "Simple Repository API" that was defined in :pep:`503` (and was in use much
|
||||
longer than that) has served us reasonably well for a very long time. However,
|
||||
the reliance on using HTML as the data exchange mechanism has several
|
||||
shortcomings.
|
||||
|
||||
There are two major issues with an HTML-based API:
|
||||
|
||||
- While HTML5 is a standard, it's an incredibly complex standard and ensuring
|
||||
completely correct parsing of it involves complex logic that does not
|
||||
currently exist within the Python standard library (nor the standard library
|
||||
of many other languages).
|
||||
|
||||
This means that to actually accept everything that is technically valid, tools
|
||||
have to pull in large dependencies or they have to rely on the standard library's
|
||||
``html.parser`` library, which is lighter weight but potentially doesn't
|
||||
fully support HTML5.
|
||||
|
||||
- HTML5 is primarily designed as a markup language to present documents for human
|
||||
consumption. Our use of it is driven largely for historical reasons and accidental
|
||||
reasons, and it's unlikely anyone would design an API that relied on it if
|
||||
they were starting from scratch.
|
||||
|
||||
The primary issue with using a markup format designed for human consumption
|
||||
is that there's not a great way to actually encode data within HTML. We've
|
||||
gotten around this by limiting the data we put in this API and being creative
|
||||
with how we can cram data into the API (for instance, hashes are embedded as
|
||||
URL fragments, adding the ``data-yanked`` attribute in :pep:`592`).
|
||||
|
||||
:pep:`503` was largely an attempt to standardize what was already in use, so it
|
||||
did not propose any large changes to the API.
|
||||
|
||||
In the intervening years, we've regularly talked about an "API V2" that would
|
||||
re-envision the entire API of PyPI. However, due to limited time constraints,
|
||||
that effort has not gained much if any traction beyond people thinking that it
|
||||
would be nice to do it.
|
||||
|
||||
This PEP attempts to take a different route. It doesn't fundamentally change
|
||||
the overall API structure, but instead specifies a new representation of the
|
||||
existing data contained in existing :pep:`503` responses in a format that is
|
||||
easier for software to parse rather than using a human centric document format.
|
||||
|
||||
Goals
|
||||
=====
|
||||
|
||||
- **Enable zero configuration discovery.** Clients of the simple API **MUST** be
|
||||
able to gracefully determine whether a target repository supports this PEP
|
||||
without relying on any form of out of band communication (configuration, prior
|
||||
knowledge, etc). Individual clients **MAY** choose to require configuration
|
||||
to enable the use of this API, however.
|
||||
- **Enable clients to drop support for "legacy" HTML parsing.** While it is expected
|
||||
that most clients will keep supporting HTML-only repositories for a while, if not
|
||||
forever, it should be possible for a client to choose to support only the new
|
||||
API formats and no longer invoke an HTML parser.
|
||||
- **Enable repositories to drop support for "legacy" HTML formats.** Similar to
|
||||
clients, it is expected that most repositories will continue to support HTML
|
||||
responses for a long time, or forever. It should be possible for a repository to
|
||||
choose to only support the new formats.
|
||||
- **Maintain full support for existing HTML-only clients.** We **MUST** not break
|
||||
existing clients that are accessing the API as a strictly :pep:`503` API. The only
|
||||
exception to this, is if the repository itself has chosen to no longer support
|
||||
the HTML format.
|
||||
- **Minimal additional HTTP requests.** Using this API **MUST** not drastically
|
||||
increase the amount of HTTP requests an installer must do in order to function.
|
||||
Ideally it will require 0 additional requests, but if needed it may require one
|
||||
or two additional requests (total, not per dependency).
|
||||
- **Minimal additional unique reponses.** Due to the nature of how large
|
||||
repositories like PyPI cache responses, this PEP should not introduce a
|
||||
significantly or combinatorially large number of additional unique responses
|
||||
that the repository may produce.
|
||||
- **Supports TUF.** This PEP **MUST** be able to function within the bounds of
|
||||
what TUF can support (:pep:`458`), and must be able to be secured using it.
|
||||
- **Require only the standard library, or small external dependencies for clients.**
|
||||
Parsing an API response should ideally require nothing but the standard
|
||||
library, however it would be acceptable to require a small, pure Python
|
||||
dependency.
|
||||
|
||||
|
||||
Specification
|
||||
=============
|
||||
|
||||
To enable parsing responses with only the standard library, this PEP specifies that
|
||||
all responses (besides the files themselves, and the HTML responses from
|
||||
:pep:`503`) should be encoded using `JSON <https://www.json.org/>`_.
|
||||
|
||||
To enable zero configuration discovery and to minimize the amount of additional HTTP
|
||||
requests, this PEP extends :pep:`503` such that all of the API endpoints (other than the
|
||||
files themselves) will utilize HTTP content negotiation to allow client and server to
|
||||
select the correct format to serve, i.e. either HTML or JSON.
|
||||
|
||||
Format Selection
|
||||
----------------
|
||||
|
||||
A HTML response will be the default when requesting in version 1.0:
|
||||
|
||||
- ``/simple/``
|
||||
- ``/simple/foo/``
|
||||
- Like :pep:`503`, the trailing ``/`` is expected
|
||||
|
||||
To request a JSON response, the ``Accept`` header will need to be added to the
|
||||
request specify the response type and version. For version 1.0 this will look like:
|
||||
|
||||
``Accept: application/vnd.pypi.simple.v1+json``
|
||||
|
||||
The version is also optional and will then always return the latest version:
|
||||
|
||||
``Accept: application/vnd.pypi.simple+json``
|
||||
|
||||
This is for clients who always want latest and should expect potential
|
||||
breakages. Additionally, it is potential useful way to run integration tests
|
||||
against a possibly breaking version.
|
||||
|
||||
Specifying HTML is also allowed so clients can be explicit to backends (e.g if we
|
||||
switch to JSON default in the future):
|
||||
|
||||
``Accept: application/vnd.pypi.simple.v1+html``
|
||||
|
||||
Using ``text/html`` will also work, which will serve the latest API version. To
|
||||
be explicit, clients should use specific HTML ``Accept``. If no
|
||||
``Accept`` is specified, the latest HTTP version will be returned unless
|
||||
the backend *only* supports JSON. Backends may default to returning JSON in the
|
||||
future.
|
||||
|
||||
The ``Accept:`` header also allows you to say that you prefer the the V1 Simple JSON API,
|
||||
if that's not available then you prefer the V1 HTML API, and if that's not available,
|
||||
just ``text/html``. To do this would look like:
|
||||
|
||||
``Accept: application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, text/html``
|
||||
|
||||
Versioning
|
||||
----------
|
||||
|
||||
Versioning will adhere to :pep:`629` format (``Major.Minor``) and will be
|
||||
included in the ``Accept`` request that clients add to obtain a JSON
|
||||
response. We don't foresee the use of *Minor* versioning but will support it if
|
||||
the need does arise.
|
||||
|
||||
The header for clients accessing version 1.0 of the API will be:
|
||||
|
||||
``application/vnd.pypi.simple.index.v1+json``
|
||||
|
||||
An example for Accept values that a newer APIs could support **would** look like:
|
||||
|
||||
``application/vnd.pypi.simple.index.v2+json``
|
||||
|
||||
If a version that does not exist is requested, the server will explicitly return a
|
||||
`406 Not Acceptable
|
||||
<https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/406>`_ HTTP status
|
||||
code. The response will also indicate available API versions and links to
|
||||
version formats.
|
||||
|
||||
|
||||
TUF Support - PEP 458
|
||||
---------------------
|
||||
|
||||
:pep:`458` states that the "Simple Index" needs to be hashable. To adhere to the TUF
|
||||
standard, we will need a target for each response, i.e. the HTML and JSON (plus any
|
||||
future type) response. To provide this we could have two targets per API endpoint:
|
||||
|
||||
- ``/simple/foo/vnd.pypi.simple.v1.html``
|
||||
- ``/simple/foo/vnd.pypi.simple.v1.json``
|
||||
|
||||
Additionally, when calculating the digest of a JSON response, indices should
|
||||
use the `Canonical JSON <https://wiki.laptop.org/go/Canonical_JSON>`_ format.
|
||||
|
||||
|
||||
Root URL
|
||||
--------
|
||||
|
||||
The root URL ``/`` for this PEP (which represents the base URL) will be a JSON encoded
|
||||
dictionary where each key is a string of the normalized project name, and the value is
|
||||
a dictionary with a single key, ``url``, which represents the URL that the project can
|
||||
be fetched from. As an example::
|
||||
|
||||
{
|
||||
"frob": {"url": "/frob/"},
|
||||
"spamspamspam": {"url": "/spamspamspam/"}
|
||||
}
|
||||
|
||||
Below the root URL is another URL for each individual project contained within
|
||||
a repository. The format of this URL is ``/<project>/`` where the ``<project>``
|
||||
is replaced by the :pep:`503`-canonicalized name for that project, so a project named
|
||||
"Holy_Grail" would have a URL like ``/holy-grail/``. This URL must respond with a
|
||||
JSON encoded dictionary that has two keys, ``name``, which represents the normalized
|
||||
name of the project and ``files``. The ``files`` key is a list of dictionaries,
|
||||
each one representing an individual file.
|
||||
|
||||
Each individual file dictionary has the following keys:
|
||||
|
||||
- ``filename``: The filename that is being represented.
|
||||
- ``url``: The URL that the file can be fetched from.
|
||||
- ``hashes``: A dictionary mapping a hash name to a hex encoded digest of the file.
|
||||
Multiple hashes can be included, and it is up to the client to decide what to do
|
||||
with multiple hashes (it may validate all of them or a subset of them, or nothing
|
||||
at all). These hash names **SHOULD** always be normalized to be lowercase.
|
||||
|
||||
The ``hashes`` dictionary **MUST** be present, even if no hashes are available
|
||||
for the file, however it is **HIGHLY** recommended that at least one secure,
|
||||
guaranteed to be available hash is always included.
|
||||
- ``requires-python``: An **optional** key that exposes the *Requires-Python*
|
||||
metadata field, specified in :pep:`345`. Where this is present, installer tools
|
||||
**SHOULD** ignore the download when installing to a Python version that
|
||||
doesn't satisfy the requirement.
|
||||
- ``dist-info-metadata-available``: An **optional** key that indicates
|
||||
that metadata for this file is available, via the same location as specified in
|
||||
:pep:`658` (`{file_url}.metadata`). Where this is present, it **MUST** be true,
|
||||
or a dictionary mapping a hash name to a hex encoded digest of the metadata hash.
|
||||
- ``gpg-sig``: An **optional** key that acts a boolean to indicate if the file has
|
||||
an associated GPG signature or not. If this key does not exist, then the signature
|
||||
may or may not exist.
|
||||
- ``yanked``: An **optional** key which may have no value, or may have an
|
||||
arbitrary string as a value. The presence of a ``yanked`` key SHOULD
|
||||
be interpreted as indicating that the file pointed to by the ``url`` field
|
||||
has been "Yanked" as per :pep:`592`.
|
||||
|
||||
As an example::
|
||||
|
||||
{
|
||||
"name": "holygrail",
|
||||
"files": [
|
||||
{
|
||||
"filename": "holygrail-1.0.tar.gz",
|
||||
"url": "https://example.com/files/holygrail-1.0.tar.gz",
|
||||
"hashes": {"sha256": "...", "blake2b": "..."},
|
||||
"requires-python": ">=3.7",
|
||||
"yanked": "Had a vulnerability"
|
||||
},
|
||||
{
|
||||
"filename": "holygrail-1.0-py3-none-any.whl",
|
||||
"url": "https://example.com/files/holygrail-1.0-py3-none-any.whl",
|
||||
"hashes": {"sha256": "...", "blake2b": "..."},
|
||||
"requires-python": ">=3.7",
|
||||
"dist-info-metadata-available": true
|
||||
},
|
||||
]
|
||||
}
|
||||
|
||||
In addition to the above, the following constraints are placed on the API:
|
||||
|
||||
* While JSON doesn't natively support an URL type, any value that represents an
|
||||
URL in this API may be either absolute or relative as long as they point to
|
||||
the correct location. If relative, they are relative to the current URL as if
|
||||
it were HTML.
|
||||
|
||||
* Additional keys may be added to any dictionary objects in the API responses
|
||||
and clients **MUST** ignore keys that they don't understand.
|
||||
|
||||
* By default, any hash algorithm available via `hashlib
|
||||
<https://docs.python.org/3/library/hashlib.html>`_ (specifically any that can
|
||||
be passed to ``hashlib.new()`` and do not require additional parameters) can
|
||||
be used as a key for the hashes dictionary. At least one secure algorithm from
|
||||
``hashlib.algorithms_guaranteed`` **SHOULD** always be included. At the time
|
||||
of this PEP, ``sha256`` specifically is recommended.
|
||||
|
||||
* Unlike ``data-requires-python`` in :pep:`503`, the ``requires-python`` key does not
|
||||
require any special escaping other than anything JSON does naturally.
|
||||
|
||||
* Future features **MAY** be implemented or only supported when operating under JSON.
|
||||
This would be decided on a case by case basis depending on how important the feature
|
||||
is, how widely used HTML is at that point, and how difficult representing the feature
|
||||
in HTML would be.
|
||||
|
||||
* All requirements of :pep:`503` that are not HTML specific still apply.
|
||||
|
||||
|
||||
FAQ
|
||||
===
|
||||
|
||||
|
||||
Why JSON instead of X format?
|
||||
-----------------------------
|
||||
|
||||
JSON parsers are widely available in most, if not every, language. A JSON
|
||||
parser is also available in the Python standard library. It's not the perfect
|
||||
format, but it's good enough.
|
||||
|
||||
|
||||
Why not add X feature?
|
||||
----------------------
|
||||
|
||||
The general goal of this PEP is to change or add very little. We will instead focus
|
||||
largely on translating the existing information contained within our HTML responses
|
||||
into a sensible JSON representation. This will include :pep:`658` metadata required
|
||||
for packaging tooling.
|
||||
|
||||
The only real new capability that is added in this PEP is the ability to have
|
||||
multiple hashes for a single file. That was done because the current mechanism being
|
||||
limited to a single hash has made it painful in the past to migrate hashes
|
||||
(md5 to sha256) and the cost of making the hashes a dictionary and allowing multiple
|
||||
is pretty low.
|
||||
|
||||
The API was generally designed to allow further extension through adding new keys,
|
||||
so if there's some new piece of data that an installer might need, future PEPs can
|
||||
easily make that available.
|
||||
|
||||
|
||||
Why is the root URL a dictionary instead of a list?
|
||||
---------------------------------------------------
|
||||
|
||||
The most natural direct translation of the root URL being a list of links is to turn
|
||||
it into a list of objects. However, stepping back, that's not the most natural way
|
||||
to actually represent this data. This was a result of a HTML limitation that we had to
|
||||
work around. With a list (either of ``<a>`` tags, or objects) there's nothing stopping
|
||||
you from listing the same project twice and other unwanted patterns.
|
||||
|
||||
A dictionary also allows for an average of constant-time access given the project name.
|
||||
|
||||
|
||||
Why include the filename when the URL has it already?
|
||||
-----------------------------------------------------
|
||||
|
||||
We could reduce the size of our responses by removing the ``filename`` key and expecting
|
||||
clients to pull that information out of the URL.
|
||||
|
||||
Currently this PEP chooses not to do that, largely because :pep:`503` explicitly required
|
||||
that the filename be available via the anchor tag of the links, though that was largely
|
||||
because *something* had to be there. It's not clear if repositories in the wild always
|
||||
have a filename as the last part of the URL or if they're relying on the filename in the
|
||||
anchor tag.
|
||||
|
||||
It also makes the responses slightly nicer to read for a human, as you get a nice short
|
||||
unique identifier.
|
||||
|
||||
If we got reasonable confidence that mandating the filename is in the URL, then we could
|
||||
drop this data and reduce the size of the JSON response.
|
||||
|
||||
|
||||
Why not break out other pieces of information from the filename?
|
||||
----------------------------------------------------------------
|
||||
|
||||
Currently clients are expected to parse a number of pieces of information from the
|
||||
filename such as project name, version, ABI tags, etc. We could break these out
|
||||
and add them as keys to the file object.
|
||||
|
||||
This PEP has chosen not to do that because doing so would increase the size of the
|
||||
API responses, and most clients are going to require the ability to parse that
|
||||
information out of file names anyways regardless of what the API does. Thus it makes
|
||||
sense to keep that functionality inside of the clients.
|
||||
|
||||
|
||||
Why Content Negotiation instead of multiple URLs?
|
||||
-------------------------------------------------
|
||||
|
||||
Another reasonable way to implement this would be to duplicate the API routes and
|
||||
include some marker in the URL itself for JSON. Such as making the URLs be something
|
||||
like ``/simple/foo.json``, ``/simple/_index.json``, etc.
|
||||
|
||||
This makes some things simpler like TUF integration and fully static serving of a
|
||||
repository (since ``.json`` files can just be written out).
|
||||
|
||||
However, this is two pretty major issues:
|
||||
|
||||
- Our current URL structure relies on the fact that there is an URL that represents
|
||||
the "root", ``/`` to serve the list of projects. If we want to have separate URLs
|
||||
for JSON and HTML, we would need to come up with some way to have two root URLs.
|
||||
|
||||
Something like ``/`` being HTML and ``/_index.json`` being JSON, since ``_index``
|
||||
isn't a valid project name could work. But ``/`` being HTML doesn't work great if
|
||||
a repository wants to remove support for HTML.
|
||||
|
||||
Another option could be moving all of the existing HTML URLs under a namespace while
|
||||
making a new namespace for JSON. Since ``/<project>/`` was defined, we would have to
|
||||
make these namespaces not valid project names, so something like ``/_html/`` and
|
||||
``/_json/`` could work, then just redirect the non namespaced URLs to whatever the
|
||||
"default" for that repository is (likely HTML, unless they've disabled HTML then JSON).
|
||||
- With separate URLs, there's no good way to support zero configuration discovery
|
||||
that a repository supports the JSON URLs without making additional HTTP requests to
|
||||
determine if the JSON URL exists or not.
|
||||
|
||||
The most naive implementation of this would be to request the JSON URL and fall back
|
||||
to the HTML URL for *every* single request, but that would be horribly performant
|
||||
and violate the goal of minimal additional HTTP requests.
|
||||
|
||||
The most likely implementation of this would be to make some sort of repository level
|
||||
configuration file that somehow indicates what is supported. We would have the same
|
||||
namespace problem as above, with the same solution, something like ``/_config.json``
|
||||
or so could hold that data, and a client could first make an HTTP request to that,
|
||||
and if it exists pull it down and parse it to learn about the capabilities of this
|
||||
particular repository.
|
||||
- The use of ``Accept`` also allows us to add versioning into this field
|
||||
|
||||
All being said, it is the opinion of this PEP that those three issues combined make
|
||||
using separate API routes a less desirable solution than relying on content
|
||||
negotiation to select the most ideal representation of the data.
|
||||
|
||||
|
||||
Appendix 1: Survey of use cases to cover
|
||||
========================================
|
||||
|
||||
This was done through a discussion between ``pip`` and ``bandersnarch``
|
||||
maintainers, who are the two first potential users for the new API. This is
|
||||
how they use the Simple + JSON APIs today:
|
||||
|
||||
- ``pip``:
|
||||
|
||||
- List of all files for a particular release
|
||||
- Metadata of each individual artifact:
|
||||
|
||||
- was it yanked? (`data-yanked`)
|
||||
- what's the python-requires? (`data-python-requires`)
|
||||
- what's the hash of this file? (currently, hash in URL)
|
||||
- Full metadata (`data-dist-info-metadata`)
|
||||
- [Bonus] what are the declared dependencies, if available (list-of-strings, null if unavailable)?
|
||||
|
||||
- ``bandersnatch`` - Only uses legacy JSON API + XMLRPC today:
|
||||
|
||||
- Generates Simple HTML rather than copying from PyPI
|
||||
|
||||
- Maybe this changes with the new API and we verbatim pull these API assets from PyPI
|
||||
|
||||
- List of all files for a particular release.
|
||||
|
||||
- Workout URL for release files to download
|
||||
|
||||
- Metadata of each individual artifact.
|
||||
|
||||
- Write out the JSON to mirror storage today (disk/S3)
|
||||
|
||||
- Required metadata used (via Package class - https://github.com/pypa/bandersnatch/blob/main/src/bandersnatch/package.py):
|
||||
|
||||
- metadata["info"]
|
||||
- metadata["last_serial"]
|
||||
- metadata["releases"]
|
||||
|
||||
- digests
|
||||
- URL
|
||||
|
||||
- XML-RPC calls (we'd love to deprecate - but we don't think should go in the Simple API)
|
||||
|
||||
- [Bonus] Get packages since serial X (or all)
|
||||
|
||||
- XML-RPC Call: ``changelog_since_serial``
|
||||
|
||||
- [Bonus] Get all packages with serial
|
||||
|
||||
- XML-RPC Call: ``list_packages_with_serial``
|
Loading…
Reference in New Issue