python-peps/pep-0691.rst

890 lines
41 KiB
ReStructuredText

PEP: 691
Title: JSON-based Simple API for Python Package Indexes
Author: Donald Stufft <donald@stufft.io>,
Pradyun Gedam <pradyunsg@gmail.com>,
Cooper Lees <me@cooperlees.com>,
Dustin Ingram <di@python.org>
PEP-Delegate: Brett Cannon <brett@python.org>
Discussions-To: https://discuss.python.org/t/pep-691-json-based-simple-api-for-python-package-indexes/15553
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 04-May-2022
Post-History: `05-May-2022 <https://discuss.python.org/t/pep-691-json-based-simple-api-for-python-package-indexes/15553>`__
Abstract
========
The "Simple Repository API" that was defined in :pep:`503` (and was in use much
longer than that) has served us reasonably well for a very long time. However,
the reliance on using HTML as the data exchange mechanism has several
shortcomings.
There are two major issues with an HTML-based API:
- While HTML5 is a standard, it's an incredibly complex standard and ensuring
completely correct parsing of it involves complex logic that does not
currently exist within the Python standard library (nor the standard library
of many other languages).
This means that to actually accept everything that is technically valid, tools
have to pull in large dependencies or they have to rely on the standard library's
``html.parser`` library, which is lighter weight but potentially doesn't
fully support HTML5.
- HTML5 is primarily designed as a markup language to present documents for human
consumption. Our use of it is driven largely for historical reasons and accidental
reasons, and it's unlikely anyone would design an API that relied on it if
they were starting from scratch.
The primary issue with using a markup format designed for human consumption
is that there's not a great way to actually encode data within HTML. We've
gotten around this by limiting the data we put in this API and being creative
with how we can cram data into the API (for instance, hashes are embedded as
URL fragments, adding the ``data-yanked`` attribute in :pep:`592`).
:pep:`503` was largely an attempt to standardize what was already in use, so it
did not propose any large changes to the API.
In the intervening years, we've regularly talked about an "API V2" that would
re-envision the entire API of PyPI. However, due to limited time constraints,
that effort has not gained much if any traction beyond people thinking that it
would be nice to do it.
This PEP attempts to take a different route. It doesn't fundamentally change
the overall API structure, but instead specifies a new serialization of the
existing data contained in existing :pep:`503` responses in a format that is
easier for software to parse rather than using a human centric document format.
Goals
=====
- **Enable zero configuration discovery.** Clients of the simple API **MUST** be
able to gracefully determine whether a target repository supports this PEP
without relying on any form of out of band communication (configuration, prior
knowledge, etc). Individual clients **MAY** choose to require configuration
to enable the use of this API, however.
- **Enable clients to drop support for "legacy" HTML parsing.** While it is expected
that most clients will keep supporting HTML-only repositories for a while, if not
forever, it should be possible for a client to choose to support only the new
API formats and no longer invoke an HTML parser.
- **Enable repositories to drop support for "legacy" HTML formats.** Similar to
clients, it is expected that most repositories will continue to support HTML
responses for a long time, or forever. It should be possible for a repository to
choose to only support the new formats.
- **Maintain full support for existing HTML-only clients.** We **MUST** not break
existing clients that are accessing the API as a strictly :pep:`503` API. The only
exception to this, is if the repository itself has chosen to no longer support
the HTML format.
- **Minimal additional HTTP requests.** Using this API **MUST** not drastically
increase the amount of HTTP requests an installer must do in order to function.
Ideally it will require 0 additional requests, but if needed it may require one
or two additional requests (total, not per dependency).
- **Minimal additional unique reponses.** Due to the nature of how large
repositories like PyPI cache responses, this PEP should not introduce a
significantly or combinatorially large number of additional unique responses
that the repository may produce.
- **Supports TUF.** This PEP **MUST** be able to function within the bounds of
what TUF can support (:pep:`458`), and must be able to be secured using it.
- **Require only the standard library, or small external dependencies for clients.**
Parsing an API response should ideally require nothing but the standard
library, however it would be acceptable to require a small, pure Python
dependency.
Specification
=============
To enable parsing responses with only the standard library, this PEP specifies that
all responses (besides the files themselves, and the HTML responses from
:pep:`503`) should be serialized using `JSON <https://www.json.org/>`_.
To enable zero configuration discovery and to minimize the amount of additional HTTP
requests, this PEP extends :pep:`503` such that all of the API endpoints (other than the
files themselves) will utilize HTTP content negotiation to allow client and server to
select the correct serialization format to serve, i.e. either HTML or JSON.
Versioning
----------
Versioning will adhere to :pep:`629` format (``Major.Minor``), which has defined the
existing HTML responses to be ``1.0``. Since this PEP does not introduce new features
into the API, rather it describes a different serialization format for the existing
features, this PEP does not change the existing ``1.0`` version, and instead just
describes how to serialize that into JSON.
Similar to :pep:`629`, the major version number **MUST** be incremented if any
changes to the new format would result in no longer being able to expect existing
clients to meaningfully understand the format.
Likewise, incrementing the minor version **MUST** be incremented if features are
added or removed from the format, but existing clients would be expected to continue
to meaningfully understand the format.
Changes that would not result in existing clients being unable to meaningfully
understand the format and which do not represent features being added or removed
may occur without changing the version number.
This is intentionally vague, as this PEP believes it is best left up to future PEPs
that make any changes to the API to investigate and decide whether or not that
change should increment the major or minor version.
Future versions of the API may add things that can only be represented in a subset
of the available serializations of that version. All serializations version numbers
**SHOULD** be kept in sync, but the specifics of how a feature serializes into each
format may differ, including whether or not that feature is present at all.
It is the intent of this PEP that the API should be thought of as URL endpoints that
return data, whose interpretation is defined by the version of that data, and then
serialized into the target serialization format.
JSON Serialization
------------------
The URL structure from :pep:`503` still applies, as this PEP only adds an additional
serialization format for the already existing API.
The following constraints apply to all JSON serialized responses described in this
PEP:
* All JSON responses will *always* be a JSON object rather than an array or other
type.
* While JSON doesn't natively support an URL type, any value that represents an
URL in this API may be either absolute or relative as long as they point to
the correct location. If relative, they are relative to the current URL as if
it were HTML.
* Additional keys may be added to any dictionary objects in the API responses
and clients **MUST** ignore keys that they don't understand.
* All JSON responses will have a ``meta`` key, which contains information related to
the response itself, rather than the content of the response.
* All JSON responses will have a ``meta.api-version`` key, which will be a string that
contains the :pep:`629` ``Major.Minor`` version number, with the same fail/warn
semantics as in :pep:`629`.
* All requirements of :pep:`503` that are not HTML specific still apply.
Project List
~~~~~~~~~~~~
The root URL ``/`` for this PEP (which represents the base URL) will be a JSON encoded
dictionary which has a single key, ``projects``, which is itself a dictionary where each
key is a string of the normalized project name, and the value is a dictionary with a
single key, ``url``, which represents the URL that the project can be fetched from. As
an example:
.. code-block:: json
{
"meta": {
"api-version": "1.0"
},
"projects": {
"frob": {"url": "/frob/"},
"spamspamspam": {"url": "/spamspamspam/"}
}
}
Project Detail
~~~~~~~~~~~~~~
The format of this URL is ``/<project>/`` where the ``<project>`` is replaced by the
:pep:`503`-canonicalized name for that project, so a project named "Holy_Grail" would
have a URL like ``/holy-grail/``.
This URL must respond with a JSON encoded dictionary that has two keys, ``name``, which
represents the normalized name of the project and ``files``. The ``files`` key is a
list of dictionaries, each one representing an individual file.
Each individual file dictionary has the following keys:
- ``filename``: The filename that is being represented.
- ``url``: The URL that the file can be fetched from.
- ``hashes``: A dictionary mapping a hash name to a hex encoded digest of the file.
Multiple hashes can be included, and it is up to the client to decide what to do
with multiple hashes (it may validate all of them or a subset of them, or nothing
at all). These hash names **SHOULD** always be normalized to be lowercase.
The ``hashes`` dictionary **MUST** be present, even if no hashes are available
for the file, however it is **HIGHLY** recommended that at least one secure,
guaranteed to be available hash is always included.
By default, any hash algorithm available via `hashlib
<https://docs.python.org/3/library/hashlib.html>`_ (specifically any that can
be passed to ``hashlib.new()`` and do not require additional parameters) can
be used as a key for the hashes dictionary. At least one secure algorithm from
``hashlib.algorithms_guaranteed`` **SHOULD** always be included. At the time
of this PEP, ``sha256`` specifically is recommended.
- ``requires-python``: An **optional** key that exposes the *Requires-Python*
metadata field, specified in :pep:`345`. Where this is present, installer tools
**SHOULD** ignore the download when installing to a Python version that
doesn't satisfy the requirement.
Unlike ``data-requires-python`` in :pep:`503`, the ``requires-python`` key does not
require any special escaping other than anything JSON does naturally.
- ``dist-info-metadata``: An **optional** key that indicates
that metadata for this file is available, via the same location as specified in
:pep:`658` (``{file_url}.metadata``). Where this is present, it **MUST** be
boolean to indicate if the file has an associated metadata file, or a dictionary
mapping hash names to a hex encoded digest of the metadata's hash.
When this is a dictionary of hashes, then all the same requirements and
recommendations as the ``hashes`` key hold true for this key as well.
If this key is missing then the metadata file may or may not exist. If the key
value is truthy, then the metadata file is present, and if it is falsey then it
is not.
It is recommended that servers make the hashes of the metadata file available if
possible.
- ``gpg-sig``: An **optional** key that acts a boolean to indicate if the file has
an associated GPG signature or not. If this key does not exist, then the signature
may or may not exist.
- ``yanked``: An **optional** key which may be a boolean to indicate if the file
has been yanked, or a non empty, but otherwise arbitrary, string to indicate that
a file has been yanked with a specific reason. If the ``yanked`` key is present
and is a truthy value, then it **SHOULD** be interpreted as indicating that the
file pointed to by the ``url`` field has been "Yanked" as per :pep:`592`.
As an example:
.. code-block:: json
{
"meta": {
"api-version": "1.0"
},
"name": "holygrail",
"files": [
{
"filename": "holygrail-1.0.tar.gz",
"url": "https://example.com/files/holygrail-1.0.tar.gz",
"hashes": {"sha256": "...", "blake2b": "..."},
"requires-python": ">=3.7",
"yanked": "Had a vulnerability"
},
{
"filename": "holygrail-1.0-py3-none-any.whl",
"url": "https://example.com/files/holygrail-1.0-py3-none-any.whl",
"hashes": {"sha256": "...", "blake2b": "..."},
"requires-python": ">=3.7",
"dist-info-metadata": true
}
]
}
Content-Types
-------------
This PEP proposes that all responses from the Simple API will have a standard
content type that describes what the response is (a Simple API response), what
version of the API it represents, and what serialization format has been used.
The structure of this content type will be:
.. code-block:: text
application/vnd.pypi.simple.$version+format
Since only major versions should be disruptive to clients attempting to
understand one of these API responses, only the major version will be included
in the content type, and will be prefixed with a ``v`` to clarify that it is a
version number.
Which means that for the existing 1.0 API, the content types would be:
- **JSON:** ``application/vnd.pypi.simple.v1+json``
- **HTML:** ``application/vnd.pypi.simple.v1+html``
In addition to the above, a special "meta" version is supported named ``latest``,
whose purpose is to allow clients to request the absolute latest version, without
having to know ahead of time what that version is. It is recommended however,
that clients be explicit about what versions they support.
To support existing clients which expect the existing :pep:`503` API responses to
use the ``text/html`` content type, this PEP further defines ``text/html`` as an alias
for the ``application/vnd.pypi.simple.v1+html`` content type.
Version + Format Selection
--------------------------
Now that there is multiple possible serializations, we need a mechanism to allow
clients to indicate what serialization formats that they're able to understand. In
addition, it would be a benefit if any possible new major version to the API can
be added without disrupting existing clients expecting the previous API version.
To enable this, this PEP standardizes on the use of HTTP's
`Server-Driven Content Negotiation <https://developer.mozilla.org/en-US/docs/Web/HTTP/Content_negotiation>`_.
While this PEP won't fully describe the entirety of server-driven content
negotiation, the flow is roughly:
1. The client makes an HTTP request containing an ``Accept`` header listing all
of the version+format content types that they are able to understand.
2. The server inspects that header, selects one of the listed content types,
then returns a response using that content type.
3. If the server does not support any of the content types in the ``Accept``
header or if the client did not provide an ``Accept`` header at all, then
they are able to choose between 3 different options for how to respond:
a. Select a default content type other than what the client has requested
and return a response with that.
b. Return a HTTP ``406 Not Acceptable`` response to indicate that none of
the requested content types were available, and the server was unable
or unwilling to select a default content type to respond with.
c. Return a HTTP ``300 Multiple Choices`` response that contains a list of
all of the possible responses that could have been chosen.
4. The client interprets the response, handling the different types of responses
that the server may have responded with.
This PEP does not specify which choices the server makes in regards to handling
a content type that it isn't able to return, and clients **SHOULD** be prepared
to handle all of the possible responses in whatever way makes the most sense for
that client.
However, as there is no standard format for how a ``300 Multiple Choices``
response can be interpreted, this PEP highly discourages servers from utilizing
that option, as clients will have no way to understand and select a different
content-type to request. In addition, it's unlikely that the client *could*
understand a different content type anyways, so at best this response would
likely just be treated the same as a ``406 Not Acceptable`` error.
This PEP **does** require that if the meta version ``latest`` is being used, the
server **MUST** respond with the content type for the actual version that is
contained in the response
(i.e. A ``Accept: application/vnd.pypi.simple.latest+json`` request that returns
a ``v1.x`` response should have a ``Content-Type`` of
``application/vnd.pypi.simple.v1+json``).
The ``Accept`` header is a comma separated list of content types that the client
understands and is able to process. It supports three different formats for each
content type that is being requested:
- ``$type/$subtype``
- ``$type/*``
- ``*/*``
For the use of selecting a version+format, the most useful of these is
``$type/$subtype``, as that is the only way to actually specify the version
and format you want.
The order of the content types listed in the ``Accept`` header does not have any
specific meaning, and the server **SHOULD** consider all of them to be equally
valid to respond with. If a client wishes to specify that they prefer a specific
content type over another, they may use the ``Accept`` header's
`quality value <https://developer.mozilla.org/en-US/docs/Glossary/Quality_values>`_
syntax.
This allows a client to specify a priority for a specific entry in their
``Accept`` header, by append a ``;q=`` followed by a value between ``0`` and
``1`` inclusive, with up to 3 decimal digits. When interpreting this value,
an entry with a higher quality has priority over an entry with a lower quality,
and any entry without a quality present will default to a quality of ``1``.
However, clients should keep in mind that a server is free to select **any** of
the content types they've asked for, regardless of their requested priority, and
it may even return a content type that they did **not** ask for.
To aid clients in determining the content type of the response that they have
received from an API request, this PEP requires that servers always include a
``Content-Type`` header indicating the content type of the response. This is
technically a backwards incompatible change, however in practice
`pip has been enforcing this requirement <https://github.com/pypa/pip/blob/cf3696a81b341925f82f20cb527e656176987565/src/pip/_internal/index/collector.py#L123-L150>`_
so the risks for actual breakages is low.
An example of how a client can operate would look like:
.. code-block:: python
import cgi
import requests
# Construct our list of acceptable content types, we want to prefer
# that we get a v1 response serialized using JSON, however we also
# can support a v1 response serialized using HTML. For compatibility
# we also request text/html, but we prefer it least of all since we
# don't know if it's actually a Simple API response, or just some
# random HTML page that we've gotten due to a misconfiguration.
CONTENT_TYPES = [
"application/vnd.pypi.simple.v1+json",
"application/vnd.pypi.simple.v1+html",
"text/html;q=0", # For legacy compatibility
]
ACCEPT = ", ".join(CONTENT_TYPES)
# Actually make our request to the API, requesting all of the content
# types that we find acceptable, and letting the server select one of
# them out of the list.
resp = requests.get("https://pypi.org/simple/", headers={"Accept": ACCEPT})
# If the server does not support any of the content types you requested,
# AND it has chosen to return a HTTP 406 error instead of a default
# response then this will raise an exception for the 406 error.
resp.raise_for_status()
# Determine what kind of response we've gotten to ensure that it is one
# that we can support, and if it is, dispatch to a function that will
# understand how to interpret that particular version+serialization. If
# we don't understand the content type we've gotten, then we'll raise
# an exception.
content_type, _ = cgi.parse_header(resp.headers.get("content-type", ""))
match content_type:
case "application/vnd.pypi.simple.v1+json":
handle_v1_json(resp)
case "application/vnd.pypi.simple.v1+html" | "text/html":
handle_v1_html(resp)
case _:
raise Exception(f"Unknown content type: {content_type}")
If a client wishes to only support HTML or only support JSON, then they would
just remove the content types that they do not want from the ``Accept`` header,
and turn receiving them into an error.
Alternative Negotiation Mechanisms
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
While using HTTP's Content negotiation is considered the standard way for a client
and server to coordinate to ensure that the client is getting an HTTP response that
it is able to understand, there are situations where that mechanism may not be
sufficient. For those cases this PEP has alternative negotiation mechanisms that
may *optionally* be used instead.
URL Parameter
^^^^^^^^^^^^^
Servers that implement the Simple API may choose to support an URL parameter named
``format`` to allow the clients to request a specific version of the URL.
The value of the ``format`` parameter should be **one** of the valid content types.
Passing multiple content types, wild cards, quality values, etc is **not** supported.
Supporting this parameter is optional, and clients **SHOULD NOT** rely on it for
interacting with the API. This negotiation mechanism is intended to allow for easier
human based exploration of the API within a browser, or to allow documentation or
notes to link to a specific version+format.
Servers that do not support this parameter may choose to return an error when it is
present, or they may simple ignore it's presence.
When a server does implement this parameter, it **SHOULD** take precedence over any
values in the client's ``Accept`` header, and if the server does not support the
requested format, it may choose to fall back to the ``Accept`` header, or choose any
of the error conditions that standard server-driven content negotiation typically
has (e.g. ``406 Not Available``, ``303 Multiple Choices``, or selecting a default
type to return).
Endpoint Configuration
^^^^^^^^^^^^^^^^^^^^^^
This option technically is not a special option at all, it is just a natural
consequence of using content negotiation and allowing servers to select which of the
available content types is their default.
If a server is unwilling or unable to implement the server-driven content negotiation,
and would instead rather require users to explicitly configure their client to select
the version they want, then that is a supported configuration.
To enable this, a server should make multiple endpoints (for instance,
``/simple/v1+html/`` and/or ``/simple/v1+json/``) for each version+format that they
wish to support. Under that endpoint, they can host a copy of their repository that
only supports one (or a subset) of the content-types. When a client makes a request
using the ``Accept`` header, the server can ignore it and return the content type
that corresponds to that endpoint.
For clients that wish to require specific configuration, they can keep track of
which version+format a specific repository URL was configured for, and when making
a request to that server, emit an ``Accept`` header that *only* includes the correct
content type.
TUF Support - PEP 458
---------------------
:pep:`458` requires that all API responses are hashable and that they can be uniquely
identified by a path relative to the repository root. For a Simple API repository, the
target path is the Root of our API (e.g. ``/simple/`` on PyPI). This creates
challenges when accessing the API using a TUF client instead of directly using a
standard HTTP client, as the TUF client cannot handle the fact that a target could
have multiple different representations that all hash differently.
:pep:`458` does not specify what the target path should be for the Simple API, but I
believe that TUF requires that the target paths be "file-like", in other words, a
path like ``simple/PROJECT/`` is not acceptable, because it technically points to a
directory.
The saving grace is that the target path does not *have* to actually match the URL
being fetched from the Simple API, and it can just be a sigil that the fetching code
knows how to transform into the actual URL that needs to be fetched. This same thing
can hold true for other aspects of the actual HTTP request, such as the ``Accept``
header.
Ultimately figuring out how to map a directory to a filename is out of scope for this
PEP (but it would be in scope for :pep:`458`), and this PEP defers making a decision
about how exactly to represent this inside of :pep:`458` metadata.
However, it appears that the current WIP branch against pip that attempts to implement
:pep:`458` is using a target path like ``simple/PROJECT/index.html``. This could be
modified to include the API version and serialization format using something like
``simple/PROJECT/vnd.pypi.simple.vN.FORMAT``. So the v1 HTML format would be
``simple/PROJECT/vnd.pypi.simple.v1.html`` and the v1 JSON format would be
``simple/PROJECT/vnd.pypi.simple.v1.json``.
In this case, since ``text/html`` is an alias to ``application/vnd.pypi.simple.v1+html``
when interacting through TUF, likely it will make the most sense to normalize to the
more explicit name.
Likewise the ``latest`` metaversion should not be included in the targets, only
explicitly declared versions should be supported.
Recommendations
===============
This section is non-normative, and represents what the PEP authors believe to be
the best default implementation decisions for something implementing this PEP, but
it does **not** represent any sort of requirement to match these decisions.
These decisions have been chosen to maximize the number of requests that can be
moved onto the newest version of an API, while maintaining the greatest amount
of compatibility. In addition, they've also tried to make using the API provide
guardrails that attempt to push clients into making the best choices it can.
It is recommended that servers:
- Support all 3 content types described in this PEP, using server-driven
content negotiation, for as long as they reasonably can, or at least as
long as they're receiving non trivial traffic that uses the HTML responses.
- When encountering an ``Accept`` header that does not contain any content types
that it knows how to work with, should not ever return a ``300 Multiple Choice``
response, and it should be preferred to return a ``406 Not Acceptable`` response.
- However, if choosing to use the endpoint configuration, you should prefer to
return a ``200 OK`` response in the expected content type for that endpoint.
- When selecting an acceptable version, should choose the highest version that
the client supports, with the most expressive/featureful serialization format,
taking into account the specificity of the client requests as well as any
quality priority values they have expressed, and it should only use the
``text/html`` content type as a last resort.
It is recommended that clients:
- Support all 3 content types described in this PEP, using server-driven
content negotiation, for as long as they reasonably can.
- When constructing an ``Accept`` header, include all of the content types
that you support.
You should generally *not* include a quality priority value for your content
types, unless you have implementation specific reasons that you want the
server to take into account (for example, if you're using the standard library
HTML parser and you're worried that there may be some kinds of HTML responses
that you're unable to parse in some edge cases).
The one exception to this recommendation is that it is recommended that you
*should* include a ``;q=0`` value on the legacy ``text/html`` content type,
unless it is the only content type that you are requesting.
- Explicitly select what versions they are looking for, rather than using the
``latest`` meta version during normal operation.
- Check the ``Content-Type`` of the response and ensure it matches something
that you were expecting.
FAQ
===
Does this mean PyPI is planning to drop support for HTML/PEP 503?
-----------------------------------------------------------------
No, PyPI has no plans at this time to drop support for :pep:`503` or HTML
responses.
While this PEP does give repositories the flexibility to do that, that largely
exists to ensure that things like using the Endpoint Configuration mechanism is
able to work, and to ensure that clients do not make any assumptions that would
prevent, at some point in the future, gracefully dropping support for HTML.
The existing HTML responses incur almost no maintenance burden on PyPI and
there is no pressing need to remove them. The only real benefit to dropping them
would be to reduce the number of items cached in our CDN.
If in the future PyPI *does* wish to drop support for them, doing so would
almost certainly be the topic of a PEP, or at a minimum a public, open, discussion
and would be informed by metrics showing any impact to end users.
Why JSON instead of X format?
-----------------------------
JSON parsers are widely available in most, if not every, language. A JSON
parser is also available in the Python standard library. It's not the perfect
format, but it's good enough.
Why not add X feature?
----------------------
The general goal of this PEP is to change or add very little. We will instead focus
largely on translating the existing information contained within our HTML responses
into a sensible JSON representation. This will include :pep:`658` metadata required
for packaging tooling.
The only real new capability that is added in this PEP is the ability to have
multiple hashes for a single file. That was done because the current mechanism being
limited to a single hash has made it painful in the past to migrate hashes
(md5 to sha256) and the cost of making the hashes a dictionary and allowing multiple
is pretty low.
The API was generally designed to allow further extension through adding new keys,
so if there's some new piece of data that an installer might need, future PEPs can
easily make that available.
Why is the root URL a dictionary instead of a list?
---------------------------------------------------
The most natural direct translation of the root URL being a list of links is to turn
it into a list of objects. However, stepping back, that's not the most natural way
to actually represent this data. This was a result of a HTML limitation that we had to
work around. With a list (either of ``<a>`` tags, or objects) there's nothing stopping
you from listing the same project twice and other unwanted patterns.
A dictionary also allows for an average of constant-time access given the project name.
Why include the filename when the URL has it already?
-----------------------------------------------------
We could reduce the size of our responses by removing the ``filename`` key and expecting
clients to pull that information out of the URL.
Currently this PEP chooses not to do that, largely because :pep:`503` explicitly required
that the filename be available via the anchor tag of the links, though that was largely
because *something* had to be there. It's not clear if repositories in the wild always
have a filename as the last part of the URL or if they're relying on the filename in the
anchor tag.
It also makes the responses slightly nicer to read for a human, as you get a nice short
unique identifier.
If we got reasonable confidence that mandating the filename is in the URL, then we could
drop this data and reduce the size of the JSON response.
Why not break out other pieces of information from the filename?
----------------------------------------------------------------
Currently clients are expected to parse a number of pieces of information from the
filename such as project name, version, ABI tags, etc. We could break these out
and add them as keys to the file object.
This PEP has chosen not to do that because doing so would increase the size of the
API responses, and most clients are going to require the ability to parse that
information out of file names anyways regardless of what the API does. Thus it makes
sense to keep that functionality inside of the clients.
Why Content Negotiation instead of multiple URLs?
-------------------------------------------------
Another reasonable way to implement this would be to duplicate the API routes and
include some marker in the URL itself for JSON. Such as making the URLs be something
like ``/simple/foo.json``, ``/simple/_index.json``, etc.
This makes some things simpler like TUF integration and fully static serving of a
repository (since ``.json`` files can just be written out).
However, this is two pretty major issues:
- Our current URL structure relies on the fact that there is an URL that represents
the "root", ``/`` to serve the list of projects. If we want to have separate URLs
for JSON and HTML, we would need to come up with some way to have two root URLs.
Something like ``/`` being HTML and ``/_index.json`` being JSON, since ``_index``
isn't a valid project name could work. But ``/`` being HTML doesn't work great if
a repository wants to remove support for HTML.
Another option could be moving all of the existing HTML URLs under a namespace while
making a new namespace for JSON. Since ``/<project>/`` was defined, we would have to
make these namespaces not valid project names, so something like ``/_html/`` and
``/_json/`` could work, then just redirect the non namespaced URLs to whatever the
"default" for that repository is (likely HTML, unless they've disabled HTML then JSON).
- With separate URLs, there's no good way to support zero configuration discovery
that a repository supports the JSON URLs without making additional HTTP requests to
determine if the JSON URL exists or not.
The most naive implementation of this would be to request the JSON URL and fall back
to the HTML URL for *every* single request, but that would be horribly performant
and violate the goal of minimal additional HTTP requests.
The most likely implementation of this would be to make some sort of repository level
configuration file that somehow indicates what is supported. We would have the same
namespace problem as above, with the same solution, something like ``/_config.json``
or so could hold that data, and a client could first make an HTTP request to that,
and if it exists pull it down and parse it to learn about the capabilities of this
particular repository.
- The use of ``Accept`` also allows us to add versioning into this field
All being said, it is the opinion of this PEP that those three issues combined make
using separate API routes a less desirable solution than relying on content
negotiation to select the most ideal representation of the data.
Does this mean that static servers are no longer supported?
-----------------------------------------------------------
In short, no, static servers are still (almost) fully supported by this PEP.
The specifics of how they are supported will depend on the static server in
question. For example:
- **S3:** S3 fully supports custom content types, however it does not support
any form of content negotiation. In order to have a server hosted on S3, you
would have to use the "Endpoint configuration" style of negotiation, and
users would have to configure their clients explicitly.
- **GitHub Pages:** GitHub pages does not support custom content types, so the
S3 solution is not currently workable, which means that only ``text/html``
repositories would function.
- **Apache:** Apache fully supports server-driven content negotiation, and would
just need to be configured to map the custom content types to specific extension.
Doesn't TUF support require having different URLs for each representation?
--------------------------------------------------------------------------
While in TUF, each target can only have a single representation, and by default
that is assumed to map exactly to the target path that is being referenced
within TUF, there is actually no requirement that the target path is the same
as the server path, that the same data can't be represented by multiple targets.
In fact, TUF doesn't support the Simple API URLs as they are already, because
TUF assumes that a target points to a filename, but all of the Simple API URLs
are directories. Thus regardless of this PEP, there is going to have to be
something that translates between the naming of the targets within the TUF
metadata, and the actual requests being made to the server.
Currently the WIP TUF implementation for pip maps a target like
``simple/PROJECT/index.html`` to an HTTP request to fetch ``/simple/PROJECT/``.
However there is no reason that it could not be extended to map a target
like ``/simple/PROJECT/vnd.pypi.simple.v1.html`` to an HTTP request to
fetch ``/simple/PROJECT/`` with an ``Accept`` header of
``application/vnd.pypi.simple.v1+html``.
Why not add an ``application/json`` alias like ``text/html``?
-------------------------------------------------------------
This PEP believes that it is best for both clients and servers to be explicit
about the types of the API responses that are being used, and a content type
like ``application/json`` is the exact opposite of explicit.
The existence of the ``text/html`` alias exists as a compromise primarily to
ensure that existing consumers of the API continue to function as they already
do. There is no such expectation of existing clients using the Simple API with
a ``application/json`` content type.
In addition, ``application/json`` has no versioning in it, which means that
if there is ever a ``2.0`` version of the Simple API, we will be forced to make
a decision. Should ``application/json`` preserve backwards compatibility and
continue to be an alias for ``application/vnd.pypi.simple.v1+json``, or should
it be updated to be an alias for ``application/vnd.pypi.simple.v2+json``?
This problem doesn't exist for ``text/html``, because the assumption is that
HTML will remain a legacy format, and will likely not gain *any* new features,
much less features that require breaking compatibility. So having it be an
alias for ``application/vnd.pypi.simple.v1+html`` is effectively the same as
having it be an alias for ``application/vnd.pypi.simple.latest+html``, since
``1.0`` will likely be the only HTML version to exist.
The largest benefit to adding the ``application/json`` content type is that
there do things that do not allow you to have custom content types, and require
you to select one of their preset content types. The main example of this being
GitHub Pages, which the lack of ``application/json`` support in this PEP means
that static repositories will no longer be able to be hosted on GitHub Pages
unless GitHub adds the ``application/vnd.pypi.simple.v1+json`` content type.
This PEP believes that the benefits are not large enough to add that content
type alias at this time, and that it's inclusion would likely be a footgun
waiting for unsuspecting people to accidentally pick it up. Especially given
that we can always add it in the future, but removing things is a lot harder
to do.
Appendix 1: Survey of use cases to cover
========================================
This was done through a discussion between ``pip``, ``PyPI``, and ``bandersnarch``
maintainers, who are the two first potential users for the new API. This is
how they use the Simple + JSON APIs today:
- ``pip``:
- List of all files for a particular release
- Metadata of each individual artifact:
- was it yanked? (``data-yanked``)
- what's the python-requires? (``data-python-requires``)
- what's the hash of this file? (currently, hash in URL)
- Full metadata (``data-dist-info-metadata``)
- [Bonus] what are the declared dependencies, if available (list-of-strings, null if unavailable)?
- ``bandersnatch`` - Only uses legacy JSON API + XMLRPC today:
- Generates Simple HTML rather than copying from PyPI
- Maybe this changes with the new API and we verbatim pull these API assets from PyPI
- List of all files for a particular release.
- Workout URL for release files to download
- Metadata of each individual artifact.
- Write out the JSON to mirror storage today (disk/S3)
- Required metadata used
(via `Package class <https://github.com/pypa/bandersnatch/blob/main/src/bandersnatch/package.py>`__):
- ``metadata["info"]``
- ``metadata["last_serial"]``
- ``metadata["releases"]``
- digests
- URL
- XML-RPC calls (we'd love to deprecate - but we don't think should go in the Simple API)
- [Bonus] Get packages since serial X (or all)
- XML-RPC Call: ``changelog_since_serial``
- [Bonus] Get all packages with serial
- XML-RPC Call: ``list_packages_with_serial``
Copyright
=========
This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.