PEP: 691 Title: JSON-based Simple API for Python Package Indexes Author: Donald Stufft , Pradyun Gedam , Cooper Lees , Dustin Ingram PEP-Delegate: Brett Cannon Discussions-To: https://discuss.python.org/t/pep-691-json-based-simple-api-for-python-package-indexes/15553 Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 04-May-2022 Post-History: `05-May-2022 `__ Abstract ======== The "Simple Repository API" that was defined in :pep:`503` (and was in use much longer than that) has served us reasonably well for a very long time. However, the reliance on using HTML as the data exchange mechanism has several shortcomings. There are two major issues with an HTML-based API: - While HTML5 is a standard, it's an incredibly complex standard and ensuring completely correct parsing of it involves complex logic that does not currently exist within the Python standard library (nor the standard library of many other languages). This means that to actually accept everything that is technically valid, tools have to pull in large dependencies or they have to rely on the standard library's ``html.parser`` library, which is lighter weight but potentially doesn't fully support HTML5. - HTML5 is primarily designed as a markup language to present documents for human consumption. Our use of it is driven largely for historical reasons and accidental reasons, and it's unlikely anyone would design an API that relied on it if they were starting from scratch. The primary issue with using a markup format designed for human consumption is that there's not a great way to actually encode data within HTML. We've gotten around this by limiting the data we put in this API and being creative with how we can cram data into the API (for instance, hashes are embedded as URL fragments, adding the ``data-yanked`` attribute in :pep:`592`). :pep:`503` was largely an attempt to standardize what was already in use, so it did not propose any large changes to the API. In the intervening years, we've regularly talked about an "API V2" that would re-envision the entire API of PyPI. However, due to limited time constraints, that effort has not gained much if any traction beyond people thinking that it would be nice to do it. This PEP attempts to take a different route. It doesn't fundamentally change the overall API structure, but instead specifies a new serialization of the existing data contained in existing :pep:`503` responses in a format that is easier for software to parse rather than using a human centric document format. Goals ===== - **Enable zero configuration discovery.** Clients of the simple API **MUST** be able to gracefully determine whether a target repository supports this PEP without relying on any form of out of band communication (configuration, prior knowledge, etc). Individual clients **MAY** choose to require configuration to enable the use of this API, however. - **Enable clients to drop support for "legacy" HTML parsing.** While it is expected that most clients will keep supporting HTML-only repositories for a while, if not forever, it should be possible for a client to choose to support only the new API formats and no longer invoke an HTML parser. - **Enable repositories to drop support for "legacy" HTML formats.** Similar to clients, it is expected that most repositories will continue to support HTML responses for a long time, or forever. It should be possible for a repository to choose to only support the new formats. - **Maintain full support for existing HTML-only clients.** We **MUST** not break existing clients that are accessing the API as a strictly :pep:`503` API. The only exception to this, is if the repository itself has chosen to no longer support the HTML format. - **Minimal additional HTTP requests.** Using this API **MUST** not drastically increase the amount of HTTP requests an installer must do in order to function. Ideally it will require 0 additional requests, but if needed it may require one or two additional requests (total, not per dependency). - **Minimal additional unique reponses.** Due to the nature of how large repositories like PyPI cache responses, this PEP should not introduce a significantly or combinatorially large number of additional unique responses that the repository may produce. - **Supports TUF.** This PEP **MUST** be able to function within the bounds of what TUF can support (:pep:`458`), and must be able to be secured using it. - **Require only the standard library, or small external dependencies for clients.** Parsing an API response should ideally require nothing but the standard library, however it would be acceptable to require a small, pure Python dependency. Specification ============= To enable parsing responses with only the standard library, this PEP specifies that all responses (besides the files themselves, and the HTML responses from :pep:`503`) should be serialized using `JSON `_. To enable zero configuration discovery and to minimize the amount of additional HTTP requests, this PEP extends :pep:`503` such that all of the API endpoints (other than the files themselves) will utilize HTTP content negotiation to allow client and server to select the correct serialization format to serve, i.e. either HTML or JSON. Versioning ---------- Versioning will adhere to :pep:`629` format (``Major.Minor``), which has defined the existing HTML responses to be ``1.0``. Since this PEP does not introduce new features into the API, rather it describes a different serialization format for the existing features, this PEP does not change the existing ``1.0`` version, and instead just describes how to serialize that into JSON. Similar to :pep:`629`, the major version number **MUST** be incremented if any changes to the new format would result in no longer being able to expect existing clients to meaningfully understand the format. Likewise, incrementing the minor version **MUST** be incremented if features are added or removed from the format, but existing clients would be expected to continue to meaningfully understand the format. Changes that would not result in existing clients being unable to meaningfully understand the format and which do not represent features being added or removed may occur without changing the version number. This is intentionally vague, as this PEP believes it is best left up to future PEPs that make any changes to the API to investigate and decide whether or not that change should increment the major or minor version. Future versions of the API may add things that can only be represented in a subset of the available serializations of that version. All serializations version numbers **SHOULD** be kept in sync, but the specifics of how a feature serializes into each format may differ, including whether or not that feature is present at all. It is the intent of this PEP that the API should be thought of as URL endpoints that return data, whose interpretation is defined by the version of that data, and then serialized into the target serialization format. JSON Serialization ------------------ The URL structure from :pep:`503` still applies, as this PEP only adds an additional serialization format for the already existing API. The following constraints apply to all JSON serialized responses described in this PEP: * All JSON responses will *always* be a JSON object rather than an array or other type. * While JSON doesn't natively support an URL type, any value that represents an URL in this API may be either absolute or relative as long as they point to the correct location. If relative, they are relative to the current URL as if it were HTML. * Additional keys may be added to any dictionary objects in the API responses and clients **MUST** ignore keys that they don't understand. * All JSON responses will have a ``meta`` key, which contains information related to the response itself, rather than the content of the response. * All JSON responses will have a ``meta.api-version`` key, which will be a string that contains the :pep:`629` ``Major.Minor`` version number, with the same fail/warn semantics as in :pep:`629`. * All requirements of :pep:`503` that are not HTML specific still apply. Project List ~~~~~~~~~~~~ The root URL ``/`` for this PEP (which represents the base URL) will be a JSON encoded dictionary which has a single key, ``projects``, which is itself a dictionary where each key is a string of the normalized project name, and the value is a dictionary with a single key, ``url``, which represents the URL that the project can be fetched from. As an example: .. code-block:: json { "meta": { "api-version": "1.0" }, "projects": { "frob": {"url": "/frob/"}, "spamspamspam": {"url": "/spamspamspam/"} } } Project Detail ~~~~~~~~~~~~~~ The format of this URL is ``//`` where the ```` is replaced by the :pep:`503`-canonicalized name for that project, so a project named "Holy_Grail" would have a URL like ``/holy-grail/``. This URL must respond with a JSON encoded dictionary that has two keys, ``name``, which represents the normalized name of the project and ``files``. The ``files`` key is a list of dictionaries, each one representing an individual file. Each individual file dictionary has the following keys: - ``filename``: The filename that is being represented. - ``url``: The URL that the file can be fetched from. - ``hashes``: A dictionary mapping a hash name to a hex encoded digest of the file. Multiple hashes can be included, and it is up to the client to decide what to do with multiple hashes (it may validate all of them or a subset of them, or nothing at all). These hash names **SHOULD** always be normalized to be lowercase. The ``hashes`` dictionary **MUST** be present, even if no hashes are available for the file, however it is **HIGHLY** recommended that at least one secure, guaranteed to be available hash is always included. By default, any hash algorithm available via `hashlib `_ (specifically any that can be passed to ``hashlib.new()`` and do not require additional parameters) can be used as a key for the hashes dictionary. At least one secure algorithm from ``hashlib.algorithms_guaranteed`` **SHOULD** always be included. At the time of this PEP, ``sha256`` specifically is recommended. - ``requires-python``: An **optional** key that exposes the *Requires-Python* metadata field, specified in :pep:`345`. Where this is present, installer tools **SHOULD** ignore the download when installing to a Python version that doesn't satisfy the requirement. Unlike ``data-requires-python`` in :pep:`503`, the ``requires-python`` key does not require any special escaping other than anything JSON does naturally. - ``dist-info-metadata``: An **optional** key that indicates that metadata for this file is available, via the same location as specified in :pep:`658` (``{file_url}.metadata``). Where this is present, it **MUST** be boolean to indicate if the file has an associated metadata file, or a dictionary mapping hash names to a hex encoded digest of the metadata's hash. When this is a dictionary of hashes, then all the same requirements and recommendations as the ``hashes`` key hold true for this key as well. If this key is missing then the metadata file may or may not exist. If the key value is truthy, then the metadata file is present, and if it is falsey then it is not. It is recommended that servers make the hashes of the metadata file available if possible. - ``gpg-sig``: An **optional** key that acts a boolean to indicate if the file has an associated GPG signature or not. If this key does not exist, then the signature may or may not exist. - ``yanked``: An **optional** key which may be a boolean to indicate if the file has been yanked, or a non empty, but otherwise arbitrary, string to indicate that a file has been yanked with a specific reason. If the ``yanked`` key is present and is a truthy value, then it **SHOULD** be interpreted as indicating that the file pointed to by the ``url`` field has been "Yanked" as per :pep:`592`. As an example: .. code-block:: json { "meta": { "api-version": "1.0" }, "name": "holygrail", "files": [ { "filename": "holygrail-1.0.tar.gz", "url": "https://example.com/files/holygrail-1.0.tar.gz", "hashes": {"sha256": "...", "blake2b": "..."}, "requires-python": ">=3.7", "yanked": "Had a vulnerability" }, { "filename": "holygrail-1.0-py3-none-any.whl", "url": "https://example.com/files/holygrail-1.0-py3-none-any.whl", "hashes": {"sha256": "...", "blake2b": "..."}, "requires-python": ">=3.7", "dist-info-metadata": true } ] } Content-Types ------------- This PEP proposes that all responses from the Simple API will have a standard content type that describes what the response is (a Simple API response), what version of the API it represents, and what serialization format has been used. The structure of this content type will be: .. code-block:: text application/vnd.pypi.simple.$version+format Since only major versions should be disruptive to clients attempting to understand one of these API responses, only the major version will be included in the content type, and will be prefixed with a ``v`` to clarify that it is a version number. Which means that for the existing 1.0 API, the content types would be: - **JSON:** ``application/vnd.pypi.simple.v1+json`` - **HTML:** ``application/vnd.pypi.simple.v1+html`` In addition to the above, a special "meta" version is supported named ``latest``, whose purpose is to allow clients to request the absolute latest version, without having to know ahead of time what that version is. It is recommended however, that clients be explicit about what versions they support. To support existing clients which expect the existing :pep:`503` API responses to use the ``text/html`` content type, this PEP further defines ``text/html`` as an alias for the ``application/vnd.pypi.simple.v1+html`` content type. Version + Format Selection -------------------------- Now that there is multiple possible serializations, we need a mechanism to allow clients to indicate what serialization formats that they're able to understand. In addition, it would be a benefit if any possible new major version to the API can be added without disrupting existing clients expecting the previous API version. To enable this, this PEP standardizes on the use of HTTP's `Server-Driven Content Negotiation `_. While this PEP won't fully describe the entirety of server-driven content negotiation, the flow is roughly: 1. The client makes an HTTP request containing an ``Accept`` header listing all of the version+format content types that they are able to understand. 2. The server inspects that header, selects one of the listed content types, then returns a response using that content type. 3. If the server does not support any of the content types in the ``Accept`` header or if the client did not provide an ``Accept`` header at all, then they are able to choose between 3 different options for how to respond: a. Select a default content type other than what the client has requested and return a response with that. b. Return a HTTP ``406 Not Acceptable`` response to indicate that none of the requested content types were available, and the server was unable or unwilling to select a default content type to respond with. c. Return a HTTP ``300 Multiple Choices`` response that contains a list of all of the possible responses that could have been chosen. 4. The client interprets the response, handling the different types of responses that the server may have responded with. This PEP does not specify which choices the server makes in regards to handling a content type that it isn't able to return, and clients **SHOULD** be prepared to handle all of the possible responses in whatever way makes the most sense for that client. However, as there is no standard format for how a ``300 Multiple Choices`` response can be interpreted, this PEP highly discourages servers from utilizing that option, as clients will have no way to understand and select a different content-type to request. In addition, it's unlikely that the client *could* understand a different content type anyways, so at best this response would likely just be treated the same as a ``406 Not Acceptable`` error. This PEP **does** require that if the meta version ``latest`` is being used, the server **MUST** respond with the content type for the actual version that is contained in the response (i.e. A ``Accept: application/vnd.pypi.simple.latest+json`` request that returns a ``v1.x`` response should have a ``Content-Type`` of ``application/vnd.pypi.simple.v1+json``). The ``Accept`` header is a comma separated list of content types that the client understands and is able to process. It supports three different formats for each content type that is being requested: - ``$type/$subtype`` - ``$type/*`` - ``*/*`` For the use of selecting a version+format, the most useful of these is ``$type/$subtype``, as that is the only way to actually specify the version and format you want. The order of the content types listed in the ``Accept`` header does not have any specific meaning, and the server **SHOULD** consider all of them to be equally valid to respond with. If a client wishes to specify that they prefer a specific content type over another, they may use the ``Accept`` header's `quality value `_ syntax. This allows a client to specify a priority for a specific entry in their ``Accept`` header, by append a ``;q=`` followed by a value between ``0`` and ``1`` inclusive, with up to 3 decimal digits. When interpreting this value, an entry with a higher quality has priority over an entry with a lower quality, and any entry without a quality present will default to a quality of ``1``. However, clients should keep in mind that a server is free to select **any** of the content types they've asked for, regardless of their requested priority, and it may even return a content type that they did **not** ask for. To aid clients in determining the content type of the response that they have received from an API request, this PEP requires that servers always include a ``Content-Type`` header indicating the content type of the response. This is technically a backwards incompatible change, however in practice `pip has been enforcing this requirement `_ so the risks for actual breakages is low. An example of how a client can operate would look like: .. code-block:: python import cgi import requests # Construct our list of acceptable content types, we want to prefer # that we get a v1 response serialized using JSON, however we also # can support a v1 response serialized using HTML. For compatibility # we also request text/html, but we prefer it least of all since we # don't know if it's actually a Simple API response, or just some # random HTML page that we've gotten due to a misconfiguration. CONTENT_TYPES = [ "application/vnd.pypi.simple.v1+json", "application/vnd.pypi.simple.v1+html", "text/html;q=0", # For legacy compatibility ] ACCEPT = ", ".join(CONTENT_TYPES) # Actually make our request to the API, requesting all of the content # types that we find acceptable, and letting the server select one of # them out of the list. resp = requests.get("https://pypi.org/simple/", headers={"Accept": ACCEPT}) # If the server does not support any of the content types you requested, # AND it has chosen to return a HTTP 406 error instead of a default # response then this will raise an exception for the 406 error. resp.raise_for_status() # Determine what kind of response we've gotten to ensure that it is one # that we can support, and if it is, dispatch to a function that will # understand how to interpret that particular version+serialization. If # we don't understand the content type we've gotten, then we'll raise # an exception. content_type, _ = cgi.parse_header(resp.headers.get("content-type", "")) match content_type: case "application/vnd.pypi.simple.v1+json": handle_v1_json(resp) case "application/vnd.pypi.simple.v1+html" | "text/html": handle_v1_html(resp) case _: raise Exception(f"Unknown content type: {content_type}") If a client wishes to only support HTML or only support JSON, then they would just remove the content types that they do not want from the ``Accept`` header, and turn receiving them into an error. Alternative Negotiation Mechanisms ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ While using HTTP's Content negotiation is considered the standard way for a client and server to coordinate to ensure that the client is getting an HTTP response that it is able to understand, there are situations where that mechanism may not be sufficient. For those cases this PEP has alternative negotiation mechanisms that may *optionally* be used instead. URL Parameter ^^^^^^^^^^^^^ Servers that implement the Simple API may choose to support an URL parameter named ``format`` to allow the clients to request a specific version of the URL. The value of the ``format`` parameter should be **one** of the valid content types. Passing multiple content types, wild cards, quality values, etc is **not** supported. Supporting this parameter is optional, and clients **SHOULD NOT** rely on it for interacting with the API. This negotiation mechanism is intended to allow for easier human based exploration of the API within a browser, or to allow documentation or notes to link to a specific version+format. Servers that do not support this parameter may choose to return an error when it is present, or they may simple ignore it's presence. When a server does implement this parameter, it **SHOULD** take precedence over any values in the client's ``Accept`` header, and if the server does not support the requested format, it may choose to fall back to the ``Accept`` header, or choose any of the error conditions that standard server-driven content negotiation typically has (e.g. ``406 Not Available``, ``303 Multiple Choices``, or selecting a default type to return). Endpoint Configuration ^^^^^^^^^^^^^^^^^^^^^^ This option technically is not a special option at all, it is just a natural consequence of using content negotiation and allowing servers to select which of the available content types is their default. If a server is unwilling or unable to implement the server-driven content negotiation, and would instead rather require users to explicitly configure their client to select the version they want, then that is a supported configuration. To enable this, a server should make multiple endpoints (for instance, ``/simple/v1+html/`` and/or ``/simple/v1+json/``) for each version+format that they wish to support. Under that endpoint, they can host a copy of their repository that only supports one (or a subset) of the content-types. When a client makes a request using the ``Accept`` header, the server can ignore it and return the content type that corresponds to that endpoint. For clients that wish to require specific configuration, they can keep track of which version+format a specific repository URL was configured for, and when making a request to that server, emit an ``Accept`` header that *only* includes the correct content type. TUF Support - PEP 458 --------------------- :pep:`458` requires that all API responses are hashable and that they can be uniquely identified by a path relative to the repository root. For a Simple API repository, the target path is the Root of our API (e.g. ``/simple/`` on PyPI). This creates challenges when accessing the API using a TUF client instead of directly using a standard HTTP client, as the TUF client cannot handle the fact that a target could have multiple different representations that all hash differently. :pep:`458` does not specify what the target path should be for the Simple API, but I believe that TUF requires that the target paths be "file-like", in other words, a path like ``simple/PROJECT/`` is not acceptable, because it technically points to a directory. The saving grace is that the target path does not *have* to actually match the URL being fetched from the Simple API, and it can just be a sigil that the fetching code knows how to transform into the actual URL that needs to be fetched. This same thing can hold true for other aspects of the actual HTTP request, such as the ``Accept`` header. Ultimately figuring out how to map a directory to a filename is out of scope for this PEP (but it would be in scope for :pep:`458`), and this PEP defers making a decision about how exactly to represent this inside of :pep:`458` metadata. However, it appears that the current WIP branch against pip that attempts to implement :pep:`458` is using a target path like ``simple/PROJECT/index.html``. This could be modified to include the API version and serialization format using something like ``simple/PROJECT/vnd.pypi.simple.vN.FORMAT``. So the v1 HTML format would be ``simple/PROJECT/vnd.pypi.simple.v1.html`` and the v1 JSON format would be ``simple/PROJECT/vnd.pypi.simple.v1.json``. In this case, since ``text/html`` is an alias to ``application/vnd.pypi.simple.v1+html`` when interacting through TUF, likely it will make the most sense to normalize to the more explicit name. Likewise the ``latest`` metaversion should not be included in the targets, only explicitly declared versions should be supported. Recommendations =============== This section is non-normative, and represents what the PEP authors believe to be the best default implementation decisions for something implementing this PEP, but it does **not** represent any sort of requirement to match these decisions. These decisions have been chosen to maximize the number of requests that can be moved onto the newest version of an API, while maintaining the greatest amount of compatibility. In addition, they've also tried to make using the API provide guardrails that attempt to push clients into making the best choices it can. It is recommended that servers: - Support all 3 content types described in this PEP, using server-driven content negotiation, for as long as they reasonably can, or at least as long as they're receiving non trivial traffic that uses the HTML responses. - When encountering an ``Accept`` header that does not contain any content types that it knows how to work with, should not ever return a ``300 Multiple Choice`` response, and it should be preferred to return a ``406 Not Acceptable`` response. - However, if choosing to use the endpoint configuration, you should prefer to return a ``200 OK`` response in the expected content type for that endpoint. - When selecting an acceptable version, should choose the highest version that the client supports, with the most expressive/featureful serialization format, taking into account the specificity of the client requests as well as any quality priority values they have expressed, and it should only use the ``text/html`` content type as a last resort. It is recommended that clients: - Support all 3 content types described in this PEP, using server-driven content negotiation, for as long as they reasonably can. - When constructing an ``Accept`` header, include all of the content types that you support. You should generally *not* include a quality priority value for your content types, unless you have implementation specific reasons that you want the server to take into account (for example, if you're using the standard library HTML parser and you're worried that there may be some kinds of HTML responses that you're unable to parse in some edge cases). The one exception to this recommendation is that it is recommended that you *should* include a ``;q=0`` value on the legacy ``text/html`` content type, unless it is the only content type that you are requesting. - Explicitly select what versions they are looking for, rather than using the ``latest`` meta version during normal operation. - Check the ``Content-Type`` of the response and ensure it matches something that you were expecting. FAQ === Does this mean PyPI is planning to drop support for HTML/PEP 503? ----------------------------------------------------------------- No, PyPI has no plans at this time to drop support for :pep:`503` or HTML responses. While this PEP does give repositories the flexibility to do that, that largely exists to ensure that things like using the Endpoint Configuration mechanism is able to work, and to ensure that clients do not make any assumptions that would prevent, at some point in the future, gracefully dropping support for HTML. The existing HTML responses incur almost no maintenance burden on PyPI and there is no pressing need to remove them. The only real benefit to dropping them would be to reduce the number of items cached in our CDN. If in the future PyPI *does* wish to drop support for them, doing so would almost certainly be the topic of a PEP, or at a minimum a public, open, discussion and would be informed by metrics showing any impact to end users. Why JSON instead of X format? ----------------------------- JSON parsers are widely available in most, if not every, language. A JSON parser is also available in the Python standard library. It's not the perfect format, but it's good enough. Why not add X feature? ---------------------- The general goal of this PEP is to change or add very little. We will instead focus largely on translating the existing information contained within our HTML responses into a sensible JSON representation. This will include :pep:`658` metadata required for packaging tooling. The only real new capability that is added in this PEP is the ability to have multiple hashes for a single file. That was done because the current mechanism being limited to a single hash has made it painful in the past to migrate hashes (md5 to sha256) and the cost of making the hashes a dictionary and allowing multiple is pretty low. The API was generally designed to allow further extension through adding new keys, so if there's some new piece of data that an installer might need, future PEPs can easily make that available. Why is the root URL a dictionary instead of a list? --------------------------------------------------- The most natural direct translation of the root URL being a list of links is to turn it into a list of objects. However, stepping back, that's not the most natural way to actually represent this data. This was a result of a HTML limitation that we had to work around. With a list (either of ```` tags, or objects) there's nothing stopping you from listing the same project twice and other unwanted patterns. A dictionary also allows for an average of constant-time access given the project name. Why include the filename when the URL has it already? ----------------------------------------------------- We could reduce the size of our responses by removing the ``filename`` key and expecting clients to pull that information out of the URL. Currently this PEP chooses not to do that, largely because :pep:`503` explicitly required that the filename be available via the anchor tag of the links, though that was largely because *something* had to be there. It's not clear if repositories in the wild always have a filename as the last part of the URL or if they're relying on the filename in the anchor tag. It also makes the responses slightly nicer to read for a human, as you get a nice short unique identifier. If we got reasonable confidence that mandating the filename is in the URL, then we could drop this data and reduce the size of the JSON response. Why not break out other pieces of information from the filename? ---------------------------------------------------------------- Currently clients are expected to parse a number of pieces of information from the filename such as project name, version, ABI tags, etc. We could break these out and add them as keys to the file object. This PEP has chosen not to do that because doing so would increase the size of the API responses, and most clients are going to require the ability to parse that information out of file names anyways regardless of what the API does. Thus it makes sense to keep that functionality inside of the clients. Why Content Negotiation instead of multiple URLs? ------------------------------------------------- Another reasonable way to implement this would be to duplicate the API routes and include some marker in the URL itself for JSON. Such as making the URLs be something like ``/simple/foo.json``, ``/simple/_index.json``, etc. This makes some things simpler like TUF integration and fully static serving of a repository (since ``.json`` files can just be written out). However, this is two pretty major issues: - Our current URL structure relies on the fact that there is an URL that represents the "root", ``/`` to serve the list of projects. If we want to have separate URLs for JSON and HTML, we would need to come up with some way to have two root URLs. Something like ``/`` being HTML and ``/_index.json`` being JSON, since ``_index`` isn't a valid project name could work. But ``/`` being HTML doesn't work great if a repository wants to remove support for HTML. Another option could be moving all of the existing HTML URLs under a namespace while making a new namespace for JSON. Since ``//`` was defined, we would have to make these namespaces not valid project names, so something like ``/_html/`` and ``/_json/`` could work, then just redirect the non namespaced URLs to whatever the "default" for that repository is (likely HTML, unless they've disabled HTML then JSON). - With separate URLs, there's no good way to support zero configuration discovery that a repository supports the JSON URLs without making additional HTTP requests to determine if the JSON URL exists or not. The most naive implementation of this would be to request the JSON URL and fall back to the HTML URL for *every* single request, but that would be horribly performant and violate the goal of minimal additional HTTP requests. The most likely implementation of this would be to make some sort of repository level configuration file that somehow indicates what is supported. We would have the same namespace problem as above, with the same solution, something like ``/_config.json`` or so could hold that data, and a client could first make an HTTP request to that, and if it exists pull it down and parse it to learn about the capabilities of this particular repository. - The use of ``Accept`` also allows us to add versioning into this field All being said, it is the opinion of this PEP that those three issues combined make using separate API routes a less desirable solution than relying on content negotiation to select the most ideal representation of the data. Does this mean that static servers are no longer supported? ----------------------------------------------------------- In short, no, static servers are still (almost) fully supported by this PEP. The specifics of how they are supported will depend on the static server in question. For example: - **S3:** S3 fully supports custom content types, however it does not support any form of content negotiation. In order to have a server hosted on S3, you would have to use the "Endpoint configuration" style of negotiation, and users would have to configure their clients explicitly. - **GitHub Pages:** GitHub pages does not support custom content types, so the S3 solution is not currently workable, which means that only ``text/html`` repositories would function. - **Apache:** Apache fully supports server-driven content negotiation, and would just need to be configured to map the custom content types to specific extension. Doesn't TUF support require having different URLs for each representation? -------------------------------------------------------------------------- While in TUF, each target can only have a single representation, and by default that is assumed to map exactly to the target path that is being referenced within TUF, there is actually no requirement that the target path is the same as the server path, that the same data can't be represented by multiple targets. In fact, TUF doesn't support the Simple API URLs as they are already, because TUF assumes that a target points to a filename, but all of the Simple API URLs are directories. Thus regardless of this PEP, there is going to have to be something that translates between the naming of the targets within the TUF metadata, and the actual requests being made to the server. Currently the WIP TUF implementation for pip maps a target like ``simple/PROJECT/index.html`` to an HTTP request to fetch ``/simple/PROJECT/``. However there is no reason that it could not be extended to map a target like ``/simple/PROJECT/vnd.pypi.simple.v1.html`` to an HTTP request to fetch ``/simple/PROJECT/`` with an ``Accept`` header of ``application/vnd.pypi.simple.v1+html``. Why not add an ``application/json`` alias like ``text/html``? ------------------------------------------------------------- This PEP believes that it is best for both clients and servers to be explicit about the types of the API responses that are being used, and a content type like ``application/json`` is the exact opposite of explicit. The existence of the ``text/html`` alias exists as a compromise primarily to ensure that existing consumers of the API continue to function as they already do. There is no such expectation of existing clients using the Simple API with a ``application/json`` content type. In addition, ``application/json`` has no versioning in it, which means that if there is ever a ``2.0`` version of the Simple API, we will be forced to make a decision. Should ``application/json`` preserve backwards compatibility and continue to be an alias for ``application/vnd.pypi.simple.v1+json``, or should it be updated to be an alias for ``application/vnd.pypi.simple.v2+json``? This problem doesn't exist for ``text/html``, because the assumption is that HTML will remain a legacy format, and will likely not gain *any* new features, much less features that require breaking compatibility. So having it be an alias for ``application/vnd.pypi.simple.v1+html`` is effectively the same as having it be an alias for ``application/vnd.pypi.simple.latest+html``, since ``1.0`` will likely be the only HTML version to exist. The largest benefit to adding the ``application/json`` content type is that there do things that do not allow you to have custom content types, and require you to select one of their preset content types. The main example of this being GitHub Pages, which the lack of ``application/json`` support in this PEP means that static repositories will no longer be able to be hosted on GitHub Pages unless GitHub adds the ``application/vnd.pypi.simple.v1+json`` content type. This PEP believes that the benefits are not large enough to add that content type alias at this time, and that it's inclusion would likely be a footgun waiting for unsuspecting people to accidentally pick it up. Especially given that we can always add it in the future, but removing things is a lot harder to do. Appendix 1: Survey of use cases to cover ======================================== This was done through a discussion between ``pip``, ``PyPI``, and ``bandersnarch`` maintainers, who are the two first potential users for the new API. This is how they use the Simple + JSON APIs today: - ``pip``: - List of all files for a particular release - Metadata of each individual artifact: - was it yanked? (``data-yanked``) - what's the python-requires? (``data-python-requires``) - what's the hash of this file? (currently, hash in URL) - Full metadata (``data-dist-info-metadata``) - [Bonus] what are the declared dependencies, if available (list-of-strings, null if unavailable)? - ``bandersnatch`` - Only uses legacy JSON API + XMLRPC today: - Generates Simple HTML rather than copying from PyPI - Maybe this changes with the new API and we verbatim pull these API assets from PyPI - List of all files for a particular release. - Workout URL for release files to download - Metadata of each individual artifact. - Write out the JSON to mirror storage today (disk/S3) - Required metadata used (via `Package class `__): - ``metadata["info"]`` - ``metadata["last_serial"]`` - ``metadata["releases"]`` - digests - URL - XML-RPC calls (we'd love to deprecate - but we don't think should go in the Simple API) - [Bonus] Get packages since serial X (or all) - XML-RPC Call: ``changelog_since_serial`` - [Bonus] Get all packages with serial - XML-RPC Call: ``list_packages_with_serial`` Copyright ========= This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.