PEP: 694 Title: Upload 2.0 API for Python Package Repositories Author: Donald Stufft Discussions-To: https://discuss.python.org/t/pep-694-upload-2-0-api-for-python-package-repositories/16879 Status: Draft Type: Standards Track Topic: Packaging Content-Type: text/x-rst Created: 11-Jun-2022 Post-History: `27-Jun-2022 `__ Abstract ======== There is currently no standardized API for uploading files to a Python package repository such as PyPI. Instead, everyone has been forced to reverse engineer the non-standard API from PyPI. That API, while functional, leaks a lot of implementation details of the original PyPI code base, which have now had to have been faithfully replicated in the new code base, and alternative implementations. Beyond the above, there are a number of major issues with the current API: - It is a fully synchronous API, which means that we're forced to have a single request being held open for potentially a long time, both for the upload itself, and then while the repository processes the uploaded file to determine success or failure. - It does not support any mechanism for resuming an upload, with the largest file size on PyPI being just under 1GB in size, that's a lot of wasted bandwidth if a large file has a network blip towards the end of an upload. - It treats a single file as the atomic unit of operation, which can be problematic when a release might have multiple binary wheels which can cause people to get different versions while the files are uploading, and if the sdist happens to not go last, possibly some hard to build packages are attempting to be built from source. - It has very limited support for communicating back to the user, with no support for multiple errors, warnings, deprecations, etc. It is limited entirely to the HTTP status code and reason phrase, of which the reason phrase has been deprecated since HTTP/2 (:rfc:`RFC 7540 <7540#section-8.1.2.4>`). - The metadata for a release/file is submitted alongside the file, however this metadata is famously unreliable, and most installers instead choose to download the entire file and read that in part due to that unreliability. - There is no mechanism for allowing a repository to do any sort of sanity checks before bandwidth starts getting expended on an upload, whereas a lot of the cases of invalid metadata or incorrect permissions could be checked prior to upload. - It has no support for "staging" a draft release prior to publishing it to the repository. - It has no support for creating new projects, without uploading a file. This PEP proposes a new API for uploads, and deprecates the existing non standard API. Status Quo ========== This does not attempt to be a fully exhaustive documentation of the current API, but give a high level overview of the existing API. Endpoint -------- The existing upload API (and the now removed register API) lives at an url, currently ``https://upload.pypi.org/legacy/``, and to communicate which specific API you want to call, you add a ``:action`` url parameter with a value of ``file_upload``. The values of ``submit``, ``submit_pkg_info``, and ``doc_upload`` also used to be supported, but no longer are. It also has a ``protocol_version`` parameter, in theory to allow new versions of the API to be written, but in practice that has never happened, and the value is always ``1``. So in practice, on PyPI, the endpoint is ``https://upload.pypi.org/legacy/?:action=file_upload&protocol_version=1``. Encoding -------- The data to be submitted is submitted as a ``POST`` request with the content type of ``multipart/form-data``. This is due to the historical nature, that this API was not actually designed as an API, but rather was a form on the initial PyPI implementation, then client code was written to programmatically submit that form. Content ------- Roughly speaking, the metadata contained within the package is submitted as parts where the content-disposition is ``form-data``, and the name is the name of the field. The names of these various pieces of metadata are not documented, and they sometimes, but not always match the names used in the ``METADATA`` files. The casing rarely matches though, but overall the ``METADATA`` to ``form-data`` conversion is extremely inconsistent. The file itself is then sent as a ``application/octet-stream`` part with the name of ``content``, and if there is a PGP signature attached, then it will be included as a ``application/octet-stream`` part with the name of ``gpg_signature``. Specification ============= This PEP traces the root cause of most of the issues with the existing API to be roughly two things: - The metadata is submitted alongside the file, rather than being parsed from the file itself. - This is actually fine if used as a pre-check, but it should be validated against the actual ``METADATA`` or similar files within the distribution. - It supports a single request, using nothing but form data, that either succeeds or fails, and everything is done and contained within that single request. We then propose a multi-request workflow, that essentially boils down to: 1. Initiate an upload session. 2. Upload the file(s) as part of the upload session. 3. Complete the upload session. 4. (Optional) Check the status of an upload session. All URLs described here will be relative to the root endpoint, which may be located anywhere within the url structure of a domain. So it could be at ``https://upload.example.com/``, or ``https://example.com/upload/``. Versioning ---------- This PEP uses the same ``MAJOR.MINOR`` versioning system as used in :pep:`691`, but it is otherwise independently versioned. The existing API is considered by this spec to be version ``1.0``, but it otherwise does not attempt to modify that API in any way. Endpoints --------- Create an Upload Session ~~~~~~~~~~~~~~~~~~~~~~~~ To create a new upload session, you can send a ``POST`` request to ``/``, with a payload that looks like: .. code-block:: json { "meta": { "api-version": "2.0" }, "name": "foo", "version": "1.0" } This currently has three keys, ``meta``, ``name``, and ``version``. The ``meta`` key is included in all payloads, and it describes information about the payload itself. The ``name`` key is the name of the project that this session is attempting to add files to. The ``version`` key is the version of the project that this session is attepmting to add files to. If creating the session was successful, then the server must return a response that looks like: .. code-block:: json { "meta": { "api-version": "2.0" }, "urls": { "upload": "...", "draft": "...", "publish": "..." }, "valid-for": 604800, "status": "pending", "files": {}, "notices": [ "a notice to display to the user" ] } Besides the ``meta`` key, this response has five keys, ``urls``, ``valid-for``, ``status``, ``files``, and ``notices``. The ``urls`` key is a dictionary mapping identifiers to related URLs to this session. The ``valid-for`` key is an integer representing how long, in seconds, until the server itself will expire this session (and thus all of the URLs contained in it). The session **SHOULD** live at least this much longer unless the client itself has canceled the session. Servers **MAY** choose to *increase* this time, but should never *decrease* it, except naturally through the passage of time. The ``status`` key is a string that contains one of ``pending``, ``published``, ``errored``, or ``canceled``, this string represents the overall status of the session. The ``files`` key is a mapping containing the filenames that have been uploaded to this session, to a mapping containing details about each file. The ``notices`` key is an optional key that points to an array of notices that the server wishes to communicate to the end user that are not specific to any one file. For each filename in ``files`` the mapping has three keys, ``status``, ``url``, and ``notices``. The ``status`` key is the same as the top level ``status`` key, except that it indicates the status of a specific file. The ``url`` key is the *absolute* URL that the client should upload that specific file to (or use to delete that file). The ``notices`` key is an optional key, that is an array of notices that the server wishes to communicate to the end user that are specific to this file. The required response code to a successful creation of the session is a ``201 Created`` response and it **MUST** include a ``Location`` header that is the URL for this session, which may be used to check its status or cancel it. For the ``urls`` key, there are currently three keys that may appear: The ``upload`` key, which is the upload endpoint for this session to initiate a file upload. The ``draft`` key, which is the repository URL that these files are available at prior to publishing. The ``publish`` key, which is the endpoint to trigger publishing the session. In addition to the above, if a second session is created for the same name+version pair, then the upload server **MUST** return the already existing session rather than creating a new, empty one. Upload Each File ~~~~~~~~~~~~~~~~ Once you have initiated an upload session for one or more files, then you have to actually upload each of those files. There is no set endpoint for actually uploading the file, that is given to the client by the server as part of the creation of the upload session, and clients **MUST NOT** assume that there is any commonality to what those URLs look like from one session to the next. To initiate a file upload, a client sends a ``POST`` request to the upload URL in the session, with a request body that looks like: .. code-block:: json { "meta": { "api-version": "2.0" }, "filename": "foo-1.0.tar.gz", "size": 1000, "hashes": {"sha256": "...", "blake2b": "..."}, "metadata": "..." } Besides the standard ``meta`` key, this currently has 4 keys: - ``filename``: The filename of the file being uploaded. - ``size``: The size, in bytes, of the file that is being uploaded. - ``hashes``: A mapping of hash names to hex encoded digests, each of these digests are the digests of that file, when hashed by the hash identified in the name. By default, any hash algorithm available via `hashlib `_ (specifically any that can be passed to ``hashlib.new()`` and do not require additional parameters) can be used as a key for the hashes dictionary. At least one secure algorithm from ``hashlib.algorithms_guaranteed`` **MUST** always be included. At the time of this PEP, ``sha256`` specifically is recommended. Multiple hashes may be passed at a time, but all hashes must be valid for the file. - ``metadata``: An optional key that is a string containing the file's `core metadata `_. Servers **MAY** use the data provided in this response to do some sanity checking prior to allowing the file to be uploaded, which may include but is not limited to: - Checking if the ``filename`` already exists. - Checking if the ``size`` would invalidate some quota. - Checking if the contents of the ``metadata``, if provided, are valid. If the server determines that the client should attempt the upload, it will return a ``201 Created`` response, with an empty body, and a ``Location`` header pointing to the URL that the file itself should be uploaded to. At this point, the status of the session should show the filename, with the above url included in it. Upload Data +++++++++++ To upload the file, a client has two choices, they may upload the file as either a single chunk, or as multiple chunks. Either option is acceptable, but it is recommended that most clients should choose to upload each file as a single chunk as that requires fewer requests and typically has better performance. However for particularly large files, uploading within a single request may result in timeouts, so larger files may need to be uploaded in multiple chunks. In either case, the client must generate a unique token (or nonce) for each upload attempt for a file, and **MUST** include that token in each request in the ``Upload-Token`` header. The ``Upload-Token`` is a binary blob encoded using base64 surrounded by a ``:`` on either side. Clients **SHOULD** use at least 32 bytes of cryptographically random data. You can generate it using the following: .. code-block:: python import base64 import secrets header = ":" + base64.b64encode(secrets.token_bytes(32)).decode() + ":" The one time that it is permissible to omit the ``Upload-Token`` from an upload request is when a client wishes to opt out of the resumable or chunked file upload feature completely. In that case, they **MAY** omit the ``Upload-Token``, and the file must be successfully uploaded in a single HTTP request, and if it fails, the entire file must be resent in another single HTTP request. To upload in a single chunk, a client sends a ``POST`` request to the URL from the session response for that filename. The client **MUST** include a ``Content-Length`` header that is equal to the size of the file in bytes, and this **MUST** match the size given in the original session creation. As an example, if uploading a 100,000 byte file, you would send headers like:: Content-Length: 100000 Upload-Token: :nYuc7Lg2/Lv9S4EYoT9WE6nwFZgN/TcUXyk9wtwoABg=: If the upload completes successfully, the server **MUST** respond with a ``201 Created`` status. At this point this file **MUST** not be present in the repository, but merely staged until the upload session has completed. To upload in multiple chunks, a client sends multiple ``POST`` requests to the same URL as before, one for each chunk. This time however, the ``Content-Length`` is equal to the size, in bytes, of the chunk that they are sending. In addition, the client **MUST** include a ``Upload-Offset`` header which indicates a byte offset that the content included in this request starts at and a ``Upload-Incomplete`` header set to ``1``. As an example, if uploading a 100,000 byte file in 1000 byte chunks, and this chunk represents bytes 1001 through 2000, you would send headers like:: Content-Length: 1000 Upload-Token: :nYuc7Lg2/Lv9S4EYoT9WE6nwFZgN/TcUXyk9wtwoABg=: Upload-Offset: 1001 Upload-Incomplete: 1 However, the **final** chunk of data omits the ``Upload-Incomplete`` header, since at that point the upload is no longer incomplete. For each successful chunk, the server **MUST** respond with a ``202 Accepted`` header, except for the final chunk, which **MUST** be a ``201 Created``. The following constraints are placed on uploads regardless of whether they are single chunk or multiple chunks: - A client **MUST NOT** perform multiple ``POST`` requests in parallel for the same file to avoid race conditions and data loss or corruption. The server **MAY** terminate any ongoing ``POST`` request that utilizes the same ``Upload-Token``. - If the offset provided in ``Upload-Offset`` is not ``0`` or the next chunk in an incomplete upload, then the server **MUST** respond with a 409 Conflict. - Once an upload has started with a specific token, you may not use another token for that file without deleting the in progress upload. - Once a file has uploaded successfully, you may initiate another upload for that file, and doing so will replace that file. Resume Upload +++++++++++++ To resume an upload, you first have to know how much of the data the server has already received, regardless of if you were originally uploading the file as a single chunk, or in multiple chunks. To get the status of an individual upload, a client can make a ``HEAD`` request with their existing ``Upload-Token`` to the same URL they were uploading to. The server **MUST** respond back with a ``204 No Content`` response, with an ``Upload-Offset`` header that indicates what offset the client should continue uploading from. If the server has not received any data, then this would be ``0``, if it has received 1007 bytes then it would be ``1007``. Once the client has retrieved the offset that they need to start from, they can upload the rest of the file as described above, either in a single request containing all of the remaining data or in multiple chunks. Canceling an In Progress Upload +++++++++++++++++++++++++++++++ If a client wishes to cancel an upload of a specific file, for instance because they need to upload a different file, they may do so by issuing a ``DELETE`` request to the file upload URL with the ``Upload-Token`` used to upload the file in the first place. A successful cancellation request **MUST** response with a ``204 No Content``. Delete an uploaded File +++++++++++++++++++++++ Already uploaded files may be deleted by issuing a ``DELETE`` request to the file upload URL without the ``Upload-Token``. A successful deletion request **MUST** response with a ``204 No Content``. Session Status ~~~~~~~~~~~~~~ Similarly to file upload, the session URL is provided in the response to creating the upload session, and clients **MUST NOT** assume that there is any commonality to what those URLs look like from one session to the next. To check the status of a session, clients issue a ``GET`` request to the session URL, to which the server will respond with the same response that they got when they initially created the upload session, except with any changes to ``status``, ``valid-for``, or updated ``files`` reflected. Session Cancellation ~~~~~~~~~~~~~~~~~~~~ To cancel an upload session, a client issues a ``DELETE`` request to the same session URL as before. At which point the server marks the session as canceled, **MAY** purge any data that was uploaded as part of that session, and future attempts to access that session URL or any of the file upload URLs **MAY** return a ``404 Not Found``. To prevent a lot of dangling sessions, servers may also choose to cancel a session on their own accord. It is recommended that servers expunge their sessions after no less than a week, but each server may choose their own schedule. Session Completion ~~~~~~~~~~~~~~~~~~ To complete a session, and publish the files that have been included in it, a client **MUST** send a ``POST`` request to the ``publish`` url in the session status payload. If the server is able to immediately complete the session, it may do so and return a ``201 Created`` response. If it is unable to immediately complete the session (for instance, if it needs to do processing that may take longer than reasonable in a single HTTP request), then it may return a ``202 Accepted`` response. In either case, the server should include a ``Location`` header pointing back to the session status url, and if the server returned a ``202 Accepted``, the client may poll that URL to watch for the status to change. Errors ------ All Error responses that contain a body will have a body that looks like: .. code-block:: json { "meta": { "api-version": "2.0" }, "message": "...", "errors": [ { "source": "...", "message": "..." } ] } Besides the standard ``meta`` key, this has two top level keys, ``message`` and ``errors``. The ``message`` key is a singular message that encapsulates all errors that may have happened on this request. The ``errors`` key is an array of specific errors, each of which contains a ``source`` key, which is a string that indicates what the source of the error is, and a ``message`` key for that specific error. The ``message`` and ``source`` strings do not have any specific meaning, and are intended for human interpretation to figure out what the underlying issue was. Content-Types ------------- Like :pep:`691`, this PEP proposes that all requests and responses from the Upload API will have a standard content type that describes what the content is, what version of the API it represents, and what serialization format has been used. The structure of this content type will be: .. code-block:: text application/vnd.pypi.upload.$version+format Since only major versions should be disruptive to systems attempting to understand one of these API content bodies, only the major version will be included in the content type, and will be prefixed with a ``v`` to clarify that it is a version number. Unlike :pep:`691`, this PEP does not change the existing ``1.0`` API in any way, so servers will be required to host the new API described in this PEP at a different endpoint than the existing upload API. Which means that for the new 2.0 API, the content types would be: - **JSON:** ``application/vnd.pypi.upload.v2+json`` In addition to the above, a special "meta" version is supported named ``latest``, whose purpose is to allow clients to request the absolute latest version, without having to know ahead of time what that version is. It is recommended however, that clients be explicit about what versions they support. These content types **DO NOT** apply to the file uploads themselves, only to the other API requests/responses in the upload API. The files themselves should use the ``application/octet-stream`` content-type. Version + Format Selection -------------------------- Again similar to :pep:`691`, this PEP standardizes on using server-driven content negotiation to allow clients to request different versions or serialization formats, which includes the ``format`` url parameter. Since this PEP expects the existing legacy ``1.0`` upload API to exist at a different endpoint, and it currently only provides for JSON serialization, this mechanism is not particularly useful, and clients only have a single version and serialization they can request. However clients **SHOULD** be setup to handle content negotiation gracefully in the case that additional formats or versions are added in the future. FAQ === Does this mean PyPI is planning to drop support for the existing upload API? ---------------------------------------------------------------------------- At this time PyPI does not have any specific plans to drop support for the existing upload API. Unlike with :pep:`691` there are wide benefits to doing so, so it is likely that we will want to drop support for it at some point in the future, but until this API is implemented, and receiving broad use it would be premature to make any plans for actually dropping support for it. Is this Resumable Upload protocol based on anything? ---------------------------------------------------- Yes! It's actually the protocol specified in an `Active Internet-Draft `_, where the authors took what they learned implementing `tus `_ to provide the idea of resumable uploads in a wholly generic, standards based way. The only deviation we've made from that spec is that we don't use the ``104 Upload Resumption Supported`` informational response in the first ``POST`` request. This decision was made for a few reasons: - The ``104 Upload Resumption Supported`` is the only part of that draft which does not rely entirely on things that are already supported in the existing standards, since it was adding a new informational status. - Many clients and web frameworks don't support ``1xx`` informational responses in a very good way, if at all, adding it would complicate implementation for very little benefit. - The purpose of the ``104 Upload Resumption Supported`` support is to allow clients to determine that an arbitrary endpoint that they're interacting with supports resumable uploads. Since this PEP is mandating support for that in servers, clients can just assume that the server they are interacting with supports it, which makes using it unneeded. - In theory, if the support for ``1xx`` responses got resolved and the draft gets accepted with it in, we can add that in at a later date without changing the overall flow of the API. There is a risk that the above draft doesn't get accepted, but even if it does not, that doesn't actually affect us. It would just mean that our support for resumable uploads is an application specific protocol, but is still wholly standards compliant. Open Questions ============== Multipart Uploads vs tus ------------------------ This PEP currently bases the actual uploading of files on an internet draft from tus.io that supports resumable file uploads. That protocol requires a few things: - That the client selects a secure ``Upload-Token`` that they use to identify uploading a single file. - That if clients don't upload the entire file in one shot, that they have to submit the chunks serially, and in the correct order, with all but the final chunk having a ``Upload-Incomplete: 1`` header. - Resumption of an upload is essentially just querying the server to see how much data they've gotten, then sending the remaining bytes (either as a single request, or in chunks). - The upload implicitly is completed when the server successfully gets all of the data from the client. This has one big benefit, that if a client doesn't care about resuming their download, the work to support, from a client side, resumable uploads is able to be completely ignored. They can just ``POST`` the file to the URL, and if it doesn't succeed, they can just ``POST`` the whole file again. The other benefit is that even if you do want to support resumption, you can still just ``POST`` the file, and unless you *need* to resume the download, that's all you have to do. Another, possibly theoretical, benefit is that for hashing the uploaded files, the serial chunks requirement means that the server can maintain hashing state between requests, update it for each request, then write that file back to storage. Unfortunately this isn't actually possible to do with Python's hashlib, though there are some libraries like `Rehash `_ that implement it, but they don't support every hash that hashlib does (specifically not blake2 or sha3 at the time of writing). We might also need to reconstitute the download for processing anyways to do things like extract metadata, etc from it, which would make it a moot point. The downside is that there is no ability to parallelize the upload of a single file because each chunk has to be submitted serially. AWS S3 has a similar API (and most blob stores have copied it either wholesale or something like it) which they call multipart uploading. The basic flow for a multipart upload is: 1. Initiate a Multipart Upload to get an Upload ID. 2. Break your file up into chunks, and upload each one of them individually. 3. Once all chunks have been uploaded, finalize the upload. - This is the step where any errors would occur. It does not directly support resuming an upload, but it allows clients to control the "blast radius" of failure by adjusting the size of each part they upload, and if any of the parts fail, they only have to resend those specific parts. This has a big benefit in that it allows parallelization in uploading files, allowing clients to maximize their bandwidth using multiple threads to send the data. We wouldn't need an explicit step (1), because our session would implicitly initiate a multipart upload for each file. It does have its own downsides: - Clients have to do more work on every request to have something resembling resumable uploads. They would *have* to break the file up into multiple parts rather than just making a single POST request, and only needing to deal with the complexity if something fails. - Clients that don't care about resumption at all still have to deal with the third explicit step, though they could just upload the file all as a single part. - S3 works around this by having another API for one shot uploads, but I'd rather not have two different APIs for uploading the same file. - Verifying hashes gets somewhat more complicated. AWS implements hashing multipart uploads by hashing each part, then the overall hash is just a hash of those hashes, not of the content itself. We need to know the actual hash of the file itself for PyPI, so we would have to reconstitute the file and read its content and hash it once it's been fully uploaded, though we could still use the hash of hashes trick for checksumming the upload itself. - See above about whether this is actually a downside in practice, or if it's just in theory. I lean towards the tus style resumable uploads as I think they're simpler to use and to implement, and the main downside is that we possibly leave some multi-threaded performance on the table, which I think that I'm personally fine with? I guess one additional benefit of the S3 style multi part uploads is that you don't have to try and do any sort of protection against parallel uploads, since they're just supported. That alone might erase most of the server side implementation simplification. Copyright ========= This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.