733 lines
29 KiB
ReStructuredText
733 lines
29 KiB
ReStructuredText
PEP: 694
|
|
Title: Upload 2.0 API for Python Package Repositories
|
|
Author: Donald Stufft <donald@stufft.io>
|
|
Discussions-To: https://discuss.python.org/t/pep-694-upload-2-0-api-for-python-package-repositories/16879
|
|
Status: Draft
|
|
Type: Standards Track
|
|
Topic: Packaging
|
|
Content-Type: text/x-rst
|
|
Created: 11-Jun-2022
|
|
Post-History: `27-Jun-2022 <https://discuss.python.org/t/pep-694-upload-2-0-api-for-python-package-repositories/16879>`__
|
|
|
|
|
|
Abstract
|
|
========
|
|
|
|
There currently is not a standardized API for uploading files to a Python package
|
|
repository such as PyPI. Instead everyone has been forced to reverse engineer
|
|
the non-standard API from PyPI.
|
|
|
|
That API, while functional, leaks a lot of implementation details of the original
|
|
PyPI code base, which have now had to have been faithfully replicated in the new
|
|
code base, and alternative implementations.
|
|
|
|
Beyond the above, there are a number of major issues with the current API:
|
|
|
|
- It is a fully synchronous API, which means that we're forced to have a single
|
|
request being held open for potentially a long time, both for the upload itself,
|
|
and then while the repository processes the uploaded file to determine success
|
|
or failure.
|
|
|
|
- It does not support any mechanism for resuming an upload, with the largest file
|
|
size on PyPI being just under 1GB in size, that's a lot of wasted bandwidth if
|
|
a large file has a network blip towards the end of an upload.
|
|
|
|
- It treats a single file as the atomic unit of operation, which can be problematic
|
|
when a release might have multiple binary wheels which can cause people to get
|
|
different versions while the files are uploading, and if the sdist happens to
|
|
not go last, possibly some hard to build packages are attempting to be built
|
|
from source.
|
|
|
|
- It has very limited support for communicating back to the user, limited entirely
|
|
to the HTTP status code, and status message (something which I'm not sure is
|
|
even technically valid HTTP?). It has no support for multiple errors, warnings,
|
|
deprecations, etc.
|
|
|
|
- The metadata for a release/file is submitted alongside the file, however this
|
|
metadata is famously unreliable, and most installers instead choose to download
|
|
the entire file and read that in part due to that unreliability.
|
|
|
|
- There is no mechanism for allowing a repository to do any sort of sanity
|
|
checks before bandwidth starts getting expended on an upload, whereas a lot
|
|
of the cases of invalid metadata or incorrect permissions could be checked
|
|
prior to upload.
|
|
|
|
- It has no support for "staging" a draft release prior to publishing it to the
|
|
repository.
|
|
|
|
- It has no support for creating new projects, without uploading a file.
|
|
|
|
This PEP proposes a new API for uploads, and deprecates the existing non standard
|
|
API.
|
|
|
|
|
|
Status Quo
|
|
==========
|
|
|
|
This does not attempt to be a fully exhaustive documentation of the current API, but
|
|
give a high level overview of the existing API.
|
|
|
|
|
|
Endpoint
|
|
--------
|
|
|
|
The existing upload API (and the now removed register API) lives at an url, currently
|
|
``https://upload.pypi.org/legacy/``, and to communicate which specific API you want
|
|
to call, you add a ``:action`` url parameter with a value of ``file_upload``. The values
|
|
of ``submit``, ``submit_pkg_info``, and ``doc_upload`` also used to be supported, but
|
|
no longer are.
|
|
|
|
It also has a ``protocol_version`` parameter, in theory to allow new versions of the
|
|
API to be written, but in practice that has never happened, and the value is always
|
|
``1``.
|
|
|
|
So in practice, on PyPI, the endpoint is
|
|
``https://upload.pypi.org/legacy/?:action=file_upload&protocol_version=1``.
|
|
|
|
|
|
|
|
Encoding
|
|
--------
|
|
|
|
The data to be submitted is submitted as a ``POST`` request with the content type
|
|
of ``multipart/form-data``. This is due to the historical nature, that this API
|
|
was not actually designed as an API, but rather was a form on the initial PyPI
|
|
implmentation, then client code was written to programatically submit that form.
|
|
|
|
|
|
Content
|
|
-------
|
|
|
|
Roughly speaking, the metadata contained within the package is submitted as parts
|
|
where the content-disposition is ``form-data``, and the name is the name of the
|
|
field. The names of these various pieces of metadata are not documented, and they
|
|
sometimes, but not always match the names used in the ``METADATA`` files. The casing
|
|
rarely matches though, but overall the ``METADATA`` to ``form-data`` conversion is
|
|
extremely inconsistent.
|
|
|
|
The file itself is then sent as a ``application/octet-stream`` part with the name
|
|
of ``content``, and if there is a PGP signature attached, then it will be included
|
|
as a ``application/octet-stream`` part with the name of ``gpg_signature``.
|
|
|
|
|
|
Specification
|
|
=============
|
|
|
|
This PEP traces the root cause of most of the issues with the existing API to be
|
|
roughly two things:
|
|
|
|
- The metadata is submitted alongside the file, rather than being parsed from the
|
|
file itself.
|
|
|
|
- This is actually fine if used as a pre-check, but it should be reconciled
|
|
against the actual ``METADATA`` or similar files within the distribution.
|
|
|
|
- It supports a single request, using nothing but form data, that either succeeds
|
|
or fails, and everything is done and contained within that single request.
|
|
|
|
We then propose a multiple request workflow, that essentially boils down to:
|
|
|
|
1. Initiate an upload session.
|
|
2. Upload the file(s) as part of the upload session.
|
|
3. Complete the upload session.
|
|
4. (Optional) Check the status of an upload session.
|
|
|
|
All URLs described here will be relative to the root end point, which may be
|
|
located anywhere within the url structure of a domain. So it could be at
|
|
``https://upload.example.com/``, or ``https://example.com/upload/``.
|
|
|
|
|
|
Versioning
|
|
----------
|
|
|
|
This PEP uses the same ``MAJOR.MINOR`` versioning system as used in :pep:`691`,
|
|
but it is otherwise independently versioned. The existing API is considered by
|
|
this spec to be version ``1.0``, but it otherwise does not attempt to modify
|
|
that API in any way.
|
|
|
|
|
|
Endpoints
|
|
---------
|
|
|
|
Create an Upload Session
|
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
To create a new upload session, you can send a ``POST`` request to ``/``,
|
|
with a payload that looks like:
|
|
|
|
.. code-block:: json
|
|
|
|
{
|
|
"meta": {
|
|
"api-version": "2.0"
|
|
},
|
|
"name": "foo",
|
|
"version": "1.0"
|
|
}
|
|
|
|
|
|
This currently has three keys, ``meta``, ``name``, and ``version``.
|
|
|
|
The ``meta`` key is included in all payloads, and it describes information about the
|
|
payload itself.
|
|
|
|
The ``name`` key is the name of the project that this session is attempting to
|
|
add files to.
|
|
|
|
The ``version`` key is the version of the project that this session is attepmting to
|
|
add files to.
|
|
|
|
If creating the session was successful, then the server must return a response
|
|
that looks like:
|
|
|
|
.. code-block:: json
|
|
|
|
{
|
|
"meta": {
|
|
"api-version": "2.0"
|
|
},
|
|
"urls": {
|
|
"upload": "...",
|
|
"draft": "...",
|
|
"publish": "..."
|
|
},
|
|
"valid-for": 604800,
|
|
"status": "pending",
|
|
"files": {},
|
|
"notices": [
|
|
"a notice to display to the user"
|
|
]
|
|
}
|
|
|
|
|
|
Besides the ``meta`` key, this response has five keys, ``urls``, ``valid-for``,
|
|
``status``, ``files``, and ``notices``.
|
|
|
|
The ``urls`` key is a dictionary mapping identifiers to related URLs to this
|
|
session.
|
|
|
|
The ``valid-for`` key is an integer representing how long, in seconds, until the
|
|
server itself will expire this session (and thus all of the URLs contained in it).
|
|
The session **SHOULD** live at least this much longer unless the client itself
|
|
has canceled the session. Servers **MAY** choose to *increase* this time, but should
|
|
never *decrease* it, except naturally through the passage of time.
|
|
|
|
The ``status`` key is a string that contains one of ``pending``, ``published``,
|
|
``errored``, or ``canceled``, this string represents the overall status of
|
|
the session.
|
|
|
|
The ``files`` key is a mapping containing the filenames that have been uploaded
|
|
to this session, to a mapping containing details about each file.
|
|
|
|
The ``notices`` key is an optional key that points to an array of notices that
|
|
the server wishes to communicate to the end user that are not specific to any
|
|
one file.
|
|
|
|
For each filename in ``files`` the mapping has three keys, ``status``, ``url``,
|
|
and ``notices``.
|
|
|
|
The ``status`` key is the same as the top level ``status`` key, except that it
|
|
indicates the status of a specific file.
|
|
|
|
The ``url`` key is the *absolute* URL that the client should upload that specific
|
|
file to (or use to delete that file).
|
|
|
|
The ``notices`` key is an optional key, that is an array of notices that the server
|
|
wishes to communicate to the end user that are specific to this file.
|
|
|
|
The required response code to a successful creation of the session is a
|
|
``201 Created`` response and it **MUST** include a ``Location`` header that is the
|
|
URL for this session, which may be used to check its status or cancel it.
|
|
|
|
For the ``urls`` key, there are currently three keys that may appear:
|
|
|
|
The ``upload`` key, which is the upload endpoint for this session to initiate
|
|
a file upload.
|
|
|
|
The ``draft`` key, which is the repository URL that these files are available at
|
|
prior to publishing.
|
|
|
|
The ``publish`` key, which is the endpoint to trigger publishing the session.
|
|
|
|
|
|
In addition to the above, if a second session is created for the same name+version
|
|
pair, then the upload server **MUST** return the already existing session rather
|
|
than creating a new, empty one.
|
|
|
|
|
|
Upload Each File
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
Once you have initiated an upload session for one or more files, then you have
|
|
to actually upload each of those files.
|
|
|
|
There is no set endpoint for actually uploading the file, that is given to the
|
|
client by the server as part of the creation of the upload session, and clients
|
|
**MUST NOT** assume that there is any stability to what those URLs look like from
|
|
one session to the next.
|
|
|
|
To initiate a file upload, a client sends a ``POST`` request to the upload URL
|
|
in the session, with a request body that looks like:
|
|
|
|
.. code-block:: json
|
|
|
|
{
|
|
"meta": {
|
|
"api-version": "2.0"
|
|
},
|
|
"filename": "foo-1.0.tar.gz",
|
|
"size": 1000,
|
|
"hashes": {"sha256": "...", "blake2b": "..."},
|
|
"metadata": "..."
|
|
}
|
|
|
|
|
|
Besides the standard ``meta`` key, this currently has 4 keys:
|
|
|
|
- ``filename``: The filename of the file being uploaded.
|
|
- ``size``: The size, in bytes, of the file that is being uploaded.
|
|
- ``hashes``: A mapping of hash names to hex encoded digests, each of these digests
|
|
are the digests of that file, when hashed by the hash identified in the name.
|
|
|
|
By default, any hash algorithm available via `hashlib
|
|
<https://docs.python.org/3/library/hashlib.html>`_ (specifically any that can
|
|
be passed to ``hashlib.new()`` and do not require additional parameters) can
|
|
be used as a key for the hashes dictionary. At least one secure algorithm from
|
|
``hashlib.algorithms_guaranteed`` **MUST** always be included. At the time
|
|
of this PEP, ``sha256`` specifically is recommended.
|
|
|
|
Multiple hashes may be passed at a time, but all hashes must be valid for the
|
|
file.
|
|
- ``metadata``: An optional key that is a string that contains the file's
|
|
`core metadata <https://packaging.python.org/en/latest/specifications/core-metadata/>`_.
|
|
|
|
Servers **MAY** use the data provided in this response to do some sanity checking
|
|
prior to allowing the file to be uploaded, which may include but is not limited
|
|
to:
|
|
|
|
- Checking if the ``filename`` already exists.
|
|
- Checking if the ``size`` would invalidate some quota.
|
|
- Checking if the contents of the ``metadata``, if provided, are valid.
|
|
|
|
If the server determines that the client should attempt the upload, it will return
|
|
a ``201 Created`` response, with an empty body, and a ``Location`` header pointing
|
|
to the URL that the file itself should be uploaded to.
|
|
|
|
At this point, the status of the session should show the filename, with the above url
|
|
included in it.
|
|
|
|
|
|
Upload Data
|
|
+++++++++++
|
|
|
|
To upload the file, a client has two choices, they may upload the file as either
|
|
a single chunk, or as multiple chunks. Either option is acceptable, but it is
|
|
recommended that most clients should choose to upload each file as a single chunk
|
|
as that requires fewer requests and typically has better performance.
|
|
|
|
However for particularly large files, uploading within a single request may result
|
|
in timeouts, so larger files may need to be uploaded in multiple chunks.
|
|
|
|
In either case, the client must generate a unique token (or nonce) for each upload
|
|
attempt for a file, and **MUST** include that token in each request in the ``Upload-Token``
|
|
header. The ``Upload-Token`` is a binary blob encoded using base64 surrounded by
|
|
a ``:`` on either side. Clients **SHOULD** use at least 32 bytes of cryptographically
|
|
random data. You can generate it using the following:
|
|
|
|
.. code-block:: python
|
|
|
|
import base64
|
|
import secrets
|
|
|
|
header = ":" + base64.b64encode(secrets.token_bytes(32)).decode() + ":"
|
|
|
|
The one time that it is permissible to omit the ``Upload-Token`` from an upload
|
|
request is when a client wishes to opt out of the resumable or chunked file upload
|
|
feature completely. In that case, they **MAY** omit the ``Upload-Token``, and the
|
|
file must be successfully uploaded in a single HTTP request, and if it fails, the
|
|
entire file must be resent in another single HTTP request.
|
|
|
|
To upload in a single chunk, a client sends a ``POST`` request to the URL from the
|
|
session response for that filename. The client **MUST** include a ``Content-Length``
|
|
header that is equal to the size of the file in bytes, and this **MUST** match the
|
|
size given in the original session creation.
|
|
|
|
As an example, if uploading a 100,000 byte file, you would send headers like::
|
|
|
|
Content-Length: 100000
|
|
Upload-Token: :nYuc7Lg2/Lv9S4EYoT9WE6nwFZgN/TcUXyk9wtwoABg=:
|
|
|
|
If the upload completes successfully, the server **MUST** respond with a
|
|
``201 Created`` status. At this point this file **MUST** not be present in the
|
|
repository, but merely staged until the upload session has completed.
|
|
|
|
To upload in multiple chunks, a client sends multiple ``POST`` requests to the same
|
|
URL as before, one for each chunk.
|
|
|
|
However, this time the ``Content-Length`` is equal to the size, in bytes, of the
|
|
chunk that they are sending. In addition, the client **MUST** include a
|
|
``Upload-Offset`` header which indicates a byte offset that the content included
|
|
in this request starts at and a ``Upload-Incomplete`` header set to ``1``.
|
|
|
|
As an example, if uploading a 100,000 byte file in 1000 byte chunks, and this chunk
|
|
represents bytes 1001 through 2000, you would send headers like::
|
|
|
|
Content-Length: 1000
|
|
Upload-Token: :nYuc7Lg2/Lv9S4EYoT9WE6nwFZgN/TcUXyk9wtwoABg=:
|
|
Upload-Offset: 1001
|
|
Upload-Incomplete: 1
|
|
|
|
However, the **final** chunk of data omits the ``Upload-Incomplete`` header, since
|
|
at that point the upload is no longer incomplete.
|
|
|
|
For each successful chunk, the server **MUST** respond with a ``202 Accepted``
|
|
header, except for the final chunk, which **MUST** be a ``201 Created``.
|
|
|
|
The following constraints are placed on uploads regardless of whether they are
|
|
single chunk or multiple chunks:
|
|
|
|
- A client **MUST NOT** perform multiple ``POST`` requests in parallel for the
|
|
same file to avoid race conditions and data loss or corruption. The server
|
|
**MAY** terminate any ongoing ``POST`` request that utilizes the same
|
|
``Upload-Token``.
|
|
- If the offset provided in ``Upload-Offset`` is not ``0`` or the next chunk
|
|
in an incomplete upload, then the server **MUST** respond with a 409 Conflict.
|
|
- Once an upload has started with a specific token, you may not use another token
|
|
for that file without deleting the in progress upload.
|
|
- Once a file has uploaded successfully, you may initiate another upload for
|
|
that file, and doing so will replace that file.
|
|
|
|
|
|
Resume Upload
|
|
+++++++++++++
|
|
|
|
To resume an upload, you first have to know how much of the data the server has
|
|
already received, regardless of if you were originally uploading the file as
|
|
a single chunk, or in multiple chunks.
|
|
|
|
To get the status of an individual upload, a client can make a ``HEAD`` request
|
|
with their existing ``Upload-Token`` to the same URL they were uploading to.
|
|
|
|
The server **MUST** respond back with a ``204 No Content`` response, with an
|
|
``Upload-Offset`` header that indicates what offset the client should continue
|
|
uploading from. If the server has not received any data, then this would be ``0``,
|
|
if it has received 1007 bytes then it would be ``1007``.
|
|
|
|
Once the client has retrieved the offset that they need to start from, they can
|
|
upload the rest of the file as described above, either in a single request
|
|
containing all of the remaining data or in multiple chunks.
|
|
|
|
|
|
Canceling an In Progress Upload
|
|
+++++++++++++++++++++++++++++++
|
|
|
|
If a client wishes to cancel an upload of a specific file, for instance because
|
|
they need to upload a different file, they may do so by issuing a ``DELETE``
|
|
request to the file upload URL with the ``Upload-Token`` used to upload the
|
|
file in the first place.
|
|
|
|
A successful cancelation request **MUST** response with a ``204 No Content``.
|
|
|
|
|
|
Delete an uploaded File
|
|
+++++++++++++++++++++++
|
|
|
|
Already uploaded files may be deleted by issuing a ``DELETE`` request to the file
|
|
upload URL without the ``Upload-Token``.
|
|
|
|
A successful deletion request **MUST** response with a ``204 No Content``.
|
|
|
|
|
|
Session Status
|
|
~~~~~~~~~~~~~~
|
|
|
|
Similiarly to file upload, the session URL is provided in the response to
|
|
creating the upload session, and clients **MUST NOT** assume that there is any
|
|
stability to what those URLs look like from one session to the next.
|
|
|
|
To check the status of a session, clients issue a ``GET`` request to the
|
|
session URL, to which the server will respond with the same response that
|
|
they got when they initially created the upload session, except with any
|
|
changes to ``status``, ``valid-for``, or updated ``files`` reflected.
|
|
|
|
|
|
Session Cancelation
|
|
~~~~~~~~~~~~~~~~~~~
|
|
|
|
To cancel an upload session, a client issues a ``DELETE`` request to the
|
|
same session URL as before. At which point the server marks the session as
|
|
canceled, **MAY** purge any data that was uploaded as part of that session,
|
|
and future attempts to access that session URL or any of the file upload URLs
|
|
**MAY** return a ``404 Not Found``.
|
|
|
|
To prevent a lot of dangling sessions, servers may also choose to cancel a
|
|
session on it's own accord. It is recommended that servers expunge their
|
|
sessions after no less than a week, but each server may choose their own
|
|
schedule.
|
|
|
|
|
|
Session Completion
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
To complete a session, and publish the files that have been included in it,
|
|
a client **MUST** send a ``POST`` request to the ``publish`` url in the
|
|
session status payload.
|
|
|
|
If the server is able to immediately complete the session, it may do so
|
|
and return a ``201 Created`` response. If it is unable to immediately
|
|
complete the session (for instance, if it needs to do processing that may
|
|
take longer than reasonable in a single http request), then it may return
|
|
a ``202 Accepted`` response.
|
|
|
|
In either case, the server should include a ``Location`` header pointing
|
|
back to the session status url, and if the server returned a ``202 Accepted``,
|
|
the client may poll that URL to watch for the status to change.
|
|
|
|
|
|
Errors
|
|
------
|
|
|
|
All Error responses that contain a body will have a body that looks like:
|
|
|
|
.. code-block:: json
|
|
|
|
{
|
|
"meta": {
|
|
"api-version": "2.0"
|
|
},
|
|
"message": "...",
|
|
"errors": [
|
|
{
|
|
"source": "...",
|
|
"message": "..."
|
|
}
|
|
]
|
|
}
|
|
|
|
Besides the standard ``meta`` key, this has two top level keys, ``message``
|
|
and ``errors``.
|
|
|
|
The ``message`` key is a singular message that encapsulates all errors that
|
|
may have happened on this request.
|
|
|
|
The ``errors`` key is an array of specific errors, each of which contains
|
|
a ``source`` key, which is a string that indicates what the source of the
|
|
error is, and a ``messasge`` key for that specific error.
|
|
|
|
The ``message`` and ``source`` strings do not have any specific meaning, and
|
|
are intended for humans to interpet to figure out what the underlying issue
|
|
was.
|
|
|
|
|
|
Content-Types
|
|
-------------
|
|
|
|
Like :pep:`691`, this PEP proposes that all requests and responses from the
|
|
Upload API will have a standard content type that describes what the content
|
|
is, what version of the API it represents, and what serialization format has
|
|
been used.
|
|
|
|
The structure of this content type will be:
|
|
|
|
.. code-block:: text
|
|
|
|
application/vnd.pypi.upload.$version+format
|
|
|
|
Since only major versions should be disruptive to systems attempting to
|
|
understand one of these API content bodies, only the major version will be
|
|
included in the content type, and will be prefixed with a ``v`` to clarify
|
|
that it is a version number.
|
|
|
|
Unlike :pep:`691`, this PEP does not change the existing ``1.0`` API in any
|
|
way, so servers will be required to host the new API described in this PEP at
|
|
a different endpoint than the existing upload API.
|
|
|
|
Which means that for the new 2.0 API, the content types would be:
|
|
|
|
- **JSON:** ``application/vnd.pypi.upload.v2+json``
|
|
|
|
In addition to the above, a special "meta" version is supported named ``latest``,
|
|
whose purpose is to allow clients to request the absolute latest version, without
|
|
having to know ahead of time what that version is. It is recommended however,
|
|
that clients be explicit about what versions they support.
|
|
|
|
These content types **DO NOT** apply to the file uploads themselves, only to the
|
|
other API requests/responses in the upload API. The files themselves should use
|
|
the ``application/octet-stream`` content-type.
|
|
|
|
|
|
Version + Format Selection
|
|
--------------------------
|
|
|
|
Again similiar to :pep:`691`, this PEP standardizes on using server-driven
|
|
content negotiation to allow clients to request different versions or
|
|
serialization formats, which includes the ``format`` url parameter.
|
|
|
|
Since this PEP expects the existing legacy ``1.0`` upload API to exist at a
|
|
different endpoint, and it currently only provides for JSON serialization, this
|
|
mechanism is not particularly useful, and clients only have a single version and
|
|
serialization they can request. However clients **SHOULD** be setup to handle
|
|
content negotiation gracefully in the case that additional formats or versions
|
|
are added in the future.
|
|
|
|
|
|
FAQ
|
|
===
|
|
|
|
Does this mean PyPI is planning to drop support for the existing upload API?
|
|
----------------------------------------------------------------------------
|
|
|
|
At this time PyPI does not have any specific plans to drop support for the
|
|
existing upload API.
|
|
|
|
Unlike with :pep:`691` there are wide benefits to doing so, so it is likely
|
|
that we will want to drop support for it at some point in the future, but
|
|
until this API is implemented, and receiving broad use it would be premature
|
|
to make any plans for actually dropping support for it.
|
|
|
|
|
|
Is this Resumable Upload protocol based on anything?
|
|
----------------------------------------------------
|
|
|
|
Yes!
|
|
|
|
It's actually the protocol specified in an
|
|
`Active Internet-Draft <https://datatracker.ietf.org/doc/draft-tus-httpbis-resumable-uploads-protocol/>`_,
|
|
where the authors took what they learned implementing `tus <https://tus.io/>`_
|
|
to provide the idea of resumable uploads in a wholly generic, standards based
|
|
way.
|
|
|
|
The only deviation we've made from that spec is that we don't use the
|
|
``104 Upload Resumption Supported`` informational response in the first
|
|
``POST`` request. This decision was made for a few reasons:
|
|
|
|
- The ``104 Upload Resumption Supported`` is the only part of that draft
|
|
which does not rely entirely on things that are already supported in the
|
|
existing standards, since it was adding a new informational status.
|
|
- Many clients and web frameworks don't support ``1xx`` informational
|
|
responses in a very good way, if at all, adding it would complicate
|
|
implementation for very little benefit.
|
|
- The purpose of the ``104 Upload Resumption Supported`` support is to allow
|
|
clients to determine that an arbitrary endpoint that they're interacting
|
|
with supports resumable uploads. Since this PEP is mandating support for
|
|
that in servers, clients can just assume that the server they are
|
|
interacting with supports it, which makes using it unneeded.
|
|
- In theory, if the support for ``1xx`` responses got resolved and the draft
|
|
gets accepted with it in, we can add that in at a later date without
|
|
changing the overall flow of the API.
|
|
|
|
There is a risk that the above draft doesn't get accepted, but even if it
|
|
does not, that doesn't actually affect us. It would just mean that our
|
|
support for resumable uploads is an application specific protocol, but is
|
|
still wholly standards compliant.
|
|
|
|
|
|
Open Questions
|
|
==============
|
|
|
|
|
|
Multipart Uploads vs tus
|
|
------------------------
|
|
|
|
This PEP currently bases the actual uploading of files on an internet draft
|
|
from tus.io that supports resumable file uploads.
|
|
|
|
That protocol requires a few things:
|
|
|
|
- That the client selects a secure ``Upload-Token`` that they use to identify
|
|
uploading a single file.
|
|
- That if clients don't upload the entire file in one shot, that they have
|
|
to submit the chunks serially, and in the correct order, with all but the
|
|
final chunk having a ``Upload-Incomplete: 1`` header.
|
|
- Resumption of an upload is essentially just querying the server to see how
|
|
much data they've gotten, then sending the remaining bytes (either as a single
|
|
request, or in chunks).
|
|
- The upload implicitly is completed when the server successfully gets all of
|
|
the data from the client.
|
|
|
|
This has one big benefit, that if a client doesn't care about resuming their
|
|
download, the work to support, from a client side, resumable uploads is able
|
|
to be completely ignored. They can just ``POST`` the file to the URL, and if
|
|
it doesn't succeed, they can just ``POST`` the whole file again.
|
|
|
|
The other benefit is that even if you do want to support resumption, you can
|
|
still just ``POST`` the file, and unless you *need* to resume the download,
|
|
that's all you have to do.
|
|
|
|
Another, possibly theortical, benefit is that for hashing the uploaded files,
|
|
the serial chunks requirement means that the server can maintain hashing state
|
|
between requests, update it for each request, then write that file back to
|
|
storage. Unfortunately this isn't actually possible to do with Python's hashlib,
|
|
though there is some libraries like `Rehash <https://github.com/kislyuk/rehash>`_
|
|
that implement it, but they don't support every hash that hashlib does
|
|
(specifically not blake2 or sha3 at the time of writing).
|
|
|
|
We might also need to reconstitute the download for processing anyways to do
|
|
things like extract metadata, etc from it, which would make it a moot point.
|
|
|
|
The downside is that there is no ability to parallelize the upload of a single
|
|
file because each chunk has to be submitted serially.
|
|
|
|
AWS S3 has a similiar API (and most blob stores have copied it either wholesale
|
|
or something like it) which they call multipart uploading.
|
|
|
|
The basic flow for a multipart upload is:
|
|
|
|
1. Initiate a Multipart Upload to get an Upload ID.
|
|
2. Break your file up into chunks, and upload each one of them individually.
|
|
3. Once all chunks have been uploaded, finalize the upload.
|
|
- This is the step where any errors would occur.
|
|
|
|
It does not directly support resuming an upload, but it allows clients to
|
|
control the "blast radius" of failure by adjusting the size of each part
|
|
they upload, and if any of the parts fail, they only have to resend those
|
|
specific parts.
|
|
|
|
This has a big benefit in that it allows parallelization in uploading files,
|
|
allowing clients to maximize their bandwidth using multiple threads to send
|
|
the data.
|
|
|
|
We wouldn't need an explicit step (1), because our session would implicitly
|
|
initiate a multipart upload for each file.
|
|
|
|
It does have it's own downsides:
|
|
|
|
- Clients have to do more work on every request to have something resembling
|
|
resumble uploads. They would *have* to break the file up into multiple parts
|
|
rather than just making a single POST request, and only needing to deal
|
|
with the complexity if something fails.
|
|
|
|
- Clients that don't care about resumption at all still have to deal with
|
|
the third explicit step, though they could just upload the file all as a
|
|
single part.
|
|
|
|
- S3 works around this by having another API for one shot uploads, but
|
|
I'd rather not have two different APIs for uploading the same file.
|
|
|
|
- Verifying hashes gets somewhat more complicated. AWS implements hashing
|
|
multipart uploads by hashing each part, then the overall hash is just a
|
|
hash of those hashes, not of the content itself. We need to know the
|
|
actual hash of the file itself for PyPI, so we would have to reconstitute
|
|
the file and read it's content and hash it once it's been fully uploaded,
|
|
though we could still use the hash of hashes trick for checksumming the
|
|
upload itself.
|
|
|
|
- See above about whether this is actually a downside in practice, or
|
|
if it's just in theory.
|
|
|
|
I lean towards the tus style resumable uploads as I think they're simpler
|
|
to use and to implement, and the main downside is that we possibly leave
|
|
some multi-threaded performance on the table, which I think that I'm
|
|
personally fine with?
|
|
|
|
I guess one additional benefit of the S3 style multi part uploads is that
|
|
you don't have to try and do any sort of protection against parallel uploads,
|
|
since they're just supported. That alone might erase most of the server side
|
|
implementation simplification.
|
|
|
|
Copyright
|
|
=========
|
|
|
|
This document is placed in the public domain or under the
|
|
CC0-1.0-Universal license, whichever is more permissive.
|