516 lines
25 KiB
Plaintext
516 lines
25 KiB
Plaintext
PEP: 470
|
||
Title: Using Multi Repository Support for External to PyPI Package File Hosting
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Donald Stufft <donald@stufft.io>,
|
||
BDFL-Delegate: Richard Jones <richard@python.org>
|
||
Discussions-To: distutils-sig@python.org
|
||
Status: Draft
|
||
Type: Process
|
||
Content-Type: text/x-rst
|
||
Created: 12-May-2014
|
||
Post-History: 14-May-2014, 05-Jun-2014, 03-Oct-2014, 13-Oct-2014
|
||
Replaces: 438
|
||
|
||
|
||
Abstract
|
||
========
|
||
|
||
This PEP proposes a mechanism for project authors to register with PyPI an
|
||
external repository where their project's downloads can be located. This
|
||
information can than be included as part of the simple API so that installers
|
||
can use it to tell users where the item they are attempting to install is
|
||
located and what they need to do to enable this additional repository. In
|
||
addition to adding discovery information to make explicit multiple repositories
|
||
easy to use, this PEP also deprecates and removes the implicit multiple
|
||
repository support which currently functions through directly or indirectly
|
||
linking off site via the simple API. Finally this PEP also proposes deprecating
|
||
and removing the functionality added by PEP 438, particularly the additional
|
||
rel information and the meta tag to indicate the API version.
|
||
|
||
This PEP *does* not propose mandating that all authors upload their projects to
|
||
PyPI in order to exist in the index nor does it propose any change to the human
|
||
facing elements of PyPI.
|
||
|
||
|
||
Rationale
|
||
=========
|
||
|
||
Historically PyPI did not have any method of hosting files nor any method of
|
||
automatically retrieving installables, it was instead focused on providing a
|
||
central registry of names, to prevent naming collisions, and as a means of
|
||
discovery for finding projects to use. In the course of time setuptools began
|
||
to scrape these human facing pages, as well as pages linked from those pages,
|
||
looking for things it could automatically download and install. Eventually this
|
||
became the "Simple" API which used a similar URL structure however it
|
||
eliminated any of the extraneous links and information to make the API more
|
||
efficient. Additionally PyPI grew the ability for a project to upload release
|
||
files directly to PyPI enabling PyPI to act as a repository in addition to an
|
||
index.
|
||
|
||
This gives PyPI two equally important roles that it plays in the Python
|
||
ecosystem, that of index to enable easy discovery of Python projects and
|
||
central repository to enable easy hosting, download, and installation of Python
|
||
projects. Due to the history behind PyPI and the very organic growth it has
|
||
experienced the lines between these two roles are blurry, and this blurring has
|
||
caused confusion for the end users of both of these roles and this has in turn
|
||
caused ire between people attempting to use PyPI in different capacities, most
|
||
often when end users want to use PyPI as a repository but the author wants to
|
||
use PyPI solely as an index.
|
||
|
||
This confusion comes down to end users of projects not realizing if a project
|
||
is hosted on PyPI or if it relies on an external service. This often manifests
|
||
itself when the external service is down but PyPI is not. People will see that
|
||
PyPI works, and other projects works, but this one specific one does not. They
|
||
often times do not realize who they need to contact in order to get this fixed
|
||
or what their remediation steps are.
|
||
|
||
By moving to using explicit multiple repositories we can make the lines between
|
||
these two roles much more explicit and remove the "hidden" surprises caused by
|
||
the current implementation of handling people who do not want to use PyPI as a
|
||
repository. However simply moving to explicit multiple repositories is a
|
||
regression in discoverability, and for that reason this PEP adds an extension
|
||
to the current simple API which will enable easy discovery of the specific
|
||
repository that a project can be found in.
|
||
|
||
PEP 438 attempted to solve this issue by allowing projects to explicitly
|
||
declare if they were using the repository features or not, and if they were
|
||
not, it had the installers classify the links it found as either "internal",
|
||
"verifiable external" or "unverifiable external". PEP 438 was accepted and
|
||
implemented in pip 1.4 (released on Jul 23, 2013) with the final transition
|
||
implemented in pip 1.5 (released on Jan 2, 2014).
|
||
|
||
PEP 438 was successful in bringing about more people to utilize PyPI's
|
||
repository features, an altogether good thing given the global CDN powering
|
||
PyPI providing speed ups for a lot of people, however it did so by introducing
|
||
a new point of confusion and pain for both the end users and the authors.
|
||
|
||
|
||
Key User Experience Expectations
|
||
--------------------------------
|
||
|
||
#. Easily allow external hosting to "just work" when appropriately configured
|
||
at the system, user or virtual environment level.
|
||
#. Easily allow package authors to tell PyPI "my releases are hosted <here>"
|
||
and have that advertised in such a way that tools can clearly communicate it
|
||
to users, without silently introducing unexpected dependencies on third
|
||
party services.
|
||
#. Eliminate any and all references to the confusing "verifiable external" and
|
||
"unverifiable external" distinction from the user experience (both when
|
||
installing and when releasing packages).
|
||
#. The repository aspects of PyPI should become *just* the default package
|
||
hosting location (i.e. the only one that is treated as opt-out rather than
|
||
opt-in by most client tools in their default configuration). Aside from that
|
||
aspect, hosting on PyPI should not otherwise provide an enhanced user
|
||
experience over hosting your own package repository.
|
||
#. Do all of the above while providing default behaviour that is secure against
|
||
most attackers below the nation state adversary level.
|
||
|
||
|
||
Why Additional Repositories?
|
||
----------------------------
|
||
|
||
The two common installer tools, pip and easy_install/setuptools, both support
|
||
the concept of additional locations to search for files to satisfy the
|
||
installation requirements and have done so for many years. This means that
|
||
there is no need to "phase" in a new flag or concept and the solution to
|
||
installing a project from a repository other than PyPI will function regardless
|
||
of how old (within reason) the end user's installer is. Not only has this
|
||
concept existed in the Python tooling for some time, but it is a concept that
|
||
exists across languages and even extending to the OS level with OS package
|
||
tools almost universally using multiple repository support making it extremely
|
||
likely that someone is already familiar with the concept.
|
||
|
||
Additionally, the multiple repository approach is a concept that is useful
|
||
outside of the narrow scope of allowing projects which wish to be included on
|
||
the index portion of PyPI but do not wish to utilize the repository portion of
|
||
PyPI. This includes places where a company may wish to host a repository that
|
||
contains their internal packages or where a project may wish to have multiple
|
||
"channels" of releases, such as alpha, beta, release candidate, and final
|
||
release. This could also be used for projects wishing to host files which
|
||
cannot be uploaded to PyPI, such as multi-gigabyte data files or, currently at
|
||
least, Linux Wheels.
|
||
|
||
|
||
Why Not PEP 438 or Similar?
|
||
---------------------------
|
||
|
||
While the additional search location support has existed in pip and setuptools
|
||
for quite some time support for PEP 438 has only existed in pip since the 1.4
|
||
version, and still has yet to be implemented in setuptools. The design of
|
||
PEP 438 did mean that users still benefited for projects which did not require
|
||
external files even with older installers, however for projects which *did*
|
||
require external files, users are still silently being given either potentially
|
||
unreliable or, even worse, unsafe files to download. This system is also unique
|
||
to Python as it arises out of the history of PyPI, this means that it is almost
|
||
certain that this concept will be foreign to most, if not all users, until they
|
||
encounter it while attempting to use the Python toolchain.
|
||
|
||
Additionally, the classification system proposed by PEP 438 has, in practice,
|
||
turned out to be extremely confusing to end users, so much so that it is a
|
||
position of this PEP that the situation as it stands is completely untenable.
|
||
The common pattern for a user with this system is to attempt to install a
|
||
project possibly get an error message (or maybe not if the project ever
|
||
uploaded something to PyPI but later switched without removing old files), see
|
||
that the error message suggests ``--allow-external``, they reissue the command
|
||
adding that flag most likely getting another error message, see that this time
|
||
the error message suggests also adding ``--allow-unverified``, and again issue
|
||
the command a third time, this time finally getting the thing they wish to
|
||
install.
|
||
|
||
This UX failure exists for several reasons.
|
||
|
||
#. If pip can locate files at all for a project on the Simple API it will
|
||
simply use that instead of attempting to locate more. This is generally the
|
||
right thing to do as attempting to locate more would erase a large part of
|
||
the benefit of PEP 438. This means that if a project *ever* uploaded a file
|
||
that matches what the user has requested for install that will be used
|
||
regardless of how old it is.
|
||
#. PEP 438 makes an implicit assumption that most projects would either upload
|
||
themselves to PyPI or would update themselves to directly linking to release
|
||
files. While a large number of projects did ultimately decide to upload to
|
||
PyPI, some of them did so only because the UX around what PEP 438 was so bad
|
||
that they felt forced to do so. More concerning however, is the fact that
|
||
very few projects have opted to directly and safely link to files and
|
||
instead they still simply link to pages which must be scraped in order to
|
||
find the actual files, thus rendering the safe variant
|
||
(``--allow-external``) largely useless.
|
||
#. Even if an author wishes to directly link to their files, doing so safely is
|
||
non-obvious. It requires the inclusion of a MD5 hash (for historical
|
||
reasons) in the hash of the URL. If they do not include this then their
|
||
files will be considered "unverified".
|
||
#. PEP 438 takes a security centric view and disallows any form of a global opt
|
||
in for unverified projects. While this is generally a good thing, it creates
|
||
extremely verbose and repetitive command invocations such as::
|
||
|
||
$ pip install --allow-external myproject --allow-unverified myproject myproject
|
||
$ pip install --allow-all-external --allow-unverified myproject myproject
|
||
|
||
|
||
Multiple Repository/Index Support
|
||
=================================
|
||
|
||
Installers SHOULD implement or continue to offer, the ability to point the
|
||
installer at multiple URL locations. The exact mechanisms for a user to
|
||
indicate they wish to use an additional location is left up to each individual
|
||
implementation.
|
||
|
||
Additionally the mechanism discovering an installation candidate when multiple
|
||
repositories are being used is also up to each individual implementation,
|
||
however once configured an implementation should not discourage, warn, or
|
||
otherwise cast a negative light upon the use of a repository simply because it
|
||
is not the default repository.
|
||
|
||
Currently both pip and setuptools implement multiple repository support by
|
||
using the best installation candidate it can find from either repository,
|
||
essentially treating it as if it were one large repository.
|
||
|
||
Installers SHOULD also implement some mechanism for removing or otherwise
|
||
disabling use of the default repository. The exact specifics of how that is
|
||
achieved is up to each individual implementation.
|
||
|
||
Installers SHOULD also implement some mechanism for whitelisting and
|
||
blacklisting which projects a user wishes to install from a particular
|
||
repository. The exact specifics of how that is achieved is up to each
|
||
individual implementation.
|
||
|
||
|
||
External Index Discovery
|
||
========================
|
||
|
||
One of the problems with using an additional index is one of discovery. Users
|
||
will not generally be aware that an additional index is required at all much
|
||
less where that index can be found. Projects can attempt to convey this
|
||
information using their description on the PyPI page however that excludes
|
||
people who discover their project organically through ``pip search``.
|
||
|
||
To support projects that wish to externally host their files and to enable
|
||
users to easily discover what additional indexes are required, PyPI will gain
|
||
the ability for projects to register external index URLs along with an
|
||
associated comment for each. These URLs will be made available on the simple
|
||
page however they will not be linked or provided in a form that older
|
||
installers will automatically search them.
|
||
|
||
This ability will take the form of a ``<meta>`` tag. The name of this tag must
|
||
be set to ``repository`` or ``find-link`` and the content will be a link to the
|
||
location of the repository. An optional data-description attribute will convey
|
||
any comments or description that the author has provided.
|
||
|
||
An example would look something like::
|
||
|
||
<meta name="repository" content="https://index.example.com/" data-description="Primary Repository">
|
||
<meta name="repository" content="https://index.example.com/Ubuntu-14.04/" data-description="Wheels built for Ubuntu 14.04">
|
||
<meta name="find-link" content="https://links.example.com/find-links/" data-description="A flat index for find links">
|
||
|
||
When an installer fetches the simple page for a project, if it finds this
|
||
additional meta-data then it should use this data to tell the user how to add
|
||
one or more of the additional URLs to search in. This message should include
|
||
any comments that the project has included to enable them to communicate to the
|
||
user and provide hints as to which URL they might want (e.g. if some are only
|
||
useful or compatible with certain platforms or situations). When the installer
|
||
has implemented the auto discovery mechanisms they should also deprecate any of
|
||
the mechanisms added for PEP 438 (such as ``--allow-external``) for removal at
|
||
the end of the deprecation period proposed by the PEP.
|
||
|
||
In addition to the API for programtic access to the registered external
|
||
repositories, PyPI will also prevent these URLs in the UI so that users with
|
||
an installer that does not implement the discovery mechanism can still easily
|
||
discover what repository the project is using to host itself.
|
||
|
||
This feature **MUST** be added to PyPI and be contained in a released version
|
||
of pip prior to starting the deprecation and removal process for the implicit
|
||
offsite hosting functionality.
|
||
|
||
|
||
Deprecation and Removal of Link Spidering
|
||
=========================================
|
||
|
||
.. important:: The deprecation specified in this section **MUST** not start to
|
||
until after the discovery mechanisms have been implemented and released in
|
||
pip.
|
||
|
||
The only exception to this is the addition of the ``pypi-only`` mode and
|
||
defaulting new projects to it without abilility to switch to a different
|
||
mode.
|
||
|
||
A new hosting mode will be added to PyPI. This hosting mode will be called
|
||
``pypi-only`` and will be in addition to the three that PEP 438 has already
|
||
given us which are ``pypi-explicit``, ``pypi-scrape``, ``pypi-scrape-crawl``.
|
||
This new hosting mode will modify a project's simple api page so that it only
|
||
lists the files which are directly hosted on PyPI and will not link to anything
|
||
else.
|
||
|
||
Upon acceptance of this PEP and the addition of the ``pypi-only`` mode, all new
|
||
projects will be defaulted to the PyPI only mode and they will be locked to
|
||
this mode and unable to change this particular setting. ``pypi-only`` projects
|
||
will still be able to register external index URLs as described above - the
|
||
"pypi-only" refers only to the download links that are published directly on
|
||
PyPI.
|
||
|
||
An email will then be sent out to all of the projects which are hosted only on
|
||
PyPI informing them that in one month their project will be automatically
|
||
converted to the ``pypi-only`` mode. A month after these emails have been sent
|
||
any of those projects which were emailed, which still are hosted only on PyPI
|
||
will have their mode set to ``pypi-only``.
|
||
|
||
After that switch, an email will be sent to projects which rely on hosting
|
||
external to PyPI. This email will warn these projects that externally hosted
|
||
files have been deprecated on PyPI and that in 6 months from the time of that
|
||
email that all external links will be removed from the installer APIs. This
|
||
email **MUST** include instructions for converting their projects to be hosted
|
||
on PyPI and **MUST** include links to a script or package that will enable them
|
||
to enter their PyPI credentials and package name and have it automatically
|
||
download and re-host all of their files on PyPI. This email **MUST** also
|
||
include instructions for setting up their own index page and registering that
|
||
with PyPI, including the fact that they can use pythonhosted.org as a host for
|
||
an index page without requiring them to host any additional infrastructure or
|
||
purchase a TLS certificate. This email must also contain a link to the Terms of
|
||
Service for PyPI as many users may have signed up a long time ago and may not
|
||
recall what those terms are. Finally this email must also contain a list of
|
||
the links registered with PyPI where we were able to detect an installable file
|
||
was located.
|
||
|
||
Five months after the initial email, another email must be sent to any projects
|
||
still relying on external hosting. This email will include all of the same
|
||
information that the first email contained, except that the removal date will
|
||
be one month away instead of six.
|
||
|
||
Finally a month later all projects will be switched to the ``pypi-only`` mode
|
||
and PyPI will be modified to remove the externally linked files functionality,
|
||
when switching these projects to the ``pypi-only`` mode we will move any links
|
||
which are able to be used for discovering other projects automatically to as
|
||
an external repository.
|
||
|
||
|
||
Summary of Changes
|
||
==================
|
||
|
||
Repository side
|
||
---------------
|
||
|
||
#. Implement simple API changes to allow the addition of an external
|
||
repository.
|
||
#. *(Optional, Mandatory on PyPI)* Deprecate and remove the hosting modes as
|
||
defined by PEP 438.
|
||
#. *(Optional, Mandatory on PyPI)* Restrict simple API to only list the files
|
||
that are contained within the repository and the external repository
|
||
metadata.
|
||
|
||
Client side
|
||
-----------
|
||
|
||
#. Implement multiple repository support.
|
||
#. Implement some mechanism for removing/disabling the default repository.
|
||
#. Implement the discovery mechanism.
|
||
#. *(Optional)* Deprecate / Remove PEP 438
|
||
|
||
|
||
Impact
|
||
======
|
||
|
||
The large impact of this PEP will be that for users of older installation
|
||
clients they will not get a discovery mechanism built into the install command.
|
||
This will require them to browse to the PyPI web UI and discover the repository
|
||
there. Since any URLs required to instal a project will be automatically
|
||
migrated to the new format, the biggest change to users will be requiring a new
|
||
option to install these projects.
|
||
|
||
Looking at the numbers the actual impact should be quite low, with it affecting
|
||
just 3.8% of projects which host any files only externally or 2.2% which have
|
||
their latest version hosted only externally.
|
||
|
||
6674 unique IP addresses have accessed the Simple API for these 3.8% of
|
||
projects in a single day (2014-09-30). Of those, 99.5% of them installed
|
||
something which could not be verified, and thus they were open to a Remote Code
|
||
Execution via a Man-In-The-Middle attack, while 7.9% installed something which
|
||
could be verified and only 0.4% only installed things which could be verified.
|
||
|
||
This means that 99.5% users of these features, both new and old, are doing
|
||
something unsafe, and for anything using an older copy of pip or using
|
||
setuptools at all they are silently unsafe.
|
||
|
||
|
||
Projects Which Rely on Externally Hosted files
|
||
----------------------------------------------
|
||
|
||
This is determined by crawling the simple index and looking for installable
|
||
files using a similar detection method as pip and setuptools use. The "latest"
|
||
version is determined using ``pkg_resources.parse_version`` sort order and it
|
||
is used to show whether or not the latest version is hosted externally or only
|
||
old versions are.
|
||
|
||
============ ======= ================ =================== =======
|
||
\ PyPI External (old) External (latest) Total
|
||
============ ======= ================ =================== =======
|
||
**Safe** 43313 16 39 43368
|
||
**Unsafe** 0 756 1092 1848
|
||
**Total** 43313 772 1131 45216
|
||
============ ======= ================ =================== =======
|
||
|
||
|
||
Top Externally Hosted Projects by Requests
|
||
------------------------------------------
|
||
|
||
This is determined by looking at the number of requests the
|
||
``/simple/<project>/`` page had gotten in a single day. The total number of
|
||
requests during that day was 10,623,831.
|
||
|
||
============================== ========
|
||
Project Requests
|
||
============================== ========
|
||
PIL 63869
|
||
Pygame 2681
|
||
mysql-connector-python 1562
|
||
pyodbc 724
|
||
elementtree 635
|
||
salesforce-python-toolkit 316
|
||
wxPython 295
|
||
PyXML 251
|
||
RBTools 235
|
||
python-graph-core 123
|
||
cElementTree 121
|
||
============================== ========
|
||
|
||
|
||
Top Externally Hosted Projects by Unique IPs
|
||
--------------------------------------------
|
||
|
||
This is determined by looking at the IP addresses of requests the
|
||
``/simple/<project>/`` page had gotten in a single day. The total number of
|
||
unique IP addresses during that day was 124,604.
|
||
|
||
============================== ==========
|
||
Project Unique IPs
|
||
============================== ==========
|
||
PIL 4553
|
||
mysql-connector-python 462
|
||
Pygame 202
|
||
pyodbc 181
|
||
elementtree 166
|
||
wxPython 126
|
||
RBTools 114
|
||
PyXML 87
|
||
salesforce-python-toolkit 76
|
||
pyDes 76
|
||
============================== ==========
|
||
|
||
|
||
Rejected Proposals
|
||
==================
|
||
|
||
Keep the current classification system but adjust the options
|
||
-------------------------------------------------------------
|
||
|
||
This PEP rejects several related proposals which attempt to fix some of the
|
||
usability problems with the current system but while still keeping the general
|
||
gist of PEP 438.
|
||
|
||
This includes:
|
||
|
||
* Default to allowing safely externally hosted files, but disallow unsafely
|
||
hosted.
|
||
|
||
* Default to disallowing safely externally hosted files with only a global flag
|
||
to enable them, but disallow unsafely hosted.
|
||
|
||
* Continue on the suggested path of PEP 438 and remove the option to unsafely
|
||
host externally but continue to allow the option to safely host externally.
|
||
|
||
These proposals are rejected because:
|
||
|
||
* The classification system introduced in PEP 438 in an entirely unique concept
|
||
to PyPI which is not generically applicable even in the context of Python
|
||
packaging. Adding additional concepts comes at a cost.
|
||
|
||
* The classification system itself is non-obvious to explain and to
|
||
pre-determine what classification of link a project will require entails
|
||
inspecting the project's ``/simple/<project>/`` page, and possibly any URLs
|
||
linked from that page.
|
||
|
||
* The ability to host externally while still being linked for automatic
|
||
discovery is mostly a historic relic which causes a fair amount of pain and
|
||
complexity for little reward.
|
||
|
||
* The installer's ability to optimize or clean up the user interface is limited
|
||
due to the nature of the implicit link scraping which would need to be done.
|
||
This extends to the ``--allow-*`` options as well as the inability to
|
||
determine if a link is expected to fail or not.
|
||
|
||
* The mechanism paints a very broad brush when enabling an option, while
|
||
PEP 438 attempts to limit this with per package options. However a project
|
||
that has existed for an extended period of time may often times have several
|
||
different URLs listed in their simple index. It is not unusual for at least
|
||
one of these to no longer be under control of the project. While an
|
||
unregistered domain will sit there relatively harmless most of the time, pip
|
||
will continue to attempt to install from it on every discovery phase. This
|
||
means that an attacker simply needs to look at projects which rely on unsafe
|
||
external URLs and register expired domains to attack users.
|
||
|
||
|
||
Implement this PEP, but Do Not Remove the Existing Links
|
||
--------------------------------------------------------
|
||
|
||
This is essentially the backwards compatible version of this PEP. It attempts
|
||
to allow people using older clients, or clients which do not implement this
|
||
PEP to continue on as if nothing had changed. This proposal is rejected because
|
||
the vast bulk of those scenarios are unsafe uses of the deprecated features. It
|
||
is the opinion of this PEP that silently allowing unsafe actions to take place
|
||
on behalf of end users is simply not an acceptable solution.
|
||
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
coding: utf-8
|
||
End:
|