PEP 470: use external indexes over link spidering
This commit is contained in:
parent
65da967dc2
commit
08ad02e889
|
@ -0,0 +1,377 @@
|
||||||
|
PEP: 470
|
||||||
|
Title: Using Multi Index Support for External to PyPI Package File Hosting
|
||||||
|
Version: $Revision$
|
||||||
|
Last-Modified: $Date$
|
||||||
|
Author: Donald Stufft <donald@stufft.io>,
|
||||||
|
BDFL-Delegate: Richard Jones <richard@python.org>
|
||||||
|
Discussions-To: distutils-sig@python.org
|
||||||
|
Status: Draft
|
||||||
|
Type: Process
|
||||||
|
Content-Type: text/x-rst
|
||||||
|
Created: 12-May-2014
|
||||||
|
Post-History: 14-May-2014
|
||||||
|
|
||||||
|
|
||||||
|
Abstract
|
||||||
|
========
|
||||||
|
|
||||||
|
This PEP proposes that the official means of having an installer locate and
|
||||||
|
find package files which are hosted externally to PyPI become the use of
|
||||||
|
multi index support instead of the practice of using external links on the
|
||||||
|
simple installer API.
|
||||||
|
|
||||||
|
It is important to remember that this is **not** about forcing anyone to host
|
||||||
|
their files on PyPI. If someone does not wish to do so they will never be under
|
||||||
|
any obligation too. They can still list their project in PyPI as an index, and
|
||||||
|
the tooling will still allow them to host it elsewhere.
|
||||||
|
|
||||||
|
|
||||||
|
Rationale
|
||||||
|
=========
|
||||||
|
|
||||||
|
There is a long history documented in PEP 438 that explains why externally
|
||||||
|
hosted files exist today in the state that they do on PyPI. For the sake of
|
||||||
|
brevity I will not duplicate that and instead urge readers to first take a look
|
||||||
|
at PEP 438 for background.
|
||||||
|
|
||||||
|
There are currently two primary ways for a project to make itself available
|
||||||
|
without directly hosting the package files on PyPI. They can either include
|
||||||
|
links to the package files in the simpler installer API or they can publish
|
||||||
|
a custom package index which contains their project.
|
||||||
|
|
||||||
|
|
||||||
|
Custom Additional Index
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
Each installer which speaks to PyPI offers a mechanism for the user invoking
|
||||||
|
that installer to provide additional custom locations to search for files
|
||||||
|
during the dependency resolution phase. For pip these locations can be
|
||||||
|
configured per invocation, per shell environment, per requirements file, per
|
||||||
|
virtual environment, and per user.
|
||||||
|
|
||||||
|
The use of additional indexes instead of external links on the simple
|
||||||
|
installer API provides a simple clean interface which is consistent with the
|
||||||
|
way most Linux package systems work (apt-get, yum, etc). More importantly it
|
||||||
|
works the same even for projects which are commercial or otherwise have their
|
||||||
|
access restricted in some form (private networks, password, IP ACLs etc)
|
||||||
|
while the external links method only realistically works for projects which
|
||||||
|
do not have their access restricted.
|
||||||
|
|
||||||
|
Compared to the complex rules which a project must be aware of to prevent
|
||||||
|
themselves from being considered unsafely hosted setting up an index is fairly
|
||||||
|
trivial and in the simplest case does not require anything more than a
|
||||||
|
filesystem and a standard web server such as Nginx. Even if using simple
|
||||||
|
static hosting without autoindexing support, it is still straightforward
|
||||||
|
to generate appropriate index pages as static HTML.
|
||||||
|
|
||||||
|
Example Index with Nginx
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
1. Create a root directory for your index, for the purposes of the example
|
||||||
|
I'll assume you've chosen ``/var/www/index.example.com/``.
|
||||||
|
2. Inside of this root directory, create a directory for each project such
|
||||||
|
as ``mkdir -p /var/www/index.example.com/{foo,bar,other}/``.
|
||||||
|
3. Place the package files for each project in their respective folder,
|
||||||
|
creating paths like ``/var/www/index.example.com/foo/foo-1.0.tar.gz``.
|
||||||
|
4. Configure nginx to serve the root directory, ideally with TLS, with the
|
||||||
|
autoindex directive enable (see below for example configuration).
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
server {
|
||||||
|
listen 443 ssl;
|
||||||
|
server_name index.example.com;
|
||||||
|
|
||||||
|
ssl_certificate /etc/pki/tls/certs/index.example.com.crt;
|
||||||
|
ssl_certificate_key /etc/pki/tls/certs/index.example.com.key;
|
||||||
|
|
||||||
|
root /var/www/index.example.com;
|
||||||
|
|
||||||
|
autoindex on;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
Examples of Additional indexes with pip
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
**Invocation:**
|
||||||
|
|
||||||
|
::
|
||||||
|
$ pip install --extra-index-url https://pypi.example.com/ foobar
|
||||||
|
|
||||||
|
**Shell Environment:**
|
||||||
|
|
||||||
|
::
|
||||||
|
$ export PIP_EXTRA_INDEX_URL=https://pypi.example.com/
|
||||||
|
$ pip install foobar
|
||||||
|
|
||||||
|
**Requirements File:**
|
||||||
|
|
||||||
|
::
|
||||||
|
$ echo "--extra-index-url https://pypi.example.com/\nfoobar" > requirements.txt
|
||||||
|
$ pip install -r requirements.txt
|
||||||
|
|
||||||
|
**Virtual Environment:**
|
||||||
|
|
||||||
|
::
|
||||||
|
$ python -m venv myvenv
|
||||||
|
$ echo "[global]\nextra-index-url = https://pypi.exmaple.com/" > myvenv/pip.conf
|
||||||
|
$ myvenv/bin/pip install foobar
|
||||||
|
|
||||||
|
**User:**
|
||||||
|
|
||||||
|
::
|
||||||
|
$ echo "[global]\nextra-index-url = https://pypi.exmaple.com/" >~/.pip/pip.conf
|
||||||
|
$ pip install foobar
|
||||||
|
|
||||||
|
|
||||||
|
External Links on the Simple Installer API
|
||||||
|
------------------------------------------
|
||||||
|
|
||||||
|
PEP438 proposed a system of classifying file links as either internal,
|
||||||
|
external, or unsafe. It recommended that by default only internal links would
|
||||||
|
be installed by an installer however users could opt into external links on
|
||||||
|
either a global or a per package basis. Additionally they could also opt into
|
||||||
|
unsafe links on a per package basis.
|
||||||
|
|
||||||
|
This system has turned out to be *extremely* unfriendly towards the end users
|
||||||
|
and it is the position of this PEP that the situation has become untenable. The
|
||||||
|
situation as provided by PEP438 requires an end user to be aware not only of
|
||||||
|
the difference between internal, external, and unsafe, but also to be aware of
|
||||||
|
what hosting mode the package they are trying to install is in, what links are
|
||||||
|
available on that project's /simple/ page, whether or not those links have
|
||||||
|
a properly formatted hash fragment, and what links are available from pages
|
||||||
|
linked to from that project's /simple/ page.
|
||||||
|
|
||||||
|
There are a number of common confusion/pain points with this system that I
|
||||||
|
have witnessed:
|
||||||
|
|
||||||
|
* Users unaware what the simple installer api is at all or how an installer
|
||||||
|
locates installable files.
|
||||||
|
* Users unaware that even if the simple api links to a file, if it does
|
||||||
|
not include a ``#md5=...`` fragment that it will be counted as unsafe.
|
||||||
|
* Users unaware that an installer can look at pages linked from the
|
||||||
|
simple api to determine additional links, or that any links found in this
|
||||||
|
fashion are considered unsafe.
|
||||||
|
* Users are unaware and often surprised that PyPI supports hosting your files
|
||||||
|
someplace other than PyPI at all.
|
||||||
|
|
||||||
|
In addition to that, the information that an installer is able to provide
|
||||||
|
when an installation fails is pretty minimal. We are able to detect if there
|
||||||
|
are externally hosted files directly linked from the simple installer api,
|
||||||
|
however we cannot detect if there are files hosted on a linked page without
|
||||||
|
fetching that page and doing so would cause a massive performance hit just to
|
||||||
|
see if there might be a file there so that a better error message could be
|
||||||
|
provided.
|
||||||
|
|
||||||
|
Finally very few projects have properly linked to their external files so that
|
||||||
|
they can be safely downloaded and verified. At the time of this writing there
|
||||||
|
are a total of 65 projects which have files that are only available externally
|
||||||
|
and are safely hosted.
|
||||||
|
|
||||||
|
The end result of all of this, is that with PEP 438, when a user attempts to
|
||||||
|
install a file that is not hosted on PyPI typically the steps they follow are:
|
||||||
|
|
||||||
|
1. First, they attempt to install it normally, using ``pip install foobar``.
|
||||||
|
This fails because the file is not hosted on PyPI and PEP 438 has us default
|
||||||
|
to only hosted on PyPI. If pip detected any externally hosted files or other
|
||||||
|
pages that we *could* have attempted to find other files at it will give an
|
||||||
|
error message suggesting that they try ``--allow-external foobar``.
|
||||||
|
2. They then attempt to install their package using
|
||||||
|
``pip install --allow-external foobar foobar``. If they are lucky foobar is
|
||||||
|
one of the packages which is hosted externally and safely and this will
|
||||||
|
succeed. If they are unlucky they will get a different error message
|
||||||
|
suggesting that they *also* try ``--allow-unverified foobar``.
|
||||||
|
3. They then attempt to install their package using
|
||||||
|
``pip install --allow-external foobar --allow-unverified foobar foobar``
|
||||||
|
and this finally works.
|
||||||
|
|
||||||
|
This is the same basic steps that practically everyone goes through every time
|
||||||
|
they try to install something that is not hosted on PyPI. If they are lucky it'll
|
||||||
|
only take them two steps, but typically it requires three steps. Worse there is
|
||||||
|
no real indication to these people why one package might install after two
|
||||||
|
but most require three. Even worse than that most of them will never get an
|
||||||
|
externally hosted package that does not take three steps, so they will be
|
||||||
|
increasingly annoyed and frustrated at the intermediate step and will likely
|
||||||
|
eventually just start skipping it.
|
||||||
|
|
||||||
|
|
||||||
|
External Index Discovery
|
||||||
|
========================
|
||||||
|
|
||||||
|
One of the problems with using an additional index is one of discovery. Users
|
||||||
|
will not generally be aware that an additional index is required at all much
|
||||||
|
less where that index can be found. Projects can attempt to convey this
|
||||||
|
information using their description on the PyPI page however that excludes
|
||||||
|
people who discover their project organically through ``pip search``.
|
||||||
|
|
||||||
|
To support projects that wish to externally host their files and to enable
|
||||||
|
users to easily discover what additional indexes are required, PyPI will gain
|
||||||
|
the ability for projects to register external index URLs and additionally an
|
||||||
|
associated comment for each. These URLs will be made available on the simple
|
||||||
|
page however they will not be linked or provided in a form that older
|
||||||
|
installers will automatically search them.
|
||||||
|
|
||||||
|
When an installer fetches the simple page for a project, if it finds this
|
||||||
|
additional meta-data and it cannot find any files for that project in it's
|
||||||
|
configured URLs then it should use this data to tell the user how to add one
|
||||||
|
or more of the additional URLs to search in. This message should include any
|
||||||
|
comments that the project has included to enable them to communicate to the
|
||||||
|
user and provide hints as to which URL they might want if some are only
|
||||||
|
useful or compatible with certain platforms or situations.
|
||||||
|
|
||||||
|
This feature *must* be added to PyPI prior to starting the deprecation and
|
||||||
|
removal process for link spidering.
|
||||||
|
|
||||||
|
|
||||||
|
Deprecation and Removal of Link Spidering
|
||||||
|
=========================================
|
||||||
|
|
||||||
|
A new hosting mode will be added to PyPI. This hosting mode will be called
|
||||||
|
``pypi-only`` and will be in addition to the three that PEP438 has already given
|
||||||
|
us which are ``pypi-explicit``, ``pypi-scrape``, ``pypi-scrape-crawl``. This
|
||||||
|
new hosting mode will modify a project's simple api page so that it only lists
|
||||||
|
the files which are directly hosted on PyPI and will not link to anything else.
|
||||||
|
|
||||||
|
Upon acceptance of this PEP and the addition of the ``pypi-only`` mode, all new
|
||||||
|
projects will by defaulted to the PyPI only mode and they will be locked to
|
||||||
|
this mode and unable to change this particular setting. ``pypi-only`` projects
|
||||||
|
will still be able to register external index URLs as described above - the
|
||||||
|
"pypi-only" refers only to the download links that are published directly on
|
||||||
|
PyPI.
|
||||||
|
|
||||||
|
An email will then be sent out to all of the projects which are hosted only on
|
||||||
|
PyPI informing them that in one month their project will be automatically
|
||||||
|
converted to the ``pypi-only`` mode. A month after these emails have been sent
|
||||||
|
any of those projects which were emailed, which still are hosted only on PyPI
|
||||||
|
will have their mode set to ``pypi-only``.
|
||||||
|
|
||||||
|
After that switch, an email will be sent to projects which rely on hosting
|
||||||
|
external to PyPI. This email will warn these projects that externally hosted
|
||||||
|
files have been deprecated on PyPI and that in 6 months from the time of that
|
||||||
|
email that all external links will be removed from the installer APIs. This
|
||||||
|
email *must* include instructions for converting their projects to be hosted
|
||||||
|
on PyPI and *must* include links to a script or package that will enable them
|
||||||
|
to enter their PyPI credentials and package name and have it automatically
|
||||||
|
download and re-host all of their files on PyPI. This email *must also*
|
||||||
|
include instructions for setting up their own index page and registering that
|
||||||
|
with PyPI.
|
||||||
|
|
||||||
|
Five months after the initial email, another email must be sent to any projects
|
||||||
|
still relying on external hosting. This email will include all of the same
|
||||||
|
information that the first email contained, except that the removal date will
|
||||||
|
be one month away instead of six.
|
||||||
|
|
||||||
|
Finally a month later all projects will be switched to the ``pypa-only`` mode
|
||||||
|
and PyPI will be modified to remove the externally linked files functionality.
|
||||||
|
|
||||||
|
|
||||||
|
Impact
|
||||||
|
======
|
||||||
|
|
||||||
|
============ ======= ========== =======
|
||||||
|
\ PyPI External Total
|
||||||
|
============ ======= ========== =======
|
||||||
|
**Safe** 37779 65 37844
|
||||||
|
**Unsafe** 0 2974 2974
|
||||||
|
**Total** 37779 3039
|
||||||
|
============ ======= ========== =======
|
||||||
|
|
||||||
|
|
||||||
|
Rejected Proposals
|
||||||
|
==================
|
||||||
|
|
||||||
|
Keep the current classification system but adjust the options
|
||||||
|
-------------------------------------------------------------
|
||||||
|
|
||||||
|
This PEP rejects several related proposals which attempt to fix some of the
|
||||||
|
usability problems with the current system but while still keeping the
|
||||||
|
general gist of PEP 438.
|
||||||
|
|
||||||
|
This includes:
|
||||||
|
|
||||||
|
* Default to allowing safely externally hosted files, but disallow unsafely
|
||||||
|
hosted.
|
||||||
|
* Default to disallowing safely externally hosted files with only a global
|
||||||
|
flag to enable them, but disallow unsafely hosted.
|
||||||
|
|
||||||
|
These proposals are rejected because:
|
||||||
|
|
||||||
|
* The classification "system" is complex, hard to explain, and requires an
|
||||||
|
intimate knowledge of how the simple API works in order to be able to reason
|
||||||
|
about which classification is required. This is reflected in the fact that
|
||||||
|
the code to implement it is complicated and hard to understand as well.
|
||||||
|
|
||||||
|
* People are generally surprised that PyPI allows externally linking to files
|
||||||
|
and doesn't require people to host on PyPI. In contrast most of them are
|
||||||
|
familiar with the concept of multiple software repositories such as is in
|
||||||
|
use by many OSs.
|
||||||
|
|
||||||
|
* PyPI is fronted by a globally distributed CDN which has improved the
|
||||||
|
reliability and speed for end users. It is unlikely that any particular
|
||||||
|
external host has something comparable. This can lead to extremely bad
|
||||||
|
performance for end users when the external host is located in different
|
||||||
|
parts of the world or does not generally have good connectivity.
|
||||||
|
|
||||||
|
As a data point, many users reported sub DSL speeds and latency when
|
||||||
|
accessing PyPI from parts of Europe and Asia prior to the use of the CDN.
|
||||||
|
|
||||||
|
* PyPI has monitoring and an on-call rotation of sysadmins whom can respond to
|
||||||
|
downtime quickly, thus enabling a quicker response to downtime. Again it is
|
||||||
|
unlikely that any particular external host will have this. This can lead
|
||||||
|
to single packages in a dependency chain being un-installable. This will
|
||||||
|
often confuse users, who often times have no idea that this package relies
|
||||||
|
on an external host, and they cannot figure out why PyPI appears to be up
|
||||||
|
but the installer cannot find a package.
|
||||||
|
|
||||||
|
* PyPI supports mirroring, both for private organizations and public mirrors.
|
||||||
|
The legal terms of uploading to PyPI ensure that mirror operators, both
|
||||||
|
public and private, have the right to distribute the software found on PyPI.
|
||||||
|
However software that is hosted externally does not have this, causing
|
||||||
|
private organizations to need to investigate each package individually and
|
||||||
|
manually to determine if the license allows them to mirror it.
|
||||||
|
|
||||||
|
For public mirrors this essentially means that these externally hosted
|
||||||
|
packages *cannot* be reasonably mirrored. This is particularly troublesome
|
||||||
|
in countries such as China where the bandwidth to outside of China is
|
||||||
|
highly congested making a mirror within China often times a massively better
|
||||||
|
experience.
|
||||||
|
|
||||||
|
* Installers have no method to determine if they should expect any particular
|
||||||
|
URL to be available or not. It is not unusual for the simple API to reference
|
||||||
|
old packages and URLs which have long since stopped working. This causes
|
||||||
|
installers to have to assume that it is OK for any particular URL to not be
|
||||||
|
accessible. This causes problems where an URL is temporarily down or
|
||||||
|
otherwise unavailable (a common cause of this is using a copy of Python
|
||||||
|
linked against a really ancient copy of OpenSSL which is unable to verify
|
||||||
|
the SSL certificate on PyPI) but it *should* be expected to be up. In this
|
||||||
|
case installers will typically silently ignore this URL and later the user
|
||||||
|
will get a confusing error stating that the installer couldn't find any
|
||||||
|
versions instead of getting the real error message indicating that the URL
|
||||||
|
was unavailable.
|
||||||
|
|
||||||
|
* In the long run, global opt in flags like ``--allow-all-external`` will
|
||||||
|
become little annoyances that developers cargo cult around in order to make
|
||||||
|
their installer work. When they run into a project that requires it they
|
||||||
|
will most likely simply add it to their configuration file for that installer
|
||||||
|
and continue on with whatever they were actually trying to do. This will
|
||||||
|
continue until they try to install their requirements on another computer
|
||||||
|
or attempt to deploy to a server where their install will fail again until
|
||||||
|
they add the "make it work" flag in their configuration file.
|
||||||
|
|
||||||
|
|
||||||
|
Copyright
|
||||||
|
=========
|
||||||
|
|
||||||
|
This document has been placed in the public domain.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
..
|
||||||
|
Local Variables:
|
||||||
|
mode: indented-text
|
||||||
|
indent-tabs-mode: nil
|
||||||
|
sentence-end-double-space: t
|
||||||
|
fill-column: 70
|
||||||
|
coding: utf-8
|
||||||
|
End:
|
Loading…
Reference in New Issue