PEP 470: use external indexes over link spidering
This commit is contained in:
parent
65da967dc2
commit
08ad02e889
|
@ -0,0 +1,377 @@
|
|||
PEP: 470
|
||||
Title: Using Multi Index Support for External to PyPI Package File Hosting
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Donald Stufft <donald@stufft.io>,
|
||||
BDFL-Delegate: Richard Jones <richard@python.org>
|
||||
Discussions-To: distutils-sig@python.org
|
||||
Status: Draft
|
||||
Type: Process
|
||||
Content-Type: text/x-rst
|
||||
Created: 12-May-2014
|
||||
Post-History: 14-May-2014
|
||||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
This PEP proposes that the official means of having an installer locate and
|
||||
find package files which are hosted externally to PyPI become the use of
|
||||
multi index support instead of the practice of using external links on the
|
||||
simple installer API.
|
||||
|
||||
It is important to remember that this is **not** about forcing anyone to host
|
||||
their files on PyPI. If someone does not wish to do so they will never be under
|
||||
any obligation too. They can still list their project in PyPI as an index, and
|
||||
the tooling will still allow them to host it elsewhere.
|
||||
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
||||
There is a long history documented in PEP 438 that explains why externally
|
||||
hosted files exist today in the state that they do on PyPI. For the sake of
|
||||
brevity I will not duplicate that and instead urge readers to first take a look
|
||||
at PEP 438 for background.
|
||||
|
||||
There are currently two primary ways for a project to make itself available
|
||||
without directly hosting the package files on PyPI. They can either include
|
||||
links to the package files in the simpler installer API or they can publish
|
||||
a custom package index which contains their project.
|
||||
|
||||
|
||||
Custom Additional Index
|
||||
-----------------------
|
||||
|
||||
Each installer which speaks to PyPI offers a mechanism for the user invoking
|
||||
that installer to provide additional custom locations to search for files
|
||||
during the dependency resolution phase. For pip these locations can be
|
||||
configured per invocation, per shell environment, per requirements file, per
|
||||
virtual environment, and per user.
|
||||
|
||||
The use of additional indexes instead of external links on the simple
|
||||
installer API provides a simple clean interface which is consistent with the
|
||||
way most Linux package systems work (apt-get, yum, etc). More importantly it
|
||||
works the same even for projects which are commercial or otherwise have their
|
||||
access restricted in some form (private networks, password, IP ACLs etc)
|
||||
while the external links method only realistically works for projects which
|
||||
do not have their access restricted.
|
||||
|
||||
Compared to the complex rules which a project must be aware of to prevent
|
||||
themselves from being considered unsafely hosted setting up an index is fairly
|
||||
trivial and in the simplest case does not require anything more than a
|
||||
filesystem and a standard web server such as Nginx. Even if using simple
|
||||
static hosting without autoindexing support, it is still straightforward
|
||||
to generate appropriate index pages as static HTML.
|
||||
|
||||
Example Index with Nginx
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
1. Create a root directory for your index, for the purposes of the example
|
||||
I'll assume you've chosen ``/var/www/index.example.com/``.
|
||||
2. Inside of this root directory, create a directory for each project such
|
||||
as ``mkdir -p /var/www/index.example.com/{foo,bar,other}/``.
|
||||
3. Place the package files for each project in their respective folder,
|
||||
creating paths like ``/var/www/index.example.com/foo/foo-1.0.tar.gz``.
|
||||
4. Configure nginx to serve the root directory, ideally with TLS, with the
|
||||
autoindex directive enable (see below for example configuration).
|
||||
|
||||
::
|
||||
|
||||
server {
|
||||
listen 443 ssl;
|
||||
server_name index.example.com;
|
||||
|
||||
ssl_certificate /etc/pki/tls/certs/index.example.com.crt;
|
||||
ssl_certificate_key /etc/pki/tls/certs/index.example.com.key;
|
||||
|
||||
root /var/www/index.example.com;
|
||||
|
||||
autoindex on;
|
||||
}
|
||||
|
||||
|
||||
Examples of Additional indexes with pip
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
**Invocation:**
|
||||
|
||||
::
|
||||
$ pip install --extra-index-url https://pypi.example.com/ foobar
|
||||
|
||||
**Shell Environment:**
|
||||
|
||||
::
|
||||
$ export PIP_EXTRA_INDEX_URL=https://pypi.example.com/
|
||||
$ pip install foobar
|
||||
|
||||
**Requirements File:**
|
||||
|
||||
::
|
||||
$ echo "--extra-index-url https://pypi.example.com/\nfoobar" > requirements.txt
|
||||
$ pip install -r requirements.txt
|
||||
|
||||
**Virtual Environment:**
|
||||
|
||||
::
|
||||
$ python -m venv myvenv
|
||||
$ echo "[global]\nextra-index-url = https://pypi.exmaple.com/" > myvenv/pip.conf
|
||||
$ myvenv/bin/pip install foobar
|
||||
|
||||
**User:**
|
||||
|
||||
::
|
||||
$ echo "[global]\nextra-index-url = https://pypi.exmaple.com/" >~/.pip/pip.conf
|
||||
$ pip install foobar
|
||||
|
||||
|
||||
External Links on the Simple Installer API
|
||||
------------------------------------------
|
||||
|
||||
PEP438 proposed a system of classifying file links as either internal,
|
||||
external, or unsafe. It recommended that by default only internal links would
|
||||
be installed by an installer however users could opt into external links on
|
||||
either a global or a per package basis. Additionally they could also opt into
|
||||
unsafe links on a per package basis.
|
||||
|
||||
This system has turned out to be *extremely* unfriendly towards the end users
|
||||
and it is the position of this PEP that the situation has become untenable. The
|
||||
situation as provided by PEP438 requires an end user to be aware not only of
|
||||
the difference between internal, external, and unsafe, but also to be aware of
|
||||
what hosting mode the package they are trying to install is in, what links are
|
||||
available on that project's /simple/ page, whether or not those links have
|
||||
a properly formatted hash fragment, and what links are available from pages
|
||||
linked to from that project's /simple/ page.
|
||||
|
||||
There are a number of common confusion/pain points with this system that I
|
||||
have witnessed:
|
||||
|
||||
* Users unaware what the simple installer api is at all or how an installer
|
||||
locates installable files.
|
||||
* Users unaware that even if the simple api links to a file, if it does
|
||||
not include a ``#md5=...`` fragment that it will be counted as unsafe.
|
||||
* Users unaware that an installer can look at pages linked from the
|
||||
simple api to determine additional links, or that any links found in this
|
||||
fashion are considered unsafe.
|
||||
* Users are unaware and often surprised that PyPI supports hosting your files
|
||||
someplace other than PyPI at all.
|
||||
|
||||
In addition to that, the information that an installer is able to provide
|
||||
when an installation fails is pretty minimal. We are able to detect if there
|
||||
are externally hosted files directly linked from the simple installer api,
|
||||
however we cannot detect if there are files hosted on a linked page without
|
||||
fetching that page and doing so would cause a massive performance hit just to
|
||||
see if there might be a file there so that a better error message could be
|
||||
provided.
|
||||
|
||||
Finally very few projects have properly linked to their external files so that
|
||||
they can be safely downloaded and verified. At the time of this writing there
|
||||
are a total of 65 projects which have files that are only available externally
|
||||
and are safely hosted.
|
||||
|
||||
The end result of all of this, is that with PEP 438, when a user attempts to
|
||||
install a file that is not hosted on PyPI typically the steps they follow are:
|
||||
|
||||
1. First, they attempt to install it normally, using ``pip install foobar``.
|
||||
This fails because the file is not hosted on PyPI and PEP 438 has us default
|
||||
to only hosted on PyPI. If pip detected any externally hosted files or other
|
||||
pages that we *could* have attempted to find other files at it will give an
|
||||
error message suggesting that they try ``--allow-external foobar``.
|
||||
2. They then attempt to install their package using
|
||||
``pip install --allow-external foobar foobar``. If they are lucky foobar is
|
||||
one of the packages which is hosted externally and safely and this will
|
||||
succeed. If they are unlucky they will get a different error message
|
||||
suggesting that they *also* try ``--allow-unverified foobar``.
|
||||
3. They then attempt to install their package using
|
||||
``pip install --allow-external foobar --allow-unverified foobar foobar``
|
||||
and this finally works.
|
||||
|
||||
This is the same basic steps that practically everyone goes through every time
|
||||
they try to install something that is not hosted on PyPI. If they are lucky it'll
|
||||
only take them two steps, but typically it requires three steps. Worse there is
|
||||
no real indication to these people why one package might install after two
|
||||
but most require three. Even worse than that most of them will never get an
|
||||
externally hosted package that does not take three steps, so they will be
|
||||
increasingly annoyed and frustrated at the intermediate step and will likely
|
||||
eventually just start skipping it.
|
||||
|
||||
|
||||
External Index Discovery
|
||||
========================
|
||||
|
||||
One of the problems with using an additional index is one of discovery. Users
|
||||
will not generally be aware that an additional index is required at all much
|
||||
less where that index can be found. Projects can attempt to convey this
|
||||
information using their description on the PyPI page however that excludes
|
||||
people who discover their project organically through ``pip search``.
|
||||
|
||||
To support projects that wish to externally host their files and to enable
|
||||
users to easily discover what additional indexes are required, PyPI will gain
|
||||
the ability for projects to register external index URLs and additionally an
|
||||
associated comment for each. These URLs will be made available on the simple
|
||||
page however they will not be linked or provided in a form that older
|
||||
installers will automatically search them.
|
||||
|
||||
When an installer fetches the simple page for a project, if it finds this
|
||||
additional meta-data and it cannot find any files for that project in it's
|
||||
configured URLs then it should use this data to tell the user how to add one
|
||||
or more of the additional URLs to search in. This message should include any
|
||||
comments that the project has included to enable them to communicate to the
|
||||
user and provide hints as to which URL they might want if some are only
|
||||
useful or compatible with certain platforms or situations.
|
||||
|
||||
This feature *must* be added to PyPI prior to starting the deprecation and
|
||||
removal process for link spidering.
|
||||
|
||||
|
||||
Deprecation and Removal of Link Spidering
|
||||
=========================================
|
||||
|
||||
A new hosting mode will be added to PyPI. This hosting mode will be called
|
||||
``pypi-only`` and will be in addition to the three that PEP438 has already given
|
||||
us which are ``pypi-explicit``, ``pypi-scrape``, ``pypi-scrape-crawl``. This
|
||||
new hosting mode will modify a project's simple api page so that it only lists
|
||||
the files which are directly hosted on PyPI and will not link to anything else.
|
||||
|
||||
Upon acceptance of this PEP and the addition of the ``pypi-only`` mode, all new
|
||||
projects will by defaulted to the PyPI only mode and they will be locked to
|
||||
this mode and unable to change this particular setting. ``pypi-only`` projects
|
||||
will still be able to register external index URLs as described above - the
|
||||
"pypi-only" refers only to the download links that are published directly on
|
||||
PyPI.
|
||||
|
||||
An email will then be sent out to all of the projects which are hosted only on
|
||||
PyPI informing them that in one month their project will be automatically
|
||||
converted to the ``pypi-only`` mode. A month after these emails have been sent
|
||||
any of those projects which were emailed, which still are hosted only on PyPI
|
||||
will have their mode set to ``pypi-only``.
|
||||
|
||||
After that switch, an email will be sent to projects which rely on hosting
|
||||
external to PyPI. This email will warn these projects that externally hosted
|
||||
files have been deprecated on PyPI and that in 6 months from the time of that
|
||||
email that all external links will be removed from the installer APIs. This
|
||||
email *must* include instructions for converting their projects to be hosted
|
||||
on PyPI and *must* include links to a script or package that will enable them
|
||||
to enter their PyPI credentials and package name and have it automatically
|
||||
download and re-host all of their files on PyPI. This email *must also*
|
||||
include instructions for setting up their own index page and registering that
|
||||
with PyPI.
|
||||
|
||||
Five months after the initial email, another email must be sent to any projects
|
||||
still relying on external hosting. This email will include all of the same
|
||||
information that the first email contained, except that the removal date will
|
||||
be one month away instead of six.
|
||||
|
||||
Finally a month later all projects will be switched to the ``pypa-only`` mode
|
||||
and PyPI will be modified to remove the externally linked files functionality.
|
||||
|
||||
|
||||
Impact
|
||||
======
|
||||
|
||||
============ ======= ========== =======
|
||||
\ PyPI External Total
|
||||
============ ======= ========== =======
|
||||
**Safe** 37779 65 37844
|
||||
**Unsafe** 0 2974 2974
|
||||
**Total** 37779 3039
|
||||
============ ======= ========== =======
|
||||
|
||||
|
||||
Rejected Proposals
|
||||
==================
|
||||
|
||||
Keep the current classification system but adjust the options
|
||||
-------------------------------------------------------------
|
||||
|
||||
This PEP rejects several related proposals which attempt to fix some of the
|
||||
usability problems with the current system but while still keeping the
|
||||
general gist of PEP 438.
|
||||
|
||||
This includes:
|
||||
|
||||
* Default to allowing safely externally hosted files, but disallow unsafely
|
||||
hosted.
|
||||
* Default to disallowing safely externally hosted files with only a global
|
||||
flag to enable them, but disallow unsafely hosted.
|
||||
|
||||
These proposals are rejected because:
|
||||
|
||||
* The classification "system" is complex, hard to explain, and requires an
|
||||
intimate knowledge of how the simple API works in order to be able to reason
|
||||
about which classification is required. This is reflected in the fact that
|
||||
the code to implement it is complicated and hard to understand as well.
|
||||
|
||||
* People are generally surprised that PyPI allows externally linking to files
|
||||
and doesn't require people to host on PyPI. In contrast most of them are
|
||||
familiar with the concept of multiple software repositories such as is in
|
||||
use by many OSs.
|
||||
|
||||
* PyPI is fronted by a globally distributed CDN which has improved the
|
||||
reliability and speed for end users. It is unlikely that any particular
|
||||
external host has something comparable. This can lead to extremely bad
|
||||
performance for end users when the external host is located in different
|
||||
parts of the world or does not generally have good connectivity.
|
||||
|
||||
As a data point, many users reported sub DSL speeds and latency when
|
||||
accessing PyPI from parts of Europe and Asia prior to the use of the CDN.
|
||||
|
||||
* PyPI has monitoring and an on-call rotation of sysadmins whom can respond to
|
||||
downtime quickly, thus enabling a quicker response to downtime. Again it is
|
||||
unlikely that any particular external host will have this. This can lead
|
||||
to single packages in a dependency chain being un-installable. This will
|
||||
often confuse users, who often times have no idea that this package relies
|
||||
on an external host, and they cannot figure out why PyPI appears to be up
|
||||
but the installer cannot find a package.
|
||||
|
||||
* PyPI supports mirroring, both for private organizations and public mirrors.
|
||||
The legal terms of uploading to PyPI ensure that mirror operators, both
|
||||
public and private, have the right to distribute the software found on PyPI.
|
||||
However software that is hosted externally does not have this, causing
|
||||
private organizations to need to investigate each package individually and
|
||||
manually to determine if the license allows them to mirror it.
|
||||
|
||||
For public mirrors this essentially means that these externally hosted
|
||||
packages *cannot* be reasonably mirrored. This is particularly troublesome
|
||||
in countries such as China where the bandwidth to outside of China is
|
||||
highly congested making a mirror within China often times a massively better
|
||||
experience.
|
||||
|
||||
* Installers have no method to determine if they should expect any particular
|
||||
URL to be available or not. It is not unusual for the simple API to reference
|
||||
old packages and URLs which have long since stopped working. This causes
|
||||
installers to have to assume that it is OK for any particular URL to not be
|
||||
accessible. This causes problems where an URL is temporarily down or
|
||||
otherwise unavailable (a common cause of this is using a copy of Python
|
||||
linked against a really ancient copy of OpenSSL which is unable to verify
|
||||
the SSL certificate on PyPI) but it *should* be expected to be up. In this
|
||||
case installers will typically silently ignore this URL and later the user
|
||||
will get a confusing error stating that the installer couldn't find any
|
||||
versions instead of getting the real error message indicating that the URL
|
||||
was unavailable.
|
||||
|
||||
* In the long run, global opt in flags like ``--allow-all-external`` will
|
||||
become little annoyances that developers cargo cult around in order to make
|
||||
their installer work. When they run into a project that requires it they
|
||||
will most likely simply add it to their configuration file for that installer
|
||||
and continue on with whatever they were actually trying to do. This will
|
||||
continue until they try to install their requirements on another computer
|
||||
or attempt to deploy to a server where their install will fail again until
|
||||
they add the "make it work" flag in their configuration file.
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
coding: utf-8
|
||||
End:
|
Loading…
Reference in New Issue