488 lines
22 KiB
Plaintext
488 lines
22 KiB
Plaintext
PEP: 470
|
||
Title: Using Multi Index Support for External to PyPI Package File Hosting
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Donald Stufft <donald@stufft.io>,
|
||
BDFL-Delegate: Richard Jones <richard@python.org>
|
||
Discussions-To: distutils-sig@python.org
|
||
Status: Draft
|
||
Type: Process
|
||
Content-Type: text/x-rst
|
||
Created: 12-May-2014
|
||
Post-History: 14-May-2014, 05-Jun-2014
|
||
|
||
|
||
Abstract
|
||
========
|
||
|
||
This PEP proposes that the official means of having an installer locate and
|
||
find package files which are hosted externally to PyPI become the use of
|
||
multi index support instead of the practice of using external links on the
|
||
simple installer API.
|
||
|
||
It is important to remember that this is **not** about forcing anyone to host
|
||
their files on PyPI. If someone does not wish to do so they will never be under
|
||
any obligation too. They can still list their project in PyPI as an index, and
|
||
the tooling will still allow them to host it elsewhere.
|
||
|
||
This PEP strictly is concerned with the Simple Installer API and how automated
|
||
installers interact with PyPI, it has no bearing on the informational pages
|
||
which are primarily for human consumption.
|
||
|
||
|
||
Rationale
|
||
=========
|
||
|
||
There is a long history documented in PEP 438 that explains why externally
|
||
hosted files exist today in the state that they do on PyPI. For the sake of
|
||
brevity I will not duplicate that and instead urge readers to first take a look
|
||
at PEP 438 for background.
|
||
|
||
There are currently two primary ways for a project to make itself available
|
||
without directly hosting the package files on PyPI. They can either include
|
||
links to the package files in the simpler installer API or they can publish
|
||
a custom package index which contains their project.
|
||
|
||
|
||
Custom Additional Index
|
||
-----------------------
|
||
|
||
Each installer which speaks to PyPI offers a mechanism for the user invoking
|
||
that installer to provide additional custom locations to search for files
|
||
during the dependency resolution phase. For pip these locations can be
|
||
configured per invocation, per shell environment, per requirements file, per
|
||
virtual environment, and per user. The mechanism for specifying additional
|
||
locations have existed within pip and setuptools for many years, by comparison
|
||
the mechanisms in PEP 438 and any other new mechanism will have existed for
|
||
only a short period of time (if they exist at all currently).
|
||
|
||
The use of additional indexes instead of external links on the simple
|
||
installer API provides a simple clean interface which is consistent with the
|
||
way most Linux package systems work (apt-get, yum, etc). More importantly it
|
||
works the same even for projects which are commercial or otherwise have their
|
||
access restricted in some form (private networks, password, IP ACLs etc)
|
||
while the external links method only realistically works for projects which
|
||
do not have their access restricted.
|
||
|
||
Compared to the complex rules which a project must be aware of to prevent
|
||
themselves from being considered unsafely hosted setting up an index is fairly
|
||
trivial and in the simplest case does not require anything more than a
|
||
filesystem and a standard web server such as Nginx or Twisted Web. Even if
|
||
using simple static hosting without autoindexing support, it is still
|
||
straightforward to generate appropriate index pages as static HTML.
|
||
|
||
Example Index with Twisted Web
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
1. Create a root directory for your index, for the purposes of the example
|
||
I'll assume you've chosen ``/var/www/index.example.com/``.
|
||
2. Inside of this root directory, create a directory for each project such
|
||
as ``mkdir -p /var/www/index.example.com/{foo,bar,other}/``.
|
||
3. Place the package files for each project in their respective folder,
|
||
creating paths like ``/var/www/index.example.com/foo/foo-1.0.tar.gz``.
|
||
4. Configure Twisted Web to serve the root directory, ideally with TLS.
|
||
|
||
::
|
||
|
||
$ twistd -n web --path /var/www/index.example.com/
|
||
|
||
|
||
Examples of Additional indexes with pip
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
**Invocation:**
|
||
|
||
::
|
||
|
||
$ pip install --extra-index-url https://pypi.example.com/ foobar
|
||
|
||
**Shell Environment:**
|
||
|
||
::
|
||
|
||
$ export PIP_EXTRA_INDEX_URL=https://pypi.example.com/
|
||
$ pip install foobar
|
||
|
||
**Requirements File:**
|
||
|
||
::
|
||
|
||
$ echo "--extra-index-url https://pypi.example.com/\nfoobar" > requirements.txt
|
||
$ pip install -r requirements.txt
|
||
|
||
**Virtual Environment:**
|
||
|
||
::
|
||
|
||
$ python -m venv myvenv
|
||
$ echo "[global]\nextra-index-url = https://pypi.example.com/" > myvenv/pip.conf
|
||
$ myvenv/bin/pip install foobar
|
||
|
||
**User:**
|
||
|
||
::
|
||
|
||
$ echo "[global]\nextra-index-url = https://pypi.example.com/" >~/.pip/pip.conf
|
||
$ pip install foobar
|
||
|
||
|
||
External Links on the Simple Installer API
|
||
------------------------------------------
|
||
|
||
PEP 438 proposed a system of classifying file links as either internal,
|
||
external, or unsafe. It recommended that by default only internal links would
|
||
be installed by an installer however users could opt into external links on
|
||
either a global or a per package basis. Additionally they could also opt into
|
||
unsafe links on a per package basis.
|
||
|
||
This system has turned out to be *extremely* unfriendly towards the end users
|
||
and it is the position of this PEP that the situation has become untenable. The
|
||
situation as provided by PEP 438 requires an end user to be aware not only of
|
||
the difference between internal, external, and unsafe, but also to be aware of
|
||
what hosting mode the package they are trying to install is in, what links are
|
||
available on that project's /simple/ page, whether or not those links have
|
||
a properly formatted hash fragment, and what links are available from pages
|
||
linked to from that project's /simple/ page.
|
||
|
||
There are a number of common confusion/pain points with this system that I
|
||
have witnessed:
|
||
|
||
* Users unaware what the simple installer api is at all or how an installer
|
||
locates installable files.
|
||
* Users unaware that even if the simple api links to a file, if it does
|
||
not include a ``#md5=...`` fragment that it will be counted as unsafe.
|
||
* Users unaware that an installer can look at pages linked from the
|
||
simple api to determine additional links, or that any links found in this
|
||
fashion are considered unsafe.
|
||
* Users are unaware and often surprised that PyPI supports hosting your files
|
||
someplace other than PyPI at all.
|
||
|
||
In addition to that, the information that an installer is able to provide
|
||
when an installation fails is pretty minimal. We are able to detect if there
|
||
are externally hosted files directly linked from the simple installer api,
|
||
however we cannot detect if there are files hosted on a linked page without
|
||
fetching that page and doing so would cause a massive performance hit just to
|
||
see if there might be a file there so that a better error message could be
|
||
provided.
|
||
|
||
Finally very few projects have properly linked to their external files so that
|
||
they can be safely downloaded and verified. At the time of this writing there
|
||
are a total of 65 projects which have files that are only available externally
|
||
and are safely hosted.
|
||
|
||
The end result of all of this, is that with PEP 438, when a user attempts to
|
||
install a file that is not hosted on PyPI typically the steps they follow are:
|
||
|
||
1. First, they attempt to install it normally, using ``pip install foobar``.
|
||
This fails because the file is not hosted on PyPI and PEP 438 has us default
|
||
to only hosted on PyPI. If pip detected any externally hosted files or other
|
||
pages that we *could* have attempted to find other files at it will give an
|
||
error message suggesting that they try ``--allow-external foobar``.
|
||
2. They then attempt to install their package using
|
||
``pip install --allow-external foobar foobar``. If they are lucky foobar is
|
||
one of the packages which is hosted externally and safely and this will
|
||
succeed. If they are unlucky they will get a different error message
|
||
suggesting that they *also* try ``--allow-unverified foobar``.
|
||
3. They then attempt to install their package using
|
||
``pip install --allow-external foobar --allow-unverified foobar foobar``
|
||
and this finally works.
|
||
|
||
This is the same basic steps that practically everyone goes through every time
|
||
they try to install something that is not hosted on PyPI. If they are lucky it'll
|
||
only take them two steps, but typically it requires three steps. Worse there is
|
||
no real indication to these people why one package might install after two
|
||
but most require three. Even worse than that most of them will never get an
|
||
externally hosted package that does not take three steps, so they will be
|
||
increasingly annoyed and frustrated at the intermediate step and will likely
|
||
eventually just start skipping it.
|
||
|
||
|
||
External Index Discovery
|
||
========================
|
||
|
||
One of the problems with using an additional index is one of discovery. Users
|
||
will not generally be aware that an additional index is required at all much
|
||
less where that index can be found. Projects can attempt to convey this
|
||
information using their description on the PyPI page however that excludes
|
||
people who discover their project organically through ``pip search``.
|
||
|
||
To support projects that wish to externally host their files and to enable
|
||
users to easily discover what additional indexes are required, PyPI will gain
|
||
the ability for projects to register external index URLs and additionally an
|
||
associated comment for each. These URLs will be made available on the simple
|
||
page however they will not be linked or provided in a form that older
|
||
installers will automatically search them.
|
||
|
||
When an installer fetches the simple page for a project, if it finds this
|
||
additional meta-data and it cannot find any files for that project in it's
|
||
configured URLs then it should use this data to tell the user how to add one
|
||
or more of the additional URLs to search in. This message should include any
|
||
comments that the project has included to enable them to communicate to the
|
||
user and provide hints as to which URL they might want if some are only
|
||
useful or compatible with certain platforms or situations. When the installer
|
||
has implemented the auto discovery mechanisms they should also deprecate any
|
||
of the mechanisms added for PEP 438 (such as ``--allow-external``) for removal
|
||
at the end of the deprecation period proposed by the PEP.
|
||
|
||
This feature *must* be added to PyPI prior to starting the deprecation and
|
||
removal process for link spidering.
|
||
|
||
|
||
Deprecation and Removal of Link Spidering
|
||
=========================================
|
||
|
||
A new hosting mode will be added to PyPI. This hosting mode will be called
|
||
``pypi-only`` and will be in addition to the three that PEP 438 has already
|
||
given us which are ``pypi-explicit``, ``pypi-scrape``, ``pypi-scrape-crawl``.
|
||
This new hosting mode will modify a project's simple api page so that it only
|
||
lists the files which are directly hosted on PyPI and will not link to anything
|
||
else.
|
||
|
||
Upon acceptance of this PEP and the addition of the ``pypi-only`` mode, all new
|
||
projects will by defaulted to the PyPI only mode and they will be locked to
|
||
this mode and unable to change this particular setting. ``pypi-only`` projects
|
||
will still be able to register external index URLs as described above - the
|
||
"pypi-only" refers only to the download links that are published directly on
|
||
PyPI.
|
||
|
||
An email will then be sent out to all of the projects which are hosted only on
|
||
PyPI informing them that in one month their project will be automatically
|
||
converted to the ``pypi-only`` mode. A month after these emails have been sent
|
||
any of those projects which were emailed, which still are hosted only on PyPI
|
||
will have their mode set to ``pypi-only``.
|
||
|
||
After that switch, an email will be sent to projects which rely on hosting
|
||
external to PyPI. This email will warn these projects that externally hosted
|
||
files have been deprecated on PyPI and that in 6 months from the time of that
|
||
email that all external links will be removed from the installer APIs. This
|
||
email *must* include instructions for converting their projects to be hosted
|
||
on PyPI and *must* include links to a script or package that will enable them
|
||
to enter their PyPI credentials and package name and have it automatically
|
||
download and re-host all of their files on PyPI. This email *must also*
|
||
include instructions for setting up their own index page and registering that
|
||
with PyPI, including the fact that they can use pythonhosted.org as a host
|
||
for an index page without requiring them to host any additional infrastructure
|
||
or purchase a TLS certificate. This email must also contain a link to the Terms
|
||
of Service for PyPI as many users may have signed up a long time ago and may
|
||
not recall what those terms are.
|
||
|
||
Five months after the initial email, another email must be sent to any projects
|
||
still relying on external hosting. This email will include all of the same
|
||
information that the first email contained, except that the removal date will
|
||
be one month away instead of six.
|
||
|
||
Finally a month later all projects will be switched to the ``pypi-only`` mode
|
||
and PyPI will be modified to remove the externally linked files functionality.
|
||
At this point in time any installers should finally remove any of the
|
||
deprecated PEP 438 functionality such as ``--allow-external`` and
|
||
``--allow-unverified`` in pip.
|
||
|
||
|
||
PIL
|
||
---
|
||
|
||
It's obvious from the numbers below that the vast bulk of the impact come from
|
||
the PIL project. On 2014-05-17 an email was sent to the contact for PIL
|
||
inquiring whether or not they would be willing to upload to PyPI. A response
|
||
has not been received as of yet (2014-06-05) nor has any change in the hosting
|
||
happened. Due to the popularity of PIL this PEP also proposes that during the
|
||
deprecation period that PyPI Administrators will set the PIL download URL as
|
||
the external index for that project. Allowing the users of PIL to take
|
||
advantage of the auto discovery mechanisms although the project has seemingly
|
||
become unmaintained.
|
||
|
||
|
||
Impact
|
||
======
|
||
|
||
The largest impact of this is going to be projects where the maintainers are
|
||
no longer maintaining the project, for one reason or another. For these
|
||
projects it's unlikely that a maintainer will arrive to set the external index
|
||
metadata which would allow the auto discovery mechanism to find it.
|
||
|
||
Looking at the numbers factoring out PIL (which has been special cased above)
|
||
the actual impact should be quite low, with it affecting just 6.9% of projects
|
||
which host only externally or 2.8% which have their latest version hosted
|
||
externally. This represents a mere 3883 unique IP addresses. The break down of
|
||
this is that of those 3883 addresses, 100% of them installed something that
|
||
could not be verified while only 3% installed something which could be.
|
||
|
||
|
||
Projects Which Rely on Externally Hosted files
|
||
----------------------------------------------
|
||
|
||
This is determined by crawling the simple index and looking for installable
|
||
files using a similar detection method as pip and setuptools use. The "latest"
|
||
version is determined using ``pkg_resources.parse_version`` sort order and it
|
||
is used to show whether or not the latest version is hosted externally or only
|
||
old versions are.
|
||
|
||
============ ======= ================ =================== =======
|
||
\ PyPI External (old) External (latest) Total
|
||
============ ======= ================ =================== =======
|
||
**Safe** 38716 31 35 38782
|
||
**Unsafe** 0 1659 1169 2828
|
||
**Total** 38716 1690 1204 41610
|
||
============ ======= ================ =================== =======
|
||
|
||
|
||
Top Externally Hosted Projects by Requests
|
||
------------------------------------------
|
||
|
||
This is determined by looking at the number of requests the
|
||
``/simple/<project>/`` page had gotten in a single day. The total number of
|
||
requests during that day was 17,960,467.
|
||
|
||
============================== ========
|
||
Project Requests
|
||
============================== ========
|
||
PIL 13470
|
||
mysql-connector-python 321
|
||
salesforce-python-toolkit 54
|
||
pyodbc 50
|
||
elementtree 44
|
||
atfork 39
|
||
RBTools 29
|
||
django-contrib-requestprovider 28
|
||
wadofstuff-django-serializers 23
|
||
Pygame 21
|
||
============================== ========
|
||
|
||
|
||
Top Externally Hosted Projects by Unique IPs
|
||
--------------------------------------------
|
||
|
||
This is determined by looking at the IP addresses of requests the
|
||
``/simple/<project>/`` page had gotten in a single day. The total number of
|
||
unique IP addresses during that day was 105,587.
|
||
|
||
============================== ==========
|
||
Project Unique IPs
|
||
============================== ==========
|
||
PIL 3515
|
||
mysql-connector-python 117
|
||
pyodbc 34
|
||
elementtree 21
|
||
RBTools 19
|
||
egenix-mx-base 16
|
||
Pygame 14
|
||
salesforce-python-toolkit 13
|
||
django-contrib-requestprovider 12
|
||
wxPython 11
|
||
python-apt 10
|
||
============================== ==========
|
||
|
||
|
||
Rejected Proposals
|
||
==================
|
||
|
||
Keep the current classification system but adjust the options
|
||
-------------------------------------------------------------
|
||
|
||
This PEP rejects several related proposals which attempt to fix some of the
|
||
usability problems with the current system but while still keeping the
|
||
general gist of PEP 438.
|
||
|
||
This includes:
|
||
|
||
* Default to allowing safely externally hosted files, but disallow unsafely
|
||
hosted.
|
||
* Default to disallowing safely externally hosted files with only a global
|
||
flag to enable them, but disallow unsafely hosted.
|
||
* Continue on the suggested path of PEP 438 and remove the option to unsafely
|
||
host externally but continue to allow the option to safely host externally.
|
||
|
||
|
||
These proposals are rejected because:
|
||
|
||
* The classification "system" is complex, hard to explain, and requires an
|
||
intimate knowledge of how the simple API works in order to be able to reason
|
||
about which classification is required. This is reflected in the fact that
|
||
the code to implement it is complicated and hard to understand as well.
|
||
|
||
* People are generally surprised that PyPI allows externally linking to files
|
||
and doesn't require people to host on PyPI. In contrast most of them are
|
||
familiar with the concept of multiple software repositories such as is in
|
||
use by many OSs.
|
||
|
||
* PyPI is fronted by a globally distributed CDN which has improved the
|
||
reliability and speed for end users. It is unlikely that any particular
|
||
external host has something comparable. This can lead to extremely bad
|
||
performance for end users when the external host is located in different
|
||
parts of the world or does not generally have good connectivity.
|
||
|
||
As a data point, many users reported sub DSL speeds and latency when
|
||
accessing PyPI from parts of Europe and Asia prior to the use of the CDN.
|
||
|
||
* PyPI has monitoring and an on-call rotation of sysadmins whom can respond to
|
||
downtime quickly, thus enabling a quicker response to downtime. Again it is
|
||
unlikely that any particular external host will have this. This can lead
|
||
to single packages in a dependency chain being un-installable. This will
|
||
often confuse users, who often times have no idea that this package relies
|
||
on an external host, and they cannot figure out why PyPI appears to be up
|
||
but the installer cannot find a package.
|
||
|
||
* PyPI supports mirroring, both for private organizations and public mirrors.
|
||
The legal terms of uploading to PyPI ensure that mirror operators, both
|
||
public and private, have the right to distribute the software found on PyPI.
|
||
However software that is hosted externally does not have this, causing
|
||
private organizations to need to investigate each package individually and
|
||
manually to determine if the license allows them to mirror it.
|
||
|
||
For public mirrors this essentially means that these externally hosted
|
||
packages *cannot* be reasonably mirrored. This is particularly troublesome
|
||
in countries such as China where the bandwidth to outside of China is
|
||
highly congested making a mirror within China often times a massively better
|
||
experience.
|
||
|
||
* Installers have no method to determine if they should expect any particular
|
||
URL to be available or not. It is not unusual for the simple API to reference
|
||
old packages and URLs which have long since stopped working. This causes
|
||
installers to have to assume that it is OK for any particular URL to not be
|
||
accessible. This causes problems where an URL is temporarily down or
|
||
otherwise unavailable (a common cause of this is using a copy of Python
|
||
linked against a really ancient copy of OpenSSL which is unable to verify
|
||
the SSL certificate on PyPI) but it *should* be expected to be up. In this
|
||
case installers will typically silently ignore this URL and later the user
|
||
will get a confusing error stating that the installer couldn't find any
|
||
versions instead of getting the real error message indicating that the URL
|
||
was unavailable.
|
||
|
||
* In the long run, global opt in flags like ``--allow-all-external`` will
|
||
become little annoyances that developers cargo cult around in order to make
|
||
their installer work. When they run into a project that requires it they
|
||
will most likely simply add it to their configuration file for that installer
|
||
and continue on with whatever they were actually trying to do. This will
|
||
continue until they try to install their requirements on another computer
|
||
or attempt to deploy to a server where their install will fail again until
|
||
they add the "make it work" flag in their configuration file.
|
||
|
||
* The URL classification only works for a certain subset of projects, however
|
||
it does not allow for any project which needs additional restrictions such
|
||
as Access Controls. This means that there would be two methods of doing the
|
||
same thing, linking to a file safely and hosting an index. Hosting an index
|
||
works in all situations and by relying on this we make for a more consistent
|
||
experience no matter the reason for external hosting.
|
||
|
||
* The safe external hosting option hampers the ability of PyPI to upgrade it's
|
||
security infrastructure. For instance if MD5 becomes broken in the future
|
||
there will be no way for PyPI to upgrade the hashes of the projects which
|
||
rely on safe external hosting via MD5 while files that are hosted on PyPI
|
||
can simply be processed over with a new hash function.
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
coding: utf-8
|
||
End:
|