PEP: 470 Title: Using Multi Index Support for External to PyPI Package File Hosting Version: $Revision$ Last-Modified: $Date$ Author: Donald Stufft , BDFL-Delegate: Richard Jones Discussions-To: distutils-sig@python.org Status: Draft Type: Process Content-Type: text/x-rst Created: 12-May-2014 Post-History: 14-May-2014, 05-Jun-2014 Abstract ======== This PEP proposes that the official means of having an installer locate and find package files which are hosted externally to PyPI become the use of multi index support instead of the practice of using external links on the simple installer API. It is important to remember that this is **not** about forcing anyone to host their files on PyPI. If someone does not wish to do so they will never be under any obligation too. They can still list their project in PyPI as an index, and the tooling will still allow them to host it elsewhere. This PEP strictly is concerned with the Simple Installer API and how automated installers interact with PyPI, it has no bearing on the informational pages which are primarily for human consumption. Rationale ========= There is a long history documented in PEP 438 that explains why externally hosted files exist today in the state that they do on PyPI. For the sake of brevity I will not duplicate that and instead urge readers to first take a look at PEP 438 for background. There are currently two primary ways for a project to make itself available without directly hosting the package files on PyPI. They can either include links to the package files in the simpler installer API or they can publish a custom package index which contains their project. Custom Additional Index ----------------------- Each installer which speaks to PyPI offers a mechanism for the user invoking that installer to provide additional custom locations to search for files during the dependency resolution phase. For pip these locations can be configured per invocation, per shell environment, per requirements file, per virtual environment, and per user. The mechanism for specifying additional locations have existed within pip and setuptools for many years, by comparison the mechanisms in PEP 438 and any other new mechanism will have existed for only a short period of time (if they exist at all currently). The use of additional indexes instead of external links on the simple installer API provides a simple clean interface which is consistent with the way most Linux package systems work (apt-get, yum, etc). More importantly it works the same even for projects which are commercial or otherwise have their access restricted in some form (private networks, password, IP ACLs etc) while the external links method only realistically works for projects which do not have their access restricted. Compared to the complex rules which a project must be aware of to prevent themselves from being considered unsafely hosted setting up an index is fairly trivial and in the simplest case does not require anything more than a filesystem and a standard web server such as Nginx or Twisted Web. Even if using simple static hosting without autoindexing support, it is still straightforward to generate appropriate index pages as static HTML. Example Index with Twisted Web ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. Create a root directory for your index, for the purposes of the example I'll assume you've chosen ``/var/www/index.example.com/``. 2. Inside of this root directory, create a directory for each project such as ``mkdir -p /var/www/index.example.com/{foo,bar,other}/``. 3. Place the package files for each project in their respective folder, creating paths like ``/var/www/index.example.com/foo/foo-1.0.tar.gz``. 4. Configure Twisted Web to serve the root directory, ideally with TLS. :: $ twistd -n web --path /var/www/index.example.com/ Examples of Additional indexes with pip ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Invocation:** :: $ pip install --extra-index-url https://pypi.example.com/ foobar **Shell Environment:** :: $ export PIP_EXTRA_INDEX_URL=https://pypi.example.com/ $ pip install foobar **Requirements File:** :: $ echo "--extra-index-url https://pypi.example.com/\nfoobar" > requirements.txt $ pip install -r requirements.txt **Virtual Environment:** :: $ python -m venv myvenv $ echo "[global]\nextra-index-url = https://pypi.example.com/" > myvenv/pip.conf $ myvenv/bin/pip install foobar **User:** :: $ echo "[global]\nextra-index-url = https://pypi.example.com/" >~/.pip/pip.conf $ pip install foobar External Links on the Simple Installer API ------------------------------------------ PEP 438 proposed a system of classifying file links as either internal, external, or unsafe. It recommended that by default only internal links would be installed by an installer however users could opt into external links on either a global or a per package basis. Additionally they could also opt into unsafe links on a per package basis. This system has turned out to be *extremely* unfriendly towards the end users and it is the position of this PEP that the situation has become untenable. The situation as provided by PEP 438 requires an end user to be aware not only of the difference between internal, external, and unsafe, but also to be aware of what hosting mode the package they are trying to install is in, what links are available on that project's /simple/ page, whether or not those links have a properly formatted hash fragment, and what links are available from pages linked to from that project's /simple/ page. There are a number of common confusion/pain points with this system that I have witnessed: * Users unaware what the simple installer api is at all or how an installer locates installable files. * Users unaware that even if the simple api links to a file, if it does not include a ``#md5=...`` fragment that it will be counted as unsafe. * Users unaware that an installer can look at pages linked from the simple api to determine additional links, or that any links found in this fashion are considered unsafe. * Users are unaware and often surprised that PyPI supports hosting your files someplace other than PyPI at all. In addition to that, the information that an installer is able to provide when an installation fails is pretty minimal. We are able to detect if there are externally hosted files directly linked from the simple installer api, however we cannot detect if there are files hosted on a linked page without fetching that page and doing so would cause a massive performance hit just to see if there might be a file there so that a better error message could be provided. Finally very few projects have properly linked to their external files so that they can be safely downloaded and verified. At the time of this writing there are a total of 65 projects which have files that are only available externally and are safely hosted. The end result of all of this, is that with PEP 438, when a user attempts to install a file that is not hosted on PyPI typically the steps they follow are: 1. First, they attempt to install it normally, using ``pip install foobar``. This fails because the file is not hosted on PyPI and PEP 438 has us default to only hosted on PyPI. If pip detected any externally hosted files or other pages that we *could* have attempted to find other files at it will give an error message suggesting that they try ``--allow-external foobar``. 2. They then attempt to install their package using ``pip install --allow-external foobar foobar``. If they are lucky foobar is one of the packages which is hosted externally and safely and this will succeed. If they are unlucky they will get a different error message suggesting that they *also* try ``--allow-unverified foobar``. 3. They then attempt to install their package using ``pip install --allow-external foobar --allow-unverified foobar foobar`` and this finally works. This is the same basic steps that practically everyone goes through every time they try to install something that is not hosted on PyPI. If they are lucky it'll only take them two steps, but typically it requires three steps. Worse there is no real indication to these people why one package might install after two but most require three. Even worse than that most of them will never get an externally hosted package that does not take three steps, so they will be increasingly annoyed and frustrated at the intermediate step and will likely eventually just start skipping it. External Index Discovery ======================== One of the problems with using an additional index is one of discovery. Users will not generally be aware that an additional index is required at all much less where that index can be found. Projects can attempt to convey this information using their description on the PyPI page however that excludes people who discover their project organically through ``pip search``. To support projects that wish to externally host their files and to enable users to easily discover what additional indexes are required, PyPI will gain the ability for projects to register external index URLs and additionally an associated comment for each. These URLs will be made available on the simple page however they will not be linked or provided in a form that older installers will automatically search them. When an installer fetches the simple page for a project, if it finds this additional meta-data and it cannot find any files for that project in it's configured URLs then it should use this data to tell the user how to add one or more of the additional URLs to search in. This message should include any comments that the project has included to enable them to communicate to the user and provide hints as to which URL they might want if some are only useful or compatible with certain platforms or situations. When the installer has implemented the auto discovery mechanisms they should also deprecate any of the mechanisms added for PEP 438 (such as ``--allow-external``) for removal at the end of the deprecation period proposed by the PEP. This feature *must* be added to PyPI prior to starting the deprecation and removal process for link spidering. Deprecation and Removal of Link Spidering ========================================= A new hosting mode will be added to PyPI. This hosting mode will be called ``pypi-only`` and will be in addition to the three that PEP 438 has already given us which are ``pypi-explicit``, ``pypi-scrape``, ``pypi-scrape-crawl``. This new hosting mode will modify a project's simple api page so that it only lists the files which are directly hosted on PyPI and will not link to anything else. Upon acceptance of this PEP and the addition of the ``pypi-only`` mode, all new projects will by defaulted to the PyPI only mode and they will be locked to this mode and unable to change this particular setting. ``pypi-only`` projects will still be able to register external index URLs as described above - the "pypi-only" refers only to the download links that are published directly on PyPI. An email will then be sent out to all of the projects which are hosted only on PyPI informing them that in one month their project will be automatically converted to the ``pypi-only`` mode. A month after these emails have been sent any of those projects which were emailed, which still are hosted only on PyPI will have their mode set to ``pypi-only``. After that switch, an email will be sent to projects which rely on hosting external to PyPI. This email will warn these projects that externally hosted files have been deprecated on PyPI and that in 6 months from the time of that email that all external links will be removed from the installer APIs. This email *must* include instructions for converting their projects to be hosted on PyPI and *must* include links to a script or package that will enable them to enter their PyPI credentials and package name and have it automatically download and re-host all of their files on PyPI. This email *must also* include instructions for setting up their own index page and registering that with PyPI, including the fact that they can use pythonhosted.org as a host for an index page without requiring them to host any additional infrastructure or purchase a TLS certificate. This email must also contain a link to the Terms of Service for PyPI as many users may have signed up a long time ago and may not recall what those terms are. Five months after the initial email, another email must be sent to any projects still relying on external hosting. This email will include all of the same information that the first email contained, except that the removal date will be one month away instead of six. Finally a month later all projects will be switched to the ``pypi-only`` mode and PyPI will be modified to remove the externally linked files functionality. At this point in time any installers should finally remove any of the deprecated PEP 438 functionality such as ``--allow-external`` and ``--allow-unverified`` in pip. PIL --- It's obvious from the numbers below that the vast bulk of the impact come from the PIL project. On 2014-05-17 an email was sent to the contact for PIL inquiring whether or not they would be willing to upload to PyPI. A response has not been received as of yet (2014-06-05) nor has any change in the hosting happened. Due to the popularity of PIL this PEP also proposes that during the deprecation period that PyPI Administrators will set the PIL download URL as the external index for that project. Allowing the users of PIL to take advantage of the auto discovery mechanisms although the project has seemingly become unmaintained. Impact ====== The largest impact of this is going to be projects where the maintainers are no longer maintaining the project, for one reason or another. For these projects it's unlikely that a maintainer will arrive to set the external index metadata which would allow the auto discovery mechanism to find it. Looking at the numbers factoring out PIL (which has been special cased above) the actual impact should be quite low, with it affecting just 6.9% of projects which host only externally or 2.8% which have their latest version hosted externally. This represents a mere 3883 unique IP addresses. The break down of this is that of those 3883 addresses, 100% of them installed something that could not be verified while only 3% installed something which could be. Projects Which Rely on Externally Hosted files ---------------------------------------------- This is determined by crawling the simple index and looking for installable files using a similar detection method as pip and setuptools use. The "latest" version is determined using ``pkg_resources.parse_version`` sort order and it is used to show whether or not the latest version is hosted externally or only old versions are. ============ ======= ================ =================== ======= \ PyPI External (old) External (latest) Total ============ ======= ================ =================== ======= **Safe** 38716 31 35 38782 **Unsafe** 0 1659 1169 2828 **Total** 38716 1690 1204 41610 ============ ======= ================ =================== ======= Top Externally Hosted Projects by Requests ------------------------------------------ This is determined by looking at the number of requests the ``/simple//`` page had gotten in a single day. The total number of requests during that day was 17,960,467. ============================== ======== Project Requests ============================== ======== PIL 13470 mysql-connector-python 321 salesforce-python-toolkit 54 pyodbc 50 elementtree 44 atfork 39 RBTools 29 django-contrib-requestprovider 28 wadofstuff-django-serializers 23 Pygame 21 ============================== ======== Top Externally Hosted Projects by Unique IPs -------------------------------------------- This is determined by looking at the IP addresses of requests the ``/simple//`` page had gotten in a single day. The total number of unique IP addresses during that day was 105,587. ============================== ========== Project Unique IPs ============================== ========== PIL 3515 mysql-connector-python 117 pyodbc 34 elementtree 21 RBTools 19 egenix-mx-base 16 Pygame 14 salesforce-python-toolkit 13 django-contrib-requestprovider 12 wxPython 11 python-apt 10 ============================== ========== Rejected Proposals ================== Keep the current classification system but adjust the options ------------------------------------------------------------- This PEP rejects several related proposals which attempt to fix some of the usability problems with the current system but while still keeping the general gist of PEP 438. This includes: * Default to allowing safely externally hosted files, but disallow unsafely hosted. * Default to disallowing safely externally hosted files with only a global flag to enable them, but disallow unsafely hosted. * Continue on the suggested path of PEP 438 and remove the option to unsafely host externally but continue to allow the option to safely host externally. These proposals are rejected because: * The classification "system" is complex, hard to explain, and requires an intimate knowledge of how the simple API works in order to be able to reason about which classification is required. This is reflected in the fact that the code to implement it is complicated and hard to understand as well. * People are generally surprised that PyPI allows externally linking to files and doesn't require people to host on PyPI. In contrast most of them are familiar with the concept of multiple software repositories such as is in use by many OSs. * PyPI is fronted by a globally distributed CDN which has improved the reliability and speed for end users. It is unlikely that any particular external host has something comparable. This can lead to extremely bad performance for end users when the external host is located in different parts of the world or does not generally have good connectivity. As a data point, many users reported sub DSL speeds and latency when accessing PyPI from parts of Europe and Asia prior to the use of the CDN. * PyPI has monitoring and an on-call rotation of sysadmins whom can respond to downtime quickly, thus enabling a quicker response to downtime. Again it is unlikely that any particular external host will have this. This can lead to single packages in a dependency chain being un-installable. This will often confuse users, who often times have no idea that this package relies on an external host, and they cannot figure out why PyPI appears to be up but the installer cannot find a package. * PyPI supports mirroring, both for private organizations and public mirrors. The legal terms of uploading to PyPI ensure that mirror operators, both public and private, have the right to distribute the software found on PyPI. However software that is hosted externally does not have this, causing private organizations to need to investigate each package individually and manually to determine if the license allows them to mirror it. For public mirrors this essentially means that these externally hosted packages *cannot* be reasonably mirrored. This is particularly troublesome in countries such as China where the bandwidth to outside of China is highly congested making a mirror within China often times a massively better experience. * Installers have no method to determine if they should expect any particular URL to be available or not. It is not unusual for the simple API to reference old packages and URLs which have long since stopped working. This causes installers to have to assume that it is OK for any particular URL to not be accessible. This causes problems where an URL is temporarily down or otherwise unavailable (a common cause of this is using a copy of Python linked against a really ancient copy of OpenSSL which is unable to verify the SSL certificate on PyPI) but it *should* be expected to be up. In this case installers will typically silently ignore this URL and later the user will get a confusing error stating that the installer couldn't find any versions instead of getting the real error message indicating that the URL was unavailable. * In the long run, global opt in flags like ``--allow-all-external`` will become little annoyances that developers cargo cult around in order to make their installer work. When they run into a project that requires it they will most likely simply add it to their configuration file for that installer and continue on with whatever they were actually trying to do. This will continue until they try to install their requirements on another computer or attempt to deploy to a server where their install will fail again until they add the "make it work" flag in their configuration file. * The URL classification only works for a certain subset of projects, however it does not allow for any project which needs additional restrictions such as Access Controls. This means that there would be two methods of doing the same thing, linking to a file safely and hosting an index. Hosting an index works in all situations and by relying on this we make for a more consistent experience no matter the reason for external hosting. * The safe external hosting option hampers the ability of PyPI to upgrade it's security infrastructure. For instance if MD5 becomes broken in the future there will be no way for PyPI to upgrade the hashes of the projects which rely on safe external hosting via MD5 while files that are hosted on PyPI can simply be processed over with a new hash function. Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: