This commit is contained in:
Alexander Belopolsky 2015-08-26 22:24:54 -04:00
commit f880710236
1 changed files with 81 additions and 195 deletions

View File

@ -1,36 +1,25 @@
PEP: 470
Title: Using Multi Repository Support for External to PyPI Package File Hosting
Title: Removing External Hosting Support on PyPI
Version: $Revision$
Last-Modified: $Date$
Author: Donald Stufft <donald@stufft.io>,
BDFL-Delegate: Richard Jones <richard@python.org>
BDFL-Delegate: TBD
Discussions-To: distutils-sig@python.org
Status: Draft
Type: Process
Content-Type: text/x-rst
Created: 12-May-2014
Post-History: 14-May-2014, 05-Jun-2014, 03-Oct-2014, 13-Oct-2014
Post-History: 14-May-2014, 05-Jun-2014, 03-Oct-2014, 13-Oct-2014, 26-Aug-2015
Replaces: 438
Abstract
========
This PEP proposes a mechanism for project authors to register with PyPI an
external repository where their project's downloads can be located. This
information can than be included as part of the simple API so that installers
can use it to tell users where the item they are attempting to install is
located and what they need to do to enable this additional repository. In
addition to adding discovery information to make explicit multiple repositories
easy to use, this PEP also deprecates and removes the implicit multiple
repository support which currently functions through directly or indirectly
linking off site via the simple API. Finally this PEP also proposes deprecating
and removing the functionality added by PEP 438, particularly the additional
rel information and the meta tag to indicate the API version.
This PEP *does* not propose mandating that all authors upload their projects to
PyPI in order to exist in the index nor does it propose any change to the human
facing elements of PyPI.
This PEP proposes the deprecation and removal of support for hosting files
externally to PyPI as well as the deprecation and removal of the functionality
added by PEP 438, particularly rel information to classify different types of
links and the meta-tag to indicate API version.
Rationale
@ -65,14 +54,6 @@ PyPI works, and other projects works, but this one specific one does not. They
often times do not realize who they need to contact in order to get this fixed
or what their remediation steps are.
By moving to using explicit multiple repositories we can make the lines between
these two roles much more explicit and remove the "hidden" surprises caused by
the current implementation of handling people who do not want to use PyPI as a
repository. However simply moving to explicit multiple repositories is a
regression in discoverability, and for that reason this PEP adds an extension
to the current simple API which will enable easy discovery of the specific
repository that a project can be found in.
PEP 438 attempted to solve this issue by allowing projects to explicitly
declare if they were using the repository features or not, and if they were
not, it had the installers classify the links it found as either "internal",
@ -85,16 +66,17 @@ repository features, an altogether good thing given the global CDN powering
PyPI providing speed ups for a lot of people, however it did so by introducing
a new point of confusion and pain for both the end users and the authors.
By moving to using explicit multiple repositories we can make the lines between
these two roles much more explicit and remove the "hidden" surprises caused by
the current implementation of handling people who do not want to use PyPI as a
repository.
Key User Experience Expectations
--------------------------------
#. Easily allow external hosting to "just work" when appropriately configured
at the system, user or virtual environment level.
#. Easily allow package authors to tell PyPI "my releases are hosted <here>"
and have that advertised in such a way that tools can clearly communicate it
to users, without silently introducing unexpected dependencies on third
party services.
#. Eliminate any and all references to the confusing "verifiable external" and
"unverifiable external" distinction from the user experience (both when
installing and when releasing packages).
@ -122,7 +104,7 @@ tools almost universally using multiple repository support making it extremely
likely that someone is already familiar with the concept.
Additionally, the multiple repository approach is a concept that is useful
outside of the narrow scope of allowing projects which wish to be included on
outside of the narrow scope of allowing projects that wish to be included on
the index portion of PyPI but do not wish to utilize the repository portion of
PyPI. This includes places where a company may wish to host a repository that
contains their internal packages or where a project may wish to have multiple
@ -215,64 +197,9 @@ repository. The exact specifics of how that is achieved is up to each
individual implementation.
External Index Discovery
========================
One of the problems with using an additional index is one of discovery. Users
will not generally be aware that an additional index is required at all much
less where that index can be found. Projects can attempt to convey this
information using their description on the PyPI page however that excludes
people who discover their project organically through ``pip search``.
To support projects that wish to externally host their files and to enable
users to easily discover what additional indexes are required, PyPI will gain
the ability for projects to register external index URLs along with an
associated comment for each. These URLs will be made available on the simple
page however they will not be linked or provided in a form that older
installers will automatically search them.
This ability will take the form of a ``<meta>`` tag. The name of this tag must
be set to ``repository`` or ``find-link`` and the content will be a link to the
location of the repository. An optional data-description attribute will convey
any comments or description that the author has provided.
An example would look something like::
<meta name="repository" content="https://index.example.com/" data-description="Primary Repository">
<meta name="repository" content="https://index.example.com/Ubuntu-14.04/" data-description="Wheels built for Ubuntu 14.04">
<meta name="find-link" content="https://links.example.com/find-links/" data-description="A flat index for find links">
When an installer fetches the simple page for a project, if it finds this
additional meta-data then it should use this data to tell the user how to add
one or more of the additional URLs to search in. This message should include
any comments that the project has included to enable them to communicate to the
user and provide hints as to which URL they might want (e.g. if some are only
useful or compatible with certain platforms or situations). When the installer
has implemented the auto discovery mechanisms they should also deprecate any of
the mechanisms added for PEP 438 (such as ``--allow-external``) for removal at
the end of the deprecation period proposed by the PEP.
In addition to the API for programtic access to the registered external
repositories, PyPI will also prevent these URLs in the UI so that users with
an installer that does not implement the discovery mechanism can still easily
discover what repository the project is using to host itself.
This feature **MUST** be added to PyPI and be contained in a released version
of pip prior to starting the deprecation and removal process for the implicit
offsite hosting functionality.
Deprecation and Removal of Link Spidering
=========================================
.. important:: The deprecation specified in this section **MUST** not start to
until after the discovery mechanisms have been implemented and released in
pip.
The only exception to this is the addition of the ``pypi-only`` mode and
defaulting new projects to it without abilility to switch to a different
mode.
A new hosting mode will be added to PyPI. This hosting mode will be called
``pypi-only`` and will be in addition to the three that PEP 438 has already
given us which are ``pypi-explicit``, ``pypi-scrape``, ``pypi-scrape-crawl``.
@ -282,44 +209,34 @@ else.
Upon acceptance of this PEP and the addition of the ``pypi-only`` mode, all new
projects will be defaulted to the PyPI only mode and they will be locked to
this mode and unable to change this particular setting. ``pypi-only`` projects
will still be able to register external index URLs as described above - the
"pypi-only" refers only to the download links that are published directly on
PyPI.
this mode and unable to change this particular setting.
An email will then be sent out to all of the projects which are hosted only on
PyPI informing them that in one month their project will be automatically
converted to the ``pypi-only`` mode. A month after these emails have been sent
any of those projects which were emailed, which still are hosted only on PyPI
will have their mode set to ``pypi-only``.
will have their mode set permanently to ``pypi-only``.
After that switch, an email will be sent to projects which rely on hosting
At the same time, an email will be sent to projects which rely on hosting
external to PyPI. This email will warn these projects that externally hosted
files have been deprecated on PyPI and that in 6 months from the time of that
files have been deprecated on PyPI and that in 3 months from the time of that
email that all external links will be removed from the installer APIs. This
email **MUST** include instructions for converting their projects to be hosted
on PyPI and **MUST** include links to a script or package that will enable them
to enter their PyPI credentials and package name and have it automatically
download and re-host all of their files on PyPI. This email **MUST** also
include instructions for setting up their own index page and registering that
with PyPI, including the fact that they can use pythonhosted.org as a host for
an index page without requiring them to host any additional infrastructure or
purchase a TLS certificate. This email must also contain a link to the Terms of
Service for PyPI as many users may have signed up a long time ago and may not
recall what those terms are. Finally this email must also contain a list of
the links registered with PyPI where we were able to detect an installable file
was located.
include instructions for setting up their own index page. This email must also contain a link to the Terms of Service for PyPI as many users may have signed
up a long time ago and may not recall what those terms are. Finally this email
must also contain a list of the links registered with PyPI where we were able
to detect an installable file was located.
Five months after the initial email, another email must be sent to any projects
Two months after the initial email, another email must be sent to any projects
still relying on external hosting. This email will include all of the same
information that the first email contained, except that the removal date will
be one month away instead of six.
be one month away instead of three.
Finally a month later all projects will be switched to the ``pypi-only`` mode
and PyPI will be modified to remove the externally linked files functionality,
when switching these projects to the ``pypi-only`` mode we will move any links
which are able to be used for discovering other projects automatically to as
an external repository.
and PyPI will be modified to remove the externally linked files functionality.
Summary of Changes
@ -328,116 +245,85 @@ Summary of Changes
Repository side
---------------
#. Implement simple API changes to allow the addition of an external
#. Deprecate and remove the hosting modes as defined by PEP 438.
#. Restrict simple API to only list the files that are contained within the
repository.
#. *(Optional, Mandatory on PyPI)* Deprecate and remove the hosting modes as
defined by PEP 438.
#. *(Optional, Mandatory on PyPI)* Restrict simple API to only list the files
that are contained within the repository and the external repository
metadata.
Client side
-----------
#. Implement multiple repository support.
#. Implement some mechanism for removing/disabling the default repository.
#. Implement the discovery mechanism.
#. *(Optional)* Deprecate / Remove PEP 438
#. Deprecate / Remove PEP 438
Impact
======
The large impact of this PEP will be that for users of older installation
clients they will not get a discovery mechanism built into the install command.
This will require them to browse to the PyPI web UI and discover the repository
there. Since any URLs required to instal a project will be automatically
migrated to the new format, the biggest change to users will be requiring a new
option to install these projects.
To determine impact, we've looked at all projects using a method of searching
PyPI which is similar to what pip and setuptools use and searched for all
files available on PyPI, safely linked from PyPI, unsafely linked from PyPI,
and finally unsafely available outside of PyPI. When the same file was found
in multiple locations it was deduplicated and only counted it in one location
based on the following preferences: PyPI > Safely Off PyPI > Unsafely Off PyPI.
This gives us the broadest possible definition of impact, it means that any
single file for this project may no longer be visible by default, however that
file could be years old, or it could be a binary file while there is a sdist
available on PyPI. This means that the *real* impact will likely be much
smaller, but in an attempt not to miscount we take the broadest possible
definition.
Looking at the numbers the actual impact should be quite low, with it affecting
just 3.8% of projects which host any files only externally or 2.2% which have
their latest version hosted only externally.
6674 unique IP addresses have accessed the Simple API for these 3.8% of
projects in a single day (2014-09-30). Of those, 99.5% of them installed
something which could not be verified, and thus they were open to a Remote Code
Execution via a Man-In-The-Middle attack, while 7.9% installed something which
could be verified and only 0.4% only installed things which could be verified.
This means that 99.5% users of these features, both new and old, are doing
something unsafe, and for anything using an older copy of pip or using
setuptools at all they are silently unsafe.
At the time of this writing there are 65,232 projects hosted on PyPI and of
those, 59 of them rely on external files that are safely hosted outside of PyPI
and 931 of them rely on external files which are unsafely hosted outside of
PyPI. This shows us that 1.5% of projects will be affected in some way by this
change while 98.5% will continue to function as they always have. In addition,
only 5% of the projects affected are using the features provided by PEP 438 to
safely host outside of PyPI while 95% of them are exposing their users to
Remote Code Execution via a Man In The Middle attack.
Projects Which Rely on Externally Hosted files
----------------------------------------------
Data Sovereignty
================
This is determined by crawling the simple index and looking for installable
files using a similar detection method as pip and setuptools use. The "latest"
version is determined using ``pkg_resources.parse_version`` sort order and it
is used to show whether or not the latest version is hosted externally or only
old versions are.
In the discussions around previous versions of this PEP, one of the key use
cases for wanting to host files externally to PyPI was due to data sovereignty
requirements for people living in jurisdictions outside of the USA, where PyPI
is currently hosted. The author of this PEP is not blind to these concerns and
realizes that this PEP represents a regression for the people that have these
concerns, however the current situation is presenting an extremely poor user
experience and the feature is only being used by a small percentage of
projects. In addition, the data sovereignty problems requires familarity with
the laws outside of the home jurisdiction of the author of this PEP, who is
also the principal developer and operator of PyPI. For these reasons, a
solution for the problem of data sovereignty has been deferred and is
considered outside of the scope for this PEP.
============ ======= ================ =================== =======
\ PyPI External (old) External (latest) Total
============ ======= ================ =================== =======
**Safe** 43313 16 39 43368
**Unsafe** 0 756 1092 1848
**Total** 43313 772 1131 45216
============ ======= ================ =================== =======
Top Externally Hosted Projects by Requests
------------------------------------------
This is determined by looking at the number of requests the
``/simple/<project>/`` page had gotten in a single day. The total number of
requests during that day was 10,623,831.
============================== ========
Project Requests
============================== ========
PIL 63869
Pygame 2681
mysql-connector-python 1562
pyodbc 724
elementtree 635
salesforce-python-toolkit 316
wxPython 295
PyXML 251
RBTools 235
python-graph-core 123
cElementTree 121
============================== ========
Top Externally Hosted Projects by Unique IPs
--------------------------------------------
This is determined by looking at the IP addresses of requests the
``/simple/<project>/`` page had gotten in a single day. The total number of
unique IP addresses during that day was 124,604.
============================== ==========
Project Unique IPs
============================== ==========
PIL 4553
mysql-connector-python 462
Pygame 202
pyodbc 181
elementtree 166
wxPython 126
RBTools 114
PyXML 87
salesforce-python-toolkit 76
pyDes 76
============================== ==========
If someone for whom the issue of data sovereignty matters to them wishes to
put forth the effort, then at that time a system can be designed, implemented,
and ultimately deployed and operated that would satisfy both the needs of non
US users that cannot upload their projects to a system on US soil and the
quality of user experience that is attempted to be created on PyPI.
Rejected Proposals
==================
Allow easier discovery of externally hosted indexes
---------------------------------------------------
A previous version of this PEP included a new feature added to both PyPI and
installers that would allow project authors to enter into PyPI a list of
URLs that would instruct installers to ignore any files uploaded to PyPI and
instead return an error telling the end user about these extra URLs that they
can add to their installer to make the installation work.
This idea is rejected because it provides a similar painful end user experience
where people will first attempt to install something, get an error, then have
to re-run the installation with the correct options.
Keep the current classification system but adjust the options
-------------------------------------------------------------