python-peps/peps/pep-0766.rst

446 lines
25 KiB
ReStructuredText
Raw Normal View History

PEP: 766
Title: Explicit Priority Choices Among Multiple Indexes
Author: Michael Sarahan <msarahan@gmail.com>
Sponsor: Barry Warsaw <barry@python.org>
PEP-Delegate: Paul Moore <p.f.moore@gmail.com>
Discussions-To: https://discuss.python.org/t/pep-for-handling-multiple-indexes-index-priority/71589
Status: Draft
Type: Informational
Topic: Packaging
Created: 18-Nov-2024
Post-History: `18-Nov-2024 <https://discuss.python.org/t/pep-for-handling-multiple-indexes-index-priority/71589>`__,
Abstract
========
Package resolution is a key part of the Python user experience as the means of
extending Python's core functionality. The experience of package resolution is
mostly taken for granted until someone encounters a situation where the package
installer does something they don't expect. The installer behavior with
multiple indexes has been `a common source of unexpected behavior
<https://github.com/pypa/pip/issues/8606>`__. Through its ubiquity, pip has
long defined the standard expected behavior across other tools in the ecosystem,
but Python installers are diverging with respect to how they handle multiple
indexes. At the core of this divergence is whether index contents are combined
before resolving distributions, or each index is handled individually in order.
pip merges all indexes before matching distributions, while uv matches
distributions on one index before moving on to the next. Each approach has
advantages and disadvantages. This PEP aims to describe each of these
behaviors, which are referred to as “version priority” and “index priority”
respectively, so that community discussions and troubleshooting can share a
common vocabulary, and so that tools can implement predictable behavior based on
these descriptions.
Motivation
==========
Python package users frequently find themselves in need of specifying an index
or package source other than PyPI. There are many reasons for external indexes
to exist:
- File size/quota limitations on PyPI
- Implementation variants, such as `different GPU library builds in PyTorch <https://pytorch.org/get-started/locally/>`__
- `Local builds of packages shared internally at an organization <https://github.com/pypa/pip/issues/8606>`__
- `Situations where a local package has remote dependencies
<https://github.com/pypa/pip/issues/11624>`__, and the user wishes to prioritize
local packages over remote dependencies, while still falling back to remote
dependencies where needed
In most of these cases, it is not desirable to completely forego PyPI. Instead,
users generally want PyPI to still be a source of packages, but a lower priority
source. Unfortunately, `pip's current design precludes this concept of priority <https://github.com/pypa/pip/issues/8606>`__.
Some Python installer tools have developed alternative ways to handle multiple
indexes that incorporate mechanisms to express index priority, such as `uv
<https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes>`__
and `PDM
<https://pdm-project.org/latest/usage/config/#respect-the-order-of-the-sources>`__.
The innovation and the potential for customization is exciting, but it comes at
the risk of further fragmenting the python packaging ecosystem, which is already
perceived as one of Python's weak points. The motivation of this PEP is to encourage
installers to provide more insight into how they handle multiple indexes, and to
provide a vocabulary that can be common to the broader community.
Specification
=============
“Version priority”
------------------
This behavior is characterized by the installer always getting the
"best" version of a package, regardless of the index that it comes
from. "Best" is defined by the installer's algorithm for optimizing
the various traits of a package, also factoring in user input (such as
preferring only binaries, or no binaries). While installers may differ
in their optimization criteria and user options, the general trait that
all version priority installers share is that the index
contents are collated prior to candidate selection.
Version priority is most useful when all configured indexes are equally trusted
and well-behaved regarding the distribution interchangeability assumption.
Mirrors are especially well-behaved in this regard. That interchangeability
assumption is what makes comparing distributions of a given package meaningful.
Without it, the installer is no longer comparing “apples to apples.” In
practice, it is common for different indexes to have files that have different
contents than other indexes, such as builds for special hardware, or differing
metadata for the same package. Version priority behavior can lead to
undesirable, unexpected outcomes in these cases, and this is where `users
generally look for some kind of index priority
<https://github.com/pypa/pip/issues/8606>`__. Additionally, when there is a
difference in trust among indexes, version priority does not provide a way to
prefer more trusted indexes over less trusted indexes. This has been exploited by
dependency confusion attacks, and :pep:`708` was proposed as a way of
hard-coding a notion of trusted external indexes into the index.
The "version priority" name is new, and introduction of new terms should always
be minimized. This PEP looks toward the uv project, which refers to `its implementation of the version priority
behavior <https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes>`__
as “``unsafe-best-match``.” Naming is really hard here. On one hand, it
isnt accurate to call pips default behavior intrinsically “unsafe.”
The addition of possibly malicious indexes is what
introduces concern with this behavior. :pep:`708` added a way to restrict
installers from drawing packages from unexpected, potentially insecure
indexes. On the other hand, the term “best-match” is technically
correct, but also misleading. The “best match” varies by user and by
application. “Best” is technically correct in the sense that it is a
global optimum according to the match criteria specified above, but that
is not necessarily what is “best” in a users eyes. “Version priority”
is a proposed term that avoids the concerns with the uv terminology,
while approximating the behavior in the most user-identifiable way that
packages are compared.
“Index priority”
----------------
In index priority, the resolver finds candidates for each index, one at a time.
The resolver proceeds to subsequent indexes only if the current package request
has no viable candidates. Index priority does not combine indexes into one
global, flat namespace. Because indexes are searched in order, the package from
an earlier index will be preferred over a package from a later index,
regardless of whether the later index had a better match with the installer's
optimization criteria. For a given installer, the optimization criteria and
selection algorithm should be the same for both index priority and version
priority. It is only the treatment of multiple indexes that differs: all
together for version priority, and individually for index priority.
The order of specification of indexes determines their priority in the
finding process. As a result, the way that installers load the index
configuration must be predictable and reproducible. This PEP does not prescribe
any particular mechanism, other than to say that installers should provide
a way of ordering their collection of sources. Installers should also
ideally provide optional debugging output that provides insight into
which index is being considered.
Each packages finder should start at the beginning of the list of indexes, so each
package starts over with the index list. In other words, if one package has no
valid candidates on the first index, but finds a hit on the second index,
subsequent packages should still start their search on the first index, rather than
starting on the second.
One desirable behavior that the index priority strategy implies is that
there are no “surprise” updates, where a version bump on a
lower-priority index wins out over a curated, approved higher-priority
index. This is related to the security improvement of :pep:`708`, where
packages can restrict the external indexes that distributions can come
from, but index priority is more configurable by end users. The package installs are
only expected to change when either the higher-priority index or the
index priority configuration change. This stability and predictability
makes it more viable to configure indexes as a more persistent property of an
environment, rather than a one-off argument for one install command.
Cache keys
~~~~~~~~~~
Because index priority is acknowledging the possibility that different indexes
may have different content for a given package, caching and lockfiles should now
include the index from which distributions were downloaded. Without this
aspect, it is possible that after changing the list of configured indexes, the
cache or lockfile could provide a similarly-named distribution from a
lower-priority index. If every index follows the recommended behavior of
providing identical files across indexes for a given filename, this is not an
issue. However, that recommendation is not readily enforceable, and augmenting
the cache key with origin index would be a wise defensive change.
Ways that a request falls through to a lower priority index
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Package name is not present at all in higher priority index
- All distributions from higher priority index filtered out due to
version specifier, compatible Python version, platform tag, yanking or otherwise
- A denylist configuration for the installer specifies that a particular package
name should be ignored on a given index
- A higher priority index is unreachable (e.g. blocked by firewall
rules, temporarily unavailable due to maintenance, other miscellaneous
and temporary networking issues). This is a less clear-cut detail that
should be controllable by users. On one hand, this behavior would lead
to less predictable, likely unreproducible results by unexpectedly
falling through to lower priority indexes. On the other hand, graceful
fallback may be more valuable to some users, especially if they can
safely assume that all of their indexes are equally trusted. pips
behavior today is graceful fallback: you see warnings if an index is
having connection issues, but the installation will proceed with any
other available indexes. Because index priority can convey different trust
levels between indexes, installers that implement index priority should
default to raising errors and aborting on network issues. Installers may
choose to provide a flag to allow fall-through to lower-priority indexes in
case of network error.
Treatment within a given index follows existing behavior, but stops at
the bounds of one index and moves on to the next index only after all
priority preferences within the one index are exhausted. This means that
existing priorities among the unified collection of packages apply to
each index individually before falling through to a lower priority
index.
There are tradeoffs to make at every level of the optimization criteria:
- version: index priority will use an older version from a higher-priority index
even if a newer version is available on another index.
- wheel vs sdist: Should the installer use an sdist from a higher-priority
index before trying a wheel from a lower-priority index?
- more platform-specific wheels before less specific ones: Should the
installer use less specific wheels from higher-priority indexes
before using more specific wheels from lower priority indexes?
- flags such as pip's ``--prefer-binary``: Should the installer use an sdist from a higher
priority index before considering wheels on a lower priority index?
Installers are free to implement these priorities in different ways for
themselves, but they should document their optimization criteria and how they
handle fall-through to lower-priority indexes. For example, an installer could
say that ``--prefer-binary`` should not install an sdist unless it had iterated
through all configured indexes and found no installable binary candidates.
Mirroring
~~~~~~~~~
As described thus far, the index priority scheme breaks the use case of more
than one index url serving the same content. Such mirrors may be used with the
intent of ameliorating network issues or otherwise improving reliability. One
approach that installers could take to preserve mirroring functionality while
adding index priority would be to add a notion of user-definable index groups,
where each index in the group is assumed to be equivalent. This is related to
`Poetry's notion of package sources
<https://python-poetry.org/docs/repositories/>`__, except that this would allow
arbitrary numbers of prioritizable groups, and that this would assume members of
a group to be mirrors. Within each group, content could be combined, or each
member could be fetched concurrently. The fastest responding index would then
represent the group.
Backwards Compatibility
=======================
This PEP does not prescribe any changes as mandatory for any installer,
so it only introduces compatibility concerns if tools choose to adopt an
index behavior other than the behavior(s) they currently implement.
This PEPs language does not quite align with existing tools, including
pip and uv. Either this PEPs language can change during review of this PEP, or if
this PEPs language is preferred, other projects could conform to it.
The only goal of proposing these terms is to create a central, common vocabulary
that makes it easier for users to learn about other installers.
As some tools rely on one or the other behavior, there are some possible
issues that may emerge, where tailoring available resources/packages for
a particular behavior may detract from the user experience for people
who rely on the other behavior.
- Different indexes may have different metadata. For example, one cannot assume
that the metadata for package “something” on index “A” has the same dependencies
as “something” on index “B”. This breaks fundamental assumptions of version
priority, but index priority can handle this. When an installer falls through to a
lower-priority index in the search order, it implies refreshing the package metadata
from the new index. This is both an improvement and a complication. It is a
complication in the sense that a cached metadata entry must be keyed by both
package name and index url, instead of just package name. It is a potential
improvement in that different implementation variants of a package can differ in
dependencies as long as their distributions are separated into different indexes.
- Users may not get updates as they expect when using index priority, because some higher priority
index has not updated/synchronized with PyPI to get the latest
packages. If the higher priority index has a valid candidate, newer
packages will not be found. This will need to be communicated
verbosely, because it is counter to pips well-established behavior.
- By adding index priority, an installer will improve the predictability of
which index will be selected, and index hosts may abuse this as a way of having
similarly named files that have different contents. With version priority,
this violates the key package interchangeability assumption, and insanity will ensue.
Index priority would be more workable, but the situation still
has great potential for confusion. It would be helpful to develop tools that
support installers in identifying these confusing issues. These tools could
operate independently of the installer process, as a means of validating the
sanity of a set of indexes. Depending on the time cost of these tools, the
installers could run them as part of their process. Users could, of course,
ignore the recommendations at their own risk.
Security Implications
=====================
Index priority creates a mechanism for users to explicitly specify a trust
hierarchy among their indexes. As such, it limits the potential for dependency
confusion attacks. Index priority was rejected by :pep:`708` as a solution for
dependency confusion attacks. This PEP requests that the rejection be
reconsidered, with index priority serving a different purpose. This PEP is
primarily motivated by the desire to support implementation variants, which is
the subject of `another discussion that hopefully leads to a PEP
<https://discuss.python.org/t/selecting-variant-wheels-according-to-a-semi-static-specification/53446>`__.
It is not mutually exclusive with :pep:`708`, nor does it suggest reverting or
withdrawing :pep:`708`. It is an answer to `how we could allow users to choose
which index to use at a more fine grained level than “per install”.
<https://github.com/astral-sh/uv/issues/171#issuecomment-1952291242>`__
For a more thorough discussion of the :pep:`708` rejection of index
priority, please see the `discuss.python.org thread for this PEP
<https://discuss.python.org/t/pep-766-handling-multiple-indexes-index-priority/71589>`__.
How to Teach This
=================
At the outset, the goal is not to convert pip or any other tool to
change its default priority behavior. The best way to teach is perhaps
to watch message boards, GitHub issue trackers and chat channels,
keeping an eye out for problems that index priority could help solve.
There are `several <https://github.com/pypa/pip/issues/8606>`__
`long-standing <https://stackoverflow.com/questions/67253141/python-pip-priority-order-with-index-url-and-extra-index-url>`__
`discussions <https://github.com/pypa/pip/issues/5045>`__
`that <https://discuss.python.org/t/dependency-notation-including-the-index-url/5659>`__
`would <https://github.com/pypa/pip/issues/9612>`__ be good places to
start advertising the concepts. The topics of the two officially
supported behaviors need documentation, and we, the authors of this
PEP, would develop these as part of the review period of this PEP.
These docs would likely consist of additions across several
indexes, cross-linking the concepts between installers. At a
minimum, we expect to add to the
`PyPUG <https://packaging.python.org/en/latest/>`__ and to `pips
documentation <https://pip.pypa.io/en/stable/cli/pip_install/>`__.
It will be important for installers to advertise the active behavior, especially in
error messaging, and that will provide ways to provide resources to
users about these behaviors.
uv users are already experiencing index priority. uv `documents this
behavior <https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes>`__
well, but it is always possible to `improve the
discoverability <https://github.com/astral-sh/uv/issues/4389>`__ of that
documentation from the command line, `where users will actually
encounter the unexpected
behavior <https://github.com/astral-sh/uv/issues/5146>`__.
Reference Implementation
========================
The uv project demonstrates index priority with its default behavior. uv
is implemented in Rust, though, so if a reference implementation to a Python-based tool
is necessary, we, the authors of this PEP, will provide one. For pip in
particular, we see the implementation plan as something like:
- For users who dont use ``--extra-index-url`` or ``--find-links``,
there will be no change, and no migration is necessary.
- pip users would be able opt in to the index priority behavior with a
new config setting in the CLI and in ``pip.conf``. This proposal does not
recommend any strategy as the default for any installer. It only
recommends documenting the strategies that a tool provides.
- Enable extra info-level output for any pip operation where more than
one index is used. In this output, state the current strategy setting,
and a terse summary of implied behavior, as well as a link to docs
that describe the different options
- Add debugging output that verbosely identifies the index being used at
each step, including where the file is in the configuration hierarchy,
and where it is being included (via config file, env var, or CLI
flag).
- Plumb tracking of which index gets used for which
package/distribution through the entire pip install process. Store
this information so that it is available to tools like ``pip freeze``
- Supplement :pep:`751` (lockfiles) with capture of index where a
package/distribution came from
Rejected Ideas
==============
- Tell users to set up a proxy/mirror, such as `devpi <https://github.com/devpi/devpi>`__
or `Artifactory <https://jfrog.com/help/r/jfrog-artifactory-documentation/pypi-repositories>`__ that
serves local files if present, and forwards to another server (PyPI)
if no local files match
This matches the behavior of this proposal very closely, except that
this method requires hosting some server, and may be inaccessible or
not configurable to users in some environments. It is also important
to consider that for an organization that operates its own index
(for overcoming PyPI size restrictions, for example), this does not
solve the need for ``--extra-index-url`` or proxy/mirror for end
users. That is, organizations get no improvement from this approach
unless they proxy/mirror PyPI as a whole, and get users to configure
their proxy/mirror as their sole index.
- Are build tags and/or local version specifiers enough?
Build tags and local version specifiers will take precedence over
packages without those tags and/or local version specifiers. In a pool
of packages, builds that have these additions hosted on a server other
than PyPI will take priority over packages on PyPI, which rarely use
build tags, and forbid local version specifiers. This approach is
viable when package providers want to provide their own local
override, such as `HPC maintainers who provide optimized builds for
their
users <https://github.com/ComputeCanada/software-stack/blob/main/pip-which-version.md>`__.
It is less viable in some ways, such as build tags not showing up in
``pip freeze`` metadata, and `local version specifiers not being
allowed on
PyPI <https://discuss.python.org/t/lets-permit-local-version-label-in-version-specifiers/22781>`__.
There is also significant work entailed in building and maintaining
package collections with local build tag variants.
https://discuss.python.org/t/dependency-notation-including-the-index-url/5659/21
- What about :pep:`708`? Isnt that
enough?
:pep:`708` is aimed specifically at addressing dependency confusion
attacks, and doesnt address the potential for implementation variants
among indexes. It is a way of filtering external URLs and encoding an
allow-list for external indexes in index metadata. It does not change
the lack of priority or preference among channels that currently
exists.
- `Namespacing <https://discuss.python.org/t/dependency-notation-including-the-index-url/5659>`__
Namespacing is a means of specifying a package such that the Python
usage of the package does not change, but the package installation
restricts where the package comes from. :pep:`752` recently proposed a way to
multiplex a packages owners in a flat package namespace (e.g.
PyPI) by reserving prefixes as grouping elements. `NPMs concept
of “scopes” <https://docs.npmjs.com/cli/v10/using-npm/scope>`__ has
been raised as another good example of how this might look. This PEP
differs in that it is targeted to multiple index, not a flat package
namespace. The net effect is roughly the same in terms of predictably
choosing a particular package source, except that the namespacing
approach relies more on naming packages with these namespace prefixes,
whereas this PEP would be less granular, pulling in packages on
whatever higher-priority index the user specifies. The namespacing
approach relies on all configured indexes treating a given namespace
similarly, which leaves the usual concern that not all configured
indexes are trusted equally. The namespace idea is not incompatible
with this PEP, but it also does not improve expression of trust of
indexes in the way that this PEP does.
Open Issues
===========
[Any points that are still being decided/discussed.]
Acknowledgements
================
This work was supported financially by NVIDIA through employment of the author.
NVIDIA teammates dramatically improved this PEP with their
input. Astral Software pioneered the behaviors of index priority and thus laid the
foundation of this document. The pip authors deserve great praise for their
consistent direction and patient communication of the version priority behavior,
especially in the face of contentious security concerns.
Copyright
=========
This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.