390 lines
15 KiB
ReStructuredText
390 lines
15 KiB
ReStructuredText
PEP: 618
|
|
Title: Add Optional Length-Checking To zip
|
|
Version: $Revision$
|
|
Last-Modified: $Date$
|
|
Author: Brandt Bucher <brandt@python.org>
|
|
Sponsor: Antoine Pitrou <antoine@python.org>
|
|
BDFL-Delegate: Guido van Rossum <guido@python.org>
|
|
Status: Final
|
|
Type: Standards Track
|
|
Content-Type: text/x-rst
|
|
Created: 01-May-2020
|
|
Python-Version: 3.10
|
|
Post-History: 01-May-2020, 10-May-2020, 16-Jun-2020
|
|
Resolution: https://mail.python.org/archives/list/python-dev@python.org/message/NLWB7FVJGMBBMCF4P3ZKUIE53JPDOWJ3
|
|
|
|
|
|
Abstract
|
|
========
|
|
|
|
This PEP proposes adding an optional ``strict`` boolean keyword
|
|
parameter to the built-in ``zip``. When enabled, a ``ValueError`` is
|
|
raised if one of the arguments is exhausted before the others.
|
|
|
|
|
|
Motivation
|
|
==========
|
|
|
|
It is clear from the author's personal experience and a `survey of the
|
|
standard library <examples_>`_ that much (if not most) ``zip`` usage
|
|
involves iterables that *must* be of equal length. Sometimes this
|
|
invariant is proven true from the context of the surrounding code, but
|
|
often the data being zipped is passed from the caller, sourced
|
|
separately, or generated in some fashion. In any of these cases, the
|
|
default behavior of ``zip`` means that faulty refactoring or logic
|
|
errors could easily result in silently losing data. These bugs are
|
|
not only difficult to diagnose, but difficult to even detect at all.
|
|
|
|
It is easy to come up with simple cases where this could be a problem.
|
|
For example, the following code may work fine when ``items`` is a
|
|
sequence, but silently start producing shortened, mismatched results
|
|
if ``items`` is refactored by the caller to be a consumable iterator::
|
|
|
|
def apply_calculations(items):
|
|
transformed = transform(items)
|
|
for i, t in zip(items, transformed):
|
|
yield calculate(i, t)
|
|
|
|
There are several other ways in which ``zip`` is commonly used.
|
|
Idiomatic tricks are especially susceptible, because they are often
|
|
employed by users who lack a complete understanding of how the code
|
|
works. One example is unpacking into ``zip`` to lazily "unzip" or
|
|
"transpose" nested iterables::
|
|
|
|
>>> x = [[1, 2, 3], ["one" "two" "three"]]
|
|
>>> xt = list(zip(*x))
|
|
|
|
Another is "chunking" data into equal-sized groups::
|
|
|
|
>>> n = 3
|
|
>>> x = range(n ** 2),
|
|
>>> xn = list(zip(*[iter(x)] * n))
|
|
|
|
In the first case, non-rectangular data is usually a logic error. In
|
|
the second case, data with a length that is not a multiple of ``n`` is
|
|
often an error as well. However, both of these idioms will silently
|
|
omit the tail-end items of malformed input.
|
|
|
|
Perhaps most convincingly, the use of ``zip`` in the standard-library
|
|
``ast`` module created a bug in ``literal_eval`` which `silently
|
|
dropped parts of malformed nodes
|
|
<https://bugs.python.org/issue40355>`_::
|
|
|
|
>>> from ast import Constant, Dict, literal_eval
|
|
>>> nasty_dict = Dict(keys=[Constant(None)], values=[])
|
|
>>> literal_eval(nasty_dict) # Like eval("{None: }")
|
|
{}
|
|
|
|
In fact, the author has `counted dozens of other call sites
|
|
<examples_>`_ in Python's standard library and tooling where it
|
|
would be appropriate to enable this new feature immediately.
|
|
|
|
|
|
Rationale
|
|
=========
|
|
|
|
Some critics assert that constant boolean switches are a "code-smell",
|
|
or go against Python's design philosophy. However, Python currently
|
|
contains several examples of boolean keyword parameters on built-in
|
|
functions which are typically called with compile-time constants:
|
|
|
|
- ``compile(..., dont_inherit=True)``
|
|
- ``open(..., closefd=False)``
|
|
- ``print(..., flush=True)``
|
|
- ``sorted(..., reverse=True)``
|
|
|
|
Many more exist in the standard library.
|
|
|
|
The idea and name for this new parameter were `originally proposed
|
|
<https://mail.python.org/archives/list/python-ideas@python.org/message/6GFUADSQ5JTF7W7OGWF7XF2NH2XUTUQM>`_
|
|
by Ram Rachum. The thread received over 100 replies, with the
|
|
alternative "equal" receiving a similar amount of support.
|
|
|
|
The author does not have a strong preference between the two choices,
|
|
though "equal equals" *is* a bit awkward in prose. It may also
|
|
(wrongly) imply some notion of "equality" between the zipped items::
|
|
|
|
>>> z = zip([2.0, 4.0, 6.0], [2, 4, 8], equal=True)
|
|
|
|
|
|
Specification
|
|
=============
|
|
|
|
When the built-in ``zip`` is called with the keyword-only argument
|
|
``strict=True``, the resulting iterator will raise a ``ValueError`` if
|
|
the arguments are exhausted at differing lengths. This error will
|
|
occur at the point when iteration would normally stop today.
|
|
|
|
|
|
Backward Compatibility
|
|
======================
|
|
|
|
This change is fully backward-compatible. ``zip`` currently takes no
|
|
keyword arguments, and the "non-strict" default behavior when
|
|
``strict`` is omitted remains unchanged.
|
|
|
|
|
|
Reference Implementation
|
|
========================
|
|
|
|
The author has drafted a `C implementation
|
|
<https://github.com/python/cpython/pull/20921>`_.
|
|
|
|
An approximate Python translation is::
|
|
|
|
def zip(*iterables, strict=False):
|
|
if not iterables:
|
|
return
|
|
iterators = tuple(iter(iterable) for iterable in iterables)
|
|
try:
|
|
while True:
|
|
items = []
|
|
for iterator in iterators:
|
|
items.append(next(iterator))
|
|
yield tuple(items)
|
|
except StopIteration:
|
|
if not strict:
|
|
return
|
|
if items:
|
|
i = len(items)
|
|
plural = " " if i == 1 else "s 1-"
|
|
msg = f"zip() argument {i+1} is shorter than argument{plural}{i}"
|
|
raise ValueError(msg)
|
|
sentinel = object()
|
|
for i, iterator in enumerate(iterators[1:], 1):
|
|
if next(iterator, sentinel) is not sentinel:
|
|
plural = " " if i == 1 else "s 1-"
|
|
msg = f"zip() argument {i+1} is longer than argument{plural}{i}"
|
|
raise ValueError(msg)
|
|
|
|
Rejected Ideas
|
|
==============
|
|
|
|
Add ``itertools.zip_strict``
|
|
----------------------------
|
|
|
|
This is the alternative with the most support on the Python-Ideas
|
|
mailing list, so it deserves to be discussed in some detail here. It
|
|
does not have any disqualifying flaws, and could work well enough as a
|
|
substitute if this PEP is rejected.
|
|
|
|
With that in mind, this section aims to outline why adding an optional
|
|
parameter to ``zip`` is a smaller change that ultimately does a better
|
|
job of solving the problems motivating this PEP.
|
|
|
|
|
|
Precedent
|
|
'''''''''
|
|
|
|
It seems that a great deal of the motivation driving this alternative
|
|
is that ``zip_longest`` already exists in ``itertools``. However,
|
|
``zip_longest`` is in many ways a much more complicated, specialized
|
|
utility: it takes on the responsibility of filling in missing values,
|
|
a job neither of the other variants needs to concern themselves with.
|
|
|
|
If both ``zip`` and ``zip_longest`` lived alongside each other in
|
|
``itertools`` or as builtins, then adding ``zip_strict`` in the same
|
|
location would indeed be a much stronger argument. However, the new
|
|
"strict" variant is conceptually *much* closer to ``zip`` in interface
|
|
and behavior than ``zip_longest``, while still not meeting the high
|
|
bar of being its own builtin. Given this situation, it seems most
|
|
natural for ``zip`` to grow this new option in-place.
|
|
|
|
|
|
Usability
|
|
'''''''''
|
|
|
|
If ``zip`` is capable of preventing this class of bug, it becomes much
|
|
simpler for users to enable the check at call sites with this
|
|
property. Compare this with importing a drop-in replacement for a
|
|
built-in utility, which feels somewhat heavy just to check a tricky
|
|
condition that should "always" be true.
|
|
|
|
Some have also argued that a new function buried in the standard
|
|
library is somehow more "discoverable" than a keyword parameter on the
|
|
built-in itself. The author does not agree with this assessment.
|
|
|
|
|
|
Maintenance Cost
|
|
''''''''''''''''
|
|
|
|
While implementation should only be a secondary concern when making
|
|
usability improvements, it is important to recognize that adding a new
|
|
utility is significantly more complicated than modifying an existing
|
|
one. The CPython implementation accompanying this PEP is simple and
|
|
has no measurable performance impact on default ``zip`` behavior,
|
|
while adding an entirely new utility to ``itertools`` would require
|
|
either:
|
|
|
|
- Duplicating much of the existing ``zip`` logic, as ``zip_longest``
|
|
already does.
|
|
- Significantly refactoring either ``zip``, ``zip_longest``, or both
|
|
to share a common or inherited implementation (which may impact
|
|
performance).
|
|
|
|
|
|
Add Several "Modes" To Switch Between
|
|
-------------------------------------
|
|
|
|
This option only makes more sense than a binary flag if we anticipate
|
|
having three or more modes. The "obvious" three choices for these
|
|
enumerated or constant modes would be "shortest" (the current ``zip``
|
|
behavior), "strict" (the proposed behavior), and "longest"
|
|
(the ``itertools.zip_longest`` behavior).
|
|
|
|
However, it doesn't seem like adding behaviors other than the current
|
|
default and the proposed "strict" mode is worth the additional
|
|
complexity. The clearest candidate, "longest", would require a new
|
|
``fillvalue`` parameter (which is meaningless for both other modes).
|
|
This mode is also already handled perfectly by
|
|
``itertools.zip_longest``, and adding it would create two ways of
|
|
doing the same thing. It's not clear which would be the "obvious"
|
|
choice: the ``mode`` parameter on the built-in ``zip``, or the
|
|
long-lived namesake utility in ``itertools``.
|
|
|
|
|
|
Add A Method Or Alternate Constructor To The ``zip`` Type
|
|
---------------------------------------------------------
|
|
|
|
Consider the following two options, which have both been proposed::
|
|
|
|
>>> zm = zip(*iters).strict()
|
|
>>> zd = zip.strict(*iters)
|
|
|
|
It's not obvious which one will succeed, or how the other will fail.
|
|
If ``zip.strict`` is implemented as a method, ``zm`` will succeed, but
|
|
``zd`` will fail in one of several confusing ways:
|
|
|
|
- Yield results that aren't wrapped in a tuple (if ``iters`` contains
|
|
just one item, a ``zip`` iterator).
|
|
- Raise a ``TypeError`` for an incorrect argument type (if ``iters``
|
|
contains just one item, not a ``zip`` iterator).
|
|
- Raise a ``TypeError`` for an incorrect number of arguments
|
|
(otherwise).
|
|
|
|
If ``zip.strict`` is implemented as a ``classmethod`` or
|
|
``staticmethod``, ``zd`` will succeed, and ``zm`` will silently yield
|
|
nothing (which is the problem we are trying to avoid in the first
|
|
place).
|
|
|
|
This proposal is further complicated by the fact that CPython's actual
|
|
``zip`` type is currently an undocumented implementation detail. This
|
|
means that choosing one of the above behaviors will effectively "lock
|
|
in" the current implementation (or at least require it to be emulated)
|
|
going forward.
|
|
|
|
|
|
Change The Default Behavior Of ``zip``
|
|
--------------------------------------
|
|
|
|
There is nothing "wrong" with the default behavior of ``zip``, since
|
|
there are many cases where it is indeed the correct way to handle
|
|
unequally-sized inputs. It's extremely useful, for example, when
|
|
dealing with infinite iterators.
|
|
|
|
``itertools.zip_longest`` already exists to service those cases where
|
|
the "extra" tail-end data is still needed.
|
|
|
|
|
|
Accept A Callback To Handle Remaining Items
|
|
-------------------------------------------
|
|
|
|
While able to do basically anything a user could need, this solution
|
|
makes handling the more common cases (like rejecting mismatched
|
|
lengths) unnecessarily complicated and non-obvious.
|
|
|
|
|
|
Raise An ``AssertionError``
|
|
---------------------------
|
|
|
|
There are no built-in functions or types that raise an
|
|
``AssertionError`` as part of their API. Further, the `official
|
|
documentation
|
|
<https://docs.python.org/3.9/library/exceptions.html?highlight=assertionerror#AssertionError>`_
|
|
simply reads (in its entirety):
|
|
|
|
Raised when an ``assert`` statement fails.
|
|
|
|
Since this feature has nothing to do with Python's ``assert``
|
|
statement, raising an ``AssertionError`` here would be inappropriate.
|
|
Users desiring a check that is disabled in optimized mode (like an
|
|
``assert`` statement) can use ``strict=__debug__`` instead.
|
|
|
|
|
|
Add A Similar Feature to ``map``
|
|
--------------------------------
|
|
|
|
This PEP does not propose any changes to ``map``, since the use of
|
|
``map`` with multiple iterable arguments is quite rare. However, this
|
|
PEP's ruling shall serve as precedent such a future discussion (should
|
|
it occur).
|
|
|
|
If rejected, the feature is realistically not worth pursuing. If
|
|
accepted, such a change to ``map`` should not require its own PEP
|
|
(though, like all enhancements, its usefulness should be carefully
|
|
considered). For consistency, it should follow same API and semantics
|
|
debated here for ``zip``.
|
|
|
|
|
|
Do Nothing
|
|
----------
|
|
|
|
This option is perhaps the least attractive.
|
|
|
|
Silently truncated data is a particularly nasty class of bug, and
|
|
hand-writing a robust solution that gets this right `isn't trivial
|
|
<https://stackoverflow.com/questions/32954486/zip-iterators-asserting-for-equal-length-in-python>`_.
|
|
The real-world motivating examples from Python's own standard library
|
|
are evidence that it's *very* easy to fall into the sort of trap that
|
|
this feature aims to avoid.
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
Examples
|
|
--------
|
|
|
|
.. note:: This listing is not exhaustive.
|
|
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/_pydecimal.py#L3394
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/_pydecimal.py#L3418
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/_pydecimal.py#L3435
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L94-L95
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1184
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1275
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1363
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1391
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/copy.py#L217
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/csv.py#L142
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/dis.py#L462
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/filecmp.py#L142
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/filecmp.py#L143
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/inspect.py#L1440
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/inspect.py#L2095
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/os.py#L510
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/plistlib.py#L577
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/tarfile.py#L1317
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/tarfile.py#L1323
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/tarfile.py#L1339
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/turtle.py#L3015
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/turtle.py#L3071
|
|
- https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/turtle.py#L3901
|
|
|
|
|
|
Copyright
|
|
=========
|
|
|
|
This document is placed in the public domain or under the
|
|
CC0-1.0-Universal license, whichever is more permissive.
|
|
|
|
|
|
..
|
|
Local Variables:
|
|
mode: indented-text
|
|
indent-tabs-mode: nil
|
|
sentence-end-double-space: t
|
|
fill-column: 70
|
|
coding: utf-8
|
|
End:
|