PEP: 618 Title: Add Optional Length-Checking To zip Version: $Revision$ Last-Modified: $Date$ Author: Brandt Bucher Sponsor: Antoine Pitrou BDFL-Delegate: Guido van Rossum Status: Final Type: Standards Track Content-Type: text/x-rst Created: 01-May-2020 Python-Version: 3.10 Post-History: 01-May-2020, 10-May-2020, 16-Jun-2020 Resolution: https://mail.python.org/archives/list/python-dev@python.org/message/NLWB7FVJGMBBMCF4P3ZKUIE53JPDOWJ3 Abstract ======== This PEP proposes adding an optional ``strict`` boolean keyword parameter to the built-in ``zip``. When enabled, a ``ValueError`` is raised if one of the arguments is exhausted before the others. Motivation ========== It is clear from the author's personal experience and a `survey of the standard library `_ that much (if not most) ``zip`` usage involves iterables that *must* be of equal length. Sometimes this invariant is proven true from the context of the surrounding code, but often the data being zipped is passed from the caller, sourced separately, or generated in some fashion. In any of these cases, the default behavior of ``zip`` means that faulty refactoring or logic errors could easily result in silently losing data. These bugs are not only difficult to diagnose, but difficult to even detect at all. It is easy to come up with simple cases where this could be a problem. For example, the following code may work fine when ``items`` is a sequence, but silently start producing shortened, mismatched results if ``items`` is refactored by the caller to be a consumable iterator:: def apply_calculations(items): transformed = transform(items) for i, t in zip(items, transformed): yield calculate(i, t) There are several other ways in which ``zip`` is commonly used. Idiomatic tricks are especially susceptible, because they are often employed by users who lack a complete understanding of how the code works. One example is unpacking into ``zip`` to lazily "unzip" or "transpose" nested iterables:: >>> x = [[1, 2, 3], ["one" "two" "three"]] >>> xt = list(zip(*x)) Another is "chunking" data into equal-sized groups:: >>> n = 3 >>> x = range(n ** 2), >>> xn = list(zip(*[iter(x)] * n)) In the first case, non-rectangular data is usually a logic error. In the second case, data with a length that is not a multiple of ``n`` is often an error as well. However, both of these idioms will silently omit the tail-end items of malformed input. Perhaps most convincingly, the use of ``zip`` in the standard-library ``ast`` module created a bug in ``literal_eval`` which `silently dropped parts of malformed nodes `_:: >>> from ast import Constant, Dict, literal_eval >>> nasty_dict = Dict(keys=[Constant(None)], values=[]) >>> literal_eval(nasty_dict) # Like eval("{None: }") {} In fact, the author has `counted dozens of other call sites `_ in Python's standard library and tooling where it would be appropriate to enable this new feature immediately. Rationale ========= Some critics assert that constant boolean switches are a "code-smell", or go against Python's design philosophy. However, Python currently contains several examples of boolean keyword parameters on built-in functions which are typically called with compile-time constants: - ``compile(..., dont_inherit=True)`` - ``open(..., closefd=False)`` - ``print(..., flush=True)`` - ``sorted(..., reverse=True)`` Many more exist in the standard library. The idea and name for this new parameter were `originally proposed `_ by Ram Rachum. The thread received over 100 replies, with the alternative "equal" receiving a similar amount of support. The author does not have a strong preference between the two choices, though "equal equals" *is* a bit awkward in prose. It may also (wrongly) imply some notion of "equality" between the zipped items:: >>> z = zip([2.0, 4.0, 6.0], [2, 4, 8], equal=True) Specification ============= When the built-in ``zip`` is called with the keyword-only argument ``strict=True``, the resulting iterator will raise a ``ValueError`` if the arguments are exhausted at differing lengths. This error will occur at the point when iteration would normally stop today. Backward Compatibility ====================== This change is fully backward-compatible. ``zip`` currently takes no keyword arguments, and the "non-strict" default behavior when ``strict`` is omitted remains unchanged. Reference Implementation ======================== The author has drafted a `C implementation `_. An approximate Python translation is:: def zip(*iterables, strict=False): if not iterables: return iterators = tuple(iter(iterable) for iterable in iterables) try: while True: items = [] for iterator in iterators: items.append(next(iterator)) yield tuple(items) except StopIteration: if not strict: return if items: i = len(items) plural = " " if i == 1 else "s 1-" msg = f"zip() argument {i+1} is shorter than argument{plural}{i}" raise ValueError(msg) sentinel = object() for i, iterator in enumerate(iterators[1:], 1): if next(iterator, sentinel) is not sentinel: plural = " " if i == 1 else "s 1-" msg = f"zip() argument {i+1} is longer than argument{plural}{i}" raise ValueError(msg) Rejected Ideas ============== Add ``itertools.zip_strict`` ---------------------------- This is the alternative with the most support on the Python-Ideas mailing list, so it deserves to be discussed in some detail here. It does not have any disqualifying flaws, and could work well enough as a substitute if this PEP is rejected. With that in mind, this section aims to outline why adding an optional parameter to ``zip`` is a smaller change that ultimately does a better job of solving the problems motivating this PEP. Precedent ''''''''' It seems that a great deal of the motivation driving this alternative is that ``zip_longest`` already exists in ``itertools``. However, ``zip_longest`` is in many ways a much more complicated, specialized utility: it takes on the responsibility of filling in missing values, a job neither of the other variants needs to concern themselves with. If both ``zip`` and ``zip_longest`` lived alongside each other in ``itertools`` or as builtins, then adding ``zip_strict`` in the same location would indeed be a much stronger argument. However, the new "strict" variant is conceptually *much* closer to ``zip`` in interface and behavior than ``zip_longest``, while still not meeting the high bar of being its own builtin. Given this situation, it seems most natural for ``zip`` to grow this new option in-place. Usability ''''''''' If ``zip`` is capable of preventing this class of bug, it becomes much simpler for users to enable the check at call sites with this property. Compare this with importing a drop-in replacement for a built-in utility, which feels somewhat heavy just to check a tricky condition that should "always" be true. Some have also argued that a new function buried in the standard library is somehow more "discoverable" than a keyword parameter on the built-in itself. The author does not agree with this assessment. Maintenance Cost '''''''''''''''' While implementation should only be a secondary concern when making usability improvements, it is important to recognize that adding a new utility is significantly more complicated than modifying an existing one. The CPython implementation accompanying this PEP is simple and has no measurable performance impact on default ``zip`` behavior, while adding an entirely new utility to ``itertools`` would require either: - Duplicating much of the existing ``zip`` logic, as ``zip_longest`` already does. - Significantly refactoring either ``zip``, ``zip_longest``, or both to share a common or inherited implementation (which may impact performance). Add Several "Modes" To Switch Between ------------------------------------- This option only makes more sense than a binary flag if we anticipate having three or more modes. The "obvious" three choices for these enumerated or constant modes would be "shortest" (the current ``zip`` behavior), "strict" (the proposed behavior), and "longest" (the ``itertools.zip_longest`` behavior). However, it doesn't seem like adding behaviors other than the current default and the proposed "strict" mode is worth the additional complexity. The clearest candidate, "longest", would require a new ``fillvalue`` parameter (which is meaningless for both other modes). This mode is also already handled perfectly by ``itertools.zip_longest``, and adding it would create two ways of doing the same thing. It's not clear which would be the "obvious" choice: the ``mode`` parameter on the built-in ``zip``, or the long-lived namesake utility in ``itertools``. Add A Method Or Alternate Constructor To The ``zip`` Type --------------------------------------------------------- Consider the following two options, which have both been proposed:: >>> zm = zip(*iters).strict() >>> zd = zip.strict(*iters) It's not obvious which one will succeed, or how the other will fail. If ``zip.strict`` is implemented as a method, ``zm`` will succeed, but ``zd`` will fail in one of several confusing ways: - Yield results that aren't wrapped in a tuple (if ``iters`` contains just one item, a ``zip`` iterator). - Raise a ``TypeError`` for an incorrect argument type (if ``iters`` contains just one item, not a ``zip`` iterator). - Raise a ``TypeError`` for an incorrect number of arguments (otherwise). If ``zip.strict`` is implemented as a ``classmethod`` or ``staticmethod``, ``zd`` will succeed, and ``zm`` will silently yield nothing (which is the problem we are trying to avoid in the first place). This proposal is further complicated by the fact that CPython's actual ``zip`` type is currently an undocumented implementation detail. This means that choosing one of the above behaviors will effectively "lock in" the current implementation (or at least require it to be emulated) going forward. Change The Default Behavior Of ``zip`` -------------------------------------- There is nothing "wrong" with the default behavior of ``zip``, since there are many cases where it is indeed the correct way to handle unequally-sized inputs. It's extremely useful, for example, when dealing with infinite iterators. ``itertools.zip_longest`` already exists to service those cases where the "extra" tail-end data is still needed. Accept A Callback To Handle Remaining Items ------------------------------------------- While able to do basically anything a user could need, this solution makes handling the more common cases (like rejecting mismatched lengths) unnecessarily complicated and non-obvious. Raise An ``AssertionError`` --------------------------- There are no built-in functions or types that raise an ``AssertionError`` as part of their API. Further, the `official documentation `_ simply reads (in its entirety): Raised when an ``assert`` statement fails. Since this feature has nothing to do with Python's ``assert`` statement, raising an ``AssertionError`` here would be inappropriate. Users desiring a check that is disabled in optimized mode (like an ``assert`` statement) can use ``strict=__debug__`` instead. Add A Similar Feature to ``map`` -------------------------------- This PEP does not propose any changes to ``map``, since the use of ``map`` with multiple iterable arguments is quite rare. However, this PEP's ruling shall serve as precedent such a future discussion (should it occur). If rejected, the feature is realistically not worth pursuing. If accepted, such a change to ``map`` should not require its own PEP (though, like all enhancements, its usefulness should be carefully considered). For consistency, it should follow same API and semantics debated here for ``zip``. Do Nothing ---------- This option is perhaps the least attractive. Silently truncated data is a particularly nasty class of bug, and hand-writing a robust solution that gets this right `isn't trivial `_. The real-world motivating examples from Python's own standard library are evidence that it's *very* easy to fall into the sort of trap that this feature aims to avoid. References ========== Examples -------- .. note:: This listing is not exhaustive. - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/_pydecimal.py#L3394 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/_pydecimal.py#L3418 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/_pydecimal.py#L3435 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L94-L95 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1184 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1275 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1363 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/ast.py#L1391 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/copy.py#L217 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/csv.py#L142 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/dis.py#L462 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/filecmp.py#L142 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/filecmp.py#L143 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/inspect.py#L1440 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/inspect.py#L2095 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/os.py#L510 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/plistlib.py#L577 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/tarfile.py#L1317 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/tarfile.py#L1323 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/tarfile.py#L1339 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/turtle.py#L3015 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/turtle.py#L3071 - https://github.com/python/cpython/blob/27c0d9b54abaa4112d5a317b8aa78b39ad60a808/Lib/turtle.py#L3901 Copyright ========= This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: