From 11b2de41a7eeb64897494a781274099afcee1102 Mon Sep 17 00:00:00 2001 From: Brandt Bucher Date: Fri, 1 May 2020 10:51:52 -0700 Subject: [PATCH] PEP: Add Optional Length-Checking To zip (#1389) --- pep-0618.rst | 236 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 236 insertions(+) create mode 100644 pep-0618.rst diff --git a/pep-0618.rst b/pep-0618.rst new file mode 100644 index 000000000..2fc6dcc75 --- /dev/null +++ b/pep-0618.rst @@ -0,0 +1,236 @@ +PEP: 618 +Title: Add Optional Length-Checking To zip +Version: $Revision$ +Last-Modified: $Date$ +Author: Brandt Bucher +Sponsor: Antoine Pitrou +Status: Draft +Type: Standards Track +Content-Type: text/x-rst +Created: 30-Apr-2020 +Python-Version: 3.9 +Post-History: +Resolution: + + +Abstract +======== + +This PEP proposes adding an optional ``strict`` boolean keyword +argument to the built-in ``zip``. When enabled, a ``ValueError`` is +raised if one of the arguments is exhausted before the others. + + +Motivation +========== + +Many Python users find that most of their ``zip`` usage involves +iterables that *should* be of equal length. Sometimes this invariant +is proven true from the context of the surrounding code, but often the +data being zipped is passed from the caller, sourced separately, or +generated in some fashion. In any of these cases, the default +behavior of ``zip`` means that faulty refactoring or logic errors +could easily result in silently losing data. These bugs are not only +difficult to diagnose, but difficult to even detect at all. + +It is easy to come up with simple cases where this could be a problem. +For example, the following code may work fine when ``items`` is a +sequence, but silently start producing shortened, mismatched results +if ``items`` is refactored by the caller to be a consumable iterator:: + + def apply_calculations(items): + transformed = transform(items) + for x, y in zip(items, transformed): + yield something(x, y) + +There are several other ways in which ``zip`` is commonly used. +Idiomatic tricks are especially susceptible, because they are often +employed by users who lack a complete understanding of how the code +works. One example is unpacking into ``zip`` to lazily "unzip" or +"transpose" nested iterables:: + + >>> x = iter([iter([1, 2, 3]), iter(["one" "two" "three"])]) + >>> xt = list(zip(*x)) + +Another is "chunking" data into equal-sized groups:: + + >>> n = 3 + >>> x = range(n ** 2), + >>> xn = list(zip(*[iter(x)] * n)) + +In the first case, non-rectangular data is usually a logic error. In +the second case, data that is not a multiple of ``n`` is often an +error as well. However, both of these idioms will silently omit the +tail-end items of malformed input. + +Perhaps most convincingly, the current use of ``zip`` in the +standard-library ``ast`` module has created multiple bugs that +`silently drop parts of malformed nodes +`_:: + + >>> from ast import Constant, Dict, literal_eval + >>> nasty_dict = Dict(keys=[Constant("XXX")], values=[]) + >>> literal_eval(nasty_dict) # Like eval('{"XXX": }') + {} + +In fact, the author has counted dozens of other call sites in Python's +standard library and tooling where it would be appropriate to enable +this new feature immediately. + + +Rationale +========= + +Some critics assert that boolean switches are a "code-smell", or go +against Python's design philosophy. However, Python currently +contains several examples of built-in functions with boolean keyword +arguments: + +- ``compile(..., dont_inherit=True)`` +- ``open(..., closefd=False)`` +- ``print(..., flush=True)`` +- ``sorted(..., reverse=True)`` + +Many more exist in the standard library. + +A good rule of thumb is that "mode-switches" which change return types +or significantly alter functionality are indeed an anti-pattern, while +ones which enable or disable complementary checks or functionality are +not. + +The name for this new parameter was taken from the `original message +`_ +suggesting the feature. It received over 100 replies, with nobody +challenging the use of the word "strict". + + +Specification +============= + +When the built-in ``zip`` is called with the keyword-only argument +``strict=True``, the resulting iterator will raise a ``ValueError`` if +the arguments are exhausted at differing lengths. This error will +occur at the point when iteration would normally stop today. + +At most one additional item may be consumed from one of the iterators +when compared to normal ``zip`` usage. + + +Backward Compatibility +====================== + +This change is fully backward-compatible. + + +Reference Implementation +======================== + +The author has `drafted a C implementation +`_. + +An approximate pure-Python translation is:: + + def zip(*iterables, strict=False): + if not iterables: + return + iterators = tuple(iter(iterable) for iterable in iterables) + try: + while True: + items = [] + for iterator in iterators: + items.append(next(iterator)) + yield tuple(items) + except StopIteration: + if not strict: + return + if items: + i = len(items) + 1 + raise ValueError(f"zip() argument {i} is too short") + sentinel = object() + for i, iterator in enumerate(iterators[1:], 2): + if next(iterator, sentinel) is not sentinel: + raise ValueError(f"zip() argument {i} is too long") + + +Rejected Ideas +============== + +Add Additional Flavors Of ``zip`` To ``itertools`` +'''''''''''''''''''''''''''''''''''''''''''''''''' + +Importing a drop-in replacement for a built-in feels too heavy, +especially just to check a tricky condition that should "always" be +true. The goal here is not just to provide a way to catch bugs, but +to also make it easy (even tempting) for a user to enable the check +whenever using ``zip`` at a call site with this property. + +Some have also argued that a new function buried in the standard +library is somehow more "discoverable" than a keyword argument on the +built-in itself. The author does not believe this to be true. + +Another proposed idiom, per-module shadowing of the built-in ``zip`` +with some subtly different variant from ``itertools``, is an +anti-pattern that shouldn't be encouraged. + + +Change The Default Behavior Of ``zip`` +'''''''''''''''''''''''''''''''''''''' + +Support for infinite iterators is generally useful, and there is +nothing "wrong" with the default behavior of ``zip``. Likely, this +backward-incompatible change would break more code than it "fixes". + +``itertools.zip_longest`` already exists to service those cases where +the "extra" tail-end data is still needed. + + +Add A Method Or Alternate Constructor To The ``zip`` Type +''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +The actual ``zip`` type is an undocumented implementation detail. +Adding additional methods or constructors is really a much larger +change that is not necessary to achieve the stated goal. + + +Raise An ``AssertionError`` Instead Of A ``ValueError`` +''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +There are no built-in functions or types that raise an +``AssertionError`` as part of their API. Further, the `official +documentation +`_ +simply reads (in its entirety): + + Raised when an ``assert`` statement fails. + +Since this feature has nothing to do with Python's ``assert`` +statement, raising an ``AssertionError`` here would be inappropriate. + + +Do Nothing +'''''''''' + +This option is perhaps the least attractive. + +Silently truncated data is a particularly nasty class of bug, and +hand-writing a robust solution that gets this right isn't trivial. The +real-world motivating examples from Python's own standard library are +evidence that it's *very* easy to fall into the sort of trap that this +feature aims to avoid. + + +Copyright +========= + +This document is placed in the public domain or under the +CC0-1.0-Universal license, whichever is more permissive. + + +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + coding: utf-8 + End: