PEP: Add Optional Length-Checking To zip (#1389)

This commit is contained in:
Brandt Bucher 2020-05-01 10:51:52 -07:00 committed by GitHub
parent 08a58eccaa
commit 11b2de41a7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 236 additions and 0 deletions

236
pep-0618.rst Normal file
View File

@ -0,0 +1,236 @@
PEP: 618
Title: Add Optional Length-Checking To zip
Version: $Revision$
Last-Modified: $Date$
Author: Brandt Bucher <brandtbucher@gmail.com>
Sponsor: Antoine Pitrou <antoine@python.org>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 30-Apr-2020
Python-Version: 3.9
Post-History:
Resolution:
Abstract
========
This PEP proposes adding an optional ``strict`` boolean keyword
argument to the built-in ``zip``. When enabled, a ``ValueError`` is
raised if one of the arguments is exhausted before the others.
Motivation
==========
Many Python users find that most of their ``zip`` usage involves
iterables that *should* be of equal length. Sometimes this invariant
is proven true from the context of the surrounding code, but often the
data being zipped is passed from the caller, sourced separately, or
generated in some fashion. In any of these cases, the default
behavior of ``zip`` means that faulty refactoring or logic errors
could easily result in silently losing data. These bugs are not only
difficult to diagnose, but difficult to even detect at all.
It is easy to come up with simple cases where this could be a problem.
For example, the following code may work fine when ``items`` is a
sequence, but silently start producing shortened, mismatched results
if ``items`` is refactored by the caller to be a consumable iterator::
def apply_calculations(items):
transformed = transform(items)
for x, y in zip(items, transformed):
yield something(x, y)
There are several other ways in which ``zip`` is commonly used.
Idiomatic tricks are especially susceptible, because they are often
employed by users who lack a complete understanding of how the code
works. One example is unpacking into ``zip`` to lazily "unzip" or
"transpose" nested iterables::
>>> x = iter([iter([1, 2, 3]), iter(["one" "two" "three"])])
>>> xt = list(zip(*x))
Another is "chunking" data into equal-sized groups::
>>> n = 3
>>> x = range(n ** 2),
>>> xn = list(zip(*[iter(x)] * n))
In the first case, non-rectangular data is usually a logic error. In
the second case, data that is not a multiple of ``n`` is often an
error as well. However, both of these idioms will silently omit the
tail-end items of malformed input.
Perhaps most convincingly, the current use of ``zip`` in the
standard-library ``ast`` module has created multiple bugs that
`silently drop parts of malformed nodes
<https://bugs.python.org/issue40355>`_::
>>> from ast import Constant, Dict, literal_eval
>>> nasty_dict = Dict(keys=[Constant("XXX")], values=[])
>>> literal_eval(nasty_dict) # Like eval('{"XXX": }')
{}
In fact, the author has counted dozens of other call sites in Python's
standard library and tooling where it would be appropriate to enable
this new feature immediately.
Rationale
=========
Some critics assert that boolean switches are a "code-smell", or go
against Python's design philosophy. However, Python currently
contains several examples of built-in functions with boolean keyword
arguments:
- ``compile(..., dont_inherit=True)``
- ``open(..., closefd=False)``
- ``print(..., flush=True)``
- ``sorted(..., reverse=True)``
Many more exist in the standard library.
A good rule of thumb is that "mode-switches" which change return types
or significantly alter functionality are indeed an anti-pattern, while
ones which enable or disable complementary checks or functionality are
not.
The name for this new parameter was taken from the `original message
<https://mail.python.org/archives/list/python-ideas@python.org/message/6GFUADSQ5JTF7W7OGWF7XF2NH2XUTUQM>`_
suggesting the feature. It received over 100 replies, with nobody
challenging the use of the word "strict".
Specification
=============
When the built-in ``zip`` is called with the keyword-only argument
``strict=True``, the resulting iterator will raise a ``ValueError`` if
the arguments are exhausted at differing lengths. This error will
occur at the point when iteration would normally stop today.
At most one additional item may be consumed from one of the iterators
when compared to normal ``zip`` usage.
Backward Compatibility
======================
This change is fully backward-compatible.
Reference Implementation
========================
The author has `drafted a C implementation
<https://github.com/python/cpython/compare/master...brandtbucher:zip-strict>`_.
An approximate pure-Python translation is::
def zip(*iterables, strict=False):
if not iterables:
return
iterators = tuple(iter(iterable) for iterable in iterables)
try:
while True:
items = []
for iterator in iterators:
items.append(next(iterator))
yield tuple(items)
except StopIteration:
if not strict:
return
if items:
i = len(items) + 1
raise ValueError(f"zip() argument {i} is too short")
sentinel = object()
for i, iterator in enumerate(iterators[1:], 2):
if next(iterator, sentinel) is not sentinel:
raise ValueError(f"zip() argument {i} is too long")
Rejected Ideas
==============
Add Additional Flavors Of ``zip`` To ``itertools``
''''''''''''''''''''''''''''''''''''''''''''''''''
Importing a drop-in replacement for a built-in feels too heavy,
especially just to check a tricky condition that should "always" be
true. The goal here is not just to provide a way to catch bugs, but
to also make it easy (even tempting) for a user to enable the check
whenever using ``zip`` at a call site with this property.
Some have also argued that a new function buried in the standard
library is somehow more "discoverable" than a keyword argument on the
built-in itself. The author does not believe this to be true.
Another proposed idiom, per-module shadowing of the built-in ``zip``
with some subtly different variant from ``itertools``, is an
anti-pattern that shouldn't be encouraged.
Change The Default Behavior Of ``zip``
''''''''''''''''''''''''''''''''''''''
Support for infinite iterators is generally useful, and there is
nothing "wrong" with the default behavior of ``zip``. Likely, this
backward-incompatible change would break more code than it "fixes".
``itertools.zip_longest`` already exists to service those cases where
the "extra" tail-end data is still needed.
Add A Method Or Alternate Constructor To The ``zip`` Type
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''
The actual ``zip`` type is an undocumented implementation detail.
Adding additional methods or constructors is really a much larger
change that is not necessary to achieve the stated goal.
Raise An ``AssertionError`` Instead Of A ``ValueError``
'''''''''''''''''''''''''''''''''''''''''''''''''''''''
There are no built-in functions or types that raise an
``AssertionError`` as part of their API. Further, the `official
documentation
<https://docs.python.org/3.9/library/exceptions.html?highlight=assertionerror#AssertionError>`_
simply reads (in its entirety):
Raised when an ``assert`` statement fails.
Since this feature has nothing to do with Python's ``assert``
statement, raising an ``AssertionError`` here would be inappropriate.
Do Nothing
''''''''''
This option is perhaps the least attractive.
Silently truncated data is a particularly nasty class of bug, and
hand-writing a robust solution that gets this right isn't trivial. The
real-world motivating examples from Python's own standard library are
evidence that it's *very* easy to fall into the sort of trap that this
feature aims to avoid.
Copyright
=========
This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: