diff --git a/pep-0472.txt b/pep-0472.txt new file mode 100644 index 000000000..848363835 --- /dev/null +++ b/pep-0472.txt @@ -0,0 +1,653 @@ +PEP: 472 +Title: Support for indexing with keyword arguments +Version: $Revision$ +Last-Modified: $Date$ +Author: Stefano Borini, Joseph Martinot-Lagarde +Discussion-To: python-ideas@python.org +Status: Draft +Type: Standards Track +Content-Type: text/x-rst +Created: 24-Jun-2014 +Python-Version: 3.6 +Post-History: 02-Jul-2014 + +Abstract +======== + +This PEP proposes an extension of the indexing operation to support keyword +arguments. Notations in the form ``a[K=3,R=2]`` would become legal syntax. +For future-proofing considerations, ``a[1:2, K=3, R=4]`` are considered and +may be allowed as well, depending on the choice for implementation. In addition +to a change in the parser, the index protocol (``__getitem__``, ``__setitem__`` +and ``__delitem__``) will also potentially require adaptation. + +Motivation +========== + +The indexing syntax carries a strong semantic content, differentiating it from +a method call: it implies referring to a subset of data. We believe this +semantic association to be important, and wish to expand the strategies allowed +to refer to this data. + +As a general observation, the number of indices needed by an indexing operation +depends on the dimensionality of the data: one-dimensional data (e.g. a list) +requires one index (e.g. ``a[3]``), two-dimensional data (e.g. a matrix) requires +two indices (e.g. ``a[2,3]``) and so on. Each index is a selector along one of the +axes of the dimensionality, and the position in the index tuple is the +metainformation needed to associate each index to the corresponding axis. + +The current python syntax focuses exclusively on position to express the +association to the axes, and also contains syntactic sugar to refer to +non-punctiform selection (slices) + +:: + + >>> a[3] # returns the fourth element of a + >>> a[1:10:2] # slice notation (extract a non-trivial data subset) + >>> a[3,2] # multiple indexes (for multidimensional arrays) + +The additional notation proposed in this PEP would allow notations involving +keyword arguments in the indexing operation, e.g. + +:: + + >>> a[K=3, R=2] + +which would allow to refer to axes by conventional names. + +One must additionally consider the extended form that allows both positional +and keyword specification + +:: + + >>> a[3,R=3,K=4] + +This PEP will explore different strategies to enable the use of these notations. + +Use cases +========= + +The following practical use cases present two broad categories of usage of a +keyworded specification: Indexing and contextual option. For indexing: + +1. To provide a more communicative meaning to the index, preventing e.g. accidental + inversion of indexes + + :: + + >>> gridValues[x=3, y=5, z=8] + >>> rain[time=0:12, location=location] + +2. In some domain, such as computational physics and chemistry, the use of a + notation such as ``Basis[Z=5]`` is a Domain Specific Language notation to represent + a level of accuracy + + :: + + >>> low_accuracy_energy = computeEnergy(molecule, BasisSet[Z=3]) + + In this case, the index operation would return a basis set at the chosen level + of accuracy (represented by the parameter Z). The reason behind an indexing is that + the BasisSet object could be internally represented as a numeric table, where + rows (the "coefficient" axis, hidden to the user in this example) are associated + to individual elements (e.g. row 0:5 contains coefficients for element 1, + row 5:8 coefficients for element 2) and each column is associated to a given + degree of accuracy ("accuracy" or "Z" axis) so that first column is low + accuracy, second column is medium accuracy and so on. With that indexing, + the user would obtain another object representing the contents of the column + of the internal table for accuracy level 3. + +Additionally, the keyword specification can be used as an option contextual to +the indexing. Specifically: + +1. A "default" option allows to specify a default return value when the index + is not present + + :: + + >>> lst = [1, 2, 3] + >>> value = lst[5, default=0] # value is 0 + +2. For a sparse dataset, to specify an interpolation strategy + to infer a missing point from e.g. its surrounding data. + + :: + + >>> value = array[1, 3, interpolate=spline_interpolator] + +3. A unit could be specified with the same mechanism + + :: + + >>> value = array[1, 3, unit="degrees"] + +How the notation is interpreted is up to the implementing class. + +Current implementation +====================== + +Currently, the indexing operation is handled by methods ``__getitem__``, +``__setitem__`` and ``__delitem__``. These methods' signature accept one argument +for the index (with ``__setitem__`` accepting an additional argument for the set +value). In the following, we will analyze ``__getitem__(self, idx)`` exclusively, +with the same considerations implied for the remaining two methods. + +When an indexing operation is performed, ``__getitem__(self, idx)`` is called. +Traditionally, the full content between square brackets is turned into a single +object passed to argument ``idx``: + + - When a single element is passed, e.g. ``a[2]``, ``idx`` will be ``2``. + - When multiple elements are passed, they must be separated by commas: ``a[2, 3]``. + In this case, ``idx`` will be a tuple ``(2, 3)``. With ``a[2, 3, "hello", {}]`` + ``idx`` will be ``(2, 3, "hello", {})``. + - A slicing notation e.g. ``a[2:10]`` will produce a slice object, or a tuple + containing slice objects if multiple values were passed. + +Except for its unique ability to handle slice notation, the indexing operation +has similarities to a plain method call: it acts like one when invoked with +only one element; If the number of elements is greater than one, the ``idx`` +argument behaves like a ``*args``. However, as stated in the Motivation section, +an indexing operation has the strong semantic implication of extraction of a +subset out of a larger set, which is not automatically associated to a regular +method call unless appropriate naming is chosen. Moreover, its different visual +style is important for readability. + +Specifications +============== + +The implementation should try to preserve the current signature for +``__getitem__``, or modify it in a backward-compatible way. We will present +different alternatives, taking into account the possible cases that need +to be addressed + +:: + + C0. a[1]; a[1,2] # Traditional indexing + C1. a[Z=3] + C2. a[Z=3, R=4] + C3. a[1, Z=3] + C4. a[1, Z=3, R=4] + C5. a[1, 2, Z=3] + C6. a[1, 2, Z=3, R=4] + C7. a[1, Z=3, 2, R=4] # Interposed ordering + +Strategy "Strict dictionary" +---------------------------- + +This strategy acknowledges that ``__getitem__`` is special in accepting only +one object, and the nature of that object must be non-ambiguous in its +specification of the axes: it can be either by order, or by name. As a result +of this assumption, in presence of keyword arguments, the passed entity is a +dictionary and all labels must be specified. + +:: + + C0. a[1]; a[1,2] -> idx = 1; idx = (1, 2) + C1. a[Z=3] -> idx = {"Z": 3} + C2. a[Z=3, R=4] -> idx = {"Z": 3, "R": 4} + C3. a[1, Z=3] -> raise SyntaxError + C4. a[1, Z=3, R=4] -> raise SyntaxError + C5. a[1, 2, Z=3] -> raise SyntaxError + C6. a[1, 2, Z=3, R=4] -> raise SyntaxError + C7. a[1, Z=3, 2, R=4] -> raise SyntaxError + +Pros +'''' + +- Strong conceptual similarity between the tuple case and the dictionary case. + In the first case, we are specifying a tuple, so we are naturally defining + a plain set of values separated by commas. In the second, we are specifying a + dictionary, so we are specifying a homogeneous set of key/value pairs, as + in ``dict(Z=3, R=4)``; +- Simple and easy to parse on the ``__getitem__`` side: if it gets a tuple, + determine the axes using positioning. If it gets a dictionary, use + the keywords. +- C interface does not need changes. + +Neutral +''''''' + +- Degeneracy of ``a[{"Z": 3, "R": 4}]`` with ``a[Z=3, R=4]`` means the notation + is syntactic sugar. + +Cons +'''' + +- Very strict. +- Destroys ordering of the passed arguments. Preserving the + order would be possible with an OrderedDict as drafted by PEP-468 [#PEP-468]_. +- Does not allow use cases with mixed positional/keyword arguments such as + ``a[1, 2, default=5]``. + +Strategy "mixed dictionary" +--------------------------- + +This strategy relaxes the above constraint to return a dictionary containing +both numbers and strings as keys. + +:: + + C0. a[1]; a[1,2] -> idx = 1; idx = (1, 2) + C1. a[Z=3] -> idx = {"Z": 3} + C2. a[Z=3, R=4] -> idx = {"Z": 3, "R": 4} + C3. a[1, Z=3] -> idx = { 0: 1, "Z": 3} + C4. a[1, Z=3, R=4] -> idx = { 0: 1, "Z": 3, "R": 4} + C5. a[1, 2, Z=3] -> idx = { 0: 1, 1: 2, "Z": 3} + C6. a[1, 2, Z=3, R=4] -> idx = { 0: 1, 1: 2, "Z": 3, "R": 4} + C7. a[1, Z=3, 2, R=4] -> idx = { 0: 1, "Z": 3, 2: 2, "R": 4} + +Pros +'''' +- Opens for mixed cases. + +Cons +'''' +- Destroys ordering information for string keys. We have no way of saying if + ``"Z"`` in C7 was in position 1 or 3. +- Implies switching from a tuple to a dict as soon as one specified index + has a keyword argument. May be confusing to parse. + +Strategy "named tuple" +----------------------- + +Return a named tuple for ``idx`` instead of a tuple. Keyword arguments would +obviously have their stated name as key, and positional argument would have an +underscore followed by their order: + +:: + + C0. a[1]; a[1,2] -> idx = 1; idx = (_0=1, _1=2) + C1. a[Z=3] -> idx = (Z=3) + C2. a[Z=3, R=2] -> idx = (Z=3, R=2) + C3. a[1, Z=3] -> idx = (_0=1, Z=3) + C4. a[1, Z=3, R=2] -> idx = (_0=1, Z=3, R=2) + C5. a[1, 2, Z=3] -> idx = (_0=1, _2=2, Z=3) + C6. a[1, 2, Z=3, R=4] -> (_0=1, _1=2, Z=3, R=4) + C7. a[1, Z=3, 2, R=4] -> (_0=1, Z=3, _1=2, R=4) + or (_0=1, Z=3, _2=2, R=4) + or raise SyntaxError + +The required typename of the namedtuple could be ``Index`` or the name of the +argument in the function definition, it keeps the ordering and is easy to +analyse by using the ``_fields`` attribute. It is backward compatible, provided +that C0 with more than one entry now passes a namedtuple instead of a plain +tuple. + +Pros +'''' +- Looks nice. namedtuple transparently replaces tuple and gracefully + degrades to the old behavior. +- Does not require a change in the C interface + +Cons +'''' +- According to some sources [#namedtuple]_ namedtuple is not well developed. + To include it as such important object would probably require rework + and improvement; +- The namedtuple fields, and thus the type, will have to change according + to the passed arguments. This can be a performance bottleneck, and makes + it impossible to guarantee that two subsequent index accesses get the same + Index class; +- the ``_n`` "magic" fields are a bit unusual, but ipython already uses them + for result history. +- Python currently has no builtin namedtuple. The current one is available + in the "collections" module in the standard library. +- Differently from a function, the two notations ``gridValues[x=3, y=5, z=8]`` + and ``gridValues[3,5,8]`` would not gracefully match if the order is modified + at call time (e.g. we ask for ``gridValues[y=5, z=8, x=3])``. In a function, + we can pre-define argument names so that keyword arguments are properly + matched. Not so in ``__getitem__``, leaving the task for interpreting and + matching to ``__getitem__`` itself. + + +Strategy "New argument contents" +-------------------------------- + +In the current implementation, when many arguments are passed to ``__getitem__``, +they are grouped in a tuple and this tuple is passed to ``__getitem__`` as the +single argument ``idx``. This strategy keeps the current signature, but expands the +range of variability in type and contents of ``idx`` to more complex representations. + +We identify four possible ways to implement this strategy: + +- **P1**: uses a single dictionary for the keyword arguments. +- **P2**: uses individual single-item dictionaries. +- **P3**: similar to **P2**, but replaces single-item dictionaries with a ``(key, value)`` tuple. +- **P4**: similar to **P2**, but uses a special and additional new object: ``keyword()`` + +Some of these possibilities lead to degenerate notations, i.e. indistinguishable +from an already possible representation. Once again, the proposed notation +becomes syntactic sugar for these representations. + +Under this strategy, the old behavior for C0 is unchanged. + +:: + + C0: a[1] -> idx = 1 # integer + a[1,2] -> idx = (1,2) # tuple + +In C1, we can use either a dictionary or a tuple to represent key and value pair +for the specific indexing entry. We need to have a tuple with a tuple in C1 +because otherwise we cannot differentiate ``a["Z", 3]`` from ``a[Z=3]``. + +:: + + C1: a[Z=3] -> idx = {"Z": 3} # P1/P2 dictionary with single key + or idx = (("Z", 3),) # P3 tuple of tuples + or idx = keyword("Z", 3) # P4 keyword object + +As you can see, notation P1/P2 implies that ``a[Z=3]`` and ``a[{"Z": 3}]`` will +call ``__getitem__`` passing the exact same value, and is therefore syntactic +sugar for the latter. Same situation occurs, although with different index, for +P3. Using a keyword object as in P4 would remove this degeneracy. + +For the C2 case: + +:: + + C2. a[Z=3, R=4] -> idx = {"Z": 3, "R": 4} # P1 dictionary/ordereddict + or idx = ({"Z": 3}, {"R": 4}) # P2 tuple of two single-key dict + or idx = (("Z", 3), ("R", 4)) # P3 tuple of tuples + or idx = (keyword("Z", 3), + keyword("R", 4) ) # P4 keyword objects + + +P1 naturally maps to the traditional ``**kwargs`` behavior, however it breaks +the convention that two or more entries for the index produce a tuple. P2 +preserves this behavior, and additionally preserves the order. Preserving the +order would also be possible with an OrderedDict as drafted by PEP-468 [#PEP-468]_. + +The remaining cases are here shown: + +:: + + C3. a[1, Z=3] -> idx = (1, {"Z": 3}) # P1/P2 + or idx = (1, ("Z", 3)) # P3 + or idx = (1, keyword("Z", 3)) # P4 + + C4. a[1, Z=3, R=4] -> idx = (1, {"Z": 3, "R": 4}) # P1 + or idx = (1, {"Z": 3}, {"R": 4}) # P2 + or idx = (1, ("Z", 3), ("R", 4)) # P3 + or idx = (1, keyword("Z", 3), + keyword("R", 4)) # P4 + + C5. a[1, 2, Z=3] -> idx = (1, 2, {"Z": 3}) # P1/P2 + or idx = (1, 2, ("Z", 3)) # P3 + or idx = (1, 2, keyword("Z", 3)) # P4 + + C6. a[1, 2, Z=3, R=4] -> idx = (1, 2, {"Z":3, "R": 4}) # P1 + or idx = (1, 2, {"Z": 3}, {"R": 4}) # P2 + or idx = (1, 2, ("Z", 3), ("R", 4)) # P3 + or idx = (1, 2, keyword("Z", 3), + keyword("R", 4)) # P4 + + C7. a[1, Z=3, 2, R=4] -> idx = (1, 2, {"Z": 3, "R": 4}) # P1. Pack the keyword arguments. Ugly. + or raise SyntaxError # P1. Same behavior as in function calls. + or idx = (1, {"Z": 3}, 2, {"R": 4}) # P2 + or idx = (1, ("Z", 3), 2, ("R", 4)) # P3 + or idx = (1, keyword("Z", 3), + 2, keyword("R", 4)) # P4 + +Pros +'''' +- Signature is unchanged; +- P2/P3 can preserve ordering of keyword arguments as specified at indexing, +- P1 needs an OrderedDict, but would destroy interposed ordering if allowed: + all keyword indexes would be dumped into the dictionary; +- Stays within traditional types: tuples and dicts. Evt. OrderedDict; +- Some proposed strategies are similar in behavior to a traditional function call; +- The C interface for ``PyObject_GetItem`` and family would remain unchanged. + +Cons +'''' +- Apparenty complex and wasteful; +- Degeneracy in notation (e.g. ``a[Z=3]`` and ``a[{"Z":3}]`` are equivalent and + indistinguishable notations at the ``__[get|set|del]item__`` level). + This behavior may or may not be acceptable. +- for P4, an additional object similar in nature to slice() is needed, + but only to disambiguate the above degeneracy. +- ``idx`` type and layout seems to change depending on the whims of the caller; +- May be complex to parse what is passed, especially in the case of tuple of tuples; +- P2 Creates a lot of single keys dictionary as members of a tuple. Looks ugly. + P3 would be lighter and easier to use than the tuple of dicts, and still + preserves order (unlike the regular dict), but would result in clumsy + extraction of keywords. + +Strategy "kwargs argument" +--------------------------- + +``__getitem__`` accepts an optional ``**kwargs`` argument which should be keyword only. +``idx`` also becomes optional to support a case where no non-keyword arguments are allowed. +The signature would then be either + +:: + + __getitem__(self, idx) + __getitem__(self, idx, **kwargs) + __getitem__(self, **kwargs) + +Applied to our cases would produce: + +:: + + C0. a[1,2] -> idx=(1,2); kwargs={} + C1. a[Z=3] -> idx=None ; kwargs={"Z":3} + C2. a[Z=3, R=4] -> idx=None ; kwargs={"Z":3, "R":4} + C3. a[1, Z=3] -> idx=1 ; kwargs={"Z":3} + C4. a[1, Z=3, R=4] -> idx=1 ; kwargs={"Z":3, "R":4} + C5. a[1, 2, Z=3] -> idx=(1,2); kwargs={"Z":3} + C6. a[1, 2, Z=3, R=4] -> idx=(1,2); kwargs={"Z":3, "R":4} + C7. a[1, Z=3, 2, R=4] -> raise SyntaxError # in agreement to function behavior + +Empty indexing ``a[]`` of course remains invalid syntax. + +Pros +'''' +- Similar to function call, evolves naturally from it; +- Use of keyword indexing with an object whose ``__getitem__`` + doesn't have a kwargs will fail in an obvious way. + That's not the case for the other strategies. + +Cons +'''' +- It doesn't preserve order, unless an OrderedDict is used; +- Forbids C7, but is it really needed? +- Requires a change in the C interface to pass an additional + PyObject for the keyword arguments. + + +C interface +=========== + +As briefly introduced in the previous analysis, the C interface would +potentially have to change to allow the new feature. Specifically, +``PyObject_GetItem`` and related routines would have to accept an additional +``PyObject *kw`` argument for Strategy "kwargs argument". The remaining +strategies would not require a change in the C function signatures, but the +different nature of the passed object would potentially require adaptation. + +Strategy "named tuple" would behave correctly without any change: the class +returned by the factory method in collections returns a subclass of tuple, +meaning that ``PyTuple_*`` functions can handle the resulting object. + +Alternative Solutions +===================== + +In this section, we present alternative solutions that would workaround the +missing feature and make the proposed enhancement not worth of implementation. + +Use a method +------------ + +One could keep the indexing as is, and use a traditional ``get()`` method for those +cases where basic indexing is not enough. This is a good point, but as already +reported in the introduction, methods have a different semantic weight from +indexing, and you can't use slices directly in methods. Compare e.g. +``a[1:3, Z=2]`` with ``a.get(slice(1,3), Z=2)``. + +The authors however recognize this argument as compelling, and the advantage +in semantic expressivity of a keyword-based indexing may be offset by a rarely +used feature that does not bring enough benefit and may have limited adoption. + +Emulate requested behavior by abusing the slice object +------------------------------------------------------ + +This extremely creative method exploits the slice objects' behavior, provided +that one accepts to use strings (or instantiate properly named placeholder +objects for the keys), and accept to use ":" instead of "=". + +:: + + >>> a["K":3] + slice('K', 3, None) + >>> a["K":3, "R":4] + (slice('K', 3, None), slice('R', 4, None)) + >>> + +While clearly smart, this approach does not allow easy inquire of the key/value +pair, it's too clever and esotheric, and does not allow to pass a slice as in +``a[K=1:10:2]``. + +However, Tim Delaney comments + + "I really do think that ``a[b=c, d=e]`` should just be syntax sugar for + ``a['b':c, 'd':e]``. It's simple to explain, and gives the greatest backwards + compatibility. In particular, libraries that already abused slices in this + way will just continue to work with the new syntax." + +We think this behavior would produce inconvenient results. The library Pandas uses +strings as labels, allowing notation such as + +:: + + >>> a[:, "A":"F"] + +to extract data from column "A" to column "F". Under the above comment, this notation +would be equally obtained with + +:: + + >>> a[:, A="F"] + +which is weird and collides with the intended meaning of keyword in indexing, that +is, specifying the axis through conventional names rather than positioning. + +Pass a dictionary as an additional index +---------------------------------------- + +:: + + >>> a[1, 2, {"K": 3}] + +this notation, although less elegant, can already be used and achieves similar +results. It's evident that the proposed Strategy "New argument contents" can be +interpreted as syntactic sugar for this notation. + +Additional Comments +=================== + +Commenters also expressed the following relevant points: + +Relevance of ordering of keyword arguments +------------------------------------------ + +As part of the discussion of this PEP, it's important to decide if the ordering +information of the keyword arguments is important, and if indexes and keys can +be ordered in an arbitrary way (e.g. ``a[1,Z=3,2,R=4]``). PEP-468 [#PEP-468]_ +tries to address the first point by proposing the use of an ordereddict, +however one would be inclined to accept that keyword arguments in indexing are +equivalent to kwargs in function calls, and therefore as of today equally +unordered, and with the same restrictions. + +Need for homogeneity of behavior +-------------------------------- + +Relative to Strategy "New argument contents", a comment from Ian Cordasco +points out that + + "it would be unreasonable for just one method to behave totally + differently from the standard behaviour in Python. It would be confusing for + only ``__getitem__`` (and ostensibly, ``__setitem__``) to take keyword + arguments but instead of turning them into a dictionary, turn them into + individual single-item dictionaries." We agree with his point, however it must + be pointed out that ``__getitem__`` is already special in some regards when it + comes to passed arguments. + +Chris Angelico also states: + + "it seems very odd to start out by saying "here, let's give indexing the + option to carry keyword args, just like with function calls", and then come + back and say "oh, but unlike function calls, they're inherently ordered and + carried very differently"." Again, we agree on this point. The most + straightforward strategy to keep homogeneity would be Strategy "kwargs + argument", opening to a ``**kwargs`` argument on ``__getitem__``. + +One of the authors (Stefano Borini) thinks that only the "strict dictionary" +strategy is worth of implementation. It is non-ambiguous, simple, does not +force complex parsing, and addresses the problem of referring to axes either +by position or by name. The "options" use case is probably best handled with +a different approach, and may be irrelevant for this PEP. The alternative +"named tuple" is another valid choice. + +Having .get() become obsolete for indexing with default fallback +---------------------------------------------------------------- + +Introducing a "default" keyword could make ``dict.get()`` obsolete, which would be +replaced by ``d["key", default=3]``. Chris Angelico however states: + + "Currently, you need to write ``__getitem__`` (which raises an exception on + finding a problem) plus something else, e.g. ``get()``, which returns a default + instead. By your proposal, both branches would go inside ``__getitem__``, which + means they could share code; but there still need to be two branches." + +Additionally, Chris continues: + + "There'll be an ad-hoc and fairly arbitrary puddle of names (some will go + ``default=``, others will say that's way too long and go ``def=``, except that + that's a keyword so they'll use ``dflt=`` or something...), unless there's a + strong force pushing people to one consistent name.". + +This argument is valid but it's equally valid for any function call, and is +generally fixed by established convention and documentation. + +On degeneracy of notation +------------------------- + +User Drekin commented: "The case of ``a[Z=3]`` and ``a[{"Z": 3}]`` is similar to +current ``a[1, 2]`` and ``a[(1, 2)]``. Even though one may argue that the parentheses +are actually not part of tuple notation but are just needed because of syntax, +it may look as degeneracy of notation when compared to function call: ``f(1, 2)`` +is not the same thing as ``f((1, 2))``.". + +References +========== + +.. [#keyword-1] "keyword-only args in __getitem__" + (http://article.gmane.org/gmane.comp.python.ideas/27584) + +.. [#keyword-2] "Accepting keyword arguments for __getitem__" + (https://mail.python.org/pipermail/python-ideas/2014-June/028164.html) + +.. [#keyword-3] "PEP pre-draft: Support for indexing with keyword arguments" + https://mail.python.org/pipermail/python-ideas/2014-July/028250.html + +.. [#namedtuple] "namedtuple is not as good as it should be" + (https://mail.python.org/pipermail/python-ideas/2013-June/021257.html) + +.. [#PEP-468] "Preserving the order of \*\*kwargs in a function." + http://legacy.python.org/dev/peps/pep-0468/ + +Copyright +========= + +This document has been placed in the public domain. + + + +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + End: