Updating PEP to reflect prototype implementation

This commit is contained in:
Greg Wilson 2001-05-07 19:51:10 +00:00
parent 9ba47180ba
commit 125ca2410c
1 changed files with 115 additions and 97 deletions

View File

@ -1,27 +1,30 @@
PEP: 218
Title: Adding a Built-In Set Object Type
Version: $Revision$
Author: gvwilson@nevex.com (Greg Wilson)
Author: gvwilson@ddj.com (Greg Wilson)
Status: Draft
Type: Standards Track
Python-Version: 2.1
Python-Version: 2.2
Created: 31-Jul-2000
Post-History:
Introduction
This PEP proposes adding sets as a built-in type in Python.
This PEP proposes adding a Set module to the standard Python
library, and to then make sets a built-in Python type if that
module is widely used. After explaining why sets are desirable,
and why the common idiom of using dictionaries in their place is
inadequate, we describe how we intend built-in sets to work, and
then how the preliminary Set module will behave. The penultimate
section discusses the mutability (or otherwise) of sets and set
elements, and the solution which the Set module will implement.
The last section then looks at alternatives that were considered,
but discarded.
Rationale
One of Python's greatest strengths as a teaching language is its
clarity. Its syntax and object model are so clean, and so simple,
that it can serve as "executable pseudocode". Anything that makes
it even better suited for this role will help increase its use in
school and college courses.
Sets are a fundamental mathematical structure, and are very
commonly used in algorithm specifications. They are much less
frequently used in implementations, even when they are the "right"
@ -42,19 +45,21 @@ Rationale
dictionaries containing key/value pairs.
Proposal
Long-Term Proposal
We propose adding a set type to Python. This type will be an
unordered collection of unique values, just as a dictionary is an
unordered collection of key/value pairs. Constant sets will be
represented using the usual mathematical notation, so that
"{1, 2, 3}" will be a set of three integers.
The long-term goal of this PEP is to add a built-in set type to
Python. This type will be an unordered collection of unique
values, just as a dictionary is an unordered collection of
key/value pairs. Constant sets will be represented using the
usual mathematical notation, so that "{1, 2, 3}" will be a set of
three integers.
In order to avoid ambiguity, the empty set will be written "{,}",
In order to avoid ambiguity, the empty set will be written "{-}",
rather than "{}" (which is already used to represent empty
dictionaries). We feel that this notation is as reasonable as the
use of "(3,)" to represent single-element tuples; a more radical
strategy is discussed in the "Alternatives" section.
strategy is discussed in the "Alternatives" section, and more
readable than the earlier proposal "{,}".
Iteration and comprehension will be implemented in the obvious
ways, so that:
@ -66,106 +71,119 @@ Proposal
{x**2 for x in S}
will produce a set containing the squares of all elements in S,
Membership will be tested using "in" and "not in", and basic set
operations will be implemented by a mixture of overloaded
operators:
Membership will be tested using "in" and "not in".
| union
& intersection
^ symmetric difference
- asymmetric difference
The binary operators '|', '&', '-', and "^" will implement set
union, intersection, difference, and symmetric difference. Their
in-place equivalents will have the obvious semantics. (We feel
that it is more sensible to overload the bitwise operators '|' and
'&', rather than the arithmetic operators '+' and "*', because
there is no arithmetic equivalent of '^'.)
and methods:
The method "add" will add an element to a set. This is different
from set union, as the following example shows:
S.add(x) Add "x" to the set.
>>> {1, 2, 3} | {4, 5, 6}
{1, 2, 3, 4, 5, 6}
S.update(s) Add all elements of sequence "s" to the set.
>>> {1, 2, 3}.add({4, 5, 6})
{1, 2, 3, {4, 5, 6}}
S.remove(x) Remove "x" from the set. If "x" is not
present, this method raises a LookupError
exception.
Note that we expect that items can also be added to sets using
in-place union of temporaries, i.e. "S |= {x}" instead of
"S.add(x)".
S.discard(x) Remove "x" from the set if it is present, or
do nothing if it is not.
Elements will be deleted from sets using a "remove" method, or
using "del":
S.popitem() Remove and return an arbitrary element,
raising a LookupError if the element is not
present.
>>> S = {1, 2, 3}
>>> S.remove(3)
>>> S
{1, 2}
>>> del S[1]
>>> S
{2}
and one new built-in conversion function:
The "KeyError" exception will be raised if an attempt is made to
remove an element which is not in a set. This definition of "del"
is consistent with that used for dictionaries:
set(x) Create a set containing the elements of the
collection "x".
>>> D = {1:2, 3:4}
>>> del D[1]
>>> D
{3:4}
Notes:
A new method "dict.keyset" will return the keys of a dictionary as
a set. A corresponding method "dict.valueset" will return the
dictionary's values as a set.
1. We propose using the bitwise operators "|&" for intersection
and union. While "+" for union would be intuitive, "*" for
intersection is not (very few of the people asked guessed what
it did correctly).
A built-in converter "set()" will convert any sequence type to a
set; converters such as "list()" and "tuple()" will be extended to
handle sets as input.
2. We considered using "+" to add elements to a set, rather than
"add". However, Guido van Rossum pointed out that "+" is
symmetric for other built-in types (although "*" is not). Use
of "add" will also avoid confusion between that operation and
set union.
3. Sets raise "LookupError" exceptions, rather than "KeyError" or
"ValueError", because set elements are neither keys nor values.
Short-Term Proposal
In order to determine whether there is enough demand for sets to
justify making them a built-in type, and to give users a chance to
try out the semantics we propose for sets, our short-term proposal
is to add a "Set" class to the standard Python library. This
class will have the operators and methods described above; it will
also have named methods corresponding to all of the operations: a
"union" method for "|", and a "union_update" method for "|=", and
so on.
This class will use a dictionary internally to contain set values.
In order to avoid having to duplicate values (e.g. for iteration
through the set), the class will rely on the iterators which are
scheduled to appear in Python 2.2.
Tim Peters believes that the class's constructor should take a
single sequence as an argument, and populate the set with that
sequence's elements. His argument is that in most cases,
programmers will be created sets from pre-existing sequences, so
that common case should be usual. However, this would require
users to remember an extra set of parentheses when initializing a
set with known values:
>>> Set((1, 2, 3, 4)) # case 1
On the other hand, feedback from a small number of novice Python
users (all of whom were very experienced with other languages)
indicates that people will find a "parenthesis-free" syntax more
natural:
>>> Set(1, 2, 3, 4) # case 2
On the other, other hand, if Python does adopt a dictionary-like
notation for sets in the future, then case 2 will become
redundant. We have therefore adopted the first strategy, in which
the initializer takes a single sequence argument.
Open Issues
Mutability
One major issue remains to be resolved: will sets be allowed to
contain mutable values, or will their values be required to
immutable (as dictionary keys are)? The disadvantages of allowing
only immutable values are clear --- if nothing else, it would
prevent users from creating sets of sets.
The most difficult question to resolve in this proposal was
whether sets ought to be able to contain mutable elements. A
dictionary's keys must be immutable in order to support fast,
reliable lookup. While it would be easy to require set elements
to be immutable, this would preclude sets of sets (which are
widely used in graph algorithms and other applications).
However, no efficient implementation of sets of mutable values has
yet been suggested. Hashing approaches will obviously fail (which
is why mutable values are not allowed to be dictionary keys).
Even simple-minded implementations, such as storing the set's
values in a list, can give incorrect results, as the following
example shows:
>>> a = [1, 2]
>>> b = [3, 4]
>>> S = [a, b]
>>> a[0:2] = [3, 4]
>>> S
[[3, 4], [3, 4]]
One way to solve this problem would be to add observer/observable
functionality to every data structure in Python, so that
structures would know to update themselves when their contained
values mutated. This is clearly impractical given the current
code base, and the performance penalties (in both memory and
execution time) would probably be unacceptable anyway.
At Tim Peters' suggestion, we will implement the following
compromise. A set may only contain immutable elements, but is
itself mutable *until* its hash code is calculated. As soon as
that happens, the set is "frozen", i.e. becomes immutable. Thus,
a set may be used as a dictionary key, or as a set element, but
cannot be updated after this is done. Peters reports that this
behavior rarely causes problems in practice.
Alternatives
A more conservative alternative to this proposal would be to add a
new built-in class "Set", rather than adding new syntax for direct
expression of sets. On the positive side, this would not require
any changes to the Python language definition. On the negative
side, people would then not be able to write Python programs using
the same notation as they would use on a whiteboard. We feel that
the more Python supports standard pre-existing notation, the
greater the chances of it being adopted as a teaching language.
A radical alternative to the (admittedly clumsy) notation "{,}" is
to re-define "{}" to be the empty collection, rather than the
empty dictionary. Operations which made this object non-empty
would silently convert it to either a dictionary or a set; it
would then retain that type for the rest of its existence. This
idea was rejected because of its potential impact on existing
Python programs. A similar proposal to modify "dict.keys" and
An alternative to the notation "{-}" for the empty set would be to
re-define "{}" to be the empty collection, rather than the empty
dictionary. Operations which made this object non-empty would
silently convert it to either a dictionary or a set; it would then
retain that type for the rest of its existence. This idea was
rejected because of its potential impact on existing Python
programs. A similar proposal to modify "dict.keys" and
"dict.values" to return sets, rather than lists, was rejected for
the same reasons.