Updating PEP to reflect prototype implementation

2001-05-07 19:51:10 +00:00 · 2001-05-07 19:51:10 +00:00 · 125ca2410c
parent 9ba47180ba
commit 125ca2410c
1 changed files with 115 additions and 97 deletions
--- a/pep-0218.txt
+++ b/pep-0218.txt
@ -1,27 +1,30 @@
 PEP: 218
 Title: Adding a Built-In Set Object Type
 Version: $Revision$
-Author: gvwilson@nevex.com (Greg Wilson)
+Author: gvwilson@ddj.com (Greg Wilson)
 Status: Draft
 Type: Standards Track
-Python-Version: 2.1
+Python-Version: 2.2
 Created: 31-Jul-2000
 Post-History: 


 Introduction

-    This PEP proposes adding sets as a built-in type in Python.
+    This PEP proposes adding a Set module to the standard Python
+    library, and to then make sets a built-in Python type if that
+    module is widely used.  After explaining why sets are desirable,
+    and why the common idiom of using dictionaries in their place is
+    inadequate, we describe how we intend built-in sets to work, and
+    then how the preliminary Set module will behave.  The penultimate
+    section discusses the mutability (or otherwise) of sets and set
+    elements, and the solution which the Set module will implement.
+    The last section then looks at alternatives that were considered,
+    but discarded.


 Rationale

-    One of Python's greatest strengths as a teaching language is its
-    clarity.  Its syntax and object model are so clean, and so simple,
-    that it can serve as "executable pseudocode".  Anything that makes
-    it even better suited for this role will help increase its use in
-    school and college courses.
-
    Sets are a fundamental mathematical structure, and are very
    commonly used in algorithm specifications.  They are much less
    frequently used in implementations, even when they are the "right"
@ -42,19 +45,21 @@ Rationale
    dictionaries containing key/value pairs.


-Proposal
+Long-Term Proposal

-    We propose adding a set type to Python.  This type will be an
-    unordered collection of unique values, just as a dictionary is an
-    unordered collection of key/value pairs.  Constant sets will be
-    represented using the usual mathematical notation, so that
-    "{1, 2, 3}" will be a set of three integers.
+    The long-term goal of this PEP is to add a built-in set type to
+    Python.  This type will be an unordered collection of unique
+    values, just as a dictionary is an unordered collection of
+    key/value pairs.  Constant sets will be represented using the
+    usual mathematical notation, so that "{1, 2, 3}" will be a set of
+    three integers.

-    In order to avoid ambiguity, the empty set will be written "{,}",
+    In order to avoid ambiguity, the empty set will be written "{-}",
    rather than "{}" (which is already used to represent empty
    dictionaries).  We feel that this notation is as reasonable as the
    use of "(3,)" to represent single-element tuples; a more radical
-    strategy is discussed in the "Alternatives" section.
+    strategy is discussed in the "Alternatives" section, and more
+    readable than the earlier proposal "{,}".

    Iteration and comprehension will be implemented in the obvious
    ways, so that:
@ -66,106 +71,119 @@ Proposal
        {x**2 for x in S} 

    will produce a set containing the squares of all elements in S,
+    Membership will be tested using "in" and "not in", and basic set
+    operations will be implemented by a mixture of overloaded
+    operators:

-    Membership will be tested using "in" and "not in".
+        |               union
+        &               intersection
+        ^               symmetric difference
+        -               asymmetric difference

-    The binary operators '|', '&', '-', and "^" will implement set
-    union, intersection, difference, and symmetric difference.  Their
-    in-place equivalents will have the obvious semantics.  (We feel
-    that it is more sensible to overload the bitwise operators '|' and
-    '&', rather than the arithmetic operators '+' and "*', because
-    there is no arithmetic equivalent of '^'.)
+    and methods:

-    The method "add" will add an element to a set.  This is different
-    from set union, as the following example shows:
+        S.add(x)        Add "x" to the set.

-        >>> {1, 2, 3} | {4, 5, 6}
-        {1, 2, 3, 4, 5, 6}
+        S.update(s)     Add all elements of sequence "s" to the set.

-        >>> {1, 2, 3}.add({4, 5, 6})
-        {1, 2, 3, {4, 5, 6}}
+        S.remove(x)     Remove "x" from the set.  If "x" is not
+                        present, this method raises a LookupError
+                        exception.

-    Note that we expect that items can also be added to sets using
-    in-place union of temporaries, i.e. "S |= {x}" instead of
-    "S.add(x)".
+        S.discard(x)    Remove "x" from the set if it is present, or
+                        do nothing if it is not.

-    Elements will be deleted from sets using a "remove" method, or
-    using "del":
+        S.popitem()     Remove and return an arbitrary element,
+                        raising a LookupError if the element is not
+                        present.

-        >>> S = {1, 2, 3}
-        >>> S.remove(3)
-        >>> S
-        {1, 2}
-        >>> del S[1]
-        >>> S
-        {2}
+    and one new built-in conversion function:

-    The "KeyError" exception will be raised if an attempt is made to
-    remove an element which is not in a set.  This definition of "del"
-    is consistent with that used for dictionaries:
+        set(x)          Create a set containing the elements of the
+                        collection "x".

-        >>> D = {1:2, 3:4}
-        >>> del D[1]
-        >>> D
-        {3:4}
+    Notes:

-    A new method "dict.keyset" will return the keys of a dictionary as
-    a set.  A corresponding method "dict.valueset" will return the
-    dictionary's values as a set.
+    1. We propose using the bitwise operators "|&" for intersection
+       and union.  While "+" for union would be intuitive, "*" for
+       intersection is not (very few of the people asked guessed what
+       it did correctly).

-    A built-in converter "set()" will convert any sequence type to a
-    set; converters such as "list()" and "tuple()" will be extended to
-    handle sets as input.
+    2. We considered using "+" to add elements to a set, rather than
+       "add".  However, Guido van Rossum pointed out that "+" is
+       symmetric for other built-in types (although "*" is not).  Use
+       of "add" will also avoid confusion between that operation and
+       set union.
+
+    3. Sets raise "LookupError" exceptions, rather than "KeyError" or
+       "ValueError", because set elements are neither keys nor values.
+
+Short-Term Proposal
+
+    In order to determine whether there is enough demand for sets to
+    justify making them a built-in type, and to give users a chance to
+    try out the semantics we propose for sets, our short-term proposal
+    is to add a "Set" class to the standard Python library.  This
+    class will have the operators and methods described above; it will
+    also have named methods corresponding to all of the operations: a
+    "union" method for "|", and a "union_update" method for "|=", and
+    so on.
+
+    This class will use a dictionary internally to contain set values.
+    In order to avoid having to duplicate values (e.g. for iteration
+    through the set), the class will rely on the iterators which are
+    scheduled to appear in Python 2.2.
+
+    Tim Peters believes that the class's constructor should take a
+    single sequence as an argument, and populate the set with that
+    sequence's elements.  His argument is that in most cases,
+    programmers will be created sets from pre-existing sequences, so
+    that common case should be usual.  However, this would require
+    users to remember an extra set of parentheses when initializing a
+    set with known values:
+
+    >>> Set((1, 2, 3, 4))       # case 1
+
+    On the other hand, feedback from a small number of novice Python
+    users (all of whom were very experienced with other languages)
+    indicates that people will find a "parenthesis-free" syntax more
+    natural:
+
+    >>> Set(1, 2, 3, 4)         # case 2
+
+    On the other, other hand, if Python does adopt a dictionary-like
+    notation for sets in the future, then case 2 will become
+    redundant.  We have therefore adopted the first strategy, in which
+    the initializer takes a single sequence argument.


-Open Issues
+Mutability

-    One major issue remains to be resolved: will sets be allowed to
-    contain mutable values, or will their values be required to
-    immutable (as dictionary keys are)?  The disadvantages of allowing
-    only immutable values are clear --- if nothing else, it would
-    prevent users from creating sets of sets.
+    The most difficult question to resolve in this proposal was
+    whether sets ought to be able to contain mutable elements.  A
+    dictionary's keys must be immutable in order to support fast,
+    reliable lookup.  While it would be easy to require set elements
+    to be immutable, this would preclude sets of sets (which are
+    widely used in graph algorithms and other applications).

-    However, no efficient implementation of sets of mutable values has
-    yet been suggested.  Hashing approaches will obviously fail (which
-    is why mutable values are not allowed to be dictionary keys).
-    Even simple-minded implementations, such as storing the set's
-    values in a list, can give incorrect results, as the following
-    example shows:
-
-        >>> a = [1, 2]
-        >>> b = [3, 4]
-        >>> S = [a, b]
-        >>> a[0:2] = [3, 4]
-        >>> S
-        [[3, 4], [3, 4]]
-
-    One way to solve this problem would be to add observer/observable
-    functionality to every data structure in Python, so that
-    structures would know to update themselves when their contained
-    values mutated.  This is clearly impractical given the current
-    code base, and the performance penalties (in both memory and
-    execution time) would probably be unacceptable anyway.
+    At Tim Peters' suggestion, we will implement the following
+    compromise.  A set may only contain immutable elements, but is
+    itself mutable *until* its hash code is calculated.  As soon as
+    that happens, the set is "frozen", i.e. becomes immutable.  Thus,
+    a set may be used as a dictionary key, or as a set element, but
+    cannot be updated after this is done.  Peters reports that this
+    behavior rarely causes problems in practice.


 Alternatives

-    A more conservative alternative to this proposal would be to add a
-    new built-in class "Set", rather than adding new syntax for direct
-    expression of sets.  On the positive side, this would not require
-    any changes to the Python language definition.  On the negative
-    side, people would then not be able to write Python programs using
-    the same notation as they would use on a whiteboard.  We feel that
-    the more Python supports standard pre-existing notation, the
-    greater the chances of it being adopted as a teaching language.
-
-    A radical alternative to the (admittedly clumsy) notation "{,}" is
-    to re-define "{}" to be the empty collection, rather than the
-    empty dictionary.  Operations which made this object non-empty
-    would silently convert it to either a dictionary or a set; it
-    would then retain that type for the rest of its existence.  This
-    idea was rejected because of its potential impact on existing
-    Python programs.  A similar proposal to modify "dict.keys" and
+    An alternative to the notation "{-}" for the empty set would be to
+    re-define "{}" to be the empty collection, rather than the empty
+    dictionary.  Operations which made this object non-empty would
+    silently convert it to either a dictionary or a set; it would then
+    retain that type for the rest of its existence.  This idea was
+    rejected because of its potential impact on existing Python
+    programs.  A similar proposal to modify "dict.keys" and
    "dict.values" to return sets, rather than lists, was rejected for
    the same reasons.