PEP: 218
Title: Adding a Built-In Set Object Type
Version: $Revision$
Author: gvwilson@nevex.com (Greg Wilson)
Status: Draft
Type: Standards Track
Python-Version: 2.1
Created: 31-Jul-2000
Post-History: 


Introduction

    This PEP proposes adding sets as a built-in type in Python.


Rationale

    One of Python's greatest strengths as a teaching language is its
    clarity.  Its syntax and object model are so clean, and so simple,
    that it can serve as "executable pseudocode".  Anything that makes
    it even better suited for this role will help increase its use in
    school and college courses.

    Sets are a fundamental mathematical structure, and are very
    commonly used in algorithm specifications.  They are much less
    frequently used in implementations, even when they are the "right"
    structure.  Programmers frequently use lists instead, even when
    the ordering information in lists is irrelevant, and by-value
    lookups are frequent.  (Most medium-sized C programs contain a
    depressing number of start-to-end searches through malloc'd
    vectors to determine whether particular items are present or
    not...)

    Programmers are often told that they can implement sets as
    dictionaries with "don't care" values.  Items can be added to
    these "sets" by assigning the "don't care" value to them;
    membership can be tested using "dict.has_key"; and items can be
    deleted using "del".  However, the other main operations on sets
    (union, intersection, and difference) are not directly supported
    by this representation, since their meaning is ambiguous for
    dictionaries containing key/value pairs.


Proposal

    We propose adding a set type to Python.  This type will be an
    unordered collection of unique values, just as a dictionary is an
    unordered collection of key/value pairs.  Constant sets will be
    represented using the usual mathematical notation, so that
    "{1, 2, 3}" will be a set of three integers.

    In order to avoid ambiguity, the empty set will be written "{,}",
    rather than "{}" (which is already used to represent empty
    dictionaries).  We feel that this notation is as reasonable as the
    use of "(3,)" to represent single-element tuples; a more radical
    strategy is discussed in the "Alternatives" section.

    Iteration and comprehension will be implemented in the obvious
    ways, so that:

        for x in S:

    will step through the elements of S in arbitrary order, while:

        {x**2 for x in S} 

    will produce a set containing the squares of all elements in S,

    Membership will be tested using "in" and "not in".

    The binary operators '|', '&', '-', and "^" will implement set
    union, intersection, difference, and symmetric difference.  Their
    in-place equivalents will have the obvious semantics.  (We feel
    that it is more sensible to overload the bitwise operators '|' and
    '&', rather than the arithmetic operators '+' and "*', because
    there is no arithmetic equivalent of '^'.)

    The method "add" will add an element to a set.  This is different
    from set union, as the following example shows:

        >>> {1, 2, 3} | {4, 5, 6}
        {1, 2, 3, 4, 5, 6}

        >>> {1, 2, 3}.add({4, 5, 6})
        {1, 2, 3, {4, 5, 6}}

    Note that we expect that items can also be added to sets using
    in-place union of temporaries, i.e. "S |= {x}" instead of
    "S.add(x)".

    Elements will be deleted from sets using a "remove" method, or
    using "del":

        >>> S = {1, 2, 3}
        >>> S.remove(3)
        >>> S
        {1, 2}
        >>> del S[1]
        >>> S
        {2}

    The "KeyError" exception will be raised if an attempt is made to
    remove an element which is not in a set.  This definition of "del"
    is consistent with that used for dictionaries:

        >>> D = {1:2, 3:4}
        >>> del D[1]
        >>> D
        {3:4}

    A new method "dict.keyset" will return the keys of a dictionary as
    a set.  A corresponding method "dict.valueset" will return the
    dictionary's values as a set.

    A built-in converter "set()" will convert any sequence type to a
    set; converters such as "list()" and "tuple()" will be extended to
    handle sets as input.


Open Issues

    One major issue remains to be resolved: will sets be allowed to
    contain mutable values, or will their values be required to
    immutable (as dictionary keys are)?  The disadvantages of allowing
    only immutable values are clear --- if nothing else, it would
    prevent users from creating sets of sets.

    However, no efficient implementation of sets of mutable values has
    yet been suggested.  Hashing approaches will obviously fail (which
    is why mutable values are not allowed to be dictionary keys).
    Even simple-minded implementations, such as storing the set's
    values in a list, can give incorrect results, as the following
    example shows:

        >>> a = [1, 2]
        >>> b = [3, 4]
        >>> S = [a, b]
        >>> a[0:2] = [3, 4]
        >>> S
        [[3, 4], [3, 4]]

    One way to solve this problem would be to add observer/observable
    functionality to every data structure in Python, so that
    structures would know to update themselves when their contained
    values mutated.  This is clearly impractical given the current
    code base, and the performance penalties (in both memory and
    execution time) would probably be unacceptable anyway.


Alternatives

    A more conservative alternative to this proposal would be to add a
    new built-in class "Set", rather than adding new syntax for direct
    expression of sets.  On the positive side, this would not require
    any changes to the Python language definition.  On the negative
    side, people would then not be able to write Python programs using
    the same notation as they would use on a whiteboard.  We feel that
    the more Python supports standard pre-existing notation, the
    greater the chances of it being adopted as a teaching language.

    A radical alternative to the (admittedly clumsy) notation "{,}" is
    to re-define "{}" to be the empty collection, rather than the
    empty dictionary.  Operations which made this object non-empty
    would silently convert it to either a dictionary or a set; it
    would then retain that type for the rest of its existence.  This
    idea was rejected because of its potential impact on existing
    Python programs.  A similar proposal to modify "dict.keys" and
    "dict.values" to return sets, rather than lists, was rejected for
    the same reasons.


Copyright

    This document has been placed in the Public Domain.


Local Variables:
mode: indented-text
indent-tabs-mode: nil
End: