python-peps/pep-3103.txt

PEP: 3103
Title: A Switch/Case Statement
Version: $Revision$
Last-Modified: $Date$
Author: guido@python.org (Guido van Rossum)
Status: Draft
Type: Standards Track
Python-Version: 3.0
Content-Type: text/x-rst
Created: 25-Jun-2006
Post-History: never


Abstract
========

Python-dev has recently seen a flurry of discussion on adding a switch
statement.  In this PEP I'm trying to extract my own preferences from
the smorgasboard of proposals, discussing alternatives and explaining
my choices where I can.  I'll also indicate how strongly I feel about
alternatives I discuss.

This PEP should be seen as an alternative to PEP 275.  My views are
somewhat different from that PEP's author, but I'm grateful for the
work done in that PEP.


Rationale
=========

A common programming idiom is to consider an expression and do
different things depending on its value.  This is usually done with a
chain of if/elif tests; I'll refer to this form as the "if/elif
chain".  There are two main motivations to want to introduce new
syntax for this idiom:

- It is repetitive: the variable and the test operator, usually '=='
  or 'in', are repeated in each if/elif branch.

- It is inefficient: when an expressaion matches the last test value
  (or no test value at all) it is compared to each of the preceding
  test values.

Both of these complaints are relatively mild; there isn't a lot of
readability or performance to be gained by writing this differently.
Yet, some kind of switch statement is found in many languages and it
is not unreasonable to expect that its addition to Python will allow
us to write up certain code more cleanly and efficiently than before.

There are forms of dispatch that are not suitable for the proposed
switch statement; for example, when the number of cases is not
statically known, or when it is desirable to place the code for
different cases in different classes or files.


Basic Syntax
============

I'm considering several variants of the syntax first proposed in PEP
275 here.  There are lots of other possibilities, but I don't see that
they add anything.

My current preference is alternative 2.

I should not that all alternatives here have the "implicit break"
property: at the end of the suite for a particular case, the control
flow jumps to the end of the whole switch statement.  There is no way
to pass control from one case to another.  This in contrast to C,
where an explicit 'break' statement is required to prevent falling
through to the next case.

In all alternatives, the else-suite is optional.  It is more Pythonic
to use 'else' here rather than introducing a new reserved word,
'default', as in C.

Semantics are discussed in the next top-level section.

Alternative 1
-------------

This is the preferred form in PEP 275::

    switch EXPR:
        case EXPR:
            SUITE
        case EXPR:
            SUITE
        ...
        else:
            SUITE

The main downside is that the suites where all the action is are
indented two levels deep.

Alternative 2
-------------

This is Fredrik Lundh's preferred form; it differs by not indenting
the cases::

    switch EXPR:
    case EXPR:
        SUITE
    case EXPR:
        SUITE
    ....
    else:
        SUITE

Alternative 3
-------------

This is the same as alternative 2 but leaves out the colon after the
switch::

    switch EXPR
    case EXPR:
        SUITE
    case EXPR:
        SUITE
    ....
    else:
        SUITE

The hope of this alternative is that is will upset the auto-indent
logic of the average Python-aware text editor less.  But it looks
strange to me.

Alternative 4
-------------

This leaves out the 'case' keyword on the basis that it is redundant::

    switch EXPR:
        EXPR:
            SUITE
        EXPR:
            SUITE
        ...
        else:
            SUITE

Unfortunately now we are forced to indent the case expressions,
because otherwise (at least in the absence of an 'else' keyword) the
parser would have a hard time distinguishing between an unindented
case expression (which continues the switch statement) or an unrelated
statement that starts like an expression (such as an assignment or a
procedure call).  The parser is not smart enough to backtrack once it
sees the colon.  This is my least favorite alternative.


Extended Syntax
===============

There is one additional concern that needs to be addressed
syntactically.  Often two or more values need to be treated the same.
In C, this done by writing multiple case labels together without any
code between them.  The "fall through" semantics then mean that these
are all handled by the same code.  Since the Python switch will not
have fall-through semantics (which have yet to find a champion) we
need another solution.  Here are some alternatives.

Alternative A
-------------

Use::

    case EXPR:

to match on a single expression; use::

    case EXPR, EXPR, ...:

to match on mulltiple expressions.  The is interpreted so that if EXPR
is a parenthesized tuple or another expression whose value is a tuple,
the switch expression must equal that tuple, not one of its elements.
This means that we cannot use a variable to indicate multiple cases.
While this is also true in C's switch statement, it is a relatively
common occurrence in Python (see for example sre_compile.py).

Alternative B
-------------

Use::

    case EXPR:

to match on a single expression; use::

    case in EXPR_LIST:

to match on multiple expressions.  If EXPR_LIST is a single
expression, the 'in' forces its interpretation as an iterable (or
something supporting __contains__, in a minority semantics
alternative).  If it is multiple expressions, each of those is
considered for a match.

Alternative C
-------------

Use::

    case EXPR:

to match on a single expression; use::

    case EXPR, EXPR, ...:

to match on multiple expressions (as in alternative A); and use::

    case *EXPR:

to match on the elements of an expression whose value is an iterable.
The latter two cases can be combined, so that the true syntax is more
like this::

    case [*]EXPR, [*]EXPR, ...:

Note that the * notation is similar to the use of prefix * already in
use for variable-length parameter lists and for passing computed
argument lists, and often proposed for value-unpacking (e.g.  "a, b,
*c = X" as an alternative to "(a, b), c = X[:2], X[2:]").

Alternative D
-------------

This is a mixture of alternatives B and C; the syntax is like
alternative B but instead of the 'in' keyword it uses '*'.  This is
more limited, but still allows the same flexibility.  It uses::

    case EXPR:

to match on a single expression and::

    case *EXPR:

to match on the elements of an iterable.  If one wants to specify
multiple matches in one case, one can write this::

    case *(EXPR, EXPR, ...):

or perhaps this (although it's a bit strange because the relative
priority of '*' and ',' is different than elsewhere)::

    case * EXPR, EXPR, ...:

Discussion
----------

Alternatives B, C and D are motivated by the desire to specify
multiple cases with the same treatment using a variable representing a
set (usually a tuple) rather than spelling them out.  The motivation
for this is usually that if one has several switches over the same set
of cases it's a shame to have to spell out all the alternatives each
time.  An additional motivation is to be able to specify *ranges* to
be matched easily and efficiently, similar to Pascal's "1..1000:"
notation.  At the same time we want to prevent the kind of mistake
that is common in exception handling (and which will be addressed in
Python 3000 by changing the syntax of the except clause): writing
"case 1, 2:" where "case (1, 2):" was meant, or vice versa.

The case could be made that the need is insufficient for the added
complexity; C doesn't have a way to express ranges either, and it's
used a lot more than Pascal these days.  Also, if a dispatch method
based on dict lookup is chosen as the semantics, large ranges could be
inefficient (consider range(1, sys.maxint)).

All in all my preferences are (in descending preference) B, A, D', C
where D' is D without the third possibility.


Semantics
=========

There are several issues to review before we can choose the right
semantics.

If/Elif Chain vs. Dict-based Dispatch
-------------------------------------

There are two main schools of thought about the switch statement's
semantics.  School I wants to define the switch statement in term of
an equivalent if/elif chain.  School II prefers to think of it as a
dispatch on a precomputed dictionary.

The difference is mainly important when either the switch expression
or one of the case expressions is not hashable; school I wants this to
be handled as it would be by an if/elif chain (i.e. hashability of the
expressions involved doesn't matter) while school II is willing to say
that the switch expression and all the case expressions must be
hashable if a switch is to be used; otherwise the user should have
written an if/elif chain.

There's also a difference of opinion regarding the treatment of
duplicate cases (i.e. two or more cases with the same match
expression).  School I wants to treat this the same is an if/elif
chain would treat it (i.e. the first match wins and the code for the
second match is silently unreachable); school II generally wants this
to be an error at the time the switch is frozen.

There's also a school III which states that the definition of a switch
statement should be in terms of an equivalent if/elif chain, with the
exception that all the expressions must be hashable.

School I believes that the if/elif chain is the only reasonably,
surprise-free of defining switch semantics, and that optimizations as
suggested by PEP 275's Solution 1 are sufficient to make most common
uses fast.

School II sees nothing but trouble in that approach: in an if/elif
chain, the test "x == y" might well be comparing two unhashable values
(e.g. two lists); even "x == 1" could be comparing a user-defined
class instance that is not hashable but happens to define equality to
integers.  Worse, the hash function might have a bug or a side effect;
if we generate code that believes the hash, a buggy hash might
generate an incorrect match, and if we generate code that catches
errors in the hash to fall back on an if/elif chain, we might hide
genuine bugs.  In addition, school II sees little value in allowing
cases involving unhashable values; after all if the user expects such
values, they can just as easily write an if/elif chain.  School II
also doesn't believe that it's fair to allow dead code due to
overlappin cases to occur unflagged, when the dict-based dispatch
implementation makes it so easy to trap this.

School III admits the problems with making hash() optional, but still
believes that the true semantics should be defined by an if/elif chain
even if the implementation should be allowed to use dict-based
dispatch as an optimization.  This means that duplicate cases must be
resolved by always choosing the first case, making the second case
undiagnosed dead code.

Personally, I'm in school II: I believe that the dict-based dispatch
is the one true implementation for switch statements and that we
should face the limitiations and benefits up front.

When to Freeze the Dispatch Dict
--------------------------------

For the supporters of school II (dict-based dispatch), the next big
dividing issue is when to create the dict used for switching.  I call
this "freezing the dict".

The main problem that makes this interesting is the observation that
Python doesn't have named compile-time constants.  What is
conceptually a constant, such as re.IGNORECASE, is a variable to the
compiler, and there's nothing to stop crooked code from modifying its
value.

Option 1
''''''''

The most limiting option is to freeze the dict in the compiler.  This
would require that the case expressions are all literals or
compile-time expressions involving only literals and operators whose
semantics are known to the compiler, since with the current state of
Python's dynamic semantics and single-module compilation, there is no
hope for the compiler to know with sufficient certainty the values of
any variables occurring in such expressions.  This is widely though
not universally considered too restrictive.

Raymond Hettinger is the main advocate of this approach.  He proposes
a syntax where only a single literal of certain types is allowed as
the case expression.  It has the advantage of being unambiguous and
easy to implement.

My may complaint about this is that by disallowing "named constants"
we force programmers to give up good habits.  Named constants are
introduced in most languages to solve the problem of "magic numbers"
occurring in the source code.  For example, sys.maxint is a lot more
readable than 2147483647.  Raymond proposes to use string literals
instead of named "enums", observing that the string literal's content
can be the name that the constant would otherwise have.  Thus, we
could write "case 'IGNORECASE':" instead of "case re.IGNORECASE:".
However, if there is a spelling error in the string literal, the case
will silently be ignored, and who knows when the bug is detected. If
there is a spelling error in a NAME, however, the error will be caught
as soon as it is evaluated.  Also, sometimes the constants are
externally defined (e.g. when parsing an file format like JPEG) and we
can't easily choose appropriate string values.  Using an explicit
mappping dict sounds like a poor hack.

Option 2
''''''''

The oldest proposal to deal with this is to freeze the dispatch dict
the first time the switch is executed.  At this point we can assume
that all the named "constants" (constant in the programmer's mind,
though not to the compiler) used as case expressions are defined --
otherwise an if/elif chain would have little chance of success either.
Assuming the switch will be executed many times, doing some extra work
the first time pays back quickly by very quick dispatch times later.

A mostly theoretical objection to this option is that there is no
obvious object where the dispatch dict can be stored.  It can't be
stored on the code object, which is supposed to be immutable; it can't
be stored on the function object, since many function objects may be
created for the same function (e.g. for nested functions).  In
practice, I'm sure that something can be found; it could be stored in
a section of the code object that's not considered when comparing two
code objects or when pickling or marshalling a code object; or all
switches could be stored in a dict indexed by weak references to code
objects.

Another objection is that the first-use rule allows obfuscated code
like this::

    def foo(x, y):
        switch x:
        case y: 
            print 42

To the untrained eye (not familiar with Python) this code would be
equivalent to this::

    def foo(x, y):
        if x == y:
            print 42

but that's not what it does (unless it is always called with the same
value as the second argument).  This has been addressed by suggesting
that the case expressions should not be allowed to reference local
variables.  But this is somewhat arbitrary.

A final objection is that in a multi-threaded application, the
first-use rule requires intricate locking in order to guarantee the
correct semantics.  (The first-use rule suggests a promise that side
effects of case expressions are incurred exactly once.)

Option 3
''''''''

TBD


Copyright
=========

This document has been placed in the public domain.


..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End: