PEP: 635 Title: Structural Pattern Matching: Motivation and Rationale Version: $Revision$ Last-Modified: $Date$ Author: Tobias Kohn , Guido van Rossum BDFL-Delegate: Discussions-To: Python-Dev Status: Draft Type: Informational Content-Type: text/x-rst Created: 12-Sep-2020 Python-Version: 3.10 Post-History: Resolution: Abstract ======== **NOTE:** This draft is incomplete and not intended for review yet. We're checking it into the peps repo for the convenience of the authors. This PEP provides the motivation and rationale for PEP 634 ("Structural Pattern Matching: Specification"). First-time readers are encouraged to start with PEP 636, which provides a gentler introduction to the concepts, syntax and semantics of patterns. Motivation ========== (Structural) pattern matching syntax is found in many languages, from Haskell, Erlang and Scala to Elixir and Ruby. (A proposal for JavaScript is also under consideration.) Python already supports a limited form of this through sequence unpacking assignments, which the new proposal leverages. Several other common Python idioms are also relevant: - The ``if ... elif ... elif ... else`` idiom is often used to find out the type or shape of an object in an ad-hoc fashion, using one or more checks like ``isinstance(x, cls)``, ``hasattr(x, "attr")``, ``len(x) == n`` or ``"key" in x`` as guards to select an applicable block. The block can then assume ``x`` supports the interface checked by the guard. For example:: if isinstance(x, tuple) and len(x) == 2: host, port = x mode = "http" elif isinstance(x, tuple) and len(x) == 3: host, port, mode = x # Etc. Code like this is more elegantly rendered using ``match``:: match x: case host, port: mode = "http" case host, port, mode: pass # Etc. - AST traversal code often looks for nodes matching a given pattern, for example the code to detect a node of the shape "A + B * C" might look like this:: if (isinstance(node, BinOp) and node.op == "+" and isinstance(node.right, BinOp) and node.right.op == "*"): a, b, c = node.left, node.right.left, node.right.right # Handle a + b*c Using ``match`` this becomes more readable:: match node: case BinOp("+", a, BinOp("*", b, c)): # Handle a + b*c - TODO: Other compelling examples? We believe that adding pattern matching to Python will enable Python users to write cleaner, more readable code for examples like those above, and many others. Pattern matching and OO ----------------------- Pattern matching is complimentary to the object-oriented paradigm. Using OO and inheritance we can easily define a method on a base class that defines default behavior for a specific operation on that class, and we can override this default behavior in subclasses. We can also use the Visitor pattern to separate actions from data. But this is not sufficient for all situations. For example, a code generator may consume an AST, and have many operations where the generated code needs to vary based not just on the class of a node, but also on the value of some class attributes, like the ``BinOp`` example above. The Visitor pattern is insufficiently flexible for this: it can only select based on the class. For a complete example, see https://github.com/gvanrossum/patma/blob/master/examples/expr.py#L231 TODO: Could we say more here? Pattern and functional style ---------------------------- Most Python applications and libraries are not written in a consistent OO style -- unlike Java, Python encourages defining functions at the top-level of a module, and for simple data structures, tuples (or named tuples or lists) and dictionaries are often used exclusively or mixed with classes or data classes. Pattern matching is particularly suitable for picking apart such data structures. As an extreme example, it's easy to write code that picks a JSON data structure using ``match``. TODO: Example code. Rationale ========= TBD. This section should provide the rationale for individual design decisions. It takes the place of "Rejected ideas" in the standard PEP format. It is organized in sections corresponding to the specification (PEP 634). Overview and terminology ------------------------ The ``match`` statement ----------------------- The match statement evaluates an expression to produce a subject, finds the first pattern that matches the subject and executes the associated block of code. Syntactically, the match statement thus takes an expression and a sequence of case clauses, where each case clause comprises a pattern and a block of code. Since case clauses comprise a block of code, they adhere to the existing indentation scheme with the syntactic structure of `` ...: <(indented) block>``, which in turn makes it a (compound) statement. The chosen keyword ``case`` reflects its widespread use in pattern matching languages, ignoring those languages that use other syntactic means such as a symbol like ``|`` because it would not fit established Python structures. The syntax of patterns following the keyword is discussed below. Given that the case clauses follow the structure of a compound statement, the match statement itself naturally becomes a compoung statement itself as well, following the same syntactic structure. This naturally leads to ``match : +``. Note that the match statement determines a quasi-scope in which the evaluated subject is kept alive (although not in a local variable), similar to how a with statement might keep a resource alive during execution of its block. Furthermore, control flows from the match statement to a case clause and then leaves the block of the match statement. The block of the match statement thus has both syntactic and semantic meaning. Various suggestions have sought to eliminate or avoid the naturally arising "double indentation" of a case clause's code block. Unfortunately, all such proposals of *flat indentation schemes* come at the expense of violating Python's establish structural paradigm, leading to additional syntactic rules: - *Unindented case clauses.* The idea is to align case clauses with the ``match``, i.e.:: match expression: case pattern_1: ... case pattern_2: ... This may look awkward to the eye of a Python programmer, because everywhere else colon is followed by an indent. The ``match`` would neither follow the syntactic scheme of simple nor composite statements but rather establish a category of its own. - *Putting the expression on a separate line after ``match``.* The idea is to use the expression yielding the subject as a statement to avoid the singularity of ``match`` having no actual block despite the colons:: match: expression case pattern_1: ... case pattern_2: ... This was ultimately rejected because the first block would be another novelty in Python's grammar: a block whose only content is a single expression rather than a sequence of statements. Attempts to amend this issue by adding or repurposing yet another keyword along the lines of ``match: return expression`` did not yield any satisfactory solution. Although flat indentation would save some horizontal space, the cost of increased complexity or unusual rules is too high. It would also complicate life for simple-minded code editors. Finally, the horizontal space issue can be alleviated by allowing "half-indent" (i.e. two spaces instead of four) for match statements. In sample programs using match, written as part of the development of this PEP, a noticeable improvement in code brevity is observed, more than making up for the additional indentation level. *Statement v Expression.* Some suggestions centered around the idea of making ``match`` an expression rather than a statement. However, this would fit poorly with Python's statement-oriented nature and lead to unusually long and complex expressions with the need to invent new syntactic constructs or break well established syntactic rules. An obvious consequence of ``match`` as an expression would be that case clauses could no longer have abitrary blocks of code attached, but only a single expression. Overall, the strong limitations could in no way offset the slight simplification in some special use cases. Match semantics ~~~~~~~~~~~~~~~ The patterns of different case clauses might overlap in that more than one case clause would match a given subject. The first-to-match rule ensures that the selection of a case clause for a given subject is unambiguous. Furthermore, case clauses can have increasingly general patterns matching wider classes of subjects. The first-to-match rule then ensures that the most precise pattern can be chosen (although it is the programmer's responsibility to order the case clauses correctly). In a statically typed language, the match statement would be compiled to a decision tree to select a matching pattern quickly and very efficiently. This would, however, require that all patterns be purely declarative and static, running against the established dynamic semantics of Python. The proposed semantics thus represent a path incorporating the best of both worlds: patterns are tried in a strictly sequential order so that each case clause constitutes an actual stement. At the same time, we allow the interpreter to cache any information about the subject or change the order in which subpatterns are tried. In other words: if the interpreter has found that the subject is not an instance of a class ``C``, it can directly skip case clauses testing for this again, without having to perform repeated instance-checks. If a guard stipulates that a variable ``x`` must be positive, say (i.e. ``if x > 0``), the interpreter might check this directly after binding ``x`` and before any further subpatterns are considered. *Binding and scoping.* In many pattern matching implementations, each case clause would establish a separate scope of its own. Variables bound by a pattern would then only be visible inside the corresponding case block. In Python, however, this does not make sense. Establishing separate scopes would essentially mean that each case clause is a separate function without direct access to the variables in the surrounding scope (without having to resort to ``nonlocal`` that is). Moreover, a case clause could no longer influence any surrounding control flow through standard statement such as ``return`` or ``break``. Hence, such script scoping would lead to unintuitive and surprising behavior. A direct consequence of this is that any variable bindings outlive the respective case or match statements. Even patterns that only match a subject partially might bind local variables (this is, in fact, necessary for guards to function properly). However, this escaping of variable bindings is in line with existing Python structures such as for loops and with statements. .. _patterns: Patterns -------- Patterns fulfill two purposes: they impose (structural) constraints on the subject and they specify which data values should be extracted from the subject and bound to variables. In iterable unpacking, which can be seen as a prototype to pattern matching in Python, there is only one *structural pattern* to express sequences while there is a rich set of *binding patterns* to assign a value to a specific variable or field. Full pattern matching differs from this in that there is more variety in structual patterns but only a minimum of binding patterns. Patterns differ from assignment targets (as in iterable unpacking) in that they impose additional constraints on the structure of the subject and in that a subject might safely fail to match a specific pattern at any point (in iterable unpacking, this constitutes an error). The latter means that pattern should avoid side effects wherever possible, including binding values to attributes or subscripts. A cornerstone of pattern matching is the possibility of arbitrarily *nesting patterns*. The nesting allows for expressing deep tree structures (for an example of nested class patterns, see the motivation section above) as well as alternatives. Although the structural patterns might superficially look like expressions, it is important to keep in mind that there is a clear distinction. In fact, no pattern is or contains an expression. It is more productive to think of patterns as declarative elements similar to the formal parameters in a function definition. Walrus patterns ~~~~~~~~~~~~~~~ OR patterns ~~~~~~~~~~~ The OR pattern allows you to combine 'structurally equivalent' alternatives into a new pattern, i.e. several patterns can share a common handler. If any one of an OR pattern's subpatterns matches the given subject, the entire OR pattern succeeds. Statically typed languages prohibit the binding of names (capture patterns) inside an OR pattern because of potential conflicts concerning the types of variables. As a dynamically typed language, Python can be less restrictive here and allow capture patterns inside OR patterns. However, each subpattern must bind the same set of variables so as not to leave potentially undefined names. With two alternatives ``P | Q``, this means that if *P* binds the variables *u* and *v*, *Q* must bind exactly the same variables *u* and *v*. There was some discussion on whether to use the bar ``|`` or the keyword ``or`` in order to separate alternatives. The OR pattern does not fully fit the existing semantics and usage of either of these two symbols. However, ``|`` is the symbol of choice in all programming languages with support of the OR pattern and is even used in that capacity for regular expressions in Python as well. Moreover, ``|`` is not only used for bitwise OR, but also for set unions and dict merging (:pep:`584`). Other alternatives were considered as well, but none of these would allow OR-patterns to be nested inside other patterns: - *Using a comma*:: case 401, 403, 404: print("Some HTTP error") This looks too much like a tuple -- we would have to find a different way to spell tuples, and the construct would have to be parenthesized inside the argument list of a class pattern. In general, commas already have many different meanings in Python, we shouldn't add more. - *Using stacked cases*:: case 401: case 403: case 404: print("Some HTTP error") This is how this would be done in *C*, using its fall-through semantics for cases. However, we don't want to mislead people into thinking that match/case uses fall-through semantics (which are a common source of bugs in *C*). Also, this would be a novel indentation pattern, which might make it harder to support in IDEs and such (it would break the simple rule "add an indentation level after a line ending in a colon"). Finally, this would not support OR patterns nested inside other patterns. - *Using ``case in`` followed by a comma-separated list*:: case in 401, 403, 404: print("Some HTTP error") This would not work for OR patterns nested inside other patterns, like:: case Point(0|1, 0|1): print("A corner of the unit square") *AND and NOT patterns.* This proposal defines an OR-pattern (|) to match one of several alternates; why not also an AND-pattern (``&``) or even a NOT-pattern (``!``)? Especially given that some other languages (``F#`` for example) support AND-patterns. However, it is not clear how useful this would be. The semantics for matching dictionaries, objects and sequences already incorporates an implicit 'and': all attributes and elements mentioned must be present for the match to succeed. Guard conditions can also support many of the use cases that a hypothetical 'and' operator would be used for. A negation of a match pattern using the operator ``!`` as a prefix would match exactly if the pattern itself does not match. For instance, ``!(3 | 4)`` would match anything except ``3`` or ``4``. However, there is evidence from other languages that this is rarely useful and primarily used as double negation ``!!`` to control variable scopes and prevent variable bindings (which does not apply to Python). In the end, it was decided that this would make the syntax more complex without adding a significant benefit. Example:: def simplify(expr): match expr: case ('/', 0, 0): return expr case ('*' | '/', 0, _): return 0 case ('+' | '-', x, 0) | ('+', 0, x) | ('*', 1, x) | ('*' | '/', x, 1): return x return expr .. _capture_pattern: Capture Patterns ~~~~~~~~~~~~~~~~ Capture patterns take on the form of a name that accepts any value and binds it to a (local) variable (unless the name is declared as ``nonlocal`` or ``global``). In that sense, a simple capture pattern is basically equivalent to a parameter in a function definition (when the function is called, each parameter binds the respective argument to a local variable in the function's scope). A name used for a capture pattern must not coincide with another capture pattern in the same pattern. This, again, is similar to parameters, which equally require each parameter name to be unique within the list of parameters. It differs, however, from iterable unpacking assignment, where the repeated use of a variable name as target is permissible (e.g., ``x, x = 1, 2``). The rationale for not supporting ``(x, x)`` in patterns is its ambiguous reading: it could be seen as in iterable unpacking where only the second binding to ``x`` survives. But it could be equally seen as expressing a tuple with two equal elements (which comes with its own issues). Should the need arise, then it is still possible to introduce support for repeated use of names later on. There were calls to explicitly mark capture patterns and thus identify them as binding targets. According to that idea, a capture pattern would be written as, e.g. ``?x`` or ``$x``. The aim of such explicit capture markers is to let an unmarked name be a constant value pattern (see below). However, this is based on the misconception that pattern matching was an extension of *switch* statements, placing the emphasis on fast switching based on (ordinal) values. Such a *switch* statement has indeed been proposed for Python before (see :pep:`275` and :pep:`3103`). Pattern matching, on the other hand, builds a generalized concept of iterable unpacking. Binding values extracted from a data structure is at the very core of the concept and hence the most common use case. Explicit markers for capture patterns would thus betray the objective of the proposed pattern matching syntax and simplify a secondary use case at the expense of additional syntactic clutter for core cases. Example:: def average(*args): match args: case [x, y]: # captures the two elements of a sequence return (x + y) / 2 case [x]: # captures the only element of a sequence return x case []: return 0 case x: # captures the entire sequence return sum(x) / len(x) .. _wildcard_pattern: Wildcard Pattern ~~~~~~~~~~~~~~~~ The wildcard pattern is a special case of a 'capture' pattern: it accepts any value, but does not bind it to a variable. The idea behind this rule is to support repeated use of the wildcard in patterns. While ``(x, x)`` is an error, ``(_, _)`` is legal. Particularly in larger (sequence) patterns, it is important to allow the pattern to concentrate on values with actual significance while ignoring anything else. Without a wildcard, it would become necessary to 'invent' a number of local variables, which would be bound but never used. Even when sticking to naming conventions and using e.g. ``_1, _2, _3`` to name irrelevant values, say, this still introduces visual clutter and can hurt performance (compare the sequence pattern ``(x, y, *z)`` to ``(_, y, *_)``, where the ``*z`` forces the interpreter to copy a potentially very long sequence, whereas the second version simply compiles to code along the lines of ``y = seq[1]``). There has been much discussion about the choice of the underscore as ``_`` as a wildcard pattern, i.e. making this one name non-binding. However, the underscore is already heavily used as an 'ignore value' marker in iterable unpacking. Since the wildcard pattern ``_`` never binds, this use of the underscore does not interfere with other uses such as inside the REPL or the ``gettext`` module. It has been proposed to use ``...`` (i.e., the ellipsis token) or ``*`` (star) as a wildcard. However, both these look as if an arbitrary number of items is omitted:: case [a, ..., z]: ... case [a, *, z]: ... Both look like the would match a sequence of at two or more items, capturing the first and last values. A single wildcard clause (i.e. ``case _:``) is semantically equivalent to an ``else:``. It accepts any subject without binding it to a variable or performing any other operation. However, the wildcard pattern is in contrast to ``else`` usable as a subpattern in nested patterns. Finally note that the underscore is as a wildcard pattern in *every* programming language with pattern matching that we could find (including *C#*, *Elixir*, *Erlang*, *F#*, *Grace*, *Haskell*, *Mathematica*, *OCaml*, *Ruby*, *Rust*, *Scala*, *Swift*, and *Thorn*). Keeping in mind that many users of Python also work with other programming languages, have prior experience when learning Python, or moving on to other languages after having learnt Python, we find that such well established standards are important and relevant with respect to readability and learnability. In our view, concerns that this wildcard means that a regular name received special treatment are not strong enough to introduce syntax that would make Python special. Example:: def is_closed(sequence): match sequence: case [_]: # any sequence with a single element return True case [start, *_, end]: # a sequence with at least two elements return start == end case _: # anything return False .. _literal_pattern: Literal Patterns ~~~~~~~~~~~~~~~~ Literal patterns are a convenient way for imposing constraints on the value of a subject, rather than its type or structure. Literal patterns even allow you to emulate a switch statement using pattern matching. Generally, the subject is compared to a literal pattern by means of standard equality (``x == y`` in Python syntax). Consequently, the literal patterns ``1.0`` and ``1`` match exactly the same set of objects, i.e. ``case 1.0:`` and ``case 1:`` are fully interchangable. In principle, ``True`` would also match the same set of objects because ``True == 1`` holds. However, we believe that many users would be surprised finding that ``case True:`` matched the object ``1.0``, resulting in some subtle bugs and convoluted workarounds. We therefore adopted the rule that the three singleton objects ``None``, ``False`` and ``True`` match by identity (``x is y`` in Python syntax) rather than equality. Hence, ``case True:`` will match only ``True`` and nothing else. Note that ``case 1:`` would still match ``True``, though, because the literal pattern ``1`` works by equality and not identity. Early ideas to induce a hierarchy on numbers so that ``case 1.0`` would match both the integer ``1`` and the floating point number ``1.0``, whereas ``case 1:`` would only match the integer ``1`` were eventually dropped in favor of the simpler and consistent rule based on equality. Moreover, any additional checks whether the subject is an instance of ``numbers.Integral`` would come at a high runtime cost to introduce what would essentially be novel in Python. When needed, the explicit syntax ``case int(1):`` might be used. Recall that literal patterns are *not* expressions, but directly denote a specific value or object. From a syntactical point of view, we have to ensure that negative and complex numbers can equally be used as patterns, although they are not atomic literal values (i.e. the seeming literal value ``-3+4j`` would syntactically be an expression of the form ``BinOp(UnaryOp('-', 3), '+', 4j)``, but as expressions are not part of patterns, we added syntactic support for such complex value literals without having to resort to full expressions). Interpolated *f*-strings, on the other hand, are not literal values, despite their appearance and can therefore not be used as literal patterns (string concatenation, however, is supported). Literal patterns not only occur as patterns in their own right, but also as keys in *mapping patterns*. Example:: def simplify(expr): match expr: case ('+', 0, x): return x case ('+' | '-', x, 0): return x case ('and', True, x): return x case ('and', False, x): return False case ('or', False, x): return x case ('or', True, x): return True case ('not', ('not', x)): return x return expr .. _constant_value_pattern: Constant Value Patterns ~~~~~~~~~~~~~~~~~~~~~~~ It is good programming style to use named constants for parametric values or to clarify the meaning of particular values. Clearly, it would be desirable to also write ``case (HttpStatus.OK, body):`` rather than ``case (200, body):``, for example. The main issue that arises here is how to distinguish capture patterns (variables) from constant value patterns. The general discussion surrounding this issue has brought forward a plethora of options, which we cannot all fully list here. Strictly speaking, constant value patterns are not really necessary, but could be implemented using guards, i.e. ``case (status, body) if status == HttpStatus.OK:``. Nonetheless, the convenience of constant value patterns is unquestioned and obvious. The observation that constants tend to be written in uppercase letters or collected in enumeration-like namespaces suggests possible rules to discern constants syntactically. However, the idea of using upper vs. lower case as a marker has been met with scepticism since there is no similar precedence in core Python (although it is common in other languages). We therefore only adopted the rule that any dotted name (i.e. attribute access) is to be interpreted as a constant value pattern like ``HttpStatus.OK`` above. This precludes, in particular, local variables from acting as constants. Global variables can only be directly used as constant when defined in other modules, although there are workarounds to access the current module as a namespace as well. A proposed rule to use a leading dot (e.g. ``.CONSTANT``) for that purpose was critisised because it was felt that the dot would not be a visible-enough marker for that purpose. Partly inspired by use cases in other programming languages, a number of different markers/sigils were proposed (such as ``^CONSTANT``, ``$CONSTANT``, ``==CONSTANT``, ``CONSTANT?``, or the word enclosed in backticks), although there was no obvious or natural choice. The current proposal therefore leaves the discussion and possible introduction of such a 'constant' marker for future PEPs. Distinguishing the semantics of names based on whether it is a global variable (i.e. the compiler would treat global variables as constants rather than capture patterns) leads to various issues. The addition or alteration of a global variable in the module could have unintended side effects on patterns. Moreover, pattern matching could not be used directly inside a module's scope because all variables would be global, making capture patterns impossible. Example:: def handle_reply(reply): match reply: case (HttpStatus.OK, MimeType.TEXT, body): process_text(body) case (HttpStatus.OK, MimeType.APPL_ZIP, body): text = deflate(body) process_text(text) case (HttpStatus.MOVED_PERMANENTLY, new_URI): resend_request(new_URI) case (HttpStatus.NOT_FOUND): raise ResourceNotFound() Group Patterns ~~~~~~~~~~~~~~ Allowing users to explicitly specify the grouping is particularly helpful in case of OR patterns. .. _sequence_pattern: Sequence Patterns ~~~~~~~~~~~~~~~~~ Sequence patterns follow as closely as possible the already established syntax and semantics of iterable unpacking. Of course, subpatterns take the place of assignment targets (variables, attributes and subscript). Moreover, the sequence pattern only matches a carefully selected set of possible subjects, whereas iterable unpacking can be applied to any iterable. - As in iterable unpacking, we do not distinguish between 'tuple' and 'list' notation. ``[a, b, c]``, ``(a, b, c)`` and ``a, b, c`` are all equivalent. While this means we have a redundant notation and checking specifically for lists or tuples requires more effort (e.g. ``case list([a, b, c])``), we mimick iterable unpacking as much as possible. - A starred pattern will capture a sub-sequence of arbitrary length, mirroring iterable unpacking as well. Only one starred item may be present in any sequence pattern. In theory, patterns such as ``(*_, 3, *_)`` could be understood as expressing any sequence containing the value ``3``. In practise, however, this would only work for a very narrow set of use cases and lead to inefficient backtracking or even ambiguities otherwise. - The sequence pattern does *not* iterate through an iterable subject. All elements are accessed through subscripting and slicing, and the subject must be an instance of ``collections.abc.Sequence`` (including, in particular, lists and tuples, but excluding strings and bytes, as well as sets and dictionaries). A sequence pattern cannot just iterate through any iterable object. The consumption of elements from the iteration would have to be undone if the overall pattern fails, which is not possible. Relying on ``len()`` and subscripting and slicing alone does not work to identify sequences because sequences share the protocol with more general maps (dictionaries) in this regard. It would be surprising if a sequence pattern also matched dictionaries or other custom objects that implement the mapping protocol (i.e. ``__getitem__``). The interpreter therefore performs an instance check to ensure that the subject in question really is a sequence (of known type). String and bytes objects have a dual nature: they are both 'atomic' objects in their own right, as well as sequences (with a strongly recursive nature in that a string is a sequence of strings). The typical behavior and use cases for strings and bytes are different enough from that of tuples and lists to warrant a clear distinction. It is in fact often unintuitive and unintended that strings pass for sequences as evidenced by regular questions and complaints. Strings and bytes are therefore not matched by a sequence pattern, limiting the sequence pattern to a very specific understanding of 'sequence'. .. _mapping_pattern: Mapping Patterns ~~~~~~~~~~~~~~~~ Dictionaries or mappings in general are one of the most important and most widely used data structures in Python. In contrast to sequences mappings are built for fast direct access to arbitrary elements (identified by a key). In most use cases an element is retrieved from a dictionary by a known key without regard for any ordering or other key-value pairs stored in the same dictionary. Particularly common are string keys. The mapping pattern reflects the common usage of dictionary lookup: it allows the user to extract some values from a mapping by means of constant/known keys and have the values match given subpatterns. Moreover, the mapping pattern does not check for the presence of additional keys. Should it be necessary to impose an upper bound on the mapping and ensure that no additional keys are present, then the usual double-star-pattern ``**rest`` can be used. The special case ``**_`` with a wildcard, however, is not supported as it would not have any effect, but might lead to a wrong understanding of the mapping pattern's semantics. To avoid overly expensive matching algorithms, keys must be literals or constant values. Example:: def change_red_to_blue(json_obj): match json_obj: case { 'color': ('red' | '#FF0000') }: json_obj['color'] = 'blue' case { 'children': children }: for child in children: change_red_to_blue(child) .. _class_pattern: Class Patterns ~~~~~~~~~~~~~~ Class patterns fulfil two purposes: checking whether a given subject is indeed an instance of a specific class and extracting data from specific attributes of the subject. A quick survey revealed that ``isinstance()`` is indeed one of the most often used functions in Python in terms of static occurrences in programs. Such instance checks typically precede a subsequent access to information stored in the object, or a possible manipulation thereof. A typical pattern might be along the lines of:: def traverse_tree(node): if isinstance(node, Node): traverse_tree(node.left) traverse_tree(node.right) elif isinstance(node, Leaf): print(node.value) In many cases, however, class patterns occur nested as in the example given in the motivation:: if (isinstance(node, BinOp) and node.op == "+" and isinstance(node.right, BinOp) and node.right.op == "*"): a, b, c = node.left, node.right.left, node.right.right # Handle a + b*c The class pattern lets you to concisely specify both an instance-check as well as relevant attributes (with possible further constraints). It is thereby very tempting to write, e.g., ``case Node(left, right):`` in the first case above and ``case Leaf(value):`` in the second. While this indeed works well for languages with strict algebraic data types, it is problematic with the structure of Python objects. When dealing with general Python objects, we face a potentially very large number of unordered attributes: an instance of ``Node`` contains a large number of attributes (most of which are 'private methods' such as, e.g., ``__repr__``). Moreover, the interpreter cannot reliably deduce which of the attributes comes first and which comes second. For an object that represents a circle, say, there is no inherently obvious ordering of the attributes ``x``, ``y`` and ``radius``. We envision two possibilities for dealing with this issue: either explicitly name the attributes of interest or provide an additional mapping that tells the interpreter which attributes to extract and in which order. Both approaches are supported. Moreover, explicitly naming the attributes of interest lets you further specify the required structure of an object; if an object lacks an attribute specified by the pattern, the match fails. - Attributes that are explicitly named pick up the syntax of named arguments. If an object of class ``Node`` has two attributes ``left`` and ``right`` as above, the pattern ``Node(left=x, right=y)`` will extract the values of both attributes and assign them to ``x`` and ``y``, respectively. The data flow from left to right seems unusual, but is in line with mapping patterns and has precedents such as assignments via ``as`` in *with*- or *import*-statements. Naming the attributes in question explicitly will be mostly used for more complex cases where the positional form (below) is insufficient. - The class field ``__match_args__`` specifies a number of attributes together with their ordering, allowing class patterns to rely on positional sub-patterns without having to explicitly name the attributes in question. This is particularly handy for smaller objects or instances of data classes, where the attributes of interest are rather obvious and often have a well-defined ordering. In a way, ``__match_args__`` is similar to the declaration of formal parameters, which allows to call functions with positional arguments rather than naming all the parameters. The syntax of class patterns is based on the idea that de-construction mirrors the syntax of construction. This is already the case in virtually any Python construct, be assignment targets, function definitions or iterable unpacking. In all these cases, we find that the syntax for sending and that for receiving 'data' are virtually identical. - Assignment targets such as variables, attributes and subscripts: ``foo.bar[2] = foo.bar[3]``; - Function definitions: a function defined with ``def foo(x, y, z=6)`` is called as, e.g., ``foo(123, y=45)``, where the actual arguments provided at the call site are matched against the formal parameters at the definition site; - Iterable unpacking: ``a, b = b, a`` or ``[a, b] = [b, a]`` or ``(a, b) = (b, a)``, just to name a few equivalent possibilities. Using the same syntax for reading and writing, l- and r-values, or construction and de-construction is widely accepted for its benefits in thinking about data, its flow and manipulation. This equally extends to the explicit construction of instances, where class patterns ``c(p, q)`` deliberately mirror the syntax of creating instances. History and Context =================== Pattern matching emerged in the late 1970s in the form of tuple unpacking and as a means to handle recursive data structures such as linked lists or trees (object-oriented languages usually use the visitor pattern for handling recursive data structures). The early proponents of pattern matching organised structured data in 'tagged tuples' rather than ``struct`` as in *C* or the objects introduced later. A node in a binary tree would, for instance, be a tuple with two elements for the left and right branches, respectively, and a ``Node`` tag, written as ``Node(left, right)``. In Python we would probably put the tag inside the tuple as ``('Node', left, right)`` or define a data class `Node` to achieve the same effect. Using modern syntax, a depth-first tree traversal would then be written as follows:: def traverse_tree(node): node match: case Node(left, right): DFS(left) DFS(right) case Leaf(value): handle(value) The notion of handling recursive data structures with pattern matching immediately gave rise to the idea of handling more general recursive 'patterns' (i.e. recursion beyond recursive data structures) with pattern matching. Pattern matching would thus also be used to define recursive functions such as:: def fib(arg): match arg: case 0: return 1 case 1: return 1 case n: return fib(n-1) + fib(n-2) As pattern matching was repeatedly integrated into new and emerging programming languages, its syntax slightly evolved and expanded. The two first cases in the ``fib`` example above could be written more succinctly as ``case 0 | 1:`` with ``|`` denoting alternative patterns. Moreover, the underscore ``_`` was widely adopted as a wildcard, a filler where neither the structure nor value of parts of a pattern were of substance. Since the underscore is already frequently used in equivalent capacity in Python's iterable unpacking (e.g., ``_, _, third, _* = something``) we kept these universal standards. It is noteworthy that the concept of pattern matching has always been closely linked to the concept of functions. The different case clauses have always been considered as something like semi-indepedent functions where pattern variables take on the role of parameters. This becomes most apparent when pattern matching is written as an overloaded function, along the lines of (Standard ML):: fun fib 0 = 1 | fib 1 = 1 | fib n = fib (n-1) + fib (n-2) Even though such a strict separation of case clauses into independent functions does not make sense in Python, we find that patterns share many syntactic rules with parameters, such as binding arguments to unqualified names only or that variable/parameter names must not be repeated for a particular pattern/function. With its emphasis on abstraction and encapsulation, object-oriented programming posed a serious challenge to pattern matching. In short: in object-oriented programming, we can no longer view objects as tagged tuples. The arguments passed into the constructor do not necessarily specify the attributes or fields of the objects. Moreover, there is no longer a strict ordering of an object's fields and some of the fields might be private and thus inaccessible. And on top of this, the given object might actually be an instance of a subclass with slightly different structure. To address this challenge, patterns became increasingly independent of the original tuple constructors. In a pattern like ``Node(left, right)``, ``Node`` is no longer a passive tag, but rather a function that can actively check for any given object whether it has the right structure and extract a ``left`` and ``right`` field. In other words: the ``Node``-tag becomes a function that transforms an object into a tuple or returns some failure indicator if it is not possible. In Python, we simply use ``isinstance()`` together with the ``__match_args__`` field of a class to check whether an object has the correct structure and then transform some of its attributes into a tuple. For the `Node` example above, for instance, we would have ``__match_args__ = ('left', 'right')`` to indicate that these two attributes should be extracted to form the tuple. That is, ``case Node(x, y)`` would first check whether a given object is an instance of ``Node`` and then assign ``left`` to ``x`` and ``right`` to ``y``, respectively. Paying tribute to Python's dynamic nature with 'duck typing', however, we also added a more direct way to specify the presence of, or constraints on specific attributes. Instead of ``Node(x, y)`` you could also write ``object(left=x, right=y)``, effectively eliminating the ``isinstance()`` check and thus supporting any object with ``left`` and ``right`` attributes. Or you would combine these ideas to write ``Node(right=y)`` so as to require an instance of ``Node`` but only extract the value of the `right` attribute. Copyright ========= This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: