diff --git a/pep-0617.rst b/pep-0617.rst new file mode 100644 index 000000000..f4d10d647 --- /dev/null +++ b/pep-0617.rst @@ -0,0 +1,740 @@ +PEP: 0617 +Title: New PEG parser for CPython +Version: $Revision$ +Last-Modified: $Date$ +Author: Guido van Rossum , + Pablo Galindo , + Lysandros Nikolaou +Discussions-To: Python-Dev +Status: Draft +Type: Standards Track +Content-Type: text/x-rst +Created: 24-March-2020 + + +======== +Overview +======== + +This PEP proposes to replace the current LL(1)-based parser of CPython +with a new PEG-based parser. This new parser will allow eliminating the multiple +"hacks" that exist in the current grammar to circumvent the LL(1)-limitation +while substantially reducing the maintenance costs in some areas related to the +compiling pipeline such as the grammar, the parser and the AST generation. The new PEG +parser will also lift the LL(1) restriction over the current Python grammar. + +=========================== +Background on LL(1) parsers +=========================== + +The current Python grammar is an LL(1)-based grammar. A grammar can be said to be +LL(1) if it can be parsed by an LL(1) parser, which in turn is defined as a +top-down parser that parses the input from left to right, performing leftmost +derivation of the sentence, and can only use one token of lookahead when parsing a +sentence. The traditional approach to construct or generate an LL(1) parser is to +produce a *parse table* which encodes the possible transitions between all possible +states of the parser. These tables are normally constructed from the *first sets* +and the *follow sets* of the grammar: + +* Given a rule, the *first set* are the collection of all terminals that can occur + first in a full derivation of that rule. Intuitively this helps the parser decide + among multiple alternatives if a rule can have multiple possibilities. For + instance, given the rule :: + + rule: A | B + + if only ``A`` can start with the terminal *a* and only ``B`` can start with the + terminal *b* and the parser sees the token *b* when parsing this rule, it knows + that it needs to follow the non-terminal ``B``. + +* Given a rule, the *follow set* are the collection of terminals that can appear + immediately to the right of that rule in a partial derivation. Intuitively this + solves the problem in which a rule can expand to the empty string. For instance, + given this rule:: + + rule: A 'b' + + if the parser has the token *b* and the rule A can only start with the token *a* + we know it is an invalid program but if A can be expanded also to the empty string + (called an ε-production) then we can consume the next token, 'b'. Therefore, *b* + is in the *follow set* of ``A``. + + +The Python grammar does not allow ε-productions so the *follow sets* are not +needed when creating the parse tables. Currently, in CPython, a parser generator +program reads the grammar and produces a parsing table representing a set of +deterministic finite automata (DFA) that can be included in a C program, the +parser, which is a pushdown automaton that uses this data to produce a Concrete +Syntax Tree (CST) sometimes known directly as a "parse tree". In this process, the +*first sets* are used indirectly when generating the DFAs. + +LL(1) parsers and grammars are usually known for being efficient and simple to +implement and generate the but the reality is that expressing some constructs +currently present in the Python language is notably difficult or impossible with +such a restriction. As LL(1) parsers can only look one token ahead to distinguish +possibilities, some rules in the grammar may be ambiguous. For instance the rule:: + + rule: A | B + +is ambiguous if the *first sets* of both ``A`` and ``B`` have some elements in +common. This is because if the parser sees a token in the input +program that both *A* and *B* can start with it is impossible for it to deduce +which option to expand as no further token of the program can be examined to +disambiguate. As will be shown later in this document, the current LL(1)-based +grammar suffers a lot from this scenario. + +Also, it is relevant to note (as other sections of this document will deal with this +concept) that a given grammar cannot be LL(1) if it is left recursive. A grammar is +left-recursive if and only if there exists a nonterminal that can derive to a +sentential form with itself as the leftmost symbol. For instance this rule:: + + rule: rule 'a' + +is left-recursive because the rule can be expanded to an expression that starts +with itself. As will be described later, left-recursion can be very useful to +express some desired properties directly in the grammar and the lack of +it can lead to some undesired scenarios. + +========================= +Background on PEG parsers +========================= + +A PEG (Parsing Expression Grammar) grammar differs from a context-free grammar +(like the current one) in the fact that the way it is written more closely +reflects how the parser will operate when parsing it. The fundamental technical +difference is that the choice operator is ordered. This means that when writing:: + + rule: A | B | C + +a context-free-grammar parser (like an LL(1) parser) will generate constructions +that given an input string will *deduce* which alternative (``A``, ``B`` or ``C``) +must be expanded, while a PEG parser will check if the first alternative succeeds +and only if it fails, will it continue with the second or the third one in the +order in which they are written. This makes the choice operator not commutative. + +Unlike LL(1) parsers PEG-based parsers cannot be ambiguous: if a string parses, +it has exactly one valid parse tree. This means that a PEG-based parser cannot +suffer from the ambiguity problems described in the previous section. + +PEG parsers are usually constructed as a recursive descent parser in which every +rule in the grammar corresponds to a function in the program implementing the +parser and the parsing expression (the "expansion" or "definition" of the rule) +represents the "code" in said function. Each parsing function conceptually takes +an input string as its argument, and yields one of the following results: + +* A "success" result. This result indicates that the expression can be parsed by + that rule and the function may optionally move forward or consume one or more + characters of the input string supplied to it. +* A "failure" result, in which case no input is consumed. + +Notice that "failure" results do not imply that the program is incorrect or a +parsing failure because as the choice operator is ordered, a "failure" result +merely indicates "try the following option". A direct implementation of a PEG +parser as a recursive descent parser will present exponential time performance in +the worst case as compared with LL(1) parsers, PEG parsers have infinite lookahead +(this means that they can consider an arbitrary number of tokens before deciding +for a rule). Usually, PEG parsers avoid this exponential time complexity with a +technique called "packrat parsing" [1]_ which not only loads the entire +program in memory before parsing it but also allows the parser to backtrack +arbitrarily. This is made efficient by memoizing the rules already matched for +each position. The cost of the memoization cache is that the parser will naturally +use more memory than a simple LL(1) parser, which normally are table-based. We +will explain later in this document why we consider this cost acceptable. + +========= +Rationale +========= + +In this section, we describe a list of problems that are present in the current parser +machinery in CPython that motivates the need for a new parser. + +--------------------------------- +Some rules are not actually LL(1) +--------------------------------- + +Although the Python grammar is technically an LL(1) grammar (because is parsed by +an LL(1) parser) several rules are not LL(1) and several workarounds are +implemented in the grammar and in other parts of CPython to deal with this. For +example, consider the rule for assignment expressions:: + + namedexpr_test: NAME [':=' test] + +This simple rule is not compatible with the Python grammar as *NAME* is among the +elements of the *first set* of the rule *test*. To work around this limitation the +actual rule that appears in the current grammar is:: + + namedexpr_test: test [':=' test] + +Which is a much broader rule than the previous one allowing constructs like ``[x +for x in y] := [1,2,3]``. The way the rule is limited to its desired form is by +disallowing these unwanted constructions when transforming the parse tree to the +abstract syntax tree. This is not only inelegant but a considerable maintenance +burden as it forces the AST creation routines and the compiler into a situation in +which they need to know how to separate valid programs from invalid programs, +which should be a responsibility solely of the parser. This also leads to the +actual grammar file not reflecting correctly what the *actual* grammar is (that +is, the collection of all valid Python programs). + +Similar workarounds appear in multiple other rules of the current grammar. +Sometimes this problem is unsolvable. For instance, `bpo-12782: Multiple context +expressions do not support parentheses for continuation across lines +`_ shows how making an LL(1) rule that supports +writing:: + + with ( + open("a_really_long_foo") as foo, + open("a_really_long_baz") as baz, + open("a_really_long_bar") as bar + ): + ... + +is not possible since the first sets of the grammar items that can +appear as context managers include the open parenthesis, making the rule +ambiguous. This rule is not only consistent with other parts of the language (like +the rule for multiple imports), but is also very useful to auto-formatting tools, +as parenthesized groups are normally used to group elements to be +formatted together (in the same way the tools operate on the contents of lists, +sets...). + +----------------------- +Complicated AST parsing +----------------------- + +Another problem of the current parser is that there is a huge coupling between the +AST generation routines and the particular shape of the produced parse trees. This +makes the code for generating the AST especially complicated as many actions and +choices are implicit. For instance, the AST generation code knows what +alternatives of a certain rule are produced based on the number of child nodes +present in a given parse node. This makes the code difficult to follow as this +property is not directly related to the grammar file and is influenced by +implementation details. As a result of this, a considerable amount of the AST +generation code needs to deal with inspecting and reasoning about the particular +shape of the parse trees that it receives. + +---------------------- +Lack of left recursion +---------------------- + +As described previously, a limitation of LL(1) grammars is that they cannot allow +left-recursion. This makes writing some rules very unnatural and far from how +programmers normally think about the program. For instance this construct (a simpler +variation of several rules present in the current grammar):: + + expr: expr '+' term | term + +cannot be parsed by an LL(1) parser. The traditional remedy is to rewrite the +grammar to circumvent the problem:: + + expr: term ('+' term)* + +The problem that appears with this form is that the parse tree is forced to have a +very unnatural shape. This is because with this rule, for the input program ``a + +b + c`` the parse tree will be flattened (``['a', '+', 'b', '+', 'c']``) and must +be post-processed to construct a left-recursive parse tree (``[['a', '+', 'b'], +'+', 'c']``). Being forced to write the second rule not only leads to the parse +tree not correctly reflecting the desired associativity, but also imposes further +pressure on later compilation stages to detect and post-process these cases. + +----------------------- +Intermediate parse tree +----------------------- + +The last problem present in the current parser is the intermediate creation of a +parse tree or Concrete Syntax Tree that is later transformed to an Abstract Syntax +Tree. Although the construction of a CST is very common in parser and compiler +pipelines, in CPython this intermediate CST is not used by anything else (it is +only indirectly exposed by the *parser* module and a surprisingly small part of +the code in the CST production is reused in the module). Which is worse: the whole +tree is kept in memory, keeping many branches that consist of chains of nodes with +a single child. This has shown to consume a considerable ammount of memory (for +instance in `bpo-26451: Excessive peak memory consumption by the Python +parser `_). + +Having to produce an intermediate result between the grammar and the AST is not only +undesirable but also makes the AST generation step much more complicated, raising +considerably the maintenance burden. + +=========================== +The new proposed PEG parser +=========================== + +The new proposed PEG parser contains the following pieces: + +* A parser generator that can read a grammar file and produce a PEG parser + written in Python or C that can parse said grammar. + +* A PEG meta-grammar that automatically generates a Python parser that is used + for the parser generator itself (this means that there are no manually-written + parsers). + +* A generated parser (using the parser generator) that can directly produce C and + Python AST objects. + +-------------- +Left recursion +-------------- + +PEG parsers normally do not support left recursion but we have implemented a +technique similar to the one described in Medeiros et al. [2]_ but using the +memoization cache instead of static variables. This approach is closer to the one +described in Warth et al. [3]_. This allows us to write not only simple left-recursive +rules but also more complicated rules that involve indirect left-recursion like:: + + rule1: rule2 | 'a' + rule2: rule3 | 'b' + rule3: rule1 | 'c' + +and "hidden left-recursion" like:: + + rule: 'optional'? rule '@' some_other_rule + +------ +Syntax +------ + +The grammar consists of a sequence of rules of the form: :: + + rule_name: expression + +Optionally, a type can be included right after the rule name, which +specifies the return type of the C or Python function corresponding to +the rule: :: + + rule_name[return_type]: expression + +If the return type is omitted, then a ``void *`` is returned in C and an +``Any`` in Python. + +The full meta-grammar for the grammars supported by the PEG generator is: + +:: + + start[Grammar]: grammar ENDMARKER { grammar } + + grammar[Grammar]: + | metas rules { Grammar(rules, metas) } + | rules { Grammar(rules, []) } + + metas[MetaList]: + | meta metas { [meta] + metas } + | meta { [meta] } + + meta[MetaTuple]: + | "@" NAME NEWLINE { (name.string, None) } + | "@" a=NAME b=NAME NEWLINE { (a.string, b.string) } + | "@" NAME STRING NEWLINE { (name.string, literal_eval(string.string)) } + + rules[RuleList]: + | rule rules { [rule] + rules } + | rule { [rule] } + + rule[Rule]: + | rulename ":" alts NEWLINE INDENT more_alts DEDENT { + Rule(rulename[0], rulename[1], Rhs(alts.alts + more_alts.alts)) } + | rulename ":" NEWLINE INDENT more_alts DEDENT { Rule(rulename[0], rulename[1], more_alts) } + | rulename ":" alts NEWLINE { Rule(rulename[0], rulename[1], alts) } + + rulename[RuleName]: + | NAME '[' type=NAME '*' ']' {(name.string, type.string+"*")} + | NAME '[' type=NAME ']' {(name.string, type.string)} + | NAME {(name.string, None)} + + alts[Rhs]: + | alt "|" alts { Rhs([alt] + alts.alts)} + | alt { Rhs([alt]) } + + more_alts[Rhs]: + | "|" alts NEWLINE more_alts { Rhs(alts.alts + more_alts.alts) } + | "|" alts NEWLINE { Rhs(alts.alts) } + + alt[Alt]: + | items '$' action { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=action) } + | items '$' { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=None) } + | items action { Alt(items, action=action) } + | items { Alt(items, action=None) } + + items[NamedItemList]: + | named_item items { [named_item] + items } + | named_item { [named_item] } + + named_item[NamedItem]: + | NAME '=' ~ item {NamedItem(name.string, item)} + | item {NamedItem(None, item)} + | it=lookahead {NamedItem(None, it)} + + lookahead[LookaheadOrCut]: + | '&' ~ atom {PositiveLookahead(atom)} + | '!' ~ atom {NegativeLookahead(atom)} + | '~' {Cut()} + + item[Item]: + | '[' ~ alts ']' {Opt(alts)} + | atom '?' {Opt(atom)} + | atom '*' {Repeat0(atom)} + | atom '+' {Repeat1(atom)} + | sep=atom '.' node=atom '+' {Gather(sep, node)} + | atom {atom} + + atom[Plain]: + | '(' ~ alts ')' {Group(alts)} + | NAME {NameLeaf(name.string) } + | STRING {StringLeaf(string.string)} + + # Mini-grammar for the actions + + action[str]: "{" ~ target_atoms "}" { target_atoms } + + target_atoms[str]: + | target_atom target_atoms { target_atom + " " + target_atoms } + | target_atom { target_atom } + + target_atom[str]: + | "{" ~ target_atoms "}" { "{" + target_atoms + "}" } + | NAME { name.string } + | NUMBER { number.string } + | STRING { string.string } + | "?" { "?" } + | ":" { ":" } + + +Grammar Expressions +~~~~~~~~~~~~~~~~~~~ + +``# comment`` +''''''''''''' + +Python-style comments. + +``e1 e2`` +''''''''' + +Match e1, then match e2. + +:: + + rule_name: first_rule second_rule + +.. _e1-e2-1: + +``e1 | e2`` +''''''''''' + +Match e1 or e2. + +The first alternative can also appear on the line after the rule name +for formatting purposes. In that case, a \| must be used before the +first alternative, like so: + +:: + + rule_name[return_type]: + | first_alt + | second_alt + +``( e )`` +''''''''' + +Match e. + +:: + + rule_name: (e) + +A slightly more complex and useful example includes using the grouping +operator together with the repeat operators: + +:: + + rule_name: (e1 e2)* + +``[ e ] or e?`` +''''''''''''''' + +Optionally match e. + +:: + + rule_name: [e] + +A more useful example includes defining that a trailing comma is +optional: + +:: + + rule_name: e (',' e)* [','] + +.. _e-1: + +``e*`` +'''''' + +Match zero or more occurrences of e. + +:: + + rule_name: (e1 e2)* + +.. _e-2: + +``e+`` +'''''' + +Match one or more occurrences of e. + +:: + + rule_name: (e1 e2)+ + +``s.e+`` +'''''''' + +Match one or more occurrences of e, separated by s. The generated parse +tree does not include the separator. This is otherwise identical to +``(e (s e)*)``. + +:: + + rule_name: ','.e+ + +.. _e-3: + +``&e`` +'''''' + +Succeed if e can be parsed, without consuming any input. + +.. _e-4: + +``!e`` +'''''' + +Fail if e can be parsed, without consuming any input. + +An example taken from the proposed Python grammar specifies that a primary +consists of an atom, which is not followed by a ``.`` or a ``(`` or a +``[``: + +:: + + primary: atom !'.' !'(' !'[' + +.. _e-5: + +``~`` +'''''' + +Commit to the current alternative, even if it fails to parse. + +:: + + rule_name: '(' ~ some_rule ')' | some_alt + +In this example, if a left parenthesis is parsed, then the other +alternative won’t be considered, even if some_rule or ‘)’ fail to be +parsed. + +Variables in the Grammar +~~~~~~~~~~~~~~~~~~~~~~~~ + +A subexpression can be named by preceding it with an identifier and an +``=`` sign. The name can then be used in the action (see below), like this: :: + + rule_name[return_type]: '(' a=some_other_rule ')' { a } + +--------------- +Grammar actions +--------------- + +To avoid the intermediate steps that obscure the relationship between the +grammar and the AST generation the proposed PEG parser allows directly generating +AST nodes for a rule via grammar actions. Grammar actions are C expressions that +are evaluated when a grammar rule is successfully parsed. This allows to directly +describe how the AST is composed in the grammar itself, making it more clear and +maintainable. This AST generation process is supported by the use of some helper +functions that factor out common AST object manipulations and some other required +operations that are not directly related to the grammar. + +To indicate these actions each alternative can be followed by a the action code +inside curly-braces, which specifies the return value of the alternative::: + + rule_name[return_type]: + | first_alt1 first_alt2 { first_alt1 } + | second_alt1 second_alt2 { second_alt1 } + +If the action is omitted and C code is being generated, then there are two +different possibilities: 1. If there’s a single name in the alternative, this gets +returned. 2. If not, a dummy name object gets returned (this case should be avoided). + +If Python code is being generated, then a list with all the parsed +expressions get returned if no action is specified (this is meant for debugging). + +As an illustrative example this simple grammar file allows to directly generate a full +parser that can parse simple aritmetic expressions and that returns a valid Python AST: + +:: + + start[mod_ty]: a=stmt* $ { Module(a, NULL, p->arena) } + stmt[stmt_ty]: a=expr_stmt { a } + expr_stmt[stmt_ty]: a=expression NEWLINE { _Py_Expr(a, EXTRA) } + expression[expr_ty]: ( l=expression '+' r=term { _Py_BinOp(l, Add, r, EXTRA) } + | l=expression '-' r=term { _Py_BinOp(l, Sub, r, EXTRA) } + | t=term { t } + ) + term[expr_ty]: ( l=term '*' r=factor { _Py_BinOp(l, Mult, r, EXTRA } + | l=term '/' r=factor { _Py_BinOp(l, Div, r, EXTRA) } + | f=factor { f } + ) + factor[expr_ty]: ('(' e=expression ')' { e } + | a=atom { a } + ) + atom[expr_ty]: ( n=NAME { n } + | n=NUMBER { n } + | s=STRING { s } + ) + +here ``EXTRA`` is a macro that expands to ``start_lineno, start_col_offset, +end_lineno, end_col_offset, p->arena``, being the values for this variables +automatically injected by the parser; ``p`` points to an object +that holds on to all state for the parser. + +============== +Migration plan +============== + +This section describes the migration plan when porting to the new PEG-based parser +if this PEP is accepted. The migration will be executed in a series of steps that allow +initially to fallback to the previous parser if needed: + +1. Before Python 3.9 beta 1, include the new PEG-based parser machinery in CPython + with a command-line flag and environment variable that allows switching between + the new and the old parsers together with explicit APIs that allow invoking the + new and the old parsers independently. At this step, all Python APIs like ``ast.parse`` + and ``compile`` will use the parser set by the flags or the environment variable and + the default parser will be the current parser. + +2. After Python 3.9 Beta 1 the default parser will be the new parser. + +3. Between Python 3.9 and Python 3.10, the old parser and related code (like the + "parser" module) will be kept until a new Python release happens (Python 3.10). In + the meanwhile and until the old parser is removed, **no new Python Grammar + addition will be added that requires the peg parser**. This means that the grammar + will be kept LL(1) until the old parser is removed. + +4. In Python 3.10, remove the old parser, the command-line flag, the environment + variable and the "parser" module and related code. + +========================== +Performance and validation +========================== + +We have done extensive timing and validation of the new parser, and +this gives us confidence that the new parser is of high enough quality +to replace the current parser. + +---------- +Validation +---------- + +To start with validation, we regularly compile the entire Python 3.8 +stdlib and compare every aspect of the resulting AST with that +produced by the standard compiler. (In the process we found a few bugs +in the standard parser's treatment of line and column numbers, which +we have all fixed upstream via a series of issues and PRs.) + +We have also occasionally compiled a much larger codebase (the 100 +most popular packages on PyPI) and this has helped us find a (very) +few additional bugs in the new parser. + +(One area we have not explored extensively is rejection of all wrong +programs. We have unit tests that check for a certain number of +explicit rejections, but more work could be done, e.g. by using a +fuzzer that inserts random subtle bugs into existing code. We're open +to help in this area.) + +----------- +Performance +----------- + +We have tuned the performance of the new parser to come within 10% of +the current parser both in speed and memory consumption. While the +PEG/packrat parsing algorithm inherently consumes more memory than the +current LL(1) parser, we have an advantage because we don't construct +an intermediate CST. + +Below are some benchmarks. These are focused on compiling source code +to bytecode, because this is the most realistic situation. Returning +an AST to Python code is not as representative, because the process to +convert the *internal* AST (only accessible to C code) to an +*external* AST (an instance of ``ast.AST``) takes more time than the +parser itself. + +All measurements reported here are done on a recent MacBook Pro, +taking the median of three runs. No particular care was taken to stop +other applications running on the same machine. + +The first timings are for our canonical test file, which has 100,000 +lines endlessly repeating the following three lines:: + + 1 + 2 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + ((((((11 * 12 * 13 * 14 * 15 + 16 * 17 + 18 * 19 * 20)))))) + 2*3 + 4*5*6 + 12 + (2 * 3 * 4 * 5 + 6 + 7 * 8) + +- Just parsing and throwing away the internal AST takes 1.16 seconds + with a max RSS of 681 MiB. + +- Parsing and converting to ``ast.AST`` takes 6.34 seconds, max RSS + 1029 MiB. + +- Parsing and compiling to bytecode takes 1.28 seconds, max RSS 681 + MiB. + +- With the current parser, parsing and compiling takes 1.44 seconds, + max RSS 836 MiB. + +For this particular test file, the new parser is faster and uses less +memory than the current parser (compare the last two bullets). + +We also did timings with a more realistic payload, the entire Python +3.8 stdlib. This payload consists of 1,641 files, 749,570 lines, +27,622,497 bytes. (Though 11 files can't be compiled by any Python 3 +parser due to encoding issues, sometimes intentional.) + +- Compiling and throwing away the internal AST took 2.141 seconds. + That's 350,040 lines/sec, or 12,899,367 bytes/sec. The max RSS was + 74 MiB (the largest file in the stdlib is much smaller than out + canonical test file). + +- Compiling to bytecode took 3.290 seconds. That's 227,861 lines/sec, + or 8,396,942 bytes/sec. Max RSS 77 MiB. + +- Compiling to bytecode using the current parser took 3.367 seconds. + That's 222,620 lines/sec, or 8,203,780 bytes/sec. Max RSS 70 MiB. + +Comparing the last two bullets we find that the new parser is slightly +faster but uses slightly (about 10%) more memory. We believe this is +acceptable. (Also, there are probably some more tweaks we can make to +reduce memory usage.) + +========== +References +========== + +.. [#GUIDO_PEG] + Guido's series on PEG parsing + https://medium.com/@gvanrossum_83706/peg-parsing-series-de5d41b2ed60 + +.. [1] Ford, Bryan + http://pdos.csail.mit.edu/~baford/packrat/thesis + +.. [2] Medeiros et al. + https://arxiv.org/pdf/1509.02439v1.pdf + +.. [3] Warth et al. + http://web.cs.ucla.edu/~todd/research/pepm08.pdf + + +========= +Copyright +========= + +This document has been placed in the public domain.