2020-03-31 16:36:17 -04:00
|
|
|
|
PEP: 0617
|
|
|
|
|
Title: New PEG parser for CPython
|
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
|
|
|
|
Author: Guido van Rossum <guido@python.org>,
|
2020-06-04 05:37:58 -04:00
|
|
|
|
Pablo Galindo <pablogsal@python.org>,
|
2020-03-31 16:36:17 -04:00
|
|
|
|
Lysandros Nikolaou <lisandrosnik@gmail.com>
|
|
|
|
|
Discussions-To: Python-Dev <python-dev@python.org>
|
2020-04-20 16:16:01 -04:00
|
|
|
|
Status: Accepted
|
2020-03-31 16:36:17 -04:00
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
|
Created: 24-March-2020
|
2020-04-02 14:13:36 -04:00
|
|
|
|
Python-Version: 3.9
|
|
|
|
|
Post-History: 02-Apr-2020
|
2020-03-31 16:36:17 -04:00
|
|
|
|
|
|
|
|
|
========
|
|
|
|
|
Overview
|
|
|
|
|
========
|
|
|
|
|
|
2020-04-07 15:57:39 -04:00
|
|
|
|
This PEP proposes replacing the current LL(1)-based parser of CPython
|
|
|
|
|
with a new PEG-based parser. This new parser would allow the elimination of multiple
|
|
|
|
|
"hacks" that exist in the current grammar to circumvent the LL(1)-limitation.
|
|
|
|
|
It would substantially reduce the maintenance costs in some areas related to the
|
2020-03-31 16:36:17 -04:00
|
|
|
|
compiling pipeline such as the grammar, the parser and the AST generation. The new PEG
|
2020-04-07 15:57:39 -04:00
|
|
|
|
parser will also lift the LL(1) restriction on the current Python grammar.
|
2020-03-31 16:36:17 -04:00
|
|
|
|
|
|
|
|
|
===========================
|
|
|
|
|
Background on LL(1) parsers
|
|
|
|
|
===========================
|
|
|
|
|
|
|
|
|
|
The current Python grammar is an LL(1)-based grammar. A grammar can be said to be
|
|
|
|
|
LL(1) if it can be parsed by an LL(1) parser, which in turn is defined as a
|
|
|
|
|
top-down parser that parses the input from left to right, performing leftmost
|
2020-04-07 15:57:39 -04:00
|
|
|
|
derivation of the sentence, with just one token of lookahead.
|
|
|
|
|
The traditional approach to constructing or generating an LL(1) parser is to
|
2020-03-31 16:36:17 -04:00
|
|
|
|
produce a *parse table* which encodes the possible transitions between all possible
|
|
|
|
|
states of the parser. These tables are normally constructed from the *first sets*
|
|
|
|
|
and the *follow sets* of the grammar:
|
|
|
|
|
|
2020-04-07 15:57:39 -04:00
|
|
|
|
* Given a rule, the *first set* is the collection of all terminals that can occur
|
|
|
|
|
first in a full derivation of that rule. Intuitively, this helps the parser decide
|
|
|
|
|
among the alternatives in a rule. For
|
2020-03-31 16:36:17 -04:00
|
|
|
|
instance, given the rule ::
|
|
|
|
|
|
|
|
|
|
rule: A | B
|
|
|
|
|
|
|
|
|
|
if only ``A`` can start with the terminal *a* and only ``B`` can start with the
|
|
|
|
|
terminal *b* and the parser sees the token *b* when parsing this rule, it knows
|
|
|
|
|
that it needs to follow the non-terminal ``B``.
|
|
|
|
|
|
2020-04-07 15:57:39 -04:00
|
|
|
|
* An extension to this simple idea is needed when a rule may expand to the empty string.
|
|
|
|
|
Given a rule, the *follow set* is the collection of terminals that can appear
|
|
|
|
|
immediately to the right of that rule in a partial derivation. Intuitively, this
|
|
|
|
|
solves the problem of the empty alternative. For instance,
|
2020-03-31 16:36:17 -04:00
|
|
|
|
given this rule::
|
|
|
|
|
|
|
|
|
|
rule: A 'b'
|
|
|
|
|
|
2020-04-07 15:57:39 -04:00
|
|
|
|
if the parser has the token *b* and the non-terminal ``A`` can only start
|
|
|
|
|
with the token *a*, then the parser can tell that this is an invalid program.
|
|
|
|
|
But if ``A`` could expand to the empty string (called an ε-production),
|
|
|
|
|
then the parser would recognise a valid empty ``A``,
|
|
|
|
|
since the next token *b* is in the *follow set* of ``A``.
|
2020-03-31 16:36:17 -04:00
|
|
|
|
|
|
|
|
|
|
2020-04-07 15:57:39 -04:00
|
|
|
|
The current Python grammar does not contain ε-productions, so the *follow sets* are not
|
2020-03-31 16:36:17 -04:00
|
|
|
|
needed when creating the parse tables. Currently, in CPython, a parser generator
|
|
|
|
|
program reads the grammar and produces a parsing table representing a set of
|
|
|
|
|
deterministic finite automata (DFA) that can be included in a C program, the
|
2020-04-07 15:57:39 -04:00
|
|
|
|
parser. The parser is a pushdown automaton that uses this data to produce a Concrete
|
2020-03-31 16:36:17 -04:00
|
|
|
|
Syntax Tree (CST) sometimes known directly as a "parse tree". In this process, the
|
|
|
|
|
*first sets* are used indirectly when generating the DFAs.
|
|
|
|
|
|
2020-04-07 15:57:39 -04:00
|
|
|
|
LL(1) parsers and grammars are usually efficient and simple to implement
|
|
|
|
|
and generate. However, it is not possible, under the LL(1) restriction,
|
|
|
|
|
to express certain common constructs in a way natural to the language
|
|
|
|
|
designer and the reader. This includes some in the Python language.
|
|
|
|
|
|
|
|
|
|
As LL(1) parsers can only look one token ahead to distinguish
|
2020-03-31 16:36:17 -04:00
|
|
|
|
possibilities, some rules in the grammar may be ambiguous. For instance the rule::
|
|
|
|
|
|
|
|
|
|
rule: A | B
|
|
|
|
|
|
|
|
|
|
is ambiguous if the *first sets* of both ``A`` and ``B`` have some elements in
|
2020-04-07 15:57:39 -04:00
|
|
|
|
common. When the parser sees a token in the input
|
|
|
|
|
program that both *A* and *B* can start with, it is impossible for it to deduce
|
|
|
|
|
which option to expand, as no further token of the program can be examined to
|
|
|
|
|
disambiguate.
|
|
|
|
|
The rule may be transformed to equivalent LL(1) rules, but then it may
|
|
|
|
|
be harder for a human reader to grasp its meaning.
|
|
|
|
|
Examples later in this document show that the current LL(1)-based
|
2020-03-31 16:36:17 -04:00
|
|
|
|
grammar suffers a lot from this scenario.
|
|
|
|
|
|
2020-04-07 15:57:39 -04:00
|
|
|
|
Another broad class of rules precluded by LL(1) is left-recursive rules.
|
|
|
|
|
A rule is left-recursive if it can derive to a
|
2020-03-31 16:36:17 -04:00
|
|
|
|
sentential form with itself as the leftmost symbol. For instance this rule::
|
|
|
|
|
|
|
|
|
|
rule: rule 'a'
|
|
|
|
|
|
|
|
|
|
is left-recursive because the rule can be expanded to an expression that starts
|
2020-04-07 15:57:39 -04:00
|
|
|
|
with itself. As will be described later, left-recursion is the natural way to
|
|
|
|
|
express certain desired language properties directly in the grammar.
|
2020-03-31 16:36:17 -04:00
|
|
|
|
|
|
|
|
|
=========================
|
|
|
|
|
Background on PEG parsers
|
|
|
|
|
=========================
|
|
|
|
|
|
|
|
|
|
A PEG (Parsing Expression Grammar) grammar differs from a context-free grammar
|
|
|
|
|
(like the current one) in the fact that the way it is written more closely
|
|
|
|
|
reflects how the parser will operate when parsing it. The fundamental technical
|
|
|
|
|
difference is that the choice operator is ordered. This means that when writing::
|
|
|
|
|
|
|
|
|
|
rule: A | B | C
|
|
|
|
|
|
|
|
|
|
a context-free-grammar parser (like an LL(1) parser) will generate constructions
|
|
|
|
|
that given an input string will *deduce* which alternative (``A``, ``B`` or ``C``)
|
|
|
|
|
must be expanded, while a PEG parser will check if the first alternative succeeds
|
|
|
|
|
and only if it fails, will it continue with the second or the third one in the
|
|
|
|
|
order in which they are written. This makes the choice operator not commutative.
|
|
|
|
|
|
2020-04-16 18:16:52 -04:00
|
|
|
|
Unlike LL(1) parsers, PEG-based parsers cannot be ambiguous: if a string parses,
|
2020-03-31 16:36:17 -04:00
|
|
|
|
it has exactly one valid parse tree. This means that a PEG-based parser cannot
|
|
|
|
|
suffer from the ambiguity problems described in the previous section.
|
|
|
|
|
|
|
|
|
|
PEG parsers are usually constructed as a recursive descent parser in which every
|
|
|
|
|
rule in the grammar corresponds to a function in the program implementing the
|
|
|
|
|
parser and the parsing expression (the "expansion" or "definition" of the rule)
|
|
|
|
|
represents the "code" in said function. Each parsing function conceptually takes
|
|
|
|
|
an input string as its argument, and yields one of the following results:
|
|
|
|
|
|
|
|
|
|
* A "success" result. This result indicates that the expression can be parsed by
|
|
|
|
|
that rule and the function may optionally move forward or consume one or more
|
|
|
|
|
characters of the input string supplied to it.
|
|
|
|
|
* A "failure" result, in which case no input is consumed.
|
|
|
|
|
|
|
|
|
|
Notice that "failure" results do not imply that the program is incorrect or a
|
|
|
|
|
parsing failure because as the choice operator is ordered, a "failure" result
|
|
|
|
|
merely indicates "try the following option". A direct implementation of a PEG
|
|
|
|
|
parser as a recursive descent parser will present exponential time performance in
|
2020-04-01 16:59:32 -04:00
|
|
|
|
the worst case as compared with LL(1) parsers, because PEG parsers have infinite lookahead
|
2020-03-31 16:36:17 -04:00
|
|
|
|
(this means that they can consider an arbitrary number of tokens before deciding
|
|
|
|
|
for a rule). Usually, PEG parsers avoid this exponential time complexity with a
|
|
|
|
|
technique called "packrat parsing" [1]_ which not only loads the entire
|
|
|
|
|
program in memory before parsing it but also allows the parser to backtrack
|
|
|
|
|
arbitrarily. This is made efficient by memoizing the rules already matched for
|
|
|
|
|
each position. The cost of the memoization cache is that the parser will naturally
|
|
|
|
|
use more memory than a simple LL(1) parser, which normally are table-based. We
|
|
|
|
|
will explain later in this document why we consider this cost acceptable.
|
|
|
|
|
|
|
|
|
|
=========
|
|
|
|
|
Rationale
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
In this section, we describe a list of problems that are present in the current parser
|
|
|
|
|
machinery in CPython that motivates the need for a new parser.
|
|
|
|
|
|
|
|
|
|
---------------------------------
|
|
|
|
|
Some rules are not actually LL(1)
|
|
|
|
|
---------------------------------
|
|
|
|
|
|
|
|
|
|
Although the Python grammar is technically an LL(1) grammar (because is parsed by
|
|
|
|
|
an LL(1) parser) several rules are not LL(1) and several workarounds are
|
|
|
|
|
implemented in the grammar and in other parts of CPython to deal with this. For
|
|
|
|
|
example, consider the rule for assignment expressions::
|
|
|
|
|
|
|
|
|
|
namedexpr_test: NAME [':=' test]
|
|
|
|
|
|
|
|
|
|
This simple rule is not compatible with the Python grammar as *NAME* is among the
|
|
|
|
|
elements of the *first set* of the rule *test*. To work around this limitation the
|
|
|
|
|
actual rule that appears in the current grammar is::
|
|
|
|
|
|
|
|
|
|
namedexpr_test: test [':=' test]
|
|
|
|
|
|
|
|
|
|
Which is a much broader rule than the previous one allowing constructs like ``[x
|
|
|
|
|
for x in y] := [1,2,3]``. The way the rule is limited to its desired form is by
|
|
|
|
|
disallowing these unwanted constructions when transforming the parse tree to the
|
|
|
|
|
abstract syntax tree. This is not only inelegant but a considerable maintenance
|
|
|
|
|
burden as it forces the AST creation routines and the compiler into a situation in
|
|
|
|
|
which they need to know how to separate valid programs from invalid programs,
|
|
|
|
|
which should be a responsibility solely of the parser. This also leads to the
|
|
|
|
|
actual grammar file not reflecting correctly what the *actual* grammar is (that
|
|
|
|
|
is, the collection of all valid Python programs).
|
|
|
|
|
|
|
|
|
|
Similar workarounds appear in multiple other rules of the current grammar.
|
|
|
|
|
Sometimes this problem is unsolvable. For instance, `bpo-12782: Multiple context
|
|
|
|
|
expressions do not support parentheses for continuation across lines
|
|
|
|
|
<http://bugs.python.org/issue12782>`_ shows how making an LL(1) rule that supports
|
|
|
|
|
writing::
|
|
|
|
|
|
|
|
|
|
with (
|
|
|
|
|
open("a_really_long_foo") as foo,
|
|
|
|
|
open("a_really_long_baz") as baz,
|
|
|
|
|
open("a_really_long_bar") as bar
|
|
|
|
|
):
|
|
|
|
|
...
|
|
|
|
|
|
|
|
|
|
is not possible since the first sets of the grammar items that can
|
|
|
|
|
appear as context managers include the open parenthesis, making the rule
|
|
|
|
|
ambiguous. This rule is not only consistent with other parts of the language (like
|
|
|
|
|
the rule for multiple imports), but is also very useful to auto-formatting tools,
|
|
|
|
|
as parenthesized groups are normally used to group elements to be
|
|
|
|
|
formatted together (in the same way the tools operate on the contents of lists,
|
|
|
|
|
sets...).
|
|
|
|
|
|
|
|
|
|
-----------------------
|
|
|
|
|
Complicated AST parsing
|
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
|
|
Another problem of the current parser is that there is a huge coupling between the
|
|
|
|
|
AST generation routines and the particular shape of the produced parse trees. This
|
|
|
|
|
makes the code for generating the AST especially complicated as many actions and
|
|
|
|
|
choices are implicit. For instance, the AST generation code knows what
|
|
|
|
|
alternatives of a certain rule are produced based on the number of child nodes
|
|
|
|
|
present in a given parse node. This makes the code difficult to follow as this
|
|
|
|
|
property is not directly related to the grammar file and is influenced by
|
|
|
|
|
implementation details. As a result of this, a considerable amount of the AST
|
|
|
|
|
generation code needs to deal with inspecting and reasoning about the particular
|
|
|
|
|
shape of the parse trees that it receives.
|
|
|
|
|
|
|
|
|
|
----------------------
|
|
|
|
|
Lack of left recursion
|
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
|
|
As described previously, a limitation of LL(1) grammars is that they cannot allow
|
|
|
|
|
left-recursion. This makes writing some rules very unnatural and far from how
|
|
|
|
|
programmers normally think about the program. For instance this construct (a simpler
|
|
|
|
|
variation of several rules present in the current grammar)::
|
|
|
|
|
|
|
|
|
|
expr: expr '+' term | term
|
|
|
|
|
|
|
|
|
|
cannot be parsed by an LL(1) parser. The traditional remedy is to rewrite the
|
|
|
|
|
grammar to circumvent the problem::
|
|
|
|
|
|
|
|
|
|
expr: term ('+' term)*
|
|
|
|
|
|
|
|
|
|
The problem that appears with this form is that the parse tree is forced to have a
|
|
|
|
|
very unnatural shape. This is because with this rule, for the input program ``a +
|
|
|
|
|
b + c`` the parse tree will be flattened (``['a', '+', 'b', '+', 'c']``) and must
|
|
|
|
|
be post-processed to construct a left-recursive parse tree (``[['a', '+', 'b'],
|
|
|
|
|
'+', 'c']``). Being forced to write the second rule not only leads to the parse
|
|
|
|
|
tree not correctly reflecting the desired associativity, but also imposes further
|
|
|
|
|
pressure on later compilation stages to detect and post-process these cases.
|
|
|
|
|
|
|
|
|
|
-----------------------
|
|
|
|
|
Intermediate parse tree
|
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
|
|
The last problem present in the current parser is the intermediate creation of a
|
|
|
|
|
parse tree or Concrete Syntax Tree that is later transformed to an Abstract Syntax
|
|
|
|
|
Tree. Although the construction of a CST is very common in parser and compiler
|
|
|
|
|
pipelines, in CPython this intermediate CST is not used by anything else (it is
|
|
|
|
|
only indirectly exposed by the *parser* module and a surprisingly small part of
|
|
|
|
|
the code in the CST production is reused in the module). Which is worse: the whole
|
|
|
|
|
tree is kept in memory, keeping many branches that consist of chains of nodes with
|
2020-07-04 21:56:20 -04:00
|
|
|
|
a single child. This has been shown to consume a considerable amount of memory (for
|
2020-04-21 14:07:44 -04:00
|
|
|
|
instance in `bpo-26415: Excessive peak memory consumption by the Python
|
2020-03-31 16:36:17 -04:00
|
|
|
|
parser <https://bugs.python.org/issue26415>`_).
|
|
|
|
|
|
|
|
|
|
Having to produce an intermediate result between the grammar and the AST is not only
|
|
|
|
|
undesirable but also makes the AST generation step much more complicated, raising
|
|
|
|
|
considerably the maintenance burden.
|
|
|
|
|
|
|
|
|
|
===========================
|
|
|
|
|
The new proposed PEG parser
|
|
|
|
|
===========================
|
|
|
|
|
|
|
|
|
|
The new proposed PEG parser contains the following pieces:
|
|
|
|
|
|
|
|
|
|
* A parser generator that can read a grammar file and produce a PEG parser
|
|
|
|
|
written in Python or C that can parse said grammar.
|
|
|
|
|
|
|
|
|
|
* A PEG meta-grammar that automatically generates a Python parser that is used
|
|
|
|
|
for the parser generator itself (this means that there are no manually-written
|
|
|
|
|
parsers).
|
|
|
|
|
|
|
|
|
|
* A generated parser (using the parser generator) that can directly produce C and
|
|
|
|
|
Python AST objects.
|
|
|
|
|
|
|
|
|
|
--------------
|
|
|
|
|
Left recursion
|
|
|
|
|
--------------
|
|
|
|
|
|
|
|
|
|
PEG parsers normally do not support left recursion but we have implemented a
|
|
|
|
|
technique similar to the one described in Medeiros et al. [2]_ but using the
|
|
|
|
|
memoization cache instead of static variables. This approach is closer to the one
|
|
|
|
|
described in Warth et al. [3]_. This allows us to write not only simple left-recursive
|
|
|
|
|
rules but also more complicated rules that involve indirect left-recursion like::
|
|
|
|
|
|
|
|
|
|
rule1: rule2 | 'a'
|
|
|
|
|
rule2: rule3 | 'b'
|
|
|
|
|
rule3: rule1 | 'c'
|
|
|
|
|
|
|
|
|
|
and "hidden left-recursion" like::
|
|
|
|
|
|
|
|
|
|
rule: 'optional'? rule '@' some_other_rule
|
|
|
|
|
|
|
|
|
|
------
|
|
|
|
|
Syntax
|
|
|
|
|
------
|
|
|
|
|
|
|
|
|
|
The grammar consists of a sequence of rules of the form: ::
|
|
|
|
|
|
|
|
|
|
rule_name: expression
|
|
|
|
|
|
|
|
|
|
Optionally, a type can be included right after the rule name, which
|
|
|
|
|
specifies the return type of the C or Python function corresponding to
|
|
|
|
|
the rule: ::
|
|
|
|
|
|
|
|
|
|
rule_name[return_type]: expression
|
|
|
|
|
|
|
|
|
|
If the return type is omitted, then a ``void *`` is returned in C and an
|
|
|
|
|
``Any`` in Python.
|
|
|
|
|
|
|
|
|
|
Grammar Expressions
|
|
|
|
|
~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
|
|
``# comment``
|
|
|
|
|
'''''''''''''
|
|
|
|
|
|
|
|
|
|
Python-style comments.
|
|
|
|
|
|
|
|
|
|
``e1 e2``
|
|
|
|
|
'''''''''
|
|
|
|
|
|
|
|
|
|
Match e1, then match e2.
|
|
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
|
|
rule_name: first_rule second_rule
|
|
|
|
|
|
|
|
|
|
.. _e1-e2-1:
|
|
|
|
|
|
|
|
|
|
``e1 | e2``
|
|
|
|
|
'''''''''''
|
|
|
|
|
|
|
|
|
|
Match e1 or e2.
|
|
|
|
|
|
|
|
|
|
The first alternative can also appear on the line after the rule name
|
|
|
|
|
for formatting purposes. In that case, a \| must be used before the
|
|
|
|
|
first alternative, like so:
|
|
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
|
|
rule_name[return_type]:
|
|
|
|
|
| first_alt
|
|
|
|
|
| second_alt
|
|
|
|
|
|
|
|
|
|
``( e )``
|
|
|
|
|
'''''''''
|
|
|
|
|
|
|
|
|
|
Match e.
|
|
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
|
|
rule_name: (e)
|
|
|
|
|
|
|
|
|
|
A slightly more complex and useful example includes using the grouping
|
|
|
|
|
operator together with the repeat operators:
|
|
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
|
|
rule_name: (e1 e2)*
|
|
|
|
|
|
|
|
|
|
``[ e ] or e?``
|
|
|
|
|
'''''''''''''''
|
|
|
|
|
|
|
|
|
|
Optionally match e.
|
|
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
|
|
rule_name: [e]
|
|
|
|
|
|
|
|
|
|
A more useful example includes defining that a trailing comma is
|
|
|
|
|
optional:
|
|
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
|
|
rule_name: e (',' e)* [',']
|
|
|
|
|
|
|
|
|
|
.. _e-1:
|
|
|
|
|
|
|
|
|
|
``e*``
|
|
|
|
|
''''''
|
|
|
|
|
|
|
|
|
|
Match zero or more occurrences of e.
|
|
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
|
|
rule_name: (e1 e2)*
|
|
|
|
|
|
|
|
|
|
.. _e-2:
|
|
|
|
|
|
|
|
|
|
``e+``
|
|
|
|
|
''''''
|
|
|
|
|
|
|
|
|
|
Match one or more occurrences of e.
|
|
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
|
|
rule_name: (e1 e2)+
|
|
|
|
|
|
|
|
|
|
``s.e+``
|
|
|
|
|
''''''''
|
|
|
|
|
|
|
|
|
|
Match one or more occurrences of e, separated by s. The generated parse
|
|
|
|
|
tree does not include the separator. This is otherwise identical to
|
|
|
|
|
``(e (s e)*)``.
|
|
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
|
|
rule_name: ','.e+
|
|
|
|
|
|
|
|
|
|
.. _e-3:
|
|
|
|
|
|
|
|
|
|
``&e``
|
|
|
|
|
''''''
|
|
|
|
|
|
|
|
|
|
Succeed if e can be parsed, without consuming any input.
|
|
|
|
|
|
|
|
|
|
.. _e-4:
|
|
|
|
|
|
|
|
|
|
``!e``
|
|
|
|
|
''''''
|
|
|
|
|
|
|
|
|
|
Fail if e can be parsed, without consuming any input.
|
|
|
|
|
|
|
|
|
|
An example taken from the proposed Python grammar specifies that a primary
|
|
|
|
|
consists of an atom, which is not followed by a ``.`` or a ``(`` or a
|
|
|
|
|
``[``:
|
|
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
|
|
primary: atom !'.' !'(' !'['
|
|
|
|
|
|
|
|
|
|
.. _e-5:
|
|
|
|
|
|
|
|
|
|
``~``
|
|
|
|
|
''''''
|
|
|
|
|
|
|
|
|
|
Commit to the current alternative, even if it fails to parse.
|
|
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
|
|
rule_name: '(' ~ some_rule ')' | some_alt
|
|
|
|
|
|
|
|
|
|
In this example, if a left parenthesis is parsed, then the other
|
|
|
|
|
alternative won’t be considered, even if some_rule or ‘)’ fail to be
|
|
|
|
|
parsed.
|
|
|
|
|
|
|
|
|
|
Variables in the Grammar
|
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
|
|
A subexpression can be named by preceding it with an identifier and an
|
|
|
|
|
``=`` sign. The name can then be used in the action (see below), like this: ::
|
|
|
|
|
|
|
|
|
|
rule_name[return_type]: '(' a=some_other_rule ')' { a }
|
|
|
|
|
|
|
|
|
|
---------------
|
|
|
|
|
Grammar actions
|
|
|
|
|
---------------
|
|
|
|
|
To avoid the intermediate steps that obscure the relationship between the
|
2020-04-05 09:59:33 -04:00
|
|
|
|
grammar and the AST generation the proposed PEG parser allows directly
|
|
|
|
|
generating AST nodes for a rule via grammar actions. Grammar actions are
|
|
|
|
|
language-specific expressions that are evaluated when a grammar rule is
|
|
|
|
|
successfully parsed. These expressions can be written in Python or C
|
|
|
|
|
depending on the desired output of the parser generator. This means that if
|
|
|
|
|
one would want to generate a parser in Python and another in C, two grammar
|
|
|
|
|
files should be written, each one with a different set of actions, keeping
|
|
|
|
|
everything else apart from said actions identical in both files. As an
|
|
|
|
|
example of a grammar with Python actions, the piece of the parser generator
|
|
|
|
|
that parses grammar files is bootstrapped from a meta-grammar file with
|
|
|
|
|
Python actions that generate the grammar tree as a result of the parsing.
|
|
|
|
|
|
2020-04-16 22:58:40 -04:00
|
|
|
|
In the specific case of the new proposed PEG grammar for Python, having
|
2020-04-05 09:59:33 -04:00
|
|
|
|
actions allows to directly describe how the AST is composed in the grammar
|
|
|
|
|
itself, making it more clear and maintainable. This AST generation process is
|
|
|
|
|
supported by the use of some helper functions that factor out common AST
|
|
|
|
|
object manipulations and some other required operations that are not directly
|
|
|
|
|
related to the grammar.
|
2020-03-31 16:36:17 -04:00
|
|
|
|
|
2020-04-03 07:07:14 -04:00
|
|
|
|
To indicate these actions each alternative can be followed by the action code
|
2020-05-28 09:09:20 -04:00
|
|
|
|
inside curly-braces, which specifies the return value of the alternative::
|
2020-03-31 16:36:17 -04:00
|
|
|
|
|
|
|
|
|
rule_name[return_type]:
|
|
|
|
|
| first_alt1 first_alt2 { first_alt1 }
|
|
|
|
|
| second_alt1 second_alt2 { second_alt1 }
|
|
|
|
|
|
|
|
|
|
If the action is omitted and C code is being generated, then there are two
|
2020-04-03 07:07:14 -04:00
|
|
|
|
different possibilities:
|
2020-03-31 16:36:17 -04:00
|
|
|
|
|
2020-04-03 07:07:14 -04:00
|
|
|
|
1. If there’s a single name in the alternative, this gets returned.
|
|
|
|
|
2. If not, a dummy name object gets returned (this case should be avoided).
|
|
|
|
|
|
|
|
|
|
If the action is omitted and Python code is being generated, then a list
|
2020-05-28 09:09:20 -04:00
|
|
|
|
with all the parsed expressions gets returned (this is meant for debugging).
|
2020-03-31 16:36:17 -04:00
|
|
|
|
|
2020-04-17 16:21:52 -04:00
|
|
|
|
The full meta-grammar for the grammars supported by the PEG generator is:
|
|
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
|
|
start[Grammar]: grammar ENDMARKER { grammar }
|
|
|
|
|
|
|
|
|
|
grammar[Grammar]:
|
|
|
|
|
| metas rules { Grammar(rules, metas) }
|
|
|
|
|
| rules { Grammar(rules, []) }
|
|
|
|
|
|
|
|
|
|
metas[MetaList]:
|
|
|
|
|
| meta metas { [meta] + metas }
|
|
|
|
|
| meta { [meta] }
|
|
|
|
|
|
|
|
|
|
meta[MetaTuple]:
|
|
|
|
|
| "@" NAME NEWLINE { (name.string, None) }
|
|
|
|
|
| "@" a=NAME b=NAME NEWLINE { (a.string, b.string) }
|
|
|
|
|
| "@" NAME STRING NEWLINE { (name.string, literal_eval(string.string)) }
|
|
|
|
|
|
|
|
|
|
rules[RuleList]:
|
|
|
|
|
| rule rules { [rule] + rules }
|
|
|
|
|
| rule { [rule] }
|
|
|
|
|
|
|
|
|
|
rule[Rule]:
|
|
|
|
|
| rulename ":" alts NEWLINE INDENT more_alts DEDENT {
|
|
|
|
|
Rule(rulename[0], rulename[1], Rhs(alts.alts + more_alts.alts)) }
|
|
|
|
|
| rulename ":" NEWLINE INDENT more_alts DEDENT { Rule(rulename[0], rulename[1], more_alts) }
|
|
|
|
|
| rulename ":" alts NEWLINE { Rule(rulename[0], rulename[1], alts) }
|
|
|
|
|
|
|
|
|
|
rulename[RuleName]:
|
|
|
|
|
| NAME '[' type=NAME '*' ']' {(name.string, type.string+"*")}
|
|
|
|
|
| NAME '[' type=NAME ']' {(name.string, type.string)}
|
|
|
|
|
| NAME {(name.string, None)}
|
|
|
|
|
|
|
|
|
|
alts[Rhs]:
|
|
|
|
|
| alt "|" alts { Rhs([alt] + alts.alts)}
|
|
|
|
|
| alt { Rhs([alt]) }
|
|
|
|
|
|
|
|
|
|
more_alts[Rhs]:
|
|
|
|
|
| "|" alts NEWLINE more_alts { Rhs(alts.alts + more_alts.alts) }
|
|
|
|
|
| "|" alts NEWLINE { Rhs(alts.alts) }
|
|
|
|
|
|
|
|
|
|
alt[Alt]:
|
|
|
|
|
| items '$' action { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=action) }
|
|
|
|
|
| items '$' { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=None) }
|
|
|
|
|
| items action { Alt(items, action=action) }
|
|
|
|
|
| items { Alt(items, action=None) }
|
|
|
|
|
|
|
|
|
|
items[NamedItemList]:
|
|
|
|
|
| named_item items { [named_item] + items }
|
|
|
|
|
| named_item { [named_item] }
|
|
|
|
|
|
|
|
|
|
named_item[NamedItem]:
|
|
|
|
|
| NAME '=' ~ item {NamedItem(name.string, item)}
|
|
|
|
|
| item {NamedItem(None, item)}
|
|
|
|
|
| it=lookahead {NamedItem(None, it)}
|
|
|
|
|
|
|
|
|
|
lookahead[LookaheadOrCut]:
|
|
|
|
|
| '&' ~ atom {PositiveLookahead(atom)}
|
|
|
|
|
| '!' ~ atom {NegativeLookahead(atom)}
|
|
|
|
|
| '~' {Cut()}
|
|
|
|
|
|
|
|
|
|
item[Item]:
|
|
|
|
|
| '[' ~ alts ']' {Opt(alts)}
|
|
|
|
|
| atom '?' {Opt(atom)}
|
|
|
|
|
| atom '*' {Repeat0(atom)}
|
|
|
|
|
| atom '+' {Repeat1(atom)}
|
|
|
|
|
| sep=atom '.' node=atom '+' {Gather(sep, node)}
|
|
|
|
|
| atom {atom}
|
|
|
|
|
|
|
|
|
|
atom[Plain]:
|
|
|
|
|
| '(' ~ alts ')' {Group(alts)}
|
|
|
|
|
| NAME {NameLeaf(name.string) }
|
|
|
|
|
| STRING {StringLeaf(string.string)}
|
|
|
|
|
|
|
|
|
|
# Mini-grammar for the actions
|
|
|
|
|
|
|
|
|
|
action[str]: "{" ~ target_atoms "}" { target_atoms }
|
|
|
|
|
|
|
|
|
|
target_atoms[str]:
|
|
|
|
|
| target_atom target_atoms { target_atom + " " + target_atoms }
|
|
|
|
|
| target_atom { target_atom }
|
|
|
|
|
|
|
|
|
|
target_atom[str]:
|
|
|
|
|
| "{" ~ target_atoms "}" { "{" + target_atoms + "}" }
|
|
|
|
|
| NAME { name.string }
|
|
|
|
|
| NUMBER { number.string }
|
|
|
|
|
| STRING { string.string }
|
|
|
|
|
| "?" { "?" }
|
|
|
|
|
| ":" { ":" }
|
|
|
|
|
|
2020-04-05 09:59:33 -04:00
|
|
|
|
As an illustrative example this simple grammar file allows to directly
|
|
|
|
|
generate a full parser that can parse simple arithmetic expressions and that
|
|
|
|
|
returns a valid C-based Python AST:
|
2020-03-31 16:36:17 -04:00
|
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
2020-04-05 09:59:33 -04:00
|
|
|
|
start[mod_ty]: a=expr_stmt* $ { Module(a, NULL, p->arena) }
|
|
|
|
|
expr_stmt[stmt_ty]: a=expr NEWLINE { _Py_Expr(a, EXTRA) }
|
|
|
|
|
expr[expr_ty]:
|
|
|
|
|
| l=expr '+' r=term { _Py_BinOp(l, Add, r, EXTRA) }
|
|
|
|
|
| l=expr '-' r=term { _Py_BinOp(l, Sub, r, EXTRA) }
|
|
|
|
|
| t=term { t }
|
|
|
|
|
|
|
|
|
|
term[expr_ty]:
|
|
|
|
|
| l=term '*' r=factor { _Py_BinOp(l, Mult, r, EXTRA) }
|
|
|
|
|
| l=term '/' r=factor { _Py_BinOp(l, Div, r, EXTRA) }
|
|
|
|
|
| f=factor { f }
|
|
|
|
|
|
|
|
|
|
factor[expr_ty]:
|
|
|
|
|
| '(' e=expr ')' { e }
|
|
|
|
|
| a=atom { a }
|
|
|
|
|
|
|
|
|
|
atom[expr_ty]:
|
|
|
|
|
| n=NAME { n }
|
|
|
|
|
| n=NUMBER { n }
|
|
|
|
|
| s=STRING { s }
|
2020-03-31 16:36:17 -04:00
|
|
|
|
|
2020-04-03 07:07:14 -04:00
|
|
|
|
Here ``EXTRA`` is a macro that expands to ``start_lineno, start_col_offset,
|
|
|
|
|
end_lineno, end_col_offset, p->arena``, those being variables automatically
|
|
|
|
|
injected by the parser; ``p`` points to an object that holds on to all state
|
|
|
|
|
for the parser.
|
2020-03-31 16:36:17 -04:00
|
|
|
|
|
2020-04-05 09:59:33 -04:00
|
|
|
|
A similar grammar written to target Python AST objects:
|
|
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
|
|
start: expr NEWLINE? ENDMARKER { ast.Expression(expr) }
|
|
|
|
|
expr:
|
|
|
|
|
| expr '+' term { ast.BinOp(expr, ast.Add(), term) }
|
|
|
|
|
| expr '-' term { ast.BinOp(expr, ast.Sub(), term) }
|
|
|
|
|
| term { term }
|
|
|
|
|
|
|
|
|
|
term:
|
|
|
|
|
| l=term '*' r=factor { ast.BinOp(l, ast.Mult(), r) }
|
|
|
|
|
| term '/' factor { ast.BinOp(term, ast.Div(), factor) }
|
|
|
|
|
| factor { factor }
|
|
|
|
|
|
|
|
|
|
factor:
|
|
|
|
|
| '(' expr ')' { expr }
|
|
|
|
|
| atom { atom }
|
|
|
|
|
|
|
|
|
|
atom:
|
|
|
|
|
| NAME { ast.Name(id=name.string, ctx=ast.Load()) }
|
|
|
|
|
| NUMBER { ast.Constant(value=ast.literal_eval(number.string)) }
|
|
|
|
|
|
|
|
|
|
|
2020-03-31 16:36:17 -04:00
|
|
|
|
==============
|
|
|
|
|
Migration plan
|
|
|
|
|
==============
|
|
|
|
|
|
|
|
|
|
This section describes the migration plan when porting to the new PEG-based parser
|
|
|
|
|
if this PEP is accepted. The migration will be executed in a series of steps that allow
|
|
|
|
|
initially to fallback to the previous parser if needed:
|
|
|
|
|
|
2020-04-15 19:31:38 -04:00
|
|
|
|
1. Starting with Python 3.9 alpha 6, include the new PEG-based parser machinery in CPython
|
2020-03-31 16:36:17 -04:00
|
|
|
|
with a command-line flag and environment variable that allows switching between
|
|
|
|
|
the new and the old parsers together with explicit APIs that allow invoking the
|
|
|
|
|
new and the old parsers independently. At this step, all Python APIs like ``ast.parse``
|
|
|
|
|
and ``compile`` will use the parser set by the flags or the environment variable and
|
2020-04-15 19:31:38 -04:00
|
|
|
|
the default parser will be the new PEG-based parser.
|
2020-03-31 16:36:17 -04:00
|
|
|
|
|
2020-04-15 19:31:38 -04:00
|
|
|
|
2. Between Python 3.9 and Python 3.10, the old parser and related code (like the
|
2020-03-31 16:36:17 -04:00
|
|
|
|
"parser" module) will be kept until a new Python release happens (Python 3.10). In
|
|
|
|
|
the meanwhile and until the old parser is removed, **no new Python Grammar
|
2020-04-01 16:59:32 -04:00
|
|
|
|
addition will be added that requires the PEG parser**. This means that the grammar
|
2020-03-31 16:36:17 -04:00
|
|
|
|
will be kept LL(1) until the old parser is removed.
|
|
|
|
|
|
2020-04-15 19:31:38 -04:00
|
|
|
|
3. In Python 3.10, remove the old parser, the command-line flag, the environment
|
2020-03-31 16:36:17 -04:00
|
|
|
|
variable and the "parser" module and related code.
|
|
|
|
|
|
|
|
|
|
==========================
|
|
|
|
|
Performance and validation
|
|
|
|
|
==========================
|
|
|
|
|
|
|
|
|
|
We have done extensive timing and validation of the new parser, and
|
|
|
|
|
this gives us confidence that the new parser is of high enough quality
|
|
|
|
|
to replace the current parser.
|
|
|
|
|
|
|
|
|
|
----------
|
|
|
|
|
Validation
|
|
|
|
|
----------
|
|
|
|
|
|
|
|
|
|
To start with validation, we regularly compile the entire Python 3.8
|
|
|
|
|
stdlib and compare every aspect of the resulting AST with that
|
|
|
|
|
produced by the standard compiler. (In the process we found a few bugs
|
|
|
|
|
in the standard parser's treatment of line and column numbers, which
|
|
|
|
|
we have all fixed upstream via a series of issues and PRs.)
|
|
|
|
|
|
2020-04-15 19:31:38 -04:00
|
|
|
|
We have also occasionally compiled a much larger codebase (the approx.
|
|
|
|
|
3800 most popular packages on PyPI) and this has helped us find a (very)
|
2020-03-31 16:36:17 -04:00
|
|
|
|
few additional bugs in the new parser.
|
|
|
|
|
|
|
|
|
|
(One area we have not explored extensively is rejection of all wrong
|
|
|
|
|
programs. We have unit tests that check for a certain number of
|
|
|
|
|
explicit rejections, but more work could be done, e.g. by using a
|
|
|
|
|
fuzzer that inserts random subtle bugs into existing code. We're open
|
|
|
|
|
to help in this area.)
|
|
|
|
|
|
|
|
|
|
-----------
|
|
|
|
|
Performance
|
|
|
|
|
-----------
|
|
|
|
|
|
|
|
|
|
We have tuned the performance of the new parser to come within 10% of
|
|
|
|
|
the current parser both in speed and memory consumption. While the
|
|
|
|
|
PEG/packrat parsing algorithm inherently consumes more memory than the
|
|
|
|
|
current LL(1) parser, we have an advantage because we don't construct
|
|
|
|
|
an intermediate CST.
|
|
|
|
|
|
|
|
|
|
Below are some benchmarks. These are focused on compiling source code
|
|
|
|
|
to bytecode, because this is the most realistic situation. Returning
|
|
|
|
|
an AST to Python code is not as representative, because the process to
|
|
|
|
|
convert the *internal* AST (only accessible to C code) to an
|
|
|
|
|
*external* AST (an instance of ``ast.AST``) takes more time than the
|
|
|
|
|
parser itself.
|
|
|
|
|
|
|
|
|
|
All measurements reported here are done on a recent MacBook Pro,
|
|
|
|
|
taking the median of three runs. No particular care was taken to stop
|
|
|
|
|
other applications running on the same machine.
|
|
|
|
|
|
|
|
|
|
The first timings are for our canonical test file, which has 100,000
|
|
|
|
|
lines endlessly repeating the following three lines::
|
|
|
|
|
|
|
|
|
|
1 + 2 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + ((((((11 * 12 * 13 * 14 * 15 + 16 * 17 + 18 * 19 * 20))))))
|
|
|
|
|
2*3 + 4*5*6
|
|
|
|
|
12 + (2 * 3 * 4 * 5 + 6 + 7 * 8)
|
|
|
|
|
|
|
|
|
|
- Just parsing and throwing away the internal AST takes 1.16 seconds
|
|
|
|
|
with a max RSS of 681 MiB.
|
|
|
|
|
|
|
|
|
|
- Parsing and converting to ``ast.AST`` takes 6.34 seconds, max RSS
|
|
|
|
|
1029 MiB.
|
|
|
|
|
|
|
|
|
|
- Parsing and compiling to bytecode takes 1.28 seconds, max RSS 681
|
|
|
|
|
MiB.
|
|
|
|
|
|
|
|
|
|
- With the current parser, parsing and compiling takes 1.44 seconds,
|
|
|
|
|
max RSS 836 MiB.
|
|
|
|
|
|
|
|
|
|
For this particular test file, the new parser is faster and uses less
|
|
|
|
|
memory than the current parser (compare the last two bullets).
|
|
|
|
|
|
|
|
|
|
We also did timings with a more realistic payload, the entire Python
|
|
|
|
|
3.8 stdlib. This payload consists of 1,641 files, 749,570 lines,
|
|
|
|
|
27,622,497 bytes. (Though 11 files can't be compiled by any Python 3
|
|
|
|
|
parser due to encoding issues, sometimes intentional.)
|
|
|
|
|
|
|
|
|
|
- Compiling and throwing away the internal AST took 2.141 seconds.
|
|
|
|
|
That's 350,040 lines/sec, or 12,899,367 bytes/sec. The max RSS was
|
|
|
|
|
74 MiB (the largest file in the stdlib is much smaller than out
|
|
|
|
|
canonical test file).
|
|
|
|
|
|
|
|
|
|
- Compiling to bytecode took 3.290 seconds. That's 227,861 lines/sec,
|
|
|
|
|
or 8,396,942 bytes/sec. Max RSS 77 MiB.
|
|
|
|
|
|
|
|
|
|
- Compiling to bytecode using the current parser took 3.367 seconds.
|
|
|
|
|
That's 222,620 lines/sec, or 8,203,780 bytes/sec. Max RSS 70 MiB.
|
|
|
|
|
|
|
|
|
|
Comparing the last two bullets we find that the new parser is slightly
|
|
|
|
|
faster but uses slightly (about 10%) more memory. We believe this is
|
|
|
|
|
acceptable. (Also, there are probably some more tweaks we can make to
|
|
|
|
|
reduce memory usage.)
|
|
|
|
|
|
2020-04-05 18:05:47 -04:00
|
|
|
|
=====================
|
|
|
|
|
Rejected Alternatives
|
|
|
|
|
=====================
|
|
|
|
|
|
|
|
|
|
We did not seriously consider alternative ways to implement the new
|
|
|
|
|
parser, but here's a brief discussion of LALR(1).
|
|
|
|
|
|
|
|
|
|
Thirty years ago the first author decided to go his own way with
|
|
|
|
|
Python's parser rather than using LALR(1), which was the industry
|
|
|
|
|
standard at the time (e.g. Bison and Yacc). The reasons were
|
|
|
|
|
primarily emotional (gut feelings, intuition), based on past experience
|
|
|
|
|
using Yacc in other projects, where grammar development took more
|
|
|
|
|
effort than anticipated (in part due to shift-reduce conflicts). A
|
|
|
|
|
specific criticism of Bison and Yacc that still holds is that their
|
|
|
|
|
meta-grammar (the notation used to feed the grammar into the parser
|
|
|
|
|
generator) does not support EBNF conveniences like
|
|
|
|
|
``[optional_clause]`` or ``(repeated_clause)*``. Using a custom
|
|
|
|
|
parser generator, a syntax tree matching the structure of the grammar
|
|
|
|
|
could be generated automatically, and with EBNF that tree could match
|
|
|
|
|
the "human-friendly" structure of the grammar.
|
|
|
|
|
|
|
|
|
|
Other variants of LR were not considered, nor was LL (e.g. ANTLR).
|
|
|
|
|
PEG was selected because it was easy to understand given a basic
|
|
|
|
|
understanding of recursive-descent parsing.
|
|
|
|
|
|
2020-03-31 16:36:17 -04:00
|
|
|
|
==========
|
|
|
|
|
References
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
.. [1] Ford, Bryan
|
|
|
|
|
http://pdos.csail.mit.edu/~baford/packrat/thesis
|
|
|
|
|
|
|
|
|
|
.. [2] Medeiros et al.
|
2020-05-11 17:34:47 -04:00
|
|
|
|
https://arxiv.org/pdf/1207.0443.pdf
|
2020-03-31 16:36:17 -04:00
|
|
|
|
|
|
|
|
|
.. [3] Warth et al.
|
|
|
|
|
http://web.cs.ucla.edu/~todd/research/pepm08.pdf
|
|
|
|
|
|
2020-03-31 23:11:01 -04:00
|
|
|
|
.. [#GUIDO_PEG]
|
|
|
|
|
Guido's series on PEG parsing
|
|
|
|
|
https://medium.com/@gvanrossum_83706/peg-parsing-series-de5d41b2ed60
|
2020-03-31 16:36:17 -04:00
|
|
|
|
|
|
|
|
|
=========
|
|
|
|
|
Copyright
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|