PEP 617: Clean-up of LL(1) section (#1363)

Numerous small changes to improve readability of the LL(1) background: idiomatic usage improved, sentences split, clearer expression of certain ideas. Emphasis that the LL(1) constraint obscures the meaning of the grammar.

Fixes #1362
This commit is contained in:
Jeff Allen 2020-04-07 20:57:39 +01:00 committed by GitHub
parent aac58d4c99
commit adb5173eb1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 38 additions and 33 deletions

View File

@ -17,12 +17,12 @@ Post-History: 02-Apr-2020
Overview Overview
======== ========
This PEP proposes to replace the current LL(1)-based parser of CPython This PEP proposes replacing the current LL(1)-based parser of CPython
with a new PEG-based parser. This new parser will allow eliminating the multiple with a new PEG-based parser. This new parser would allow the elimination of multiple
"hacks" that exist in the current grammar to circumvent the LL(1)-limitation "hacks" that exist in the current grammar to circumvent the LL(1)-limitation.
while substantially reducing the maintenance costs in some areas related to the It would substantially reduce the maintenance costs in some areas related to the
compiling pipeline such as the grammar, the parser and the AST generation. The new PEG compiling pipeline such as the grammar, the parser and the AST generation. The new PEG
parser will also lift the LL(1) restriction over the current Python grammar. parser will also lift the LL(1) restriction on the current Python grammar.
=========================== ===========================
Background on LL(1) parsers Background on LL(1) parsers
@ -31,15 +31,15 @@ Background on LL(1) parsers
The current Python grammar is an LL(1)-based grammar. A grammar can be said to be The current Python grammar is an LL(1)-based grammar. A grammar can be said to be
LL(1) if it can be parsed by an LL(1) parser, which in turn is defined as a LL(1) if it can be parsed by an LL(1) parser, which in turn is defined as a
top-down parser that parses the input from left to right, performing leftmost top-down parser that parses the input from left to right, performing leftmost
derivation of the sentence, and can only use one token of lookahead when parsing a derivation of the sentence, with just one token of lookahead.
sentence. The traditional approach to construct or generate an LL(1) parser is to The traditional approach to constructing or generating an LL(1) parser is to
produce a *parse table* which encodes the possible transitions between all possible produce a *parse table* which encodes the possible transitions between all possible
states of the parser. These tables are normally constructed from the *first sets* states of the parser. These tables are normally constructed from the *first sets*
and the *follow sets* of the grammar: and the *follow sets* of the grammar:
* Given a rule, the *first set* are the collection of all terminals that can occur * Given a rule, the *first set* is the collection of all terminals that can occur
first in a full derivation of that rule. Intuitively this helps the parser decide first in a full derivation of that rule. Intuitively, this helps the parser decide
among multiple alternatives if a rule can have multiple possibilities. For among the alternatives in a rule. For
instance, given the rule :: instance, given the rule ::
rule: A | B rule: A | B
@ -48,53 +48,58 @@ and the *follow sets* of the grammar:
terminal *b* and the parser sees the token *b* when parsing this rule, it knows terminal *b* and the parser sees the token *b* when parsing this rule, it knows
that it needs to follow the non-terminal ``B``. that it needs to follow the non-terminal ``B``.
* Given a rule, the *follow set* are the collection of terminals that can appear * An extension to this simple idea is needed when a rule may expand to the empty string.
immediately to the right of that rule in a partial derivation. Intuitively this Given a rule, the *follow set* is the collection of terminals that can appear
solves the problem in which a rule can expand to the empty string. For instance, immediately to the right of that rule in a partial derivation. Intuitively, this
solves the problem of the empty alternative. For instance,
given this rule:: given this rule::
rule: A 'b' rule: A 'b'
if the parser has the token *b* and the rule A can only start with the token *a* if the parser has the token *b* and the non-terminal ``A`` can only start
we know it is an invalid program but if A can be expanded also to the empty string with the token *a*, then the parser can tell that this is an invalid program.
(called an ε-production) then we can consume the next token, 'b'. Therefore, *b* But if ``A`` could expand to the empty string (called an ε-production),
is in the *follow set* of ``A``. then the parser would recognise a valid empty ``A``,
since the next token *b* is in the *follow set* of ``A``.
The Python grammar does not allow ε-productions so the *follow sets* are not The current Python grammar does not contain ε-productions, so the *follow sets* are not
needed when creating the parse tables. Currently, in CPython, a parser generator needed when creating the parse tables. Currently, in CPython, a parser generator
program reads the grammar and produces a parsing table representing a set of program reads the grammar and produces a parsing table representing a set of
deterministic finite automata (DFA) that can be included in a C program, the deterministic finite automata (DFA) that can be included in a C program, the
parser, which is a pushdown automaton that uses this data to produce a Concrete parser. The parser is a pushdown automaton that uses this data to produce a Concrete
Syntax Tree (CST) sometimes known directly as a "parse tree". In this process, the Syntax Tree (CST) sometimes known directly as a "parse tree". In this process, the
*first sets* are used indirectly when generating the DFAs. *first sets* are used indirectly when generating the DFAs.
LL(1) parsers and grammars are usually known for being efficient and simple to LL(1) parsers and grammars are usually efficient and simple to implement
implement and generate, but the reality is that expressing some constructs and generate. However, it is not possible, under the LL(1) restriction,
currently present in the Python language is notably difficult or impossible with to express certain common constructs in a way natural to the language
such a restriction. As LL(1) parsers can only look one token ahead to distinguish designer and the reader. This includes some in the Python language.
As LL(1) parsers can only look one token ahead to distinguish
possibilities, some rules in the grammar may be ambiguous. For instance the rule:: possibilities, some rules in the grammar may be ambiguous. For instance the rule::
rule: A | B rule: A | B
is ambiguous if the *first sets* of both ``A`` and ``B`` have some elements in is ambiguous if the *first sets* of both ``A`` and ``B`` have some elements in
common. This is because if the parser sees a token in the input common. When the parser sees a token in the input
program that both *A* and *B* can start with it is impossible for it to deduce program that both *A* and *B* can start with, it is impossible for it to deduce
which option to expand as no further token of the program can be examined to which option to expand, as no further token of the program can be examined to
disambiguate. As will be shown later in this document, the current LL(1)-based disambiguate.
The rule may be transformed to equivalent LL(1) rules, but then it may
be harder for a human reader to grasp its meaning.
Examples later in this document show that the current LL(1)-based
grammar suffers a lot from this scenario. grammar suffers a lot from this scenario.
Also, it is relevant to note (as other sections of this document will deal with this Another broad class of rules precluded by LL(1) is left-recursive rules.
concept) that a given grammar cannot be LL(1) if it is left recursive. A grammar is A rule is left-recursive if it can derive to a
left-recursive if and only if there exists a nonterminal that can derive to a
sentential form with itself as the leftmost symbol. For instance this rule:: sentential form with itself as the leftmost symbol. For instance this rule::
rule: rule 'a' rule: rule 'a'
is left-recursive because the rule can be expanded to an expression that starts is left-recursive because the rule can be expanded to an expression that starts
with itself. As will be described later, left-recursion can be very useful to with itself. As will be described later, left-recursion is the natural way to
express some desired properties directly in the grammar and the lack of express certain desired language properties directly in the grammar.
it can lead to some undesired scenarios.
========================= =========================
Background on PEG parsers Background on PEG parsers