PEP 617: Clean-up of LL(1) section (#1363)
Numerous small changes to improve readability of the LL(1) background: idiomatic usage improved, sentences split, clearer expression of certain ideas. Emphasis that the LL(1) constraint obscures the meaning of the grammar. Fixes #1362
This commit is contained in:
parent
aac58d4c99
commit
adb5173eb1
71
pep-0617.rst
71
pep-0617.rst
|
@ -17,12 +17,12 @@ Post-History: 02-Apr-2020
|
||||||
Overview
|
Overview
|
||||||
========
|
========
|
||||||
|
|
||||||
This PEP proposes to replace the current LL(1)-based parser of CPython
|
This PEP proposes replacing the current LL(1)-based parser of CPython
|
||||||
with a new PEG-based parser. This new parser will allow eliminating the multiple
|
with a new PEG-based parser. This new parser would allow the elimination of multiple
|
||||||
"hacks" that exist in the current grammar to circumvent the LL(1)-limitation
|
"hacks" that exist in the current grammar to circumvent the LL(1)-limitation.
|
||||||
while substantially reducing the maintenance costs in some areas related to the
|
It would substantially reduce the maintenance costs in some areas related to the
|
||||||
compiling pipeline such as the grammar, the parser and the AST generation. The new PEG
|
compiling pipeline such as the grammar, the parser and the AST generation. The new PEG
|
||||||
parser will also lift the LL(1) restriction over the current Python grammar.
|
parser will also lift the LL(1) restriction on the current Python grammar.
|
||||||
|
|
||||||
===========================
|
===========================
|
||||||
Background on LL(1) parsers
|
Background on LL(1) parsers
|
||||||
|
@ -31,15 +31,15 @@ Background on LL(1) parsers
|
||||||
The current Python grammar is an LL(1)-based grammar. A grammar can be said to be
|
The current Python grammar is an LL(1)-based grammar. A grammar can be said to be
|
||||||
LL(1) if it can be parsed by an LL(1) parser, which in turn is defined as a
|
LL(1) if it can be parsed by an LL(1) parser, which in turn is defined as a
|
||||||
top-down parser that parses the input from left to right, performing leftmost
|
top-down parser that parses the input from left to right, performing leftmost
|
||||||
derivation of the sentence, and can only use one token of lookahead when parsing a
|
derivation of the sentence, with just one token of lookahead.
|
||||||
sentence. The traditional approach to construct or generate an LL(1) parser is to
|
The traditional approach to constructing or generating an LL(1) parser is to
|
||||||
produce a *parse table* which encodes the possible transitions between all possible
|
produce a *parse table* which encodes the possible transitions between all possible
|
||||||
states of the parser. These tables are normally constructed from the *first sets*
|
states of the parser. These tables are normally constructed from the *first sets*
|
||||||
and the *follow sets* of the grammar:
|
and the *follow sets* of the grammar:
|
||||||
|
|
||||||
* Given a rule, the *first set* are the collection of all terminals that can occur
|
* Given a rule, the *first set* is the collection of all terminals that can occur
|
||||||
first in a full derivation of that rule. Intuitively this helps the parser decide
|
first in a full derivation of that rule. Intuitively, this helps the parser decide
|
||||||
among multiple alternatives if a rule can have multiple possibilities. For
|
among the alternatives in a rule. For
|
||||||
instance, given the rule ::
|
instance, given the rule ::
|
||||||
|
|
||||||
rule: A | B
|
rule: A | B
|
||||||
|
@ -48,53 +48,58 @@ and the *follow sets* of the grammar:
|
||||||
terminal *b* and the parser sees the token *b* when parsing this rule, it knows
|
terminal *b* and the parser sees the token *b* when parsing this rule, it knows
|
||||||
that it needs to follow the non-terminal ``B``.
|
that it needs to follow the non-terminal ``B``.
|
||||||
|
|
||||||
* Given a rule, the *follow set* are the collection of terminals that can appear
|
* An extension to this simple idea is needed when a rule may expand to the empty string.
|
||||||
immediately to the right of that rule in a partial derivation. Intuitively this
|
Given a rule, the *follow set* is the collection of terminals that can appear
|
||||||
solves the problem in which a rule can expand to the empty string. For instance,
|
immediately to the right of that rule in a partial derivation. Intuitively, this
|
||||||
|
solves the problem of the empty alternative. For instance,
|
||||||
given this rule::
|
given this rule::
|
||||||
|
|
||||||
rule: A 'b'
|
rule: A 'b'
|
||||||
|
|
||||||
if the parser has the token *b* and the rule A can only start with the token *a*
|
if the parser has the token *b* and the non-terminal ``A`` can only start
|
||||||
we know it is an invalid program but if A can be expanded also to the empty string
|
with the token *a*, then the parser can tell that this is an invalid program.
|
||||||
(called an ε-production) then we can consume the next token, 'b'. Therefore, *b*
|
But if ``A`` could expand to the empty string (called an ε-production),
|
||||||
is in the *follow set* of ``A``.
|
then the parser would recognise a valid empty ``A``,
|
||||||
|
since the next token *b* is in the *follow set* of ``A``.
|
||||||
|
|
||||||
|
|
||||||
The Python grammar does not allow ε-productions so the *follow sets* are not
|
The current Python grammar does not contain ε-productions, so the *follow sets* are not
|
||||||
needed when creating the parse tables. Currently, in CPython, a parser generator
|
needed when creating the parse tables. Currently, in CPython, a parser generator
|
||||||
program reads the grammar and produces a parsing table representing a set of
|
program reads the grammar and produces a parsing table representing a set of
|
||||||
deterministic finite automata (DFA) that can be included in a C program, the
|
deterministic finite automata (DFA) that can be included in a C program, the
|
||||||
parser, which is a pushdown automaton that uses this data to produce a Concrete
|
parser. The parser is a pushdown automaton that uses this data to produce a Concrete
|
||||||
Syntax Tree (CST) sometimes known directly as a "parse tree". In this process, the
|
Syntax Tree (CST) sometimes known directly as a "parse tree". In this process, the
|
||||||
*first sets* are used indirectly when generating the DFAs.
|
*first sets* are used indirectly when generating the DFAs.
|
||||||
|
|
||||||
LL(1) parsers and grammars are usually known for being efficient and simple to
|
LL(1) parsers and grammars are usually efficient and simple to implement
|
||||||
implement and generate, but the reality is that expressing some constructs
|
and generate. However, it is not possible, under the LL(1) restriction,
|
||||||
currently present in the Python language is notably difficult or impossible with
|
to express certain common constructs in a way natural to the language
|
||||||
such a restriction. As LL(1) parsers can only look one token ahead to distinguish
|
designer and the reader. This includes some in the Python language.
|
||||||
|
|
||||||
|
As LL(1) parsers can only look one token ahead to distinguish
|
||||||
possibilities, some rules in the grammar may be ambiguous. For instance the rule::
|
possibilities, some rules in the grammar may be ambiguous. For instance the rule::
|
||||||
|
|
||||||
rule: A | B
|
rule: A | B
|
||||||
|
|
||||||
is ambiguous if the *first sets* of both ``A`` and ``B`` have some elements in
|
is ambiguous if the *first sets* of both ``A`` and ``B`` have some elements in
|
||||||
common. This is because if the parser sees a token in the input
|
common. When the parser sees a token in the input
|
||||||
program that both *A* and *B* can start with it is impossible for it to deduce
|
program that both *A* and *B* can start with, it is impossible for it to deduce
|
||||||
which option to expand as no further token of the program can be examined to
|
which option to expand, as no further token of the program can be examined to
|
||||||
disambiguate. As will be shown later in this document, the current LL(1)-based
|
disambiguate.
|
||||||
|
The rule may be transformed to equivalent LL(1) rules, but then it may
|
||||||
|
be harder for a human reader to grasp its meaning.
|
||||||
|
Examples later in this document show that the current LL(1)-based
|
||||||
grammar suffers a lot from this scenario.
|
grammar suffers a lot from this scenario.
|
||||||
|
|
||||||
Also, it is relevant to note (as other sections of this document will deal with this
|
Another broad class of rules precluded by LL(1) is left-recursive rules.
|
||||||
concept) that a given grammar cannot be LL(1) if it is left recursive. A grammar is
|
A rule is left-recursive if it can derive to a
|
||||||
left-recursive if and only if there exists a nonterminal that can derive to a
|
|
||||||
sentential form with itself as the leftmost symbol. For instance this rule::
|
sentential form with itself as the leftmost symbol. For instance this rule::
|
||||||
|
|
||||||
rule: rule 'a'
|
rule: rule 'a'
|
||||||
|
|
||||||
is left-recursive because the rule can be expanded to an expression that starts
|
is left-recursive because the rule can be expanded to an expression that starts
|
||||||
with itself. As will be described later, left-recursion can be very useful to
|
with itself. As will be described later, left-recursion is the natural way to
|
||||||
express some desired properties directly in the grammar and the lack of
|
express certain desired language properties directly in the grammar.
|
||||||
it can lead to some undesired scenarios.
|
|
||||||
|
|
||||||
=========================
|
=========================
|
||||||
Background on PEG parsers
|
Background on PEG parsers
|
||||||
|
|
Loading…
Reference in New Issue