PEP 617: Clean-up of LL(1) section (#1363)
Numerous small changes to improve readability of the LL(1) background: idiomatic usage improved, sentences split, clearer expression of certain ideas. Emphasis that the LL(1) constraint obscures the meaning of the grammar. Fixes #1362
This commit is contained in:
parent
aac58d4c99
commit
adb5173eb1
71
pep-0617.rst
71
pep-0617.rst
|
@ -17,12 +17,12 @@ Post-History: 02-Apr-2020
|
|||
Overview
|
||||
========
|
||||
|
||||
This PEP proposes to replace the current LL(1)-based parser of CPython
|
||||
with a new PEG-based parser. This new parser will allow eliminating the multiple
|
||||
"hacks" that exist in the current grammar to circumvent the LL(1)-limitation
|
||||
while substantially reducing the maintenance costs in some areas related to the
|
||||
This PEP proposes replacing the current LL(1)-based parser of CPython
|
||||
with a new PEG-based parser. This new parser would allow the elimination of multiple
|
||||
"hacks" that exist in the current grammar to circumvent the LL(1)-limitation.
|
||||
It would substantially reduce the maintenance costs in some areas related to the
|
||||
compiling pipeline such as the grammar, the parser and the AST generation. The new PEG
|
||||
parser will also lift the LL(1) restriction over the current Python grammar.
|
||||
parser will also lift the LL(1) restriction on the current Python grammar.
|
||||
|
||||
===========================
|
||||
Background on LL(1) parsers
|
||||
|
@ -31,15 +31,15 @@ Background on LL(1) parsers
|
|||
The current Python grammar is an LL(1)-based grammar. A grammar can be said to be
|
||||
LL(1) if it can be parsed by an LL(1) parser, which in turn is defined as a
|
||||
top-down parser that parses the input from left to right, performing leftmost
|
||||
derivation of the sentence, and can only use one token of lookahead when parsing a
|
||||
sentence. The traditional approach to construct or generate an LL(1) parser is to
|
||||
derivation of the sentence, with just one token of lookahead.
|
||||
The traditional approach to constructing or generating an LL(1) parser is to
|
||||
produce a *parse table* which encodes the possible transitions between all possible
|
||||
states of the parser. These tables are normally constructed from the *first sets*
|
||||
and the *follow sets* of the grammar:
|
||||
|
||||
* Given a rule, the *first set* are the collection of all terminals that can occur
|
||||
first in a full derivation of that rule. Intuitively this helps the parser decide
|
||||
among multiple alternatives if a rule can have multiple possibilities. For
|
||||
* Given a rule, the *first set* is the collection of all terminals that can occur
|
||||
first in a full derivation of that rule. Intuitively, this helps the parser decide
|
||||
among the alternatives in a rule. For
|
||||
instance, given the rule ::
|
||||
|
||||
rule: A | B
|
||||
|
@ -48,53 +48,58 @@ and the *follow sets* of the grammar:
|
|||
terminal *b* and the parser sees the token *b* when parsing this rule, it knows
|
||||
that it needs to follow the non-terminal ``B``.
|
||||
|
||||
* Given a rule, the *follow set* are the collection of terminals that can appear
|
||||
immediately to the right of that rule in a partial derivation. Intuitively this
|
||||
solves the problem in which a rule can expand to the empty string. For instance,
|
||||
* An extension to this simple idea is needed when a rule may expand to the empty string.
|
||||
Given a rule, the *follow set* is the collection of terminals that can appear
|
||||
immediately to the right of that rule in a partial derivation. Intuitively, this
|
||||
solves the problem of the empty alternative. For instance,
|
||||
given this rule::
|
||||
|
||||
rule: A 'b'
|
||||
|
||||
if the parser has the token *b* and the rule A can only start with the token *a*
|
||||
we know it is an invalid program but if A can be expanded also to the empty string
|
||||
(called an ε-production) then we can consume the next token, 'b'. Therefore, *b*
|
||||
is in the *follow set* of ``A``.
|
||||
if the parser has the token *b* and the non-terminal ``A`` can only start
|
||||
with the token *a*, then the parser can tell that this is an invalid program.
|
||||
But if ``A`` could expand to the empty string (called an ε-production),
|
||||
then the parser would recognise a valid empty ``A``,
|
||||
since the next token *b* is in the *follow set* of ``A``.
|
||||
|
||||
|
||||
The Python grammar does not allow ε-productions so the *follow sets* are not
|
||||
The current Python grammar does not contain ε-productions, so the *follow sets* are not
|
||||
needed when creating the parse tables. Currently, in CPython, a parser generator
|
||||
program reads the grammar and produces a parsing table representing a set of
|
||||
deterministic finite automata (DFA) that can be included in a C program, the
|
||||
parser, which is a pushdown automaton that uses this data to produce a Concrete
|
||||
parser. The parser is a pushdown automaton that uses this data to produce a Concrete
|
||||
Syntax Tree (CST) sometimes known directly as a "parse tree". In this process, the
|
||||
*first sets* are used indirectly when generating the DFAs.
|
||||
|
||||
LL(1) parsers and grammars are usually known for being efficient and simple to
|
||||
implement and generate, but the reality is that expressing some constructs
|
||||
currently present in the Python language is notably difficult or impossible with
|
||||
such a restriction. As LL(1) parsers can only look one token ahead to distinguish
|
||||
LL(1) parsers and grammars are usually efficient and simple to implement
|
||||
and generate. However, it is not possible, under the LL(1) restriction,
|
||||
to express certain common constructs in a way natural to the language
|
||||
designer and the reader. This includes some in the Python language.
|
||||
|
||||
As LL(1) parsers can only look one token ahead to distinguish
|
||||
possibilities, some rules in the grammar may be ambiguous. For instance the rule::
|
||||
|
||||
rule: A | B
|
||||
|
||||
is ambiguous if the *first sets* of both ``A`` and ``B`` have some elements in
|
||||
common. This is because if the parser sees a token in the input
|
||||
program that both *A* and *B* can start with it is impossible for it to deduce
|
||||
which option to expand as no further token of the program can be examined to
|
||||
disambiguate. As will be shown later in this document, the current LL(1)-based
|
||||
common. When the parser sees a token in the input
|
||||
program that both *A* and *B* can start with, it is impossible for it to deduce
|
||||
which option to expand, as no further token of the program can be examined to
|
||||
disambiguate.
|
||||
The rule may be transformed to equivalent LL(1) rules, but then it may
|
||||
be harder for a human reader to grasp its meaning.
|
||||
Examples later in this document show that the current LL(1)-based
|
||||
grammar suffers a lot from this scenario.
|
||||
|
||||
Also, it is relevant to note (as other sections of this document will deal with this
|
||||
concept) that a given grammar cannot be LL(1) if it is left recursive. A grammar is
|
||||
left-recursive if and only if there exists a nonterminal that can derive to a
|
||||
Another broad class of rules precluded by LL(1) is left-recursive rules.
|
||||
A rule is left-recursive if it can derive to a
|
||||
sentential form with itself as the leftmost symbol. For instance this rule::
|
||||
|
||||
rule: rule 'a'
|
||||
|
||||
is left-recursive because the rule can be expanded to an expression that starts
|
||||
with itself. As will be described later, left-recursion can be very useful to
|
||||
express some desired properties directly in the grammar and the lack of
|
||||
it can lead to some undesired scenarios.
|
||||
with itself. As will be described later, left-recursion is the natural way to
|
||||
express certain desired language properties directly in the grammar.
|
||||
|
||||
=========================
|
||||
Background on PEG parsers
|
||||
|
|
Loading…
Reference in New Issue