PEP 617: Clean-up of LL(1) section (#1363)

Numerous small changes to improve readability of the LL(1) background: idiomatic usage improved, sentences split, clearer expression of certain ideas. Emphasis that the LL(1) constraint obscures the meaning of the grammar.

Fixes #1362
This commit is contained in:
Jeff Allen 2020-04-07 20:57:39 +01:00 committed by GitHub
parent aac58d4c99
commit adb5173eb1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 38 additions and 33 deletions

View File

@ -17,12 +17,12 @@ Post-History: 02-Apr-2020
Overview
========
This PEP proposes to replace the current LL(1)-based parser of CPython
with a new PEG-based parser. This new parser will allow eliminating the multiple
"hacks" that exist in the current grammar to circumvent the LL(1)-limitation
while substantially reducing the maintenance costs in some areas related to the
This PEP proposes replacing the current LL(1)-based parser of CPython
with a new PEG-based parser. This new parser would allow the elimination of multiple
"hacks" that exist in the current grammar to circumvent the LL(1)-limitation.
It would substantially reduce the maintenance costs in some areas related to the
compiling pipeline such as the grammar, the parser and the AST generation. The new PEG
parser will also lift the LL(1) restriction over the current Python grammar.
parser will also lift the LL(1) restriction on the current Python grammar.
===========================
Background on LL(1) parsers
@ -31,15 +31,15 @@ Background on LL(1) parsers
The current Python grammar is an LL(1)-based grammar. A grammar can be said to be
LL(1) if it can be parsed by an LL(1) parser, which in turn is defined as a
top-down parser that parses the input from left to right, performing leftmost
derivation of the sentence, and can only use one token of lookahead when parsing a
sentence. The traditional approach to construct or generate an LL(1) parser is to
derivation of the sentence, with just one token of lookahead.
The traditional approach to constructing or generating an LL(1) parser is to
produce a *parse table* which encodes the possible transitions between all possible
states of the parser. These tables are normally constructed from the *first sets*
and the *follow sets* of the grammar:
* Given a rule, the *first set* are the collection of all terminals that can occur
first in a full derivation of that rule. Intuitively this helps the parser decide
among multiple alternatives if a rule can have multiple possibilities. For
* Given a rule, the *first set* is the collection of all terminals that can occur
first in a full derivation of that rule. Intuitively, this helps the parser decide
among the alternatives in a rule. For
instance, given the rule ::
rule: A | B
@ -48,53 +48,58 @@ and the *follow sets* of the grammar:
terminal *b* and the parser sees the token *b* when parsing this rule, it knows
that it needs to follow the non-terminal ``B``.
* Given a rule, the *follow set* are the collection of terminals that can appear
immediately to the right of that rule in a partial derivation. Intuitively this
solves the problem in which a rule can expand to the empty string. For instance,
* An extension to this simple idea is needed when a rule may expand to the empty string.
Given a rule, the *follow set* is the collection of terminals that can appear
immediately to the right of that rule in a partial derivation. Intuitively, this
solves the problem of the empty alternative. For instance,
given this rule::
rule: A 'b'
if the parser has the token *b* and the rule A can only start with the token *a*
we know it is an invalid program but if A can be expanded also to the empty string
(called an ε-production) then we can consume the next token, 'b'. Therefore, *b*
is in the *follow set* of ``A``.
if the parser has the token *b* and the non-terminal ``A`` can only start
with the token *a*, then the parser can tell that this is an invalid program.
But if ``A`` could expand to the empty string (called an ε-production),
then the parser would recognise a valid empty ``A``,
since the next token *b* is in the *follow set* of ``A``.
The Python grammar does not allow ε-productions so the *follow sets* are not
The current Python grammar does not contain ε-productions, so the *follow sets* are not
needed when creating the parse tables. Currently, in CPython, a parser generator
program reads the grammar and produces a parsing table representing a set of
deterministic finite automata (DFA) that can be included in a C program, the
parser, which is a pushdown automaton that uses this data to produce a Concrete
parser. The parser is a pushdown automaton that uses this data to produce a Concrete
Syntax Tree (CST) sometimes known directly as a "parse tree". In this process, the
*first sets* are used indirectly when generating the DFAs.
LL(1) parsers and grammars are usually known for being efficient and simple to
implement and generate, but the reality is that expressing some constructs
currently present in the Python language is notably difficult or impossible with
such a restriction. As LL(1) parsers can only look one token ahead to distinguish
LL(1) parsers and grammars are usually efficient and simple to implement
and generate. However, it is not possible, under the LL(1) restriction,
to express certain common constructs in a way natural to the language
designer and the reader. This includes some in the Python language.
As LL(1) parsers can only look one token ahead to distinguish
possibilities, some rules in the grammar may be ambiguous. For instance the rule::
rule: A | B
is ambiguous if the *first sets* of both ``A`` and ``B`` have some elements in
common. This is because if the parser sees a token in the input
program that both *A* and *B* can start with it is impossible for it to deduce
which option to expand as no further token of the program can be examined to
disambiguate. As will be shown later in this document, the current LL(1)-based
common. When the parser sees a token in the input
program that both *A* and *B* can start with, it is impossible for it to deduce
which option to expand, as no further token of the program can be examined to
disambiguate.
The rule may be transformed to equivalent LL(1) rules, but then it may
be harder for a human reader to grasp its meaning.
Examples later in this document show that the current LL(1)-based
grammar suffers a lot from this scenario.
Also, it is relevant to note (as other sections of this document will deal with this
concept) that a given grammar cannot be LL(1) if it is left recursive. A grammar is
left-recursive if and only if there exists a nonterminal that can derive to a
Another broad class of rules precluded by LL(1) is left-recursive rules.
A rule is left-recursive if it can derive to a
sentential form with itself as the leftmost symbol. For instance this rule::
rule: rule 'a'
is left-recursive because the rule can be expanded to an expression that starts
with itself. As will be described later, left-recursion can be very useful to
express some desired properties directly in the grammar and the lack of
it can lead to some undesired scenarios.
with itself. As will be described later, left-recursion is the natural way to
express certain desired language properties directly in the grammar.
=========================
Background on PEG parsers