PEP 617: Clean-up of LL(1) section (#1363)

Numerous small changes to improve readability of the LL(1) background: idiomatic usage improved, sentences split, clearer expression of certain ideas. Emphasis that the LL(1) constraint obscures the meaning of the grammar. Fixes #1362
2020-04-07 20:57:39 +01:00 · 2020-04-07 20:57:39 +01:00 · adb5173eb1
parent aac58d4c99
commit adb5173eb1
1 changed files with 38 additions and 33 deletions
--- a/pep-0617.rst
+++ b/pep-0617.rst
@ -17,12 +17,12 @@ Post-History: 02-Apr-2020
 Overview
 ========

-This PEP proposes to replace the current LL(1)-based parser of CPython
-with a new PEG-based parser. This new parser will allow eliminating the multiple
-"hacks" that exist in the current grammar to circumvent the LL(1)-limitation
-while substantially reducing the maintenance costs in some areas related to the
+This PEP proposes replacing the current LL(1)-based parser of CPython
+with a new PEG-based parser. This new parser would allow the elimination of multiple
+"hacks" that exist in the current grammar to circumvent the LL(1)-limitation.
+It would substantially reduce the maintenance costs in some areas related to the
 compiling pipeline such as the grammar, the parser and the AST generation. The new PEG
-parser will also lift the LL(1) restriction over the current Python grammar.
+parser will also lift the LL(1) restriction on the current Python grammar.

 ===========================
 Background on LL(1) parsers
@ -31,15 +31,15 @@ Background on LL(1) parsers
 The current Python grammar is an LL(1)-based grammar. A grammar can be said to be
 LL(1) if it can be parsed by an LL(1) parser, which in turn is defined as a
 top-down parser that parses the input from left to right, performing leftmost
-derivation of the sentence, and can only use one token of lookahead when parsing a
-sentence. The traditional approach to construct or generate an LL(1) parser is to
+derivation of the sentence, with just one token of lookahead.
+The traditional approach to constructing or generating an LL(1) parser is to
 produce a *parse table* which encodes the possible transitions between all possible
 states of the parser. These tables are normally constructed from the *first sets*
 and the *follow sets* of the grammar:

-* Given a rule, the *first set* are the collection of all terminals that can occur
-  first in a full derivation of that rule. Intuitively this helps the parser decide
-  among multiple alternatives if a rule can have multiple possibilities. For
+* Given a rule, the *first set* is the collection of all terminals that can occur
+  first in a full derivation of that rule. Intuitively, this helps the parser decide
+  among the alternatives in a rule. For
  instance, given the rule ::

      rule: A | B
@ -48,53 +48,58 @@ and the *follow sets* of the grammar:
  terminal *b* and the parser sees the token *b* when parsing this rule, it knows
  that it needs to follow the non-terminal ``B``.

-* Given a rule, the *follow set* are the collection of terminals that can appear
-  immediately to the right of that rule in a partial derivation. Intuitively this
-  solves the problem in which a rule can expand to the empty string. For instance,
+* An extension to this simple idea is needed when a rule may expand to the empty string.
+  Given a rule, the *follow set* is the collection of terminals that can appear
+  immediately to the right of that rule in a partial derivation. Intuitively, this
+  solves the problem of the empty alternative. For instance,
  given this rule::

    rule: A 'b'

-  if the parser has the token *b* and the rule A can only start with the token *a*
-  we know it is an invalid program but if A can be expanded also to the empty string
-  (called an ε-production) then we can consume the next token, 'b'. Therefore, *b*
-  is in the *follow set*  of ``A``.
+  if the parser has the token *b* and the non-terminal ``A`` can only start
+  with the token *a*, then the parser can tell that this is an invalid program.
+  But if ``A`` could expand to the empty string (called an ε-production),
+  then the parser would recognise a valid empty ``A``,
+  since the next token *b* is in the *follow set*  of ``A``.


-The Python grammar does not allow ε-productions so the *follow sets* are not
+The current Python grammar does not contain ε-productions, so the *follow sets* are not
 needed when creating the parse tables. Currently, in CPython, a parser generator
 program reads the grammar and produces a parsing table representing a set of
 deterministic finite automata (DFA) that can be included in a C program, the
-parser, which is a pushdown automaton that uses this data to produce a Concrete
+parser. The parser is a pushdown automaton that uses this data to produce a Concrete
 Syntax Tree (CST) sometimes known directly as a "parse tree". In this process, the
 *first sets* are used indirectly when generating the DFAs.

-LL(1) parsers and grammars are usually known for being efficient and simple to
-implement and generate, but the reality is that expressing some constructs
-currently present in the Python language is notably difficult or impossible with
-such a restriction. As LL(1) parsers can only look one token ahead to distinguish
+LL(1) parsers and grammars are usually efficient and simple to implement
+and generate. However, it is not possible, under the LL(1) restriction,
+to express certain common constructs in a way natural to the language
+designer and the reader. This includes some in the Python language.
+
+As LL(1) parsers can only look one token ahead to distinguish
 possibilities, some rules in the grammar may be ambiguous. For instance the rule::

    rule: A | B

 is ambiguous if the *first sets* of both ``A`` and ``B`` have some elements in
-common. This is because if the parser sees a token in the input
-program that both *A* and *B* can start with it is impossible for it to deduce
-which option to expand as no further token of the program can be examined to
-disambiguate. As will be shown later in this document, the current LL(1)-based
+common. When the parser sees a token in the input
+program that both *A* and *B* can start with, it is impossible for it to deduce
+which option to expand, as no further token of the program can be examined to
+disambiguate.
+The rule may be transformed to equivalent LL(1) rules, but then it may
+be harder for a human reader to grasp its meaning.
+Examples later in this document show that the current LL(1)-based
 grammar suffers a lot from this scenario.

-Also, it is relevant to note (as other sections of this document will deal with this
-concept) that a given grammar cannot be LL(1) if it is left recursive. A grammar is
-left-recursive if and only if there exists a nonterminal that can derive to a
+Another broad class of rules precluded by LL(1) is left-recursive rules.
+A rule is left-recursive if it can derive to a
 sentential form with itself as the leftmost symbol. For instance this rule::

    rule: rule 'a'

 is left-recursive because the rule can be expanded to an expression that starts
-with itself. As will be described later, left-recursion can be very useful to
-express some desired properties directly in the grammar and the lack of
-it can lead to some undesired scenarios.
+with itself. As will be described later, left-recursion is the natural way to
+express certain desired language properties directly in the grammar.

 =========================
 Background on PEG parsers