PEP 617: Clean-up of LL(1) section (#1363)

Numerous small changes to improve readability of the LL(1) background: idiomatic usage improved, sentences split, clearer expression of certain ideas. Emphasis that the LL(1) constraint obscures the meaning of the grammar. Fixes #1362
2020-04-07 20:57:39 +01:00 · 2020-04-07 20:57:39 +01:00 · adb5173eb1
parent aac58d4c99
commit adb5173eb1
1 changed files with 38 additions and 33 deletions
--- a/pep-0617.rst
+++ b/pep-0617.rst
@ -17,12 +17,12 @@ Post-History: 02-Apr-2020
 Overview
 ========
-This PEP proposes to replace the current LL(1)-based parser of CPython
+This PEP proposes replacing the current LL(1)-based parser of CPython
-with a new PEG-based parser. This new parser will allow eliminating the multiple
+with a new PEG-based parser. This new parser would allow the elimination of multiple
-"hacks" that exist in the current grammar to circumvent the LL(1)-limitation
+"hacks" that exist in the current grammar to circumvent the LL(1)-limitation.
-while substantially reducing the maintenance costs in some areas related to the
+It would substantially reduce the maintenance costs in some areas related to the
 compiling pipeline such as the grammar, the parser and the AST generation. The new PEG
-parser will also lift the LL(1) restriction over the current Python grammar.
+parser will also lift the LL(1) restriction on the current Python grammar.
 ===========================
 Background on LL(1) parsers
@ -31,15 +31,15 @@ Background on LL(1) parsers
 The current Python grammar is an LL(1)-based grammar. A grammar can be said to be
 LL(1) if it can be parsed by an LL(1) parser, which in turn is defined as a
 top-down parser that parses the input from left to right, performing leftmost
-derivation of the sentence, and can only use one token of lookahead when parsing a
+derivation of the sentence, with just one token of lookahead.
-sentence. The traditional approach to construct or generate an LL(1) parser is to
+The traditional approach to constructing or generating an LL(1) parser is to
 produce a *parse table* which encodes the possible transitions between all possible
 states of the parser. These tables are normally constructed from the *first sets*
 and the *follow sets* of the grammar:
-* Given a rule, the *first set* are the collection of all terminals that can occur
+* Given a rule, the *first set* is the collection of all terminals that can occur
-  first in a full derivation of that rule. Intuitively this helps the parser decide
+  first in a full derivation of that rule. Intuitively, this helps the parser decide
-  among multiple alternatives if a rule can have multiple possibilities. For
+  among the alternatives in a rule. For
  instance, given the rule ::
      rule: A | B
@ -48,53 +48,58 @@ and the *follow sets* of the grammar:
  terminal *b* and the parser sees the token *b* when parsing this rule, it knows
  that it needs to follow the non-terminal ``B``.
-* Given a rule, the *follow set* are the collection of terminals that can appear
+* An extension to this simple idea is needed when a rule may expand to the empty string.
-  immediately to the right of that rule in a partial derivation. Intuitively this
+  Given a rule, the *follow set* is the collection of terminals that can appear
-  solves the problem in which a rule can expand to the empty string. For instance,
+  immediately to the right of that rule in a partial derivation. Intuitively, this
  solves the problem of the empty alternative. For instance,
  given this rule::
    rule: A 'b'
-  if the parser has the token *b* and the rule A can only start with the token *a*
+  if the parser has the token *b* and the non-terminal ``A`` can only start
-  we know it is an invalid program but if A can be expanded also to the empty string
+  with the token *a*, then the parser can tell that this is an invalid program.
-  (called an ε-production) then we can consume the next token, 'b'. Therefore, *b*
+  But if ``A`` could expand to the empty string (called an ε-production),
-  is in the *follow set*  of ``A``.
+  then the parser would recognise a valid empty ``A``,
  since the next token *b* is in the *follow set*  of ``A``.
-The Python grammar does not allow ε-productions so the *follow sets* are not
+The current Python grammar does not contain ε-productions, so the *follow sets* are not
 needed when creating the parse tables. Currently, in CPython, a parser generator
 program reads the grammar and produces a parsing table representing a set of
 deterministic finite automata (DFA) that can be included in a C program, the
-parser, which is a pushdown automaton that uses this data to produce a Concrete
+parser. The parser is a pushdown automaton that uses this data to produce a Concrete
 Syntax Tree (CST) sometimes known directly as a "parse tree". In this process, the
 *first sets* are used indirectly when generating the DFAs.
-LL(1) parsers and grammars are usually known for being efficient and simple to
+LL(1) parsers and grammars are usually efficient and simple to implement
-implement and generate, but the reality is that expressing some constructs
+and generate. However, it is not possible, under the LL(1) restriction,
-currently present in the Python language is notably difficult or impossible with
+to express certain common constructs in a way natural to the language
-such a restriction. As LL(1) parsers can only look one token ahead to distinguish
+designer and the reader. This includes some in the Python language.
 As LL(1) parsers can only look one token ahead to distinguish
 possibilities, some rules in the grammar may be ambiguous. For instance the rule::
    rule: A | B
 is ambiguous if the *first sets* of both ``A`` and ``B`` have some elements in
-common. This is because if the parser sees a token in the input
+common. When the parser sees a token in the input
-program that both *A* and *B* can start with it is impossible for it to deduce
+program that both *A* and *B* can start with, it is impossible for it to deduce
-which option to expand as no further token of the program can be examined to
+which option to expand, as no further token of the program can be examined to
-disambiguate. As will be shown later in this document, the current LL(1)-based
+disambiguate.
 The rule may be transformed to equivalent LL(1) rules, but then it may
 be harder for a human reader to grasp its meaning.
 Examples later in this document show that the current LL(1)-based
 grammar suffers a lot from this scenario.
-Also, it is relevant to note (as other sections of this document will deal with this
+Another broad class of rules precluded by LL(1) is left-recursive rules.
-concept) that a given grammar cannot be LL(1) if it is left recursive. A grammar is
+A rule is left-recursive if it can derive to a
 left-recursive if and only if there exists a nonterminal that can derive to a
 sentential form with itself as the leftmost symbol. For instance this rule::
    rule: rule 'a'
 is left-recursive because the rule can be expanded to an expression that starts
-with itself. As will be described later, left-recursion can be very useful to
+with itself. As will be described later, left-recursion is the natural way to
-express some desired properties directly in the grammar and the lack of
+express certain desired language properties directly in the grammar.
 it can lead to some undesired scenarios.
 =========================
 Background on PEG parsers