PEP 701: Incorporate more feedback from the discussion thread (#2974)

2023-01-28 20:04:56 +00:00 · 2023-01-28 20:04:56 +00:00 · 32eab9398d
parent 558d2066c0
commit 32eab9398d
1 changed files with 109 additions and 32 deletions
--- a/pep-0701.rst
+++ b/pep-0701.rst
@ -271,17 +271,19 @@ New tokens
 ----------
 Three new tokens are introduced: ``FSTRING_START``, ``FSTRING_MIDDLE`` and
-``FSTRING_END``. This PEP does not mandate the precise definitions of these tokens
+``FSTRING_END``. Different lexers may have different implementations that may be
-as different lexers may have different implementations that may be more efficient
+more efficient than the ones proposed here given the context of the particular
-than the ones proposed here given the context of the particular implementation.  However,
+implementation. However, the following definitions will be used as part of the
-the following definitions are provided as a reference so that the reader can have a
+public APIs of CPython (such as the ``tokenize`` module) and are also provided
-better understanding of the proposed grammar changes and how the tokens are used:
+as a reference so that the reader can have a better understanding of the
 proposed grammar changes and how the tokens are used:
-* ``FSTRING_START``: This token includes f-string character (``f``/``F``) and the open quote(s).
+* ``FSTRING_START``: This token includes the f-string prefix (``f``/``F``/``fr``) and the opening quote(s).
-* ``FSTRING_MIDDLE``: This token includes the text between the opening quote
+* ``FSTRING_MIDDLE``: This token includes a portion of text inside the string that's not part of the
-  and the first expression brace (``{``) and the text between two expression braces (``}`` and ``{``).
+  expression part and isn't an opening or closing brace. This can include the text between the opening quote
-* ``FSTRING_END``: This token includes everything after the last expression brace (or the whole literal part
+  and the first expression brace (``{``), the text between two expression braces (``}`` and ``{``) and the text
-  if no expression exists) until the closing quote.
+  between the last expression brace (``}``) and the closing quote.
 * ``FSTRING_END``: This token includes the closing quote.
 These tokens are always string parts and they are semantically equivalent to the
 ``STRING`` token with the restrictions specified. These tokens must be produced by the lexer
@ -292,7 +294,7 @@ differently to the one used by the PEG parser).
 As an example::
-    f'some words {a+b} more words {c+d} final words'
+    f'some words {a+b:.3f} more words {c+d=} final words'
 will be tokenized as::
@ -302,33 +304,88 @@ will be tokenized as::
    NAME - 'a'
    PLUS - '+'
    NAME - 'b'
    OP - ':'
    FSTRING_MIDDLE - '.3f'
    RBRACE - '}'
    FSTRING_MIDDLE - ' more words '
    LBRACE - '{'
    NAME - 'c'
    PLUS - '+'
    NAME - 'd'
    OP - '='
    RBRACE - '}'
-    FSTRING_END - ' final words' (without the end quote)
+    FSTRING_MIDDLE - ' final words'
    FSTRING_END - "'"
 while ``f"""some words"""`` will be tokenized simply as::
    FSTRING_START - 'f"""'
-    FSTRING_END - 'some words'
+    FSTRING_MIDDLE - 'some words'
    FSTRING_END - '"""'
-One way existing lexers can be adapted to emit these tokens is to incorporate a stack of "lexer modes"
+.. _701-tokenize-changes:
 or to use a stack of different lexers. This is because the lexer needs to switch from "regular Python
 lexing" to "f-string lexing" when it encounters an f-string start token and as f-strings can be nested,
 the context needs to be preserved until the f-string closes. Also, the "lexer mode" inside an f-string
 expression part needs to behave as a "super-set" of the regular Python lexer (as it needs to be able to 
 switch back to f-string lexing when it encounters the ``}`` terminator for the expression part as well
 as handling f-string formatting and debug expressions). Of course, as mentioned before, is not possible to
 provide a precise specification of how this should be done as it will depend on the specific implementation
 and nature of the lexer to be changed.
-The specifics of how (or if) the ``tokenize`` module will emit these tokens (or others) and what
+Changes to the tokenize module
-is included in the emitted tokens are left out of this document and must be decided later in a regular
+------------------------------
-CPython issue.
+
 The :mod:`tokenize` module will be adapted to emit these tokens as described in the previous section
 when parsing f-strings so tools can take advantage of this new tokenization schema and avoid having
 to implement their own f-string tokenizer and parser.
 How to produce these new tokens
 -------------------------------
 One way existing lexers can be adapted to emit these tokens is to incorporate a
 stack of "lexer modes" or to use a stack of different lexers. This is because
 the lexer needs to switch from "regular Python lexing" to "f-string lexing" when
 it encounters an f-string start token and as f-strings can be nested, the
 context needs to be preserved until the f-string closes. Also, the "lexer mode"
 inside an f-string expression part needs to behave as a "super-set" of the
 regular Python lexer (as it needs to be able to switch back to f-string lexing
 when it encounters the ``}`` terminator for the expression part as well as
 handling f-string formatting and debug expressions). For reference, here is a
 draft of the algorithm to modify a CPython-like tokenizer to emit these new
 tokens:
 1. If the lexer detects that an f-string is starting (by detecting the letter
   'f/F' and one of the possible quotes) keep advancing until a valid quote is
   detected (one of ``"``, ``"""``, ``'`` or ``'''``) and emit a
   ``FSTRING_START`` token with the contents captured (the 'f/F' and the
   starting quote). Push a new tokenizer mode to the tokenizer mode stack for
   "F-string tokenization". Go to step 2.
 2. Keep consuming tokens until a one of the following is encountered:
   * A closing quote equal to the opening quote.
   * An opening brace (``{``) or a closing brace (``}``) that is not immediately
     followed by another opening/closing brace.
   In all cases, if the character buffer is not empty, emit a ``FSTRING_MIDDLE``
   token with the contents captured so far but transform any double
   opening/closing braces into single opening/closing braces.  Now, proceed as
   follows depending on the character encountered:
   * If a closing quote matching the opening quite is encountered go to step 4.
   * If an opening bracket (not immediately followed by another opening bracket)
     is encountered, go to step 3.
   * If a closing bracket (not immediately followed by another closing bracket)
     is encountered, emit a token for the closing bracket and go to step 2.
 3. Push a new tokenizer mode to the tokenizer mode stack for "Regular Python
   tokenization withing f-string" and proceed to tokenize with it. This mode
   tokenizes as the "Regular Python tokenization" until a ``!``, ``:``, ``=``
   character is encountered or if a ``}`` character is encountered with the same
   level of nesting as the opening bracket token that was pushed when we enter the
   f-string part. Using this mode, emit tokens until one of the stop points are
   reached. When this happens, emit the corresponding token for the stopping
   character encountered and, pop the current tokenizer mode from the tokenizer mode
   stack and go to step 2.
 4. Emit a ``FSTRING_END`` token with the contents captured and pop the current
   tokenizer mode (corresponding to "F-string tokenization") and go back to
   "Regular Python mode".
 Of course, as mentioned before, it is not possible to provide a precise
 specification of how this should be done for an arbitrary tokenizer as it will
 depend on the specific implementation and nature of the lexer to be changed.
 Consequences of the new grammar
 -------------------------------
@ -340,11 +397,22 @@ All restrictions mentioned in the PEP are lifted from f-string literals, as expl
 * Backslashes may now appear within expressions just like anywhere else in
  Python code. In case of strings nested within f-string literals, escape sequences are
  expanded when the innermost string is evaluated.
-* Comments, using the ``#`` character, are possible only in multi-line f-string literals,
+* New lines are now allowed within expression brackets. This means that these are now allowed::
-  since comments are terminated by the end of the line (which makes closing a
+
-  single-line f-string literal impossible). Comments in multi-line f-string literals require
+    >>> x = 1
-  the closing ``{`` of the expression part to be present in a different line as the one the
+    >>> f"___{
-  comment is in.
+    ...     x
    ... }___"
    '___1___'
    >>> f"___{(
    ...     x
    ... )}___"
    '___1___'
 * Comments, using the ``#`` character, are allowed within the expression part of an f-string.
  Note that comments require that the closing bracket (``}``) of the expression part to be present in
  a different line as the one the comment is in or otherwise it will be ignored as part of the comment.
 .. _701-considerations-of-quote-reuse:
@ -423,8 +491,11 @@ Here are some of the arguments in favour:
 Backwards Compatibility
 =======================
-This PEP is backwards compatible: any valid Python code will continue to
+This PEP does not introduce any backwards incompatible syntactic or semantic changes
-be valid if this PEP is implemented and it will not change semantically.
+to the Python language. However, the :mod:`tokenize` module (a quasi-public part of the standard
 library) will need to be updated to support the new f-string tokens (to allow tool authors
 to correctly tokenize f-strings). See :ref:`701-tokenize-changes` for more details regarding
 how the public API of ``tokenize`` will be affected.
 How to Teach This
 =================
@ -499,6 +570,12 @@ Rejected Ideas
    >>> f'Useless use of lambdas: { (lambda x: x*2) }'
 #. We have decided to disallow (for the time being) using escaped braces (``\{`` and ``\}``)
   in addition to the ``{{`` and ``}}`` syntax. Although the authors of the PEP believe that
   allowing escaped braces is a good idea, we have decided to not include it in this PEP, as it is not strictly
   necessary for the formalization of f-strings proposed here, and it can be
   added independently in a regular CPython issue.
 Open Issues
 ===========