PEP 701: Incorporate more feedback from the discussion thread (#2974)
This commit is contained in:
parent
558d2066c0
commit
32eab9398d
141
pep-0701.rst
141
pep-0701.rst
|
@ -271,17 +271,19 @@ New tokens
|
||||||
----------
|
----------
|
||||||
|
|
||||||
Three new tokens are introduced: ``FSTRING_START``, ``FSTRING_MIDDLE`` and
|
Three new tokens are introduced: ``FSTRING_START``, ``FSTRING_MIDDLE`` and
|
||||||
``FSTRING_END``. This PEP does not mandate the precise definitions of these tokens
|
``FSTRING_END``. Different lexers may have different implementations that may be
|
||||||
as different lexers may have different implementations that may be more efficient
|
more efficient than the ones proposed here given the context of the particular
|
||||||
than the ones proposed here given the context of the particular implementation. However,
|
implementation. However, the following definitions will be used as part of the
|
||||||
the following definitions are provided as a reference so that the reader can have a
|
public APIs of CPython (such as the ``tokenize`` module) and are also provided
|
||||||
better understanding of the proposed grammar changes and how the tokens are used:
|
as a reference so that the reader can have a better understanding of the
|
||||||
|
proposed grammar changes and how the tokens are used:
|
||||||
|
|
||||||
* ``FSTRING_START``: This token includes f-string character (``f``/``F``) and the open quote(s).
|
* ``FSTRING_START``: This token includes the f-string prefix (``f``/``F``/``fr``) and the opening quote(s).
|
||||||
* ``FSTRING_MIDDLE``: This token includes the text between the opening quote
|
* ``FSTRING_MIDDLE``: This token includes a portion of text inside the string that's not part of the
|
||||||
and the first expression brace (``{``) and the text between two expression braces (``}`` and ``{``).
|
expression part and isn't an opening or closing brace. This can include the text between the opening quote
|
||||||
* ``FSTRING_END``: This token includes everything after the last expression brace (or the whole literal part
|
and the first expression brace (``{``), the text between two expression braces (``}`` and ``{``) and the text
|
||||||
if no expression exists) until the closing quote.
|
between the last expression brace (``}``) and the closing quote.
|
||||||
|
* ``FSTRING_END``: This token includes the closing quote.
|
||||||
|
|
||||||
These tokens are always string parts and they are semantically equivalent to the
|
These tokens are always string parts and they are semantically equivalent to the
|
||||||
``STRING`` token with the restrictions specified. These tokens must be produced by the lexer
|
``STRING`` token with the restrictions specified. These tokens must be produced by the lexer
|
||||||
|
@ -292,7 +294,7 @@ differently to the one used by the PEG parser).
|
||||||
|
|
||||||
As an example::
|
As an example::
|
||||||
|
|
||||||
f'some words {a+b} more words {c+d} final words'
|
f'some words {a+b:.3f} more words {c+d=} final words'
|
||||||
|
|
||||||
will be tokenized as::
|
will be tokenized as::
|
||||||
|
|
||||||
|
@ -302,33 +304,88 @@ will be tokenized as::
|
||||||
NAME - 'a'
|
NAME - 'a'
|
||||||
PLUS - '+'
|
PLUS - '+'
|
||||||
NAME - 'b'
|
NAME - 'b'
|
||||||
|
OP - ':'
|
||||||
|
FSTRING_MIDDLE - '.3f'
|
||||||
RBRACE - '}'
|
RBRACE - '}'
|
||||||
FSTRING_MIDDLE - ' more words '
|
FSTRING_MIDDLE - ' more words '
|
||||||
LBRACE - '{'
|
LBRACE - '{'
|
||||||
NAME - 'c'
|
NAME - 'c'
|
||||||
PLUS - '+'
|
PLUS - '+'
|
||||||
NAME - 'd'
|
NAME - 'd'
|
||||||
|
OP - '='
|
||||||
RBRACE - '}'
|
RBRACE - '}'
|
||||||
FSTRING_END - ' final words' (without the end quote)
|
FSTRING_MIDDLE - ' final words'
|
||||||
|
FSTRING_END - "'"
|
||||||
|
|
||||||
while ``f"""some words"""`` will be tokenized simply as::
|
while ``f"""some words"""`` will be tokenized simply as::
|
||||||
|
|
||||||
FSTRING_START - 'f"""'
|
FSTRING_START - 'f"""'
|
||||||
FSTRING_END - 'some words'
|
FSTRING_MIDDLE - 'some words'
|
||||||
|
FSTRING_END - '"""'
|
||||||
|
|
||||||
One way existing lexers can be adapted to emit these tokens is to incorporate a stack of "lexer modes"
|
.. _701-tokenize-changes:
|
||||||
or to use a stack of different lexers. This is because the lexer needs to switch from "regular Python
|
|
||||||
lexing" to "f-string lexing" when it encounters an f-string start token and as f-strings can be nested,
|
|
||||||
the context needs to be preserved until the f-string closes. Also, the "lexer mode" inside an f-string
|
|
||||||
expression part needs to behave as a "super-set" of the regular Python lexer (as it needs to be able to
|
|
||||||
switch back to f-string lexing when it encounters the ``}`` terminator for the expression part as well
|
|
||||||
as handling f-string formatting and debug expressions). Of course, as mentioned before, is not possible to
|
|
||||||
provide a precise specification of how this should be done as it will depend on the specific implementation
|
|
||||||
and nature of the lexer to be changed.
|
|
||||||
|
|
||||||
The specifics of how (or if) the ``tokenize`` module will emit these tokens (or others) and what
|
Changes to the tokenize module
|
||||||
is included in the emitted tokens are left out of this document and must be decided later in a regular
|
------------------------------
|
||||||
CPython issue.
|
|
||||||
|
The :mod:`tokenize` module will be adapted to emit these tokens as described in the previous section
|
||||||
|
when parsing f-strings so tools can take advantage of this new tokenization schema and avoid having
|
||||||
|
to implement their own f-string tokenizer and parser.
|
||||||
|
|
||||||
|
How to produce these new tokens
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
One way existing lexers can be adapted to emit these tokens is to incorporate a
|
||||||
|
stack of "lexer modes" or to use a stack of different lexers. This is because
|
||||||
|
the lexer needs to switch from "regular Python lexing" to "f-string lexing" when
|
||||||
|
it encounters an f-string start token and as f-strings can be nested, the
|
||||||
|
context needs to be preserved until the f-string closes. Also, the "lexer mode"
|
||||||
|
inside an f-string expression part needs to behave as a "super-set" of the
|
||||||
|
regular Python lexer (as it needs to be able to switch back to f-string lexing
|
||||||
|
when it encounters the ``}`` terminator for the expression part as well as
|
||||||
|
handling f-string formatting and debug expressions). For reference, here is a
|
||||||
|
draft of the algorithm to modify a CPython-like tokenizer to emit these new
|
||||||
|
tokens:
|
||||||
|
|
||||||
|
1. If the lexer detects that an f-string is starting (by detecting the letter
|
||||||
|
'f/F' and one of the possible quotes) keep advancing until a valid quote is
|
||||||
|
detected (one of ``"``, ``"""``, ``'`` or ``'''``) and emit a
|
||||||
|
``FSTRING_START`` token with the contents captured (the 'f/F' and the
|
||||||
|
starting quote). Push a new tokenizer mode to the tokenizer mode stack for
|
||||||
|
"F-string tokenization". Go to step 2.
|
||||||
|
2. Keep consuming tokens until a one of the following is encountered:
|
||||||
|
|
||||||
|
* A closing quote equal to the opening quote.
|
||||||
|
* An opening brace (``{``) or a closing brace (``}``) that is not immediately
|
||||||
|
followed by another opening/closing brace.
|
||||||
|
|
||||||
|
In all cases, if the character buffer is not empty, emit a ``FSTRING_MIDDLE``
|
||||||
|
token with the contents captured so far but transform any double
|
||||||
|
opening/closing braces into single opening/closing braces. Now, proceed as
|
||||||
|
follows depending on the character encountered:
|
||||||
|
|
||||||
|
* If a closing quote matching the opening quite is encountered go to step 4.
|
||||||
|
* If an opening bracket (not immediately followed by another opening bracket)
|
||||||
|
is encountered, go to step 3.
|
||||||
|
* If a closing bracket (not immediately followed by another closing bracket)
|
||||||
|
is encountered, emit a token for the closing bracket and go to step 2.
|
||||||
|
|
||||||
|
3. Push a new tokenizer mode to the tokenizer mode stack for "Regular Python
|
||||||
|
tokenization withing f-string" and proceed to tokenize with it. This mode
|
||||||
|
tokenizes as the "Regular Python tokenization" until a ``!``, ``:``, ``=``
|
||||||
|
character is encountered or if a ``}`` character is encountered with the same
|
||||||
|
level of nesting as the opening bracket token that was pushed when we enter the
|
||||||
|
f-string part. Using this mode, emit tokens until one of the stop points are
|
||||||
|
reached. When this happens, emit the corresponding token for the stopping
|
||||||
|
character encountered and, pop the current tokenizer mode from the tokenizer mode
|
||||||
|
stack and go to step 2.
|
||||||
|
4. Emit a ``FSTRING_END`` token with the contents captured and pop the current
|
||||||
|
tokenizer mode (corresponding to "F-string tokenization") and go back to
|
||||||
|
"Regular Python mode".
|
||||||
|
|
||||||
|
Of course, as mentioned before, it is not possible to provide a precise
|
||||||
|
specification of how this should be done for an arbitrary tokenizer as it will
|
||||||
|
depend on the specific implementation and nature of the lexer to be changed.
|
||||||
|
|
||||||
Consequences of the new grammar
|
Consequences of the new grammar
|
||||||
-------------------------------
|
-------------------------------
|
||||||
|
@ -340,11 +397,22 @@ All restrictions mentioned in the PEP are lifted from f-string literals, as expl
|
||||||
* Backslashes may now appear within expressions just like anywhere else in
|
* Backslashes may now appear within expressions just like anywhere else in
|
||||||
Python code. In case of strings nested within f-string literals, escape sequences are
|
Python code. In case of strings nested within f-string literals, escape sequences are
|
||||||
expanded when the innermost string is evaluated.
|
expanded when the innermost string is evaluated.
|
||||||
* Comments, using the ``#`` character, are possible only in multi-line f-string literals,
|
* New lines are now allowed within expression brackets. This means that these are now allowed::
|
||||||
since comments are terminated by the end of the line (which makes closing a
|
|
||||||
single-line f-string literal impossible). Comments in multi-line f-string literals require
|
>>> x = 1
|
||||||
the closing ``{`` of the expression part to be present in a different line as the one the
|
>>> f"___{
|
||||||
comment is in.
|
... x
|
||||||
|
... }___"
|
||||||
|
'___1___'
|
||||||
|
|
||||||
|
>>> f"___{(
|
||||||
|
... x
|
||||||
|
... )}___"
|
||||||
|
'___1___'
|
||||||
|
|
||||||
|
* Comments, using the ``#`` character, are allowed within the expression part of an f-string.
|
||||||
|
Note that comments require that the closing bracket (``}``) of the expression part to be present in
|
||||||
|
a different line as the one the comment is in or otherwise it will be ignored as part of the comment.
|
||||||
|
|
||||||
.. _701-considerations-of-quote-reuse:
|
.. _701-considerations-of-quote-reuse:
|
||||||
|
|
||||||
|
@ -423,8 +491,11 @@ Here are some of the arguments in favour:
|
||||||
Backwards Compatibility
|
Backwards Compatibility
|
||||||
=======================
|
=======================
|
||||||
|
|
||||||
This PEP is backwards compatible: any valid Python code will continue to
|
This PEP does not introduce any backwards incompatible syntactic or semantic changes
|
||||||
be valid if this PEP is implemented and it will not change semantically.
|
to the Python language. However, the :mod:`tokenize` module (a quasi-public part of the standard
|
||||||
|
library) will need to be updated to support the new f-string tokens (to allow tool authors
|
||||||
|
to correctly tokenize f-strings). See :ref:`701-tokenize-changes` for more details regarding
|
||||||
|
how the public API of ``tokenize`` will be affected.
|
||||||
|
|
||||||
How to Teach This
|
How to Teach This
|
||||||
=================
|
=================
|
||||||
|
@ -499,6 +570,12 @@ Rejected Ideas
|
||||||
|
|
||||||
>>> f'Useless use of lambdas: { (lambda x: x*2) }'
|
>>> f'Useless use of lambdas: { (lambda x: x*2) }'
|
||||||
|
|
||||||
|
#. We have decided to disallow (for the time being) using escaped braces (``\{`` and ``\}``)
|
||||||
|
in addition to the ``{{`` and ``}}`` syntax. Although the authors of the PEP believe that
|
||||||
|
allowing escaped braces is a good idea, we have decided to not include it in this PEP, as it is not strictly
|
||||||
|
necessary for the formalization of f-strings proposed here, and it can be
|
||||||
|
added independently in a regular CPython issue.
|
||||||
|
|
||||||
Open Issues
|
Open Issues
|
||||||
===========
|
===========
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue