368 lines
16 KiB
ReStructuredText
368 lines
16 KiB
ReStructuredText
PEP: 701
|
||
Title: Syntactic formalization of f-strings
|
||
Author: Pablo Galindo <pablogsal@python.org>,
|
||
Batuhan Taskaya <batuhan@python.org>,
|
||
Lysandros Nikolaou <lisandrosnik@gmail.com>
|
||
Discussions-To:
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Content-Type: text/x-rst
|
||
Created: 15-Nov-2022
|
||
Python-Version: 3.12
|
||
|
||
|
||
Abstract
|
||
========
|
||
|
||
This document proposes to lift some of the restrictions originally formulated in
|
||
:pep:`498` and to provide a formalized grammar for f-strings that can be
|
||
integrated into the parser directly. The proposed syntactic formalization of
|
||
f-strings will have some small side-effects on how f-strings are parsed and
|
||
interpreted, allowing for a considerable number of advantages for end users and
|
||
library developers, while also dramatically reducing the maintenance cost of
|
||
the code dedicated to parsing f-strings.
|
||
|
||
|
||
Motivation
|
||
==========
|
||
|
||
When f-strings were originally introduced in :pep:`498`, the specification was
|
||
provided without providing a formal grammar for f-strings. Additionally, the
|
||
specification contains several restrictions that are imposed so the parsing of
|
||
f-strings could be implemented into CPython without modifying the existing
|
||
lexer. These limitations have been recognized previously and previous attempts
|
||
have been made to lift them in :pep:`536`, but `none of this work was ever implemented`_.
|
||
Some of these limitations (collected originally by :pep:`536`) are:
|
||
|
||
#. It is impossible to use the quote character delimiting the f-string
|
||
within the expression portion::
|
||
|
||
>>> f'Magic wand: { bag['wand'] }'
|
||
^
|
||
SyntaxError: invalid syntax
|
||
|
||
#. A previously considered way around it would lead to escape sequences
|
||
in executed code and is prohibited in f-strings::
|
||
|
||
>>> f'Magic wand { bag[\'wand\'] } string'
|
||
SyntaxError: f-string expression portion cannot include a backslash
|
||
|
||
#. Comments are forbidden even in multi-line f-strings::
|
||
|
||
>>> f'''A complex trick: {
|
||
... bag['bag'] # recursive bags!
|
||
... }'''
|
||
SyntaxError: f-string expression part cannot include '#'
|
||
|
||
#. Arbitrary nesting of expressions without expansion of escape sequences is
|
||
available in every single other language employing a string interpolation
|
||
method that uses expressions instead of just variable names, `per Wikipedia`_.
|
||
|
||
These limitations serve no purpose from a language user perspective and
|
||
can be lifted by giving f-literals a regular grammar without exceptions
|
||
and implementing it using dedicated parse code.
|
||
|
||
The other issue that f-strings have is that the current implementation in
|
||
CPython relies on tokenising f-strings as ``STRING`` tokens and a post processing of
|
||
these tokens. This has the following problems:
|
||
|
||
#. It adds a considerable maintenance cost to the CPython parser. This is because
|
||
the parsing code needs to be written by hand, which has historically led to a
|
||
considerable number of inconsistencies and bugs. Writing and maintaining parsing
|
||
code by hand in C has always been considered error prone and dangerous as it needs
|
||
to deal with a lot of manual memory management over the original lexer buffers.
|
||
|
||
#. The f-string parsing code is not able to use the new improved error message mechanisms
|
||
that the new PEG parser, originally introduced in :pep:`617`, has allowed. The
|
||
improvements that these error messages brought has been greatly celebrated but
|
||
unfortunately f-strings cannot benefit from them because they are parsed in a
|
||
separate piece of the parsing machinery. This is especially unfortunate, since
|
||
there are several syntactical features of f-strings that can be confusing due
|
||
to the different implicit tokenization that happens inside the expression
|
||
part (for instance ``f"{y:=3}"`` is not an assignment expression).
|
||
|
||
#. Other Python implementations have no way to know if they have implemented
|
||
f-strings correctly because contrary to other language features, they are not
|
||
part of the :ref:`official Python grammar <f-strings>`.
|
||
This is important because several prominent
|
||
alternative implementations are using CPython's PEG parser, `such as PyPy`_,
|
||
and/or are basing their grammars on the official PEG grammar. The
|
||
fact that f-strings use a separate parser prevents these alternative implementations
|
||
from leveraging the official grammar and benefiting from improvements in error messages derived
|
||
from the grammar.
|
||
|
||
|
||
A version of this proposal was originally `discussed on Python-Dev`_ and
|
||
`presented at the Python Language Summit 2022`_ where it was enthusiastically
|
||
received.
|
||
|
||
Rationale
|
||
=========
|
||
|
||
By building on top of the new Python PEG Parser (:pep:`617`), this PEP proposes
|
||
to redefine “f-strings”, especially emphasizing the clear separation of the
|
||
string component and the expression (or replacement, ``{...}``) component. :pep:`498`
|
||
summarizes the syntactical part of “f-strings” as the following:
|
||
|
||
In Python source code, an f-string is a literal string, prefixed with ‘f’, which
|
||
contains expressions inside braces. The expressions are replaced with their values.
|
||
|
||
However, :pep:`498` also contained a formal list of exclusions on what
|
||
can or cannot be contained inside the expression component (primarily due to the
|
||
limitations of the existing parser). By clearly establishing the formal grammar, we
|
||
now also have the ability to define the expression component of an f-string as truly "any
|
||
applicable Python expression" (in that particular context) without being bound
|
||
by the limitations imposed by the details of our implementation.
|
||
|
||
The formalization effort and the premise above also has a significant benefit for
|
||
Python programmers due to its ability to simplify and eliminate the obscure
|
||
limitations. This reduces the mental burden and the cognitive complexity of
|
||
f-string literals (as well as the Python language in general).
|
||
|
||
#. The expression component can include any string literal that a normal Python expression
|
||
can include. This opens up the possibility of nesting string literals (formatted or
|
||
not) inside the expression component of an f-string with the same quote type (and length)::
|
||
|
||
>>> f"These are the things: {", ".join(things)}"
|
||
|
||
>>> f"{source.removesuffix(".py")}.c: $(srcdir)/{source}"
|
||
|
||
>>> f"{f"{f"infinite"}"}" + " " + f"{f"nesting!!!"}"
|
||
|
||
This choice not only allows for a more consistent and predictable behavior of what can be
|
||
placed in f-strings but provides an intuitive way to manimulate string literals in a
|
||
more flexible way without to having to fight the limitations of the implementation.
|
||
|
||
#. Another issue that has felt unintuitive to most is the lack of support for backslashes
|
||
within the expression component of an f-string. One example that keeps coming up is including
|
||
a newline character in the expression part for joining containers. For example::
|
||
|
||
>>> a = ["hello", "world"]
|
||
>>> f"{'\n'.join(a)}"
|
||
File "<stdin>", line 1
|
||
f"{'\n'.join(a)}"
|
||
^
|
||
SyntaxError: f-string expression part cannot include a backslash
|
||
|
||
A common work-around for this was to either assign the newline to an intermediate variable or
|
||
pre-create the whole string prior to creating the f-string::
|
||
|
||
>>> a = ["hello", "world"]
|
||
>>> joined = '\n'.join(a)
|
||
>>> f"{joined}"
|
||
'hello\nworld'
|
||
|
||
It only feels natural to allow backslashes in the expression part now that the new PEG parser
|
||
can easily support it.
|
||
|
||
>>> a = ["hello", "world"]
|
||
>>> f"{'\n'.join(a)}"
|
||
'hello\nworld'
|
||
|
||
#. Before the changes proposed in this document, there was no explicit limit in
|
||
how f-strings can be nested, but the fact that string quotes cannot be reused
|
||
inside the expression component of f-strings made it impossible to nest
|
||
f-strings arbitrarily. In fact, this is the most nested-fstring that can be
|
||
written::
|
||
|
||
>>> f"""{f'''{f'{f"{1+1}"}'}'''}"""
|
||
'2'
|
||
|
||
As this PEP allows placing **any** valid Python expression inside the
|
||
expression component of the f-strings, it is now possible to reuse quotes and
|
||
therefore is possible to nest f-strings arbitrarily::
|
||
|
||
>>> f"{f"{f"{f"{f"{f"{1+1}"}"}"}"}"}"
|
||
'2'
|
||
|
||
Although this is just a consequence of allowing arbitrary expressions, the
|
||
authors of this PEP do not believe that this is a fundamental benefit and we
|
||
have decided that the language specification will not explicitly mandate that
|
||
this nesting can be arbitrary. This is because allowing arbitrarily-deep
|
||
nesting imposes a lot of extra complexity to the lexer implementation
|
||
(particularly as lexer/parser pipelines need to allow "untokenizing" to
|
||
support the 'f-string debugging expressions' and this is especially taxing when
|
||
arbitrary nesting is allowed). Implementations are therefore free to impose a
|
||
limit on the nesting depth if they need to. Note that this is not an uncommon
|
||
situation, as the CPython implementation already imposes several limits all
|
||
over the place, including a limit on the nesting depth of parentheses and
|
||
brackets, a limit on the nesting of the blocks, a limit in the number of
|
||
branches in ``if`` statements, a limit on the number of expressions in
|
||
star-unpacking, etc.
|
||
|
||
Specification
|
||
=============
|
||
|
||
The formal proposed PEG grammar specification for f-strings is (see :pep:`617`
|
||
for details on the syntax):
|
||
|
||
.. code-block:: peg
|
||
|
||
fstring
|
||
| FSTRING_START fstring_middle* FSTRING_END
|
||
fstring_middle
|
||
| fstring_replacement_field
|
||
| FSTRING_MIDDLE
|
||
fstring_replacement_field
|
||
| '{' (yield_expr | star_expressions) "="? [ "!" NAME ] [ ':' fstring_format_spec* ] '}'
|
||
fstring_format_spec:
|
||
| FSTRING_MIDDLE
|
||
| fstring_replacement_field
|
||
|
||
This PEP leaves up to the implementation the level of f-string nesting allowed.
|
||
This means that limiting nesting is **not part of the language specification**
|
||
but also the language specification **doesn't mandate arbitrary nesting**.
|
||
|
||
Three new tokens are introduced:
|
||
|
||
* ``FSTRING_START``: This token includes f-string character (``f``/``F``) and the open quote(s).
|
||
* ``FSTRING_MIDDLE``: This token includes the text between the opening quote
|
||
and the first expression brace (``{``) and the text between two expression braces (``}`` and ``{``).
|
||
* ``FSTRING_END``: This token includes everything after the last expression brace (or the whole literal part
|
||
if no expression exists) until the closing quote.
|
||
|
||
These tokens are always string parts and they are semantically equivalent to the
|
||
``STRING`` token with the restrictions specified. These tokens must be produced by the lexer
|
||
when lexing f-strings. This means that **the tokenizer cannot produce a single token for f-strings anymore**. How
|
||
the lexer emits this token is **not specified** as this will heavily depend on every
|
||
implementation (even the Python version of the lexer in the standard library is
|
||
implemented differently to the one used by the PEG parser).
|
||
|
||
As an example::
|
||
|
||
f'some words {a+b} more words {c+d} final words'
|
||
|
||
will be tokenized as::
|
||
|
||
FSTRING_START - "f'"
|
||
FSTRING_MIDDLE - 'some words '
|
||
LBRACE - '{'
|
||
NAME - 'a'
|
||
PLUS - '+'
|
||
NAME - 'b'
|
||
RBRACE - '}'
|
||
FSTRING_MIDDLE - ' more words '
|
||
LBRACE - '{'
|
||
NAME - 'c'
|
||
PLUS - '+'
|
||
NAME - 'd'
|
||
RBRACE - '}'
|
||
FSTRING_END - ' final words' (without the end quote)
|
||
|
||
while ``f"""some words"""`` will be tokenized simply as::
|
||
|
||
FSTRING_START - 'f"""'
|
||
FSTRING_END - 'some words'
|
||
|
||
All restrictions mentioned in the PEP are lifted from f-literals, as explained below:
|
||
|
||
* Expression portions may now contain strings delimited with the same kind of
|
||
quote that is used to delimit the f-literal.
|
||
* Backslashes may now appear within expressions just like anywhere else in
|
||
Python code. In case of strings nested within f-literals, escape sequences are
|
||
expanded when the innermost string is evaluated.
|
||
* Comments, using the ``#`` character, are possible only in multi-line f-literals,
|
||
since comments are terminated by the end of the line (which makes closing a
|
||
single-line f-literal impossible)
|
||
|
||
Backwards Compatibility
|
||
=======================
|
||
|
||
This PEP is backwards compatible: any valid Python code will continue to
|
||
be valid if this PEP is implemented and it will not change semantically.
|
||
|
||
How to Teach This
|
||
=================
|
||
|
||
As the concept of f-strings is already ubiquitous in the Python community, there is
|
||
no fundamental need for users to learn anything new. However, as the formalized grammar
|
||
allows some new possibilities, it is important that the formal grammar is added to the
|
||
documentation and explained in detail, explicitly mentioning what constructs are possible
|
||
since this PEP is aiming to avoid confusion.
|
||
|
||
It is also beneficial to provide users with a simple framework for understanding what can
|
||
be placed inside an f-string expression. In this case the authors think that this work will
|
||
make it even simpler to explain this aspect of the language, since it can be summarized as:
|
||
|
||
You can place any valid Python expression inside an f-string expression.
|
||
|
||
With the changes in this PEP, there is no need to clarify that string quotes are
|
||
limited to be different from the quotes of the enclosing string, because this is
|
||
now allowed: as an arbitrary Python string can contain any possible choice of
|
||
quotes, so can any f-string expression. Additionally there is no need to clarify
|
||
that certain things are not allowed in the expression part because of
|
||
implementation restructions such as comments, new line characters or
|
||
backslashes.
|
||
|
||
The only "surprising" difference is that as f-strings allow specifying a
|
||
format, expressions that allow a ``:`` character at the top level still need to be
|
||
enclosed in parenthesis. This is not new to this work, but it is important to
|
||
emphasize that this restriction is still in place. This allows for an easier
|
||
modification of the summary:
|
||
|
||
You can place any valid Python expression inside
|
||
an f-string expression, and everything after a ``:`` character at the top level will
|
||
be identified as a format specification.
|
||
|
||
|
||
Reference Implementation
|
||
========================
|
||
|
||
A reference implementation can be found in the implementation_ fork.
|
||
|
||
Rejected Ideas
|
||
==============
|
||
|
||
#. We have decided not to lift the restriction that some expression portions
|
||
need to wrap ``':'`` and ``'!'`` in braces at the top level, e.g.::
|
||
|
||
>>> f'Useless use of lambdas: { lambda x: x*2 }'
|
||
SyntaxError: unexpected EOF while parsing
|
||
|
||
The reason is that this would this will introduce a considerable amount of
|
||
complexity for no real benefit. This is due to the fact that the ``:`` character
|
||
normally separates the f-string format specification. This format specification
|
||
is currently tokenized as a string. As the tokenizer MUST tokenize what's on the
|
||
right of the ``:`` as either a string or a stream of tokens, this won't allow the
|
||
parser to differentiate between the different semantics as that would require the
|
||
tokenizer to backtrack and produce a different set of tokens (this is, first try
|
||
as a stream of tokens, and if it fails, try as a string for a format specifier).
|
||
|
||
As there is no fundamental advantage in being able to allow lambdas and similar
|
||
expressions at the top level, we have decided to keep the restriction that these must
|
||
be parenthesized if needed::
|
||
|
||
>>> f'Useless use of lambdas: { (lambda x: x*2) }'
|
||
|
||
|
||
Open Issues
|
||
===========
|
||
|
||
None yet
|
||
|
||
|
||
Footnotes
|
||
=========
|
||
|
||
|
||
.. _official Python grammar: https://docs.python.org/3/reference/lexical_analysis.html#formatted-string-literals
|
||
|
||
.. _none of this work was ever implemented: https://mail.python.org/archives/list/python-dev@python.org/thread/N43O4KNLZW4U7YZC4NVPCETZIVRDUVU2/#NM2A37THVIXXEYR4J5ZPTNLXGGUNFRLZ
|
||
|
||
.. _such as PyPy: https://foss.heptapod.net/pypy/pypy/-/commit/fe120f89bf07e64a41de62b224e4a3d80e0fe0d4/pipelines?ref=branch%2Fpy3.9
|
||
|
||
.. _discussed on Python-Dev: https://mail.python.org/archives/list/python-dev@python.org/thread/54N3MOYVBDSJQZTU6MTCPLUPIFSDN5IS/#SAYU6SMP4KT7G7AQ6WVQYUDOSZPKHJMS
|
||
|
||
.. _presented at the Python Language Summit 2022: https://pyfound.blogspot.com/2022/05/the-2022-python-language-summit-f.html
|
||
|
||
.. _per Wikipedia: https://en.wikipedia.org/wiki/String_interpolation#Examples
|
||
|
||
.. _implementation: https://github.com/we-like-parsers/cpython/tree/fstring-grammar
|
||
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document is placed in the public domain or under the
|
||
CC0-1.0-Universal license, whichever is more permissive.
|