python-peps/pep-0701.rst

PEP: 701
Title: Syntactic formalization of f-strings
Author: Pablo Galindo <pablogsal@python.org>,
        Batuhan Taskaya <batuhan@python.org>,
        Lysandros Nikolaou <lisandrosnik@gmail.com>
Discussions-To:
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 15-Nov-2022
Python-Version: 3.12


Abstract
========

This document proposes to lift some of the restrictions originally formulated in
:pep:`498` and to provide a formalized grammar for f-strings that can be
integrated into the parser directly. The proposed syntactic formalization of
f-strings will have some small side-effects on how f-strings are parsed and
interpreted, allowing for a considerable number of advantages for end users and
library developers, while also dramatically reducing the maintenance cost of
the code dedicated to parsing f-strings.


Motivation
==========

When f-strings were originally introduced in :pep:`498`, the specification was
provided without providing a formal grammar for f-strings. Additionally, the
specification contains several restrictions that are imposed so the parsing of
f-strings could be implemented into CPython without modifying the existing
lexer. These limitations have been recognized previously and previous attempts
have been made to lift them in :pep:`536`, but `none of this work was ever implemented`_.
Some of these limitations (collected originally by :pep:`536`) are:

#. It is impossible to use the quote character delimiting the f-string
   within the expression portion::

    >>> f'Magic wand: { bag['wand'] }'
                                 ^
    SyntaxError: invalid syntax

#. A previously considered way around it would lead to escape sequences
   in executed code and is prohibited in f-strings::

    >>> f'Magic wand { bag[\'wand\'] } string'
    SyntaxError: f-string expression portion cannot include a backslash

#. Comments are forbidden even in multi-line f-strings::

    >>> f'''A complex trick: {
    ... bag['bag']  # recursive bags!
    ... }'''
    SyntaxError: f-string expression part cannot include '#'

#. Arbitrary nesting of expressions without expansion of escape sequences is
   available in every single other language employing a string interpolation
   method that uses expressions instead of just variable names, `per Wikipedia`_.

These limitations serve no purpose from a language user perspective and
can be lifted by giving f-literals a regular grammar without exceptions
and implementing it using dedicated parse code.

The other issue that f-strings have is that the current implementation in
CPython relies on tokenising f-strings as ``STRING`` tokens and a post processing of
these tokens. This has the following problems:

#. It adds a considerable maintenance cost to the CPython parser. This is because
   the parsing code needs to be written by hand, which has historically led to a
   considerable number of inconsistencies and bugs. Writing and maintaining parsing
   code by hand in C has always been considered error prone and dangerous as it needs
   to deal with a lot of manual memory management over the original lexer buffers.

#. The f-string parsing code is not able to use the new improved error message mechanisms
   that the new PEG parser, originally introduced in :pep:`617`, has allowed. The
   improvements that these error messages brought has been greatly celebrated but
   unfortunately f-strings cannot benefit from them because they are parsed in a
   separate piece of the parsing machinery. This is especially unfortunate, since
   there are several syntactical features of f-strings that can be confusing due
   to the different implicit tokenization that happens inside the expression
   part (for instance ``f"{y:=3}"`` is not an assignment expression).

#. Other Python implementations have no way to know if they have implemented
   f-strings correctly because contrary to other language features, they are not
   part of the :ref:`official Python grammar <f-strings>`.
   This is important because several prominent
   alternative implementations are using CPython's PEG parser, `such as PyPy`_,
   and/or are basing their grammars on the official PEG grammar. The
   fact that f-strings use a separate parser prevents these alternative implementations
   from leveraging the official grammar and benefiting from improvements in error messages derived
   from the grammar.


A version of this proposal was originally `discussed on Python-Dev`_  and
`presented at the Python Language Summit 2022`_ where it was enthusiastically
received.

Rationale
=========

By building on top of the new Python PEG Parser (:pep:`617`), this PEP proposes
to redefine “f-strings”, especially emphasizing the clear separation of the
string component and the expression (or replacement, ``{...}``) component. :pep:`498`
summarizes the syntactical part of “f-strings” as the following:

    In Python source code, an f-string is a literal string, prefixed with ‘f’, which
    contains expressions inside braces. The expressions are replaced with their values.

However, :pep:`498` also contained a formal list of exclusions on what
can or cannot be contained inside the expression component (primarily due to the
limitations of the existing parser). By clearly establishing the formal grammar, we
now also have the ability to define the expression component of an f-string as truly "any
applicable Python expression" (in that particular context) without being bound
by the limitations imposed by the details of our implementation.

The formalization effort and the premise above also has a significant benefit for
Python programmers due to its ability to simplify and eliminate the obscure
limitations. This reduces the mental burden and the cognitive complexity of
f-string literals (as well as the Python language in general).

#. The expression component can include any string literal that a normal Python expression
   can include. This opens up the possibility of nesting string literals (formatted or
   not) inside the expression component of an f-string with the same quote type (and length)::

    >>> f"These are the things: {", ".join(things)}"

    >>> f"{source.removesuffix(".py")}.c: $(srcdir)/{source}"

    >>> f"{f"{f"infinite"}"}" + " " + f"{f"nesting!!!"}"

   This choice not only allows for a more consistent and predictable behavior of what can be
   placed in f-strings but provides an intuitive way to manimulate string literals in a
   more flexible way without to having to fight the limitations of the implementation.

#. Another issue that has felt unintuitive to most is the lack of support for backslashes
   within the expression component of an f-string. One example that keeps coming up is including
   a newline character in the expression part for joining containers. For example::

    >>> a = ["hello", "world"]
    >>> f"{'\n'.join(a)}"
    File "<stdin>", line 1
        f"{'\n'.join(a)}"
                        ^
    SyntaxError: f-string expression part cannot include a backslash

   A common work-around for this was to either assign the newline to an intermediate variable or
   pre-create the whole string prior to creating the f-string::

    >>> a = ["hello", "world"]
    >>> joined = '\n'.join(a)
    >>> f"{joined}"
    'hello\nworld'

   It only feels natural to allow backslashes in the expression part now that the new PEG parser
   can easily support it.

    >>> a = ["hello", "world"]
    >>> f"{'\n'.join(a)}"
    'hello\nworld'

#. Before the changes proposed in this document, there was no explicit limit in
   how f-strings can be nested, but the fact that string quotes cannot be reused
   inside the expression component of f-strings made it impossible to nest
   f-strings arbitrarily. In fact, this is the most nested-fstring that can be
   written::

    >>> f"""{f'''{f'{f"{1+1}"}'}'''}"""
    '2'

   As this PEP allows placing **any** valid Python expression inside the
   expression component of the f-strings, it is now possible to reuse quotes and
   therefore is possible to nest f-strings arbitrarily::

    >>> f"{f"{f"{f"{f"{f"{1+1}"}"}"}"}"}"
    '2'
   
   Although this is just a consequence of allowing arbitrary expressions, the
   authors of this PEP do not believe that this is a fundamental benefit and we
   have decided that the language specification will not explicitly mandate that
   this nesting can be arbitrary. This is because allowing arbitrarily-deep
   nesting imposes a lot of extra complexity to the lexer implementation
   (particularly as lexer/parser pipelines need to allow "untokenizing" to
   support the 'f-string debugging expressions' and this is especially taxing when
   arbitrary nesting is allowed). Implementations are therefore free to impose a
   limit on the nesting depth if they need to. Note that this is not an uncommon
   situation, as the CPython implementation already imposes several limits all
   over the place, including a limit on the nesting depth of parentheses and
   brackets, a limit on the nesting of the blocks, a limit in the number of
   branches in ``if`` statements, a limit on the number of expressions in
   star-unpacking, etc.

Specification
=============

The formal proposed PEG grammar specification for f-strings is (see :pep:`617`
for details on the syntax):

.. code-block:: peg

    fstring
        | FSTRING_START fstring_middle* FSTRING_END
    fstring_middle
        | fstring_replacement_field
        | FSTRING_MIDDLE
    fstring_replacement_field
        | '{' (yield_expr | star_expressions) "="? [ "!" NAME ] [ ':' fstring_format_spec* ] '}'
    fstring_format_spec:
        | FSTRING_MIDDLE
        | fstring_replacement_field

This PEP leaves up to the implementation the level of f-string nesting allowed.
This means that limiting nesting is **not part of the language specification**
but also the language specification **doesn't mandate arbitrary nesting**. 

Three new tokens are introduced:

* ``FSTRING_START``: This token includes f-string character (``f``/``F``) and the open quote(s).
* ``FSTRING_MIDDLE``: This token includes the text between the opening quote
  and the first expression brace (``{``) and the text between two expression braces (``}`` and ``{``).
* ``FSTRING_END``: This token includes everything after the last expression brace (or the whole literal part
  if no expression exists) until the closing quote.

These tokens are always string parts and they are semantically equivalent to the
``STRING`` token with the restrictions specified. These tokens must be produced by the lexer
when lexing f-strings.  This means that **the tokenizer cannot produce a single token for f-strings anymore**. How
the lexer emits this token is **not specified** as this will heavily depend on every
implementation (even the Python version of the lexer in the standard library is
implemented differently to the one used by the PEG parser).

As an example::

    f'some words {a+b} more words {c+d} final words'

will be tokenized as::

    FSTRING_START - "f'"
    FSTRING_MIDDLE - 'some words '
    LBRACE - '{'
    NAME - 'a'
    PLUS - '+'
    NAME - 'b'
    RBRACE - '}'
    FSTRING_MIDDLE - ' more words '
    LBRACE - '{'
    NAME - 'c'
    PLUS - '+'
    NAME - 'd'
    RBRACE - '}'
    FSTRING_END - ' final words' (without the end quote)

while ``f"""some words"""`` will be tokenized simply as::

    FSTRING_START - 'f"""'
    FSTRING_END - 'some words'

All restrictions mentioned in the PEP are lifted from f-literals, as explained below:

* Expression portions may now contain strings delimited with the same kind of
  quote that is used to delimit the f-literal.
* Backslashes may now appear within expressions just like anywhere else in
  Python code. In case of strings nested within f-literals, escape sequences are
  expanded when the innermost string is evaluated.
* Comments, using the ``#`` character, are possible only in multi-line f-literals,
  since comments are terminated by the end of the line (which makes closing a
  single-line f-literal impossible)

Backwards Compatibility
=======================

This PEP is backwards compatible: any valid Python code will continue to
be valid if this PEP is implemented and it will not change semantically.

How to Teach This
=================

As the concept of f-strings is already ubiquitous in the Python community, there is
no fundamental need for users to learn anything new. However, as the formalized grammar
allows some new possibilities, it is important that the formal grammar is added to the
documentation and explained in detail, explicitly mentioning what constructs are possible
since this PEP is aiming to avoid confusion.

It is also beneficial to provide users with a simple framework for understanding what can
be placed inside an f-string expression. In this case the authors think that this work will
make it even simpler to explain this aspect of the language, since it can be summarized as:

    You can place any valid Python expression inside an f-string expression.

With the changes in this PEP, there is no need to clarify that string quotes are
limited to be different from the quotes of the enclosing string, because this is
now allowed: as an arbitrary Python string can contain any possible choice of
quotes, so can any f-string expression. Additionally there is no need to clarify
that certain things are not allowed in the expression part because of
implementation restructions such as comments, new line characters or
backslashes. 

The only "surprising" difference is that as f-strings allow specifying a
format, expressions that allow a ``:`` character at the top level still need to be
enclosed in parenthesis. This is not new to this work, but it is important to
emphasize that this restriction is still in place. This allows for an easier
modification of the summary:

    You can place any valid Python expression inside
    an f-string expression, and everything after a ``:`` character at the top level will
    be identified as a format specification.


Reference Implementation
========================

A reference implementation can be found in the implementation_ fork.

Rejected Ideas
==============

#. We have decided not to lift the restriction that some expression portions
   need to wrap ``':'`` and ``'!'`` in braces at the top level, e.g.::

    >>> f'Useless use of lambdas: { lambda x: x*2 }'
    SyntaxError: unexpected EOF while parsing
   
   The reason is that this would this will introduce a considerable amount of
   complexity for no real benefit. This is due to the fact that the ``:`` character
   normally separates the f-string format specification. This format specification
   is currently tokenized as a string. As the tokenizer MUST tokenize what's on the
   right of the ``:`` as either a string or a stream of tokens, this won't allow the
   parser to differentiate between the different semantics as that would require the
   tokenizer to backtrack and produce a different set of tokens (this is, first try
   as a stream of tokens, and if it fails, try as a string for a format specifier).

   As there is no fundamental advantage in being able to allow lambdas and similar
   expressions at the top level, we have decided to keep the restriction that these must
   be parenthesized if needed::

    >>> f'Useless use of lambdas: { (lambda x: x*2) }'
  

Open Issues
===========

None yet


Footnotes
=========


.. _official Python grammar: https://docs.python.org/3/reference/lexical_analysis.html#formatted-string-literals

.. _none of this work was ever implemented: https://mail.python.org/archives/list/python-dev@python.org/thread/N43O4KNLZW4U7YZC4NVPCETZIVRDUVU2/#NM2A37THVIXXEYR4J5ZPTNLXGGUNFRLZ

.. _such as PyPy: https://foss.heptapod.net/pypy/pypy/-/commit/fe120f89bf07e64a41de62b224e4a3d80e0fe0d4/pipelines?ref=branch%2Fpy3.9

.. _discussed on Python-Dev: https://mail.python.org/archives/list/python-dev@python.org/thread/54N3MOYVBDSJQZTU6MTCPLUPIFSDN5IS/#SAYU6SMP4KT7G7AQ6WVQYUDOSZPKHJMS

.. _presented at the Python Language Summit 2022: https://pyfound.blogspot.com/2022/05/the-2022-python-language-summit-f.html

.. _per Wikipedia: https://en.wikipedia.org/wiki/String_interpolation#Examples

.. _implementation: https://github.com/we-like-parsers/cpython/tree/fstring-grammar


Copyright
=========

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.