python-peps/pep-0723.rst

PEP: 723
Title: Embedding pyproject.toml in single-file scripts
Author: Ofek Lev <ofekmeister@gmail.com>
Sponsor: Adam Turner <python@quite.org.uk>
PEP-Delegate: Brett Cannon <brett@python.org>
Discussions-To: https://discuss.python.org/t/31151
Status: Draft
Type: Standards Track
Topic: Packaging
Content-Type: text/x-rst
Created: 04-Aug-2023
Post-History: `04-Aug-2023 <https://discuss.python.org/t/30979>`__,
              `06-Aug-2023 <https://discuss.python.org/t/31151>`__,
Replaces: 722


Abstract
========

This PEP specifies a metadata format that can be embedded in single-file Python
scripts to assist launchers, IDEs and other external tools which may need to
interact with such scripts.


Motivation
==========

Python is routinely used as a scripting language, with Python scripts as a
(better) alternative to shell scripts, batch files, etc. When Python code is
structured as a script, it is usually stored as a single file and does not
expect the availability of any other local code that may be used for imports.
As such, it is possible to share with others over arbitrary text-based means
such as email, a URL to the script, or even a chat window. Code that is
structured like this may live as a single file forever, never becoming a
full-fledged project with its own directory and ``pyproject.toml`` file.

An issue that users encounter with this approach is that there is no standard
mechanism to define metadata for tools whose job it is to execute such scripts.
For example, a tool that runs a script may need to know which dependencies are
required or the supported version(s) of Python.

There is currently no standard tool that addresses this issue, and this PEP
does *not* attempt to define one. However, any tool that *does* address this
issue will need to know what the runtime requirements of scripts are. By
defining a standard format for storing such metadata, existing tools, as well
as any future tools, will be able to obtain that information without requiring
users to include tool-specific metadata in their scripts.


Rationale
=========

This PEP defines a mechanism for embedding metadata *within the script itself*,
and not in an external file.

We choose to follow the latest developments of other modern packaging
ecosystems (namely `Rust`__ and `Go`__) by embedding the existing
`metadata standard <pyproject metadata_>`_ that is used to describe
projects.

__ https://github.com/rust-lang/rfcs/blob/master/text/3424-cargo-script.md
__ https://github.com/erning/gorun

The format is intended to bridge the gap between different types of users
of Python. Knowledge of how to write project metadata will be directly
transferable to all use cases, whether writing a script or maintaining a
project that is distributed via PyPI. Additionally, users will benefit from
seamless interoperability with tools that are already familiar with the format.

One of the central themes we discovered from the recent
`packaging survey <https://discuss.python.org/t/22420>`__ is that users have
begun getting frustrated with the lack of unification regarding both tooling
and specs. Adding yet another way to define metadata, even for a currently
unsatisfied use case, would further fragment the community.

A use case that this PEP wishes to support that other formats may preclude is
a script that desires to transition to a directory-type project. A user may
be rapidly prototyping locally or in a remote REPL environment and then decide
to transition to a more formal project if their idea works out. This
intermediate script stage would be very useful to have fully reproducible bug
reports. By using the same metadata format, the user can simply copy and paste
the metadata into a ``pyproject.toml`` file and continue working without having
to learn a new format. More likely, even, is that tooling will eventually
support this transformation with a single command.


Specification
=============

Any Python script may assign a variable named ``__pyproject__`` to a multi-line
*double-quoted* string (``"""``) containing a valid TOML document. The opening
of the string MUST be on the same line as the assignment. The closing of the
string MUST be on a line by itself, and MUST NOT be indented.

The TOML document MUST NOT contain multi-line double-quoted strings, as that
would conflict with the Python string containing the document. Single-quoted
multi-line TOML strings may be used instead.

Tools reading embedded metadata MAY respect the standard Python encoding
declaration. If they choose not to do so, they MUST process the file as UTF-8.

This document MAY include the ``[project]`` and ``[tool]`` tables but MUST NOT
define the ``[build-system]`` table. The ``[build-system]`` table MAY be
allowed in a future PEP that standardizes how backends are to build
distributions from single file scripts.

The ``[project]`` table differs in the following ways:

* The ``name`` and ``version`` fields are not required and MAY be defined
  dynamically by tools if the user does not define them
* These fields do not need to be listed in the ``dynamic`` array

Non-script running tools MAY choose to read from their expected ``[tool]``
sub-table. If a single-file script is not the sole input to a tool then
behavior SHOULD NOT be altered based on the embedded metadata. For example,
if a linter is invoked with the path to a directory, it SHOULD behave the same
as if zero files had embedded metadata.

Example
-------

The following is an example of a script with an embedded ``pyproject.toml``:

.. code:: python

    __pyproject__ = """
    [project]
    requires-python = ">=3.11"
    dependencies = [
      "requests<3",
      "rich",
    ]
    """

    import requests
    from rich.pretty import pprint

    resp = requests.get("https://peps.python.org/api/peps.json")
    data = resp.json()
    pprint([(k, v["title"]) for k, v in data.items()][:10])

The following is an example of a single-file Rust project that embeds their
version of ``pyproject.toml``, which is called ``Cargo.toml``:

.. code:: rust

    #!/usr/bin/env cargo

    //! ```cargo
    //! [dependencies]
    //! regex = "1.8.0"
    //! ```

    fn main() {
        let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap();
        println!("Did our date match? {}", re.is_match("2014-01-01"));
    }

One important thing to note is that the metadata is embedded in a comment
mostly for introspection since Rust documentation is generated from comments.
Another is that users rarely edit dependencies manually, but rather use their
Cargo package manager.

We argue that our choice, in comparison to the Rust format, is easier to read
and provides easier edits for humans by virtue of the contents starting at the
beginning of lines so would precisely match the contents of a
``pyproject.toml`` file. It is also is easier for tools to parse and modify
this continuous block of text which was `one of the concerns`__ raised in the
Rust pre-RFC.

__ https://github.com/epage/cargo-script-mvs/blob/main/0000-cargo-script.md#embedded-manifest-format

Reference Implementation
========================

This regular expression may be used to parse the metadata:

.. code:: text

   (?ms)^__pyproject__ *= *"""\\?$(.+?)^"""$

In circumstances where there is a discrepancy between the regular expression
and the text specification, the text specification takes precedence.

The following is an example of how to read the metadata on Python 3.11 or
higher.

.. code:: python

    import re, tomllib

    def read(script: str) -> dict | None:
        match = re.search(r'(?ms)^__pyproject__ *= *"""\\?$(.+?)^"""$', script)
        return tomllib.loads(match.group(1)) if match else None

Often tools will edit dependencies like package managers or dependency update
automation in CI. The following is a crude example of modifying the content
using the ``tomlkit`` library.

.. code:: python

    import re, tomlkit

    def add(script: str, dependency: str) -> str:
        match = re.search(r'(?ms)^__pyproject__ *= *"""\\?$(.+?)^"""$', script)
        config = tomlkit.parse(match.group(1))
        config['project']['dependencies'].append(dependency)

        start, end = match.span(1)
        return script[:start] + tomlkit.dumps(config) + script[end:]

Note that this example used a library that preserves TOML formatting. This is
not a requirement for editing by any means but rather is a "nice to have"
especially since there are unlikely to be embedded comments.


Backwards Compatibility
=======================

At the time of writing, the ``__pyproject__`` variable only appears five times
`on GitHub`__ and four of those belong to a user who appears to already be
using this PEP's exact format.

__ https://github.com/search?q=__pyproject__&type=code

For example, `this script`__ uses ``matplotlib`` and ``pandas`` to plot a
timeseries. It is a good example of a script that you would see in the wild:
self-contained and short.

__ https://github.com/cjolowicz/scripts/blob/31c61e7dad8d17e0070b080abee68f4f505da211/python/plot_timeseries.py

This user's tooling invokes scripts by creating a project at runtime using the
embedded metadata and then uses an entry point that references the main
function.

This PEP allows this user's tooling to remove that extra step of indirection.

This PEP's author has discovered after writing a draft that this pattern is
used in the wild by others (sent private messages).


Security Implications
=====================

If a script containing embedded metadata is ran using a tool that automatically
installs dependencies, this could cause arbitrary code to be downloaded and
installed in the user's environment.

The risk here is part of the functionality of the tool being used to run the
script, and as such should already be addressed by the tool itself. The only
additional risk introduced by this PEP is if an untrusted script with a
embedded metadata is run, when a potentially malicious dependency might be
installed. This risk is addressed by the normal good practice of reviewing code
before running it.


How to Teach This
=================

Since the format chosen is the same as the official metadata standard, we can
have a page that describes how to embed the metadata in scripts and to learn
about metadata itself direct users to the living document that describes
`project metadata <pyproject metadata_>`_.

We will document that the name and version fields in the ``[project]`` table
may be elided for simplicity. Additionally, we will have guidance (perhaps
temporary) explaining that single-file scripts cannot be built into a wheel
and therefore you would never see the associated ``[build-system]`` metadata.

Finally, we may want to list some tools that support this PEP's format.


Recommendations
===============

Tools that support managing different versions of Python should attempt to use
the highest available version of Python that is compatible with the script's
``requires-python`` metadata, if defined.


Rejected Ideas
==============

Why not limit to specific metadata fields?
------------------------------------------

By limiting the metadata to a specific set of fields, for example just
``dependencies``, we would prevent legitimate use cases both known and unknown.
The following are examples of known use cases:

* ``requires-python``: For tools that support managing Python installations,
  this allows users to target specific versions of Python for new syntax
  or standard library functionality.
* ``version``: It is quite common to version scripts for persistence even when
  using a VCS like Git. When not using a VCS it is even more common to version,
  for example the author has been in multiple time sensitive debugging sessions
  with customers where due to the airgapped nature of the environment, the only
  way to transfer the script was via email or copying and pasting it into a
  chat window. In these cases, versioning is invaluable to ensure that the
  customer is using the latest (or a specific) version of the script.
* ``description``: For scripts that don't need an argument parser, or if the
  author has never used one, tools can treat this as help text which can be
  shown to the user.

By not allowing the ``[tool]`` section, we would prevent especially script
runners from allowing users to configure behavior. For example, a script runner
may support configuration instructing to run scripts in containers for
situations in which there is no cross-platform support for a dependency or if
the setup is too complex for the average user like when requiring Nvidia
drivers. Situations like this would allow users to proceed with what they want
to do whereas otherwise they may stop at that point altogether.


Why not use a comment block resembling requirements.txt?
--------------------------------------------------------

This PEP considers there to be different types of users for whom Python code
would live as single-file scripts:

* Non-programmers who are just using Python as a scripting language to achieve
  a specific task. These users are unlikely to be familiar with concepts of
  operating systems like shebang lines or the ``PATH`` environment variable.
  Some examples:

  * The average person, perhaps at a workplace, who wants to write a script to
    automate something for efficiency or to reduce tedium
  * Someone doing data science or machine learning in industry or academia who
    wants to write a script to analyze some data or for research purposes.
    These users are special in that, although they have limited programming
    knowledge, they learn from sources like StackOverflow and blogs that have a
    programming bent and are increasingly likely to be part of communities that
    share knowledge and code. Therefore, a non-trivial number of these users
    will have some familiarity with things like Git(Hub), Jupyter, HuggingFace,
    etc.
* Non-programmers who manage operating systems e.g. a sysadmin. These users are
  able to set up ``PATH``, for example, but are unlikely to be familiar with
  Python concepts like virtual environments. These users often operate in
  isolation and have limited need to gain exposure to tools intended for
  sharing like Git.
* Programmers who manage operating systems/infrastructure e.g. SREs. These
  users are not very likely to be familiar with Python concepts like virtual
  environments, but are likely to be familiar with Git and most often use it
  to version control everything required to manage infrastructure like Python
  scripts and Kubernetes config.
* Programmers who write scripts primarily for themselves. These users over time
  accumulate a great number of scripts in various languages that they use to
  automate their workflow and often store them in a single directory, that is
  potentially version controlled for persistence. Non-Windows users may set
  up each Python script with a shebang line pointing to the desired Python
  executable or script runner.

This PEP argues that reusing our TOML-based metadata format is the best for
each category of user and that the block comment is only approachable for
those who have familiarity with ``requirements.txt``, which represents a
small subset of users.

* For the average person automating a task or the data scientist, they are
  already starting with zero context and are unlikely to be familiar with
  TOML nor ``requirements.txt``. These users will very likely rely on
  snippets found online via a search engine or utilize AI in the form
  of a chat bot or direct code completion software. Searching for Python
  metadata formatting will lead them to the TOML-based format that already
  exists which they can reuse. The author tested GitHub Copilot with this
  PEP and it already supports auto-completion of fields and dependencies.
  In contrast, a new format may take years of being trained on the Internet
  for models to learn.

  Additionally, these users are most susceptible to formatting quirks and
  syntax errors. TOML is a well-defined format with existing online
  validators that features assignment that is compatible with Python
  expressions and has no strict indenting rules. The block comment format
  on the other hand could be easily malformed by forgetting the colon, for
  example, and debugging why it's not working with a search engine would be
  a difficult task for such a user.
* For the sysadmin types, they are equally unlikely as the previously described
  users to be familiar with TOML or ``requirements.txt``. For either format
  they would have to read documentation. They would likely be more comfortable
  with TOML since they are used to structured data formats and there would be
  less perceived magic in their systems.

  Additionally, for maintenance of their systems ``__pyproject__`` would be
  much easier to search for from a shell than a block comment with potentially
  numerous extensions over time.
* For the SRE types, they are likely to be familiar with TOML already from
  other projects that they might have to work with like configuring the
  `GitLab Runner`__ or `Cloud Native Buildpacks`__.

  __ https://docs.gitlab.com/runner/configuration/advanced-configuration.html
  __ https://buildpacks.io/docs/reference/config/

  These users are responsible for the security of their systems and most likely
  have security scanners set up to automatically open PRs to update versions
  of dependencies. Such automated tools like Dependabot would have a much
  easier time using existing TOML libraries than writing their own custom
  parser for a block comment format.
* For the programmer types, they are more likely to be familiar with TOML
  than they have ever seen a ``requirements.txt`` file, unless they are a
  Python programmer who has had previous experience with writing applications.
  In the case of experience with the requirements format, it necessarily means
  that they are at least somewhat familiar with the ecosystem and therefore
  it is safe to assume they know what TOML is.

  Another benefit of this PEP to these users is that their IDEs like Visual
  Studio Code would be able to provide TOML syntax highlighting much more
  easily than each writing custom logic for this feature.

Additionally, the block comment format goes against the recommendation of
:pep:`8`:

    Each line of a block comment starts with a ``#`` and a single space (unless
    it is indented text inside the comment). [...] Paragraphs inside a block
    comment are separated by a line containing a single ``#``.

Linters and IDE auto-formatters that respect this long-time recommendation
would fail by default. The following uses the example from :pep:`722`:

.. code:: bash

    $ flake8 .
    .\script.py:3:1: E266 too many leading '#' for block comment
    .\script.py:4:1: E266 too many leading '#' for block comment
    .\script.py:5:1: E266 too many leading '#' for block comment


Why not consider scripts as projects without wheels?
----------------------------------------------------

There is `an ongoing discussion <pyproject without wheels_>`_ about how to
use ``pyproject.toml`` for projects that are not intended to be built as
wheels. This PEP considers the discussion only tangentially related.

The use case described in that thread is primarily talking about projects that
represent applications like a Django app or a Flask app. These projects are
often installed on each server in a virtual environment with strict dependency
pinning e.g. a lock file with some sort of hash checking. Such projects would
never be distributed as a wheel (except for maybe a transient editable one
that is created when doing ``pip install -e .``).

In contrast, scripts are managed loosely by its runner and would almost
always have relaxed dependency constraints. Additionally, to reduce
friction associated with managing small projects there may be a future
in which there is a standard prescribed way to ship projects that are in
the form of a single file. The author of the Rust RFC for embedding metadata
`mentioned to us <https://discuss.python.org/t/29905/179>`__ that they are
actively looking into that based on user feedback.

Why not just set up a Python project with a ``pyproject.toml``?
---------------------------------------------------------------

Again, a key issue here is that the target audience for this proposal is people
writing scripts which aren't intended for distribution. Sometimes scripts will
be "shared", but this is far more informal than "distribution" - it typically
involves sending a script via an email with some written instructions on how to
run it, or passing someone a link to a GitHub gist.

Expecting such users to learn the complexities of Python packaging is a
significant step up in complexity, and would almost certainly give the
impression that "Python is too hard for scripts".

In addition, if the expectation here is that the ``pyproject.toml`` will
somehow be designed for running scripts in place, that's a new feature of the
standard that doesn't currently exist. At a minimum, this isn't a reasonable
suggestion until the `current discussion on Discourse
<pyproject without wheels_>`_ about using ``pyproject.toml`` for projects that
won't be distributed as wheels is resolved. And even then, it doesn't address
the "sending someone a script in a gist or email" use case.

Why not use a requirements file for dependencies?
-------------------------------------------------

Putting your requirements in a requirements file, doesn't require a PEP. You
can do that right now, and in fact it's quite likely that many adhoc solutions
do this. However, without a standard, there's no way of knowing how to locate a
script's dependency data. And furthermore, the requirements file format is
pip-specific, so tools relying on it are depending on a pip implementation
detail.

So in order to make a standard, two things would be required:

1. A standardised replacement for the requirements file format.
2. A standard for how to locate the requiements file for a given script.

The first item is a significant undertaking. It has been discussed on a number
of occasions, but so far no-one has attempted to actually do it. The most
likely approach would be for standards to be developed for individual use cases
currently addressed with requirements files. One option here would be for this
PEP to simply define a new file format which is simply a text file containing
:pep:`508` requirements, one per line. That would just leave the question of
how to locate that file.

The "obvious" solution here would be to do something like name the file the
same as the script, but with a ``.reqs`` extension (or something similar).
However, this still requires *two* files, where currently only a single file is
needed, and as such, does not match the "better batch file" model (shell
scripts and batch files are typically self-contained). It requires the
developer to remember to keep the two files together, and this may not always
be possible. For example, system administration policies may require that *all*
files in a certain directory are executable (the Linux filesystem standards
require this of ``/usr/bin``, for example). And some methods of sharing a
script (for example, publishing it on a text file sharing service like Github's
gist, or a corporate intranet) may not allow for deriving the location of an
associated requirements file from the script's location (tools like ``pipx``
support running a script directly from a URL, so "download and unpack a zip of
the script and itsdependencies" may not be an appropriate requirement).

Essentially, though, the issue here is that there is an explicitly stated
requirement that the format supports storing dependency data *in the script
file itself*. Solutions that don't do that are simply ignoring that
requirement.

Why not use (possibly restricted) Python syntax?
------------------------------------------------

This would typically involve storing metadata as multiple special variables,
such as the following.

.. code:: python

    __requires_python__ = ">=3.11"
    __dependencies__ = [
        "requests",
        "click",
    ]

The most significant problem with this proposal is that it requires all
consumers of the dependency data to implement a Python parser. Even if the
syntax is restricted, the *rest* of the script will use the full Python syntax,
and trying to define a syntax which can be successfully parsed in isolation
from the surrounding code is likely to be extremely difficult and error-prone.

Furthermore, Python's syntax changes in every release. If extracting dependency
data needs a Python parser, the parser will need to know which version of
Python the script is written for, and the overhead for a generic tool of having
a parser that can handle *multiple* versions of Python is unsustainable.

With this approach there is the potential to clutter scripts with many
variables as new extensions get added. Additionally, intuiting which metadata
fields correspond to which variable names would cause confusion for users.

It is worth noting, though, that the ``pip-run`` utility does implement (an
extended form of) this approach. `Further discussion <pip-run issue_>`_ of
the ``pip-run`` design is available on the project's issue tracker.

What about local dependencies?
------------------------------

These can be handled without needing special metadata and tooling, simply by
adding the location of the dependencies to ``sys.path``. This PEP simply isn't
needed for this case. If, on the other hand, the "local dependencies" are
actual distributions which are published locally, they can be specified as
usual with a :pep:`508` requirement, and the local package index specified when
running a tool by using the tool's UI for that.

Open Issues
===========

None at this point.


References
==========

.. _pyproject metadata: https://packaging.python.org/en/latest/specifications/declaring-project-metadata/
.. _pip-run issue: https://github.com/jaraco/pip-run/issues/44
.. _pyproject without wheels: https://discuss.python.org/t/projects-that-arent-meant-to-generate-a-wheel-and-pyproject-toml/29684


Copyright
=========

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.