diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 9cd7f698f..286a782e1 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -600,6 +600,7 @@ pep-0719.rst @Yhg1s pep-0720.rst @FFY00 pep-0721.rst @encukou pep-0722.rst @pfmoore +pep-0723.rst @AA-Turner # ... # pep-0754.txt # ... diff --git a/pep-0723.rst b/pep-0723.rst new file mode 100644 index 000000000..cef338337 --- /dev/null +++ b/pep-0723.rst @@ -0,0 +1,572 @@ +PEP: 723 +Title: Embedding pyproject.toml in single-file scripts +Author: Ofek Lev +Sponsor: Adam Turner +PEP-Delegate: Brett Cannon +Discussions-To: https://discuss.python.org/t/31151 +Status: Draft +Type: Standards Track +Topic: Packaging +Content-Type: text/x-rst +Created: 04-Aug-2023 +Post-History: `04-Aug-2023 `__, + `06-Aug-2023 `__, +Replaces: 722 + + +Abstract +======== + +This PEP specifies a metadata format that can be embedded in single-file Python +scripts to assist launchers, IDEs and other external tools which may need to +interact with such scripts. + + +Motivation +========== + +Python is routinely used as a scripting language, with Python scripts as a +(better) alternative to shell scripts, batch files, etc. When Python code is +structured as a script, it is usually stored as a single file and does not +expect the availability of any other local code that may be used for imports. +As such, it is possible to share with others over arbitrary text-based means +such as email, a URL to the script, or even a chat window. Code that is +structured like this may live as a single file forever, never becoming a +full-fledged project with its own directory and ``pyproject.toml`` file. + +An issue that users encounter with this approach is that there is no standard +mechanism to define metadata for tools whose job it is to execute such scripts. +For example, a tool that runs a script may need to know which dependencies are +required or the supported version(s) of Python. + +There is currently no standard tool that addresses this issue, and this PEP +does *not* attempt to define one. However, any tool that *does* address this +issue will need to know what the runtime requirements of scripts are. By +defining a standard format for storing such metadata, existing tools, as well +as any future tools, will be able to obtain that information without requiring +users to include tool-specific metadata in their scripts. + + +Rationale +========= + +This PEP defines a mechanism for embedding metadata *within the script itself*, +and not in an external file. + +We choose to follow the latest developments of other modern packaging +ecosystems (namely `Rust`__ and `Go`__) by embedding the existing +`metadata standard `_ that is used to describe +projects. + +__ https://github.com/rust-lang/rfcs/blob/master/text/3424-cargo-script.md +__ https://github.com/erning/gorun + +The format is intended to bridge the gap between different types of users +of Python. Knowledge of how to write project metadata will be directly +transferable to all use cases, whether writing a script or maintaining a +project that is distributed via PyPI. Additionally, users will benefit from +seamless interoperability with tools that are already familiar with the format. + +One of the central themes we discovered from the recent +`packaging survey `__ is that users have +begun getting frustrated with the lack of unification regarding both tooling +and specs. Adding yet another way to define metadata, even for a currently +unsatisfied use case, would further fragment the community. + +A use case that this PEP wishes to support that other formats may preclude is +a script that desires to transition to a directory-type project. A user may +be rapidly prototyping locally or in a remote REPL environment and then decide +to transition to a more formal project if their idea works out. This +intermediate script stage would be very useful to have fully reproducible bug +reports. By using the same metadata format, the user can simply copy and paste +the metadata into a ``pyproject.toml`` file and continue working without having +to learn a new format. More likely, even, is that tooling will eventually +support this transformation with a single command. + + +Specification +============= + +Any Python script may assign a variable named ``__pyproject__`` to a multi-line +*double-quoted* string (``"""``) containing a valid TOML document. The opening +of the string MUST be on the same line as the assignment. The closing of the +string MUST be on a line by itself, and MUST NOT be indented. + +The TOML document MUST NOT contain multi-line double-quoted strings, as that +would conflict with the Python string containing the document. Single-quoted +multi-line TOML strings may be used instead. + +Tools reading embedded metadata MAY respect the standard Python encoding +declaration. If they choose not to do so, they MUST process the file as UTF-8. + +This document MAY include the ``[project]`` and ``[tool]`` tables but MUST NOT +define the ``[build-system]`` table. The ``[build-system]`` table MAY be +allowed in a future PEP that standardizes how backends are to build +distributions from single file scripts. + +The ``[project]`` table differs in the following ways: + +* The ``name`` and ``version`` fields are not required and MAY be defined + dynamically by tools if the user does not define them +* These fields do not need to be listed in the ``dynamic`` array + +Non-script running tools MAY choose to read from their expected ``[tool]`` +sub-table. If a single-file script is not the sole input to a tool then +behavior SHOULD NOT be altered based on the embedded metadata. For example, +if a linter is invoked with the path to a directory, it SHOULD behave the same +as if zero files had embedded metadata. + +Example +------- + +The following is an example of a script with an embedded ``pyproject.toml``: + +.. code:: python + + __pyproject__ = """ + [project] + requires-python = ">=3.11" + dependencies = [ + "requests<3", + "rich", + ] + """ + + import requests + from rich.pretty import pprint + + resp = requests.get("https://peps.python.org/api/peps.json") + data = resp.json() + pprint([(k, v["title"]) for k, v in data.items()][:10]) + +The following is an example of a single-file Rust project that embeds their +version of ``pyproject.toml``, which is called ``Cargo.toml``: + +.. code:: rust + + #!/usr/bin/env cargo + + //! ```cargo + //! [dependencies] + //! regex = "1.8.0" + //! ``` + + fn main() { + let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap(); + println!("Did our date match? {}", re.is_match("2014-01-01")); + } + +One important thing to note is that the metadata is embedded in a comment +mostly for introspection since Rust documentation is generated from comments. +Another is that users rarely edit dependencies manually, but rather use their +Cargo package manager. + +We argue that our choice, in comparison to the Rust format, is easier to read +and provides easier edits for humans by virtue of the contents starting at the +beginning of lines so would precisely match the contents of a +``pyproject.toml`` file. It is also is easier for tools to parse and modify +this continuous block of text which was `one of the concerns`__ raised in the +Rust pre-RFC. + +__ https://github.com/epage/cargo-script-mvs/blob/main/0000-cargo-script.md#embedded-manifest-format + +Reference Implementation +======================== + +This regular expression may be used to parse the metadata: + +.. code:: text + + (?ms)^__pyproject__ *= *"""\\?$(.+?)^"""$ + +In circumstances where there is a discrepancy between the regular expression +and the text specification, the text specification takes precedence. + +The following is an example of how to read the metadata on Python 3.11 or +higher. + +.. code:: python + + import re, tomllib + + def read(script: str) -> dict | None: + match = re.search(r'(?ms)^__pyproject__ *= *"""\\?$(.+?)^"""$', script) + return tomllib.loads(match.group(1)) if match else None + +Often tools will edit dependencies like package managers or dependency update +automation in CI. The following is a crude example of modifying the content +using the ``tomlkit`` library. + +.. code:: python + + import re, tomlkit + + def add(script: str, dependency: str) -> str: + match = re.search(r'(?ms)^__pyproject__ *= *"""\\?$(.+?)^"""$', script) + config = tomlkit.parse(match.group(1)) + config['project']['dependencies'].append(dependency) + + start, end = match.span(1) + return script[:start] + tomlkit.dumps(config) + script[end:] + +Note that this example used a library that preserves TOML formatting. This is +not a requirement for editing by any means but rather is a "nice to have" +especially since there are unlikely to be embedded comments. + + +Backwards Compatibility +======================= + +At the time of writing, the ``__pyproject__`` variable only appears five times +`on GitHub`__ and four of those belong to a user who appears to already be +using this PEP's exact format. + +__ https://github.com/search?q=__pyproject__&type=code + +For example, `this script`__ uses ``matplotlib`` and ``pandas`` to plot a +timeseries. It is a good example of a script that you would see in the wild: +self-contained and short. + +__ https://github.com/cjolowicz/scripts/blob/31c61e7dad8d17e0070b080abee68f4f505da211/python/plot_timeseries.py + +This user's tooling invokes scripts by creating a project at runtime using the +embedded metadata and then uses an entry point that references the main +function. + +This PEP allows this user's tooling to remove that extra step of indirection. + +This PEP's author has discovered after writing a draft that this pattern is +used in the wild by others (sent private messages). + + +Security Implications +===================== + +If a script containing embedded metadata is ran using a tool that automatically +installs dependencies, this could cause arbitrary code to be downloaded and +installed in the user's environment. + +The risk here is part of the functionality of the tool being used to run the +script, and as such should already be addressed by the tool itself. The only +additional risk introduced by this PEP is if an untrusted script with a +embedded metadata is run, when a potentially malicious dependency might be +installed. This risk is addressed by the normal good practice of reviewing code +before running it. + + +How to Teach This +================= + +Since the format chosen is the same as the official metadata standard, we can +have a page that describes how to embed the metadata in scripts and to learn +about metadata itself direct users to the living document that describes +`project metadata `_. + +We will document that the name and version fields in the ``[project]`` table +may be elided for simplicity. Additionally, we will have guidance (perhaps +temporary) explaining that single-file scripts cannot be built into a wheel +and therefore you would never see the associated ``[build-system]`` metadata. + +Finally, we may want to list some tools that support this PEP's format. + + +Recommendations +=============== + +Tools that support managing different versions of Python should attempt to use +the highest available version of Python that is compatible with the script's +``requires-python`` metadata, if defined. + + +Rejected Ideas +============== + +Why not limit to specific metadata fields? +------------------------------------------ + +By limiting the metadata to a specific set of fields, for example just +``dependencies``, we would prevent legitimate use cases both known and unknown. +The following are examples of known use cases: + +* ``requires-python``: For tools that support managing Python installations, + this allows users to target specific versions of Python for new syntax + or standard library functionality. +* ``version``: It is quite common to version scripts for persistence even when + using a VCS like Git. When not using a VCS it is even more common to version, + for example the author has been in multiple time sensitive debugging sessions + with customers where due to the airgapped nature of the environment, the only + way to transfer the script was via email or copying and pasting it into a + chat window. In these cases, versioning is invaluable to ensure that the + customer is using the latest (or a specific) version of the script. +* ``description``: For scripts that don't need an argument parser, or if the + author has never used one, tools can treat this as help text which can be + shown to the user. + +By not allowing the ``[tool]`` section, we would prevent especially script +runners from allowing users to configure behavior. For example, a script runner +may support configuration instructing to run scripts in containers for +situations in which there is no cross-platform support for a dependency or if +the setup is too complex for the average user like when requiring Nvidia +drivers. Situations like this would allow users to proceed with what they want +to do whereas otherwise they may stop at that point altogether. + + +Why not use a comment block resembling requirements.txt? +-------------------------------------------------------- + +This PEP considers there to be different types of users for whom Python code +would live as single-file scripts: + +* Non-programmers who are just using Python as a scripting language to achieve + a specific task. These users are unlikely to be familiar with concepts of + operating systems like shebang lines or the ``PATH`` environment variable. + Some examples: + + * The average person, perhaps at a workplace, who wants to write a script to + automate something for efficiency or to reduce tedium + * Someone doing data science or machine learning in industry or academia who + wants to write a script to analyze some data or for research purposes. + These users are special in that, although they have limited programming + knowledge, they learn from sources like StackOverflow and blogs that have a + programming bent and are increasingly likely to be part of communities that + share knowledge and code. Therefore, a non-trivial number of these users + will have some familiarity with things like Git(Hub), Jupyter, HuggingFace, + etc. +* Non-programmers who manage operating systems e.g. a sysadmin. These users are + able to set up ``PATH``, for example, but are unlikely to be familiar with + Python concepts like virtual environments. These users often operate in + isolation and have limited need to gain exposure to tools intended for + sharing like Git. +* Programmers who manage operating systems/infrastructure e.g. SREs. These + users are not very likely to be familiar with Python concepts like virtual + environments, but are likely to be familiar with Git and most often use it + to version control everything required to manage infrastructure like Python + scripts and Kubernetes config. +* Programmers who write scripts primarily for themselves. These users over time + accumulate a great number of scripts in various languages that they use to + automate their workflow and often store them in a single directory, that is + potentially version controlled for persistence. Non-Windows users may set + up each Python script with a shebang line pointing to the desired Python + executable or script runner. + +This PEP argues that reusing our TOML-based metadata format is the best for +each category of user and that the block comment is only approachable for +those who have familiarity with ``requirements.txt``, which represents a +small subset of users. + +* For the average person automating a task or the data scientist, they are + already starting with zero context and are unlikely to be familiar with + TOML nor ``requirements.txt``. These users will very likely rely on + snippets found online via a search engine or utilize AI in the form + of a chat bot or direct code completion software. Searching for Python + metadata formatting will lead them to the TOML-based format that already + exists which they can reuse. The author tested GitHub Copilot with this + PEP and it already supports auto-completion of fields and dependencies. + In contrast, a new format may take years of being trained on the Internet + for models to learn. + + Additionally, these users are most susceptible to formatting quirks and + syntax errors. TOML is a well-defined format with existing online + validators that features assignment that is compatible with Python + expressions and has no strict indenting rules. The block comment format + on the other hand could be easily malformed by forgetting the colon, for + example, and debugging why it's not working with a search engine would be + a difficult task for such a user. +* For the sysadmin types, they are equally unlikely as the previously described + users to be familiar with TOML or ``requirements.txt``. For either format + they would have to read documentation. They would likely be more comfortable + with TOML since they are used to structured data formats and there would be + less perceived magic in their systems. + + Additionally, for maintenance of their systems ``__pyproject__`` would be + much easier to search for from a shell than a block comment with potentially + numerous extensions over time. +* For the SRE types, they are likely to be familiar with TOML already from + other projects that they might have to work with like configuring the + `GitLab Runner`__ or `Cloud Native Buildpacks`__. + + __ https://docs.gitlab.com/runner/configuration/advanced-configuration.html + __ https://buildpacks.io/docs/reference/config/ + + These users are responsible for the security of their systems and most likely + have security scanners set up to automatically open PRs to update versions + of dependencies. Such automated tools like Dependabot would have a much + easier time using existing TOML libraries than writing their own custom + parser for a block comment format. +* For the programmer types, they are more likely to be familiar with TOML + than they have ever seen a ``requirements.txt`` file, unless they are a + Python programmer who has had previous experience with writing applications. + In the case of experience with the requirements format, it necessarily means + that they are at least somewhat familiar with the ecosystem and therefore + it is safe to assume they know what TOML is. + + Another benefit of this PEP to these users is that their IDEs like Visual + Studio Code would be able to provide TOML syntax highlighting much more + easily than each writing custom logic for this feature. + +Additionally, the block comment format goes against the recommendation of +:pep:`8`: + + Each line of a block comment starts with a ``#`` and a single space (unless + it is indented text inside the comment). [...] Paragraphs inside a block + comment are separated by a line containing a single ``#``. + +Linters and IDE auto-formatters that respect this long-time recommendation +would fail by default. The following uses the example from :pep:`722`: + +.. code:: bash + + $ flake8 . + .\script.py:3:1: E266 too many leading '#' for block comment + .\script.py:4:1: E266 too many leading '#' for block comment + .\script.py:5:1: E266 too many leading '#' for block comment + + +Why not consider scripts as projects without wheels? +---------------------------------------------------- + +There is `an ongoing discussion `_ about how to +use ``pyproject.toml`` for projects that are not intended to be built as +wheels. This PEP considers the discussion only tangentially related. + +The use case described in that thread is primarily talking about projects that +represent applications like a Django app or a Flask app. These projects are +often installed on each server in a virtual environment with strict dependency +pinning e.g. a lock file with some sort of hash checking. Such projects would +never be distributed as a wheel (except for maybe a transient editable one +that is created when doing ``pip install -e .``). + +In contrast, scripts are managed loosely by its runner and would almost +always have relaxed dependency constraints. Additionally, to reduce +friction associated with managing small projects there may be a future +in which there is a standard prescribed way to ship projects that are in +the form of a single file. The author of the Rust RFC for embedding metadata +`mentioned to us `__ that they are +actively looking into that based on user feedback. + +Why not just set up a Python project with a ``pyproject.toml``? +--------------------------------------------------------------- + +Again, a key issue here is that the target audience for this proposal is people +writing scripts which aren't intended for distribution. Sometimes scripts will +be "shared", but this is far more informal than "distribution" - it typically +involves sending a script via an email with some written instructions on how to +run it, or passing someone a link to a GitHub gist. + +Expecting such users to learn the complexities of Python packaging is a +significant step up in complexity, and would almost certainly give the +impression that "Python is too hard for scripts". + +In addition, if the expectation here is that the ``pyproject.toml`` will +somehow be designed for running scripts in place, that's a new feature of the +standard that doesn't currently exist. At a minimum, this isn't a reasonable +suggestion until the `current discussion on Discourse +`_ about using ``pyproject.toml`` for projects that +won't be distributed as wheels is resolved. And even then, it doesn't address +the "sending someone a script in a gist or email" use case. + +Why not use a requirements file for dependencies? +------------------------------------------------- + +Putting your requirements in a requirements file, doesn't require a PEP. You +can do that right now, and in fact it's quite likely that many adhoc solutions +do this. However, without a standard, there's no way of knowing how to locate a +script's dependency data. And furthermore, the requirements file format is +pip-specific, so tools relying on it are depending on a pip implementation +detail. + +So in order to make a standard, two things would be required: + +1. A standardised replacement for the requirements file format. +2. A standard for how to locate the requiements file for a given script. + +The first item is a significant undertaking. It has been discussed on a number +of occasions, but so far no-one has attempted to actually do it. The most +likely approach would be for standards to be developed for individual use cases +currently addressed with requirements files. One option here would be for this +PEP to simply define a new file format which is simply a text file containing +:pep:`508` requirements, one per line. That would just leave the question of +how to locate that file. + +The "obvious" solution here would be to do something like name the file the +same as the script, but with a ``.reqs`` extension (or something similar). +However, this still requires *two* files, where currently only a single file is +needed, and as such, does not match the "better batch file" model (shell +scripts and batch files are typically self-contained). It requires the +developer to remember to keep the two files together, and this may not always +be possible. For example, system administration policies may require that *all* +files in a certain directory are executable (the Linux filesystem standards +require this of ``/usr/bin``, for example). And some methods of sharing a +script (for example, publishing it on a text file sharing service like Github's +gist, or a corporate intranet) may not allow for deriving the location of an +associated requirements file from the script's location (tools like ``pipx`` +support running a script directly from a URL, so "download and unpack a zip of +the script and itsdependencies" may not be an appropriate requirement). + +Essentially, though, the issue here is that there is an explicitly stated +requirement that the format supports storing dependency data *in the script +file itself*. Solutions that don't do that are simply ignoring that +requirement. + +Why not use (possibly restricted) Python syntax? +------------------------------------------------ + +This would typically involve storing metadata as multiple special variables, +such as the following. + +.. code:: python + + __requires_python__ = ">=3.11" + __dependencies__ = [ + "requests", + "click", + ] + +The most significant problem with this proposal is that it requires all +consumers of the dependency data to implement a Python parser. Even if the +syntax is restricted, the *rest* of the script will use the full Python syntax, +and trying to define a syntax which can be successfully parsed in isolation +from the surrounding code is likely to be extremely difficult and error-prone. + +Furthermore, Python's syntax changes in every release. If extracting dependency +data needs a Python parser, the parser will need to know which version of +Python the script is written for, and the overhead for a generic tool of having +a parser that can handle *multiple* versions of Python is unsustainable. + +With this approach there is the potential to clutter scripts with many +variables as new extensions get added. Additionally, intuiting which metadata +fields correspond to which variable names would cause confusion for users. + +It is worth noting, though, that the ``pip-run`` utility does implement (an +extended form of) this approach. `Further discussion `_ of +the ``pip-run`` design is available on the project's issue tracker. + +What about local dependencies? +------------------------------ + +These can be handled without needing special metadata and tooling, simply by +adding the location of the dependencies to ``sys.path``. This PEP simply isn't +needed for this case. If, on the other hand, the "local dependencies" are +actual distributions which are published locally, they can be specified as +usual with a :pep:`508` requirement, and the local package index specified when +running a tool by using the tool's UI for that. + +Open Issues +=========== + +None at this point. + + +References +========== + +.. _pyproject metadata: https://packaging.python.org/en/latest/specifications/declaring-project-metadata/ +.. _pip-run issue: https://github.com/jaraco/pip-run/issues/44 +.. _pyproject without wheels: https://discuss.python.org/t/projects-that-arent-meant-to-generate-a-wheel-and-pyproject-toml/29684 + + +Copyright +========= + +This document is placed in the public domain or under the +CC0-1.0-Universal license, whichever is more permissive.