From 189134e403bec15828879ef328656502dcfb431a Mon Sep 17 00:00:00 2001
From: Paul Moore <p.f.moore@gmail.com>
Date: Tue, 8 Aug 2023 17:26:32 +0100
Subject: [PATCH] PEP 722: Further revisions (gh-3282)

---
 pep-0722.rst | 425 +++++++++++++++++++++++++++++++--------------------
 1 file changed, 262 insertions(+), 163 deletions(-)

diff --git a/pep-0722.rst b/pep-0722.rst
index 006381b44..493ebcfca 100644
--- a/pep-0722.rst
+++ b/pep-0722.rst
@@ -49,11 +49,10 @@ Because a key requirement is writing single-file scripts, and simple sharing by
 giving someone a copy of the script, the PEP defines a mechanism for embedding
 dependency data *within the script itself*, and not in an external file.
 
-We define the concept of a *metadata block* that contains information about a
-script. The only type of metadata defined here is dependency information, but
-making the concept general allows expansion in the future, should it be needed.
+We define the concept of a *dependency block* that contains information about
+what 3rd party packages a script depends on.
 
-In order to identify metadata blocks, the script can simply be read as a text
+In order to identify dependency blocks, the script can simply be read as a text
 file. This is deliberate, as Python syntax changes over time, so attempting to
 parse the script as Python code would require choosing a specific version of
 Python syntax. Also, it is likely that at least some tools will not be written
@@ -62,7 +61,7 @@ burden.
 
 However, to avoid needing changes to core Python, the format is designed to
 appear as comments to the Python parser. It is possible to write code where a
-metadata block is *not* interpreted as a comment (for example, by embedding it
+dependency block is *not* interpreted as a comment (for example, by embedding it
 in a Python multi-line string), but such uses are discouraged and can easily be
 avoided assuming you are not deliberately trying to create a pathological
 example.
@@ -74,42 +73,41 @@ commonly-used approach.
 Specification
 =============
 
-Any Python script may contain one or more *metadata blocks*. Metadata blocks are
-identified by reading the script *as a text file* (i.e., the file is not parsed
-as Python source code), looking for contiguous blocks of lines that start with
-the identifying characters ``##``. Whitespace is not allowed before the
-identifying ``##``. More than one metadata block may exist in a Python file.
+The content of this section will be published in the Python Packaging user
+guide, PyPA Specifications section, as a document with the title "Embedding
+Metadata in Script Files".
 
-Tools reading metadata blocks MAY respect the standard Python encoding
+Any Python script may contain a *dependency block*. The dependency block is
+identified by reading the script *as a text file* (i.e., the file is not parsed
+as Python source code), looking for the first line of the form::
+
+   # Script Dependencies:
+
+The hash character must be at the start of the line with no preceding whitespace.
+The text "Script Dependencies" is recognised regardless of case, and the spaces
+represent arbitrary whitespace (although at least one space must be present). The
+following regular expression recognises the dependency block header line::
+
+    (?i)^#\s+script\s+dependencies:\s*$
+
+Tools reading the dependency block MAY respect the standard Python encoding
 declaration. If they choose not to do so, they MUST process the file as UTF-8.
 
-Within a metadata block, whitespace after the ``##`` and at the end of the line
-is ignored, and blank lines are ignored. The first line of a metadata block is
-special, and identifies the type of block. This line MUST contain a colon
-character, ``:``. If the colon is not present, the block is not considered to be
-a metadata block, and tools MUST ignore it. The block type is all of the text on
-the initial line, up to the colon. There must be no whitespace before the colon.
-The initial line MAY contain text after the colon. How this is interpreted
-depends on the block type. Block types MUST be treated as case sensitive.
+After the header line, all lines in the file up to the first line that doesn't
+start with a ``#`` sign are considered *dependency lines* and are treated as
+follows:
 
-The interpretation of any lines in a metadata block after the initial
-identifying line, is defined by the type of block.
+1. The initial ``#`` sign is stripped.
+2. If the line contains the character sequence " # " (SPACE HASH SPACE), then
+   those characters and any subsequent characters are discarded. This allows
+   dependency blocks to contain inline comments.
+3. Whitespace at the start and end of the remaining text is discarded.
+4. If the line is now empty, it is ignored.
+5. The content of the line MUST now be a valid :pep:`508` dependency specifier.
 
-Tools MUST ignore any blocks with types they do not handle.
-
-Block types starting with the characters ``X-`` are reserved for the user, and
-MUST NOT be given a meaning in any future standard.
-
-Otherwise, the only defined block type is ``Script Dependencies``. For this
-block type,
-
-1. Text after the colon on the initial line is NOT allowed.
-2. All subsequent lines MUST contain :pep:`508` requirement
-   specifiers, one per line.
-
-There SHOULD only be a single ``Script Dependencies`` block in the file. Tools
-consuming dependency data MAY simply process the first such block found. This
-avoids the need for tools to process more data than is necessary.
+The requirement for spaces before and after the ``#`` in an inline comment is
+necessary to distinguish them from part of a :pep:`508` URL specifier (which
+can contain a hash, but without surrounding whitespace).
 
 Consumers MUST validate that at a minimum, all dependencies start with a
 ``name`` as defined in :pep:`508`, and they MAY validate that all dependencies
@@ -123,9 +121,13 @@ The following is an example of a script with an embedded dependency block::
 
     # In order to run, this script needs the following 3rd party libraries
     #
-    ## Script Dependencies:
-    ##    requests
-    ##    rich
+    # Script Dependencies:
+    #    requests
+    #    rich     # Needed for the output
+    #
+    #    # Not needed - just to show that fragments in URLs do not
+    #    # get treated as comments
+    #    pip @ https://github.com/pypa/pip/archive/1.3.1.zip#sha1=da9234ee9982d4bbb3c72346a6de940a148ea686
 
     import requests
     from rich.pretty import pprint
@@ -138,18 +140,12 @@ The following is an example of a script with an embedded dependency block::
 Backwards Compatibility
 =======================
 
-As metadata blocks take the form of a structured comment, they can be added
+As dependency blocks take the form of a structured comment, they can be added
 without altering the meaning of existing code.
 
 It is possible that a comment may already exist which matches the form of a
-metadata block. While the use of a double ``#`` prefix is intended to minimise
-this risk, it is still possible.
-
-Because tools must ignore unrecognised metadata types, the only potential issue
-we need to consider is script dependencies. In that case, a tool might read the
-wrong dependencies. In practice, though, this is unlikely to happen, as (a) the
-header text (``Script Dependencies:``) is fairly unusual, and (b) any following
-lines are unlikely to conform to :pep:`508` unless they *are* dependencies.
+dependency block. While the identifying header text, "Script Dependencies" is
+chosen to minimise this risk, it is still possible.
 
 In the rare case where an existing comment would be interpreted incorrectly as a
 dependency block, this can be addressed by adding an actual dependency block
@@ -176,9 +172,7 @@ How to Teach This
 
 The format is intended to be close to how a developer might already specify
 script dependencies in an explanatory comment. The required structure is
-deliberately minimal, and the concept of using a special comment marker (``##``
-in this case) is not unusual (the "shebang" line in a Unix shell script is an
-example).
+deliberately minimal, so that formatting rules are easy to learn.
 
 Users will need to know how to write Python dependency specifiers. This is
 covered by :pep:`508`, but for simple examples (which is expected to be the norm
@@ -207,12 +201,20 @@ Recommendations
 This section is non-normative and simply describes "good practices" when using
 metadata blocks.
 
-Scripts should, in general, place metadata blocks at the top of the file,
+While it is permitted for tools to do minimal validation of requirements, in
+practice they should do as much "sanity check" validation as possible, even if
+they cannot do a full check for :pep:`508` syntax. This helps to ensure that
+dependency blocks that are not correctly terminated are reported early. A good
+compromise between the minimal approach of checking just that the requirement
+starts with a name, and full :pep:`508` validation, is to check for a bare name,
+or a name followed by optional whitespace, and then one of ``[`` (extra), ``@``
+(urlspec), ``;`` (marker) or one of ``(<!=>~`` (version).
+
+Scripts should, in general, place the dependency block at the top of the file,
 either immediately after any shebang line, or straight after the script
-docstring. In particular, the metadata block should always be placed before
+docstring. In particular, the dependency block should always be placed before
 any executable code in the file. This makes it easy for the human reader to
-locate the metadata block, and allows tools to only read the minimum necessary
-to identify them.
+locate it.
 
 
 Reference Implementation
@@ -221,57 +223,34 @@ Reference Implementation
 Code to implement this proposal in Python is fairly straightforward, so the
 reference implementation can be included here.
 
-A parser that reads *only* the script dependency metadata.
-
 .. code:: python
 
+   import re
    import tokenize
    from packaging.requirements import Requirement
 
-   DEPENDENCY_BLOCK_MARKER = "Script Dependencies:"
-
+   DEPENDENCY_BLOCK_MARKER = r"(?i)^#\s+script\s+dependencies:\s*$"
+   
    def read_dependency_block(filename):
-      # Use the tokenize module to handle any encoding declaration.
-      with tokenize.open(filename) as f:
-         for line in f:
-               if line.startswith("##"):
-                  line = line[2:].strip()
-                  if line == DEPENDENCY_BLOCK_MARKER:
-                     for line in f:
-                        if not line.startswith("##"):
-                              break
-                        line = line[2:].strip()
-                        if not line:
-                              continue
-                        # Try to convert to a requirement. This will raise
-                        # an error if the line is not a PEP 508 requirement
-                        yield Requirement(line)
-                     break
-
-A full metadata block parser that returns all metadata blocks in a script.
-
-.. code:: python
-
-   import tokenize
-   from packaging.requirements import Requirement
-
-   def read_metadata_blocks(filename):
-      # Use the tokenize module to handle any encoding declaration.
-      with tokenize.open(filename) as f:
-         for line in f:
-               if line.startswith("##"):
-                  block_type, sep, extra = line[2:].strip().partition(":")
-                  if not sep:
-                     continue
-                  block_data = []
-                  for line in f:
-                     if not line.startswith("##"):
+       # Use the tokenize module to handle any encoding declaration.
+       with tokenize.open(filename) as f:
+           for line in f:
+               if re.match(DEPENDENCY_BLOCK_MARKER, line):
+                   for line in f:
+                       if not line.startswith("#"):
                            break
-                     line = line[2:].strip()
-                     if not line:
-                           continue
-                     block_data.append(line)
-                  yield block_type, extra, block_data
+                       # Remove comments. An inline comment is introduced by
+                       # a hash, which must be preceded and followed by a
+                       # space. The initial hash will be skipped as it has
+                       # no space before it.
+                       line = line.split(" # ", maxsplit=1)[0]
+                       line = line[1:].strip()
+                       if not line:
+                           break
+                       # Try to convert to a requirement. This will raise
+                       # an error if the line is not a PEP 508 requirement
+                       yield Requirement(line)
+                   break
 
 A format similar to the one proposed here is already supported `in pipx
 <https://github.com/pypa/pipx/pull/916>`__ and in `pip-run
@@ -284,32 +263,97 @@ Rejected Ideas
 Why not include other metadata?
 -------------------------------
 
-The "metadata block" format is designed to allow additional metadata types, but
-none are defined at this time. Currently, the only data used by tools is
-dependency information, and therefore this is the only information required by
-this standard. If, in future, a need is identified for other data to be
-standardised, adding further metadata types is straightforward.
+The core use case addressed by this proposal is that of identifying what
+dependencies a standalone script needs in order to run successfully. This is a
+common real-world issue that is currently solved by script runner tools, using
+implementation-specific ways of storing the data. Standardising the storage
+format improves interoperability by not typing the script to a particular
+runner.
 
-By reserving metadata types starting with ``X-``, the specification allows
-experimentation with additional data *before* standardising.
+While it is arguable that other forms of metadata could be useful in a
+standalone script, the need is largely theoretical at this point. In practical
+terms, scripts either don't use other metadata, or they store it in existing,
+widely used (and therefore de facto standard) formats. For example, scripts
+needing README style text typically use the standard Python module docstring,
+and scripts wanting to declare a version use the common convention of having a
+``__version__`` variable.
 
-Two particular cases are a script version number, and the version of Python
-needed to run the script.
+One case which was raised during the discussion on this PEP, was the ability to
+declare a minimum Python version that a script needed to run, by analogy with
+the ``Requires-Python`` core metadata item for packages. Unlike packages,
+scripts are normally only run by one user or in one environment, in contexts
+where multiple versions of Python are uncommon. The need for this metadata is
+therefore much less critical in the case of scripts. As further evidence of
+this, the two key script runners currently available, ``pipx`` and ``pip-run``
+do not offer a means of including this data in a script.
 
-In the case of the version number, there are no known tools that try to extract
-version information from scripts, so there is no immediate benefit to having the
-version as metadata, rather than, for example, as a normal comment or a
-``__version__`` attribute (see :pep:`396`). If it becomes common for tools to
-want to introspect script versions, this could be added at a later date.
+Creating a standard "metadata container" format would unify the various
+approaches, but in practical terms there is no real need for unification, and
+the disruption would either delay adoption, or more likely simply mean script
+authors would ignore the standard.
 
-In the case of the Python version, existing tools provide a means for the *user*
-to specify what Python interpreter to use when running the script (for example,
-``pipx run`` provides the ``--python`` command line option), but they do not
-typically allow the *script* to define a version range, and then automatically
-pick an interpreter based on that. Having a "supported version" for a script may
-allow the tool to provide better error messages when run with an inappropriate
-interpreter, but currently, this is largely a theoretical benefit. Again, it is
-something that can be added later if it becomes a commonly requested feature.
+This proposal therefore chooses to focus just on the one use case where there is
+a clear need for something, and no existing standard or common practice.
+
+
+Why not use a marker per line?
+------------------------------
+
+Rather than using a comment block with a header, another possibility would be to
+use a marker on each line, something like::
+
+   # Script-Dependency: requests
+   # Script-Dependency: click
+
+While this makes it easier to parse lines individually, it has a number of
+issues. The first is simply that it's rather verbose, and less readable. This is
+clearly affected by the chosen keyword, but all of the suggested options were
+(in the author's opinion) less readable than the block comment form.
+
+More importantly, this form *by design* makes it impossible to require that the
+dependency specifiers are all together in a single block. As a result, it's not
+possible for a human reader, without a careful check of the whole file, to be
+sure that they have identified all of the dependencies. See the question below,
+"Why not allow multiple dependency blocks and merge them?", for further
+discussion of this problem.
+
+Finally, as the reference implementation demonstrates, parsing the "comment
+block" form isn't, in practice, significantly more difficult than parsing this
+form.
+
+
+Why not use a distinct form of comment for the dependency block?
+----------------------------------------------------------------
+
+A previous version of this proposal used ``##`` to identify dependency blocks.
+Unfortunately, however, the flake8 linter implements a rule requiring that
+comments must have a space after the initial ``#`` sign. While the PEP author
+considers that rule misguided, it is on by default and as a result would cause
+checks to fail when faced with a dependency block.
+
+Furthermore, the ``black`` formatter, although it allows the ``##`` form, does
+add a space after the ``#`` for most other forms of comment. This means that if
+we chose an alternative like ``#%``, automatic reformatting would corrupt the
+dependency block. Forms including a space, like ``# #`` are possible, but less
+natural for the average user (omitting the space is an obvious mistake to make).
+
+While it is possible that linters and formatters could be changed to recognise
+the new standard, the benefit of having a dedicated prefix did not seem
+sufficient to justify the transition cost, or the risk that users might be using
+older tools.
+
+
+Why not allow multiple dependency blocks and merge them?
+--------------------------------------------------------
+
+Because it's too easy for the human reader to miss the fact that there's a
+second dependency block. This could simply result in the script runner
+unexpectedly downloading extra packages, or it could even be a way to smuggle
+malicious packages onto a user's machine (by "hiding" a second dependency block
+in the body of the script).
+
+While the principle of "don't run untrusted code" applies here, the benefits
+aren't sufficient to be worth the risk.
 
 
 Why not use a more standard data format (e.g., TOML)?
@@ -347,10 +391,9 @@ And finally, there will be tools that expect to *write* dependency data into
 scripts -- for example, an IDE with a feature that automatically adds an import
 and a dependency specifier when you reference a library function. While
 libraries exist that allow editing TOML data, they are not always good at
-preserving the user's layout, which could include comments, specific formatting,
-etc. Even if libraries exist which do an effective job at this, expecting all
-tools to use such a library is a significant imposition on code supporting this
-PEP.
+preserving the user's layout. Even if libraries exist which do an effective job
+at this, expecting all tools to use such a library is a significant imposition
+on code supporting this PEP.
 
 By choosing a simple, line-based format with no quoting rules, dependency data
 is easy to read (for humans and tools) and easy to write. The format doesn't
@@ -358,6 +401,45 @@ have the flexibility of something like TOML, but the use case simply doesn't
 demand that sort of flexibility.
 
 
+Why not use (possibly restricted) Python syntax?
+------------------------------------------------
+
+This would typically involve storing the dependencies as a (runtime) list
+variable with a conventional name, such as::
+
+    __requires__ = [
+        "requests",
+        "click",
+    ]
+
+Other suggestions include a static multi-line string, or including the
+dependencies in the script's docstring.
+
+The most significant problem with this proposal is that it requires all
+consumers of the dependency data to implement a Python parser. Even if the
+syntax is restricted, the *rest* of the script will use the full Python syntax,
+and trying to define a syntax which can be successfully parsed in isolation from
+the surrounding code is likely to be extremely difficult and error-prone.
+
+Furthermore, Python's syntax changes in every release. If extracting dependency
+data needs a Python parser, the parser will need to know which version of Python
+the script is written for, and the overhead for a generic tool of having a
+parser that can handle *multiple* versions of Python is unsustainable.
+
+Even if the above issues could be addressed, the format would give the
+impression that the data could be altered at runtime. However, this is not the
+case in general, and code that tries to do so will encounter unexpected and
+confusing behaviour.
+
+And finally, there is no evidence that having dependency data available at
+runtime is of any practical use. Should such a use be found, it is simple enough
+to get the data by parsing the source - ``read_dependency_block(__file__)``.
+
+It is worth noting, though, that the ``pip-run`` utility does implement (an
+extended form of) this approach. `Further discussion <pip-run issue_>`_ of
+the ``pip-run`` design is available on the project's issue tracker.
+
+
 Why not embed a ``pyproject.toml`` file in the script?
 ------------------------------------------------------
 
@@ -413,6 +495,60 @@ existing solutions are likely to be unwelcome to that audience, and could easily
 result in people simply continuing to use existing adhoc solutions, and ignoring
 the standard that was intended to make their lives easier.
 
+Why not infer the requirements from import statements?
+------------------------------------------------------
+
+The idea would be to automatically recognize ``import`` statements in the source
+file and turn them into a list of requirements.
+
+However, this is infeasible for several reasons. First, the points above about
+the necessity to keep the syntax easily parsable, for all Python versions, also
+by tools written in other languages, apply equally here.
+
+Second, PyPI and other package repositories conforming to the Simple Repository
+API do not provide a mechanism to resolve package names from the module names
+that are imported (see also `this related discussion <import-names_>`_).
+
+Third, even if repositories did offer this information, the same import name may
+correspond to several packages on PyPI. One might object that disambiguating
+which package is wanted would only be needed if there are several projects
+providing the same import name. However, this would make it easy for anyone to
+unintentionally or malevolently break working scripts, by uploading a package to
+PyPI providing an import name that is the same as an existing project. The
+alternative where, among the candidates, the first package to have been
+registered on the index is chosen, would be confusing in case a popular package
+is developed with the same import name as an existing obscure package, and even
+harmful if the existing package is malware intentionally uploaded with a
+sufficiently generic import name that has a high probability of being reused.
+
+A related idea would be to attach the requirements as comments to the import
+statements instead of gathering them in a block, with a syntax such as::
+
+  import numpy as np # requires: numpy
+  import rich # requires: rich
+
+This still suffers from parsing difficulties. Also, where to place the comment
+in the case of multiline imports is ambiguous and may look ugly::
+
+   from PyQt5.QtWidgets import (
+       QCheckBox, QComboBox, QDialog, QDialogButtonBox,
+       QGridLayout, QLabel, QSpinBox, QTextEdit
+   ) # requires: PyQt5
+
+Furthermore, this syntax cannot behave as might be intuitively expected
+in all situations. Consider::
+
+  import platform
+  if platform.system() == "Windows":
+      import pywin32 # requires: pywin32
+
+Here, the user's intent is that the package is only required on Windows, but
+this cannot be understood by the script runner (the correct way to write
+it would be ``requires: pywin32 ; sys_platform == 'win32'``).
+
+(Thanks to Jean Abou-Samra for the clear discussion of this point)
+
+
 Why not just set up a Python project with a ``pyproject.toml``?
 ---------------------------------------------------------------
 
@@ -476,44 +612,6 @@ Essentially, though, the issue here is that there is an explicitly stated
 requirement that the format supports storing dependency data *in the script file
 itself*. Solutions that don't do that are simply ignoring that requirement.
 
-Why not use (possibly restricted) Python syntax?
-------------------------------------------------
-
-This would typically involve storing the dependencies as a (runtime) list
-variable with a conventional name, such as::
-
-    __requires__ = [
-        "requests",
-        "click",
-    ]
-
-Other suggestions include a static multi-line string, or including the
-dependencies in the script's docstring.
-
-The most significant problem with this proposal is that it requires all
-consumers of the dependency data to implement a Python parser. Even if the
-syntax is restricted, the *rest* of the script will use the full Python syntax,
-and trying to define a syntax which can be successfully parsed in isolation from
-the surrounding code is likely to be extremely difficult and error-prone.
-
-Furthermore, Python's syntax changes in every release. If extracting dependency
-data needs a Python parser, the parser will need to know which version of Python
-the script is written for, and the overhead for a generic tool of having a
-parser that can handle *multiple* versions of Python is unsustainable.
-
-Even if the above issues could be addressed, the format would give the
-impression that the data could be altered at runtime. However, this is not the
-case in general, and code that tries to do so will encounter unexpected and
-confusing behaviour.
-
-And finally, there is no evidence that having dependency data available at
-runtime is of any practical use. Should such a use be found, it is simple enough
-to get the data by parsing the source - ``read_dependency_block(__file__)``.
-
-It is worth noting, though, that the ``pip-run`` utility does implement (an
-extended form of) this approach. `Further discussion <pip-run issue_>`_ of
-the ``pip-run`` design is available on the project's issue tracker.
-
 Should scripts be able to specify a package index?
 --------------------------------------------------
 
@@ -555,6 +653,7 @@ References
 .. _pip-run issue: https://github.com/jaraco/pip-run/issues/44
 .. _language survey: https://dbohdan.com/scripts-with-dependencies
 .. _pyproject without wheels: https://discuss.python.org/t/projects-that-arent-meant-to-generate-a-wheel-and-pyproject-toml/29684
+.. _import-names: https://discuss.python.org/t/record-the-top-level-names-of-a-wheel-in-metadata/29494
 
 Copyright
 =========