PEP 680: "tomllib" Support for parsing TOML in the Standard Library (#2218)

Co-authored-by: Taneli Hukkinen <3275109+hukkin@users.noreply.github.com> Co-authored-by: Petr Viktorin <encukou@gmail.com>
2022-01-10 12:55:30 -08:00 · 2022-01-10 12:55:30 -08:00 · 5056f2a964
parent 8b9859a142
commit 5056f2a964
1 changed files with 501 additions and 0 deletions
--- a/pep-0680.rst
+++ b/pep-0680.rst
@ -0,0 +1,501 @@
+PEP: 680
+Title: tomllib: Support for parsing TOML in the Standard Library
+Author: Taneli Hukkinen, Shantanu Jain <hauntsaninja at gmail.com>
+Sponsor: Petr Viktorin <encukou@gmail.com>
+Discussions-To: https://discuss.python.org/t/adopting-recommending-a-toml-parser/4068
+Status: Draft
+Type: Standards Track
+Content-Type: text/x-rst
+Created: 01-Jan-2022
+Python-Version: 3.11
+Post-History: 1900-01-01
+
+
+Abstract
+========
+
+This proposes adding a module, ``tomllib``, to the standard library for
+parsing TOML (Tom's Obvious Minimal Language,
+`https://toml.io <https://toml.io/en/>`_).
+
+
+Motivation
+==========
+
+The TOML format is the format of choice for Python packaging, as evidenced by
+:pep:`517`, :pep:`518` and :pep:`621`. Including TOML support in the standard
+library helps avoid bootstrapping problems for Python build tools. Currently
+most Python build tools need to vendor a TOML parsing library.
+
+Python tools are increasingly configurable via TOML, for examples: ``black``,
+``mypy``, ``pytest``, ``tox``, ``pylint``, ``isort``. Those that are not, such
+as ``flake8``, cite the lack of standard library support as a `main reason why
+<https://github.com/PyCQA/flake8/issues/234#issuecomment-812800657>`_.
+
+Given the special place TOML already has in the Python ecosystem, it makes sense
+for this to be an included battery.
+
+Finally, TOML as a format is increasingly popular (some reasons for this are
+outlined in PEP 518). Hence this is likely to be a generally useful addition,
+even looking beyond the needs of Python packaging and Python tooling: various
+Python TOML libraries have about 2000 reverse dependencies on PyPI. For
+comparison, ``requests`` has about 28k reverse dependencies.
+
+
+Rationale
+=========
+
+This PEP proposes basing the standard library support for reading TOML on the
+third party library ``tomli``
+(`github.com/hukkin/tomli <https://github.com/hukkin/tomli>`_).
+
+Many projects have recently switched to using ``tomli``, for example, ``pip``,
+``build``, ``pytest``, ``mypy``, ``black``, ``flit``, ``coverage``,
+``setuptools-scm``, ``cibuildwheel``.
+
+``tomli`` is actively maintained and well-tested. ``tomli`` is about 800 lines
+of code with 100% test coverage and passes all tests in a test suite `proposed
+as the official TOML compliance test suite
+<https://github.com/toml-lang/compliance/pull/8>`_, as well as `the more
+established BurntSushi/toml-test suite
+<https://github.com/BurntSushi/toml-test>`_.
+
+
+Specification
+=============
+
+A new module ``tomllib`` with the following functions will be added:
+
+.. code-block::
+
+   def load(fp: SupportsRead[bytes], /, *, parse_float: Callable[[str], Any] = ...) -> dict[str, Any]: ...
+   def loads(s: str, /, *, parse_float: Callable[[str], Any] = ...) -> dict[str, Any]: ...
+
+``tomllib.load`` deserializes a binary file containing a
+TOML document to a Python dict.
+The ``fp`` argument must have a ``read()`` method with the same API as
+``io.RawIOBase.read()``.
+
+``tomllib.loads`` deserializes a str instance containing a TOML document
+to a Python dict.
+
+``parse_float`` is a function that takes a string representing a TOML float and
+returns a Python object (similar to ``parse_float`` in ``json.load``). For
+example, a function returning a ``decimal.Decimal`` in cases where precision is
+important. By default, TOML floats are represented as ``float`` type.
+
+The returned object contains only basic Python objects (``str``, ``int``,
+``bool``, ``float``, ``datetime.{datetime,date,time}``, ``list``, ``dict`` with
+string keys), and the results of ``parse_float``.
+
+``tomllib.TOMLDecodeError`` is raised in the case of invalid TOML.
+
+Note that this PEP does not propose ``tomllib.dump`` or ``tomllib.dumps``
+functions, see `<Including an API for writing TOML_>`_ for details.
+
+
+Maintenance Implications
+========================
+
+Stability of TOML
+-----------------
+
+The release of TOML v1 in January 2021 indicates stability. Empirically, TOML
+has proven to be a stable format even prior to the release of TOML v1. From the
+`changelog <https://github.com/toml-lang/toml/blob/master/CHANGELOG.md>`_, we
+see TOML has had no major changes since April 2020 and has had two releases in
+the last five years.
+
+In the event of changes to the TOML specification, we could treat minor
+revisions as bug fixes and update the implementation in place. In the event of
+major breaking changes, we should preserve support for TOML v1.
+
+Maintainability of proposed implementation
+------------------------------------------
+
+The proposed implementation (``tomli``) is in pure Python, well tested and
+weighs under 1000 lines of code. It is minimalist, offering a smaller API
+surface area than other TOML implementations.
+
+The author of ``tomli`` is willing to help integrate ``tomli`` into the standard
+library and help maintain it, `as per this post
+<https://github.com/hukkin/tomli/issues/141#issuecomment-998018972>`__.
+Petr Viktorin has indicated willingness to maintain a read API,
+`as per this post
+<https://discuss.python.org/t/adopting-recommending-a-toml-parser/4068/88>`__.
+
+Rewriting the parser in C is not deemed necessary at this time. It's rare for
+TOML parsing to be a bottleneck in applications. Users with higher performance
+needs can use a third party library (as is already often the case with JSON,
+despite a stdlib extension module).
+
+TOML support a slippery slope for other things
+----------------------------------------------
+
+As discussed in motivations, TOML holds a special place in the Python ecosystem.
+This chief reason to include TOML in the standard library does not apply to
+other formats, such as YAML or MessagePack.
+
+In addition, the simplicity of TOML can help serve as a dividing line, for
+example, YAML is large and complicated.
+
+Including an API for writing TOML may, however, be added in a future PEP.
+
+
+Backwards Compatibility
+=======================
+
+This proposal has no backwards compatibility issues within the stdlib, as it
+describes a new module.
+Any existing third-party module named ``tomllib`` will break, as
+``import tomllib`` will import standard library module.
+However, ``tomllib`` is not registered on PyPI, so it is unlikely that such
+a module is widely used.
+
+Note that we avoid using the more straightforward name ``toml``, to avoid
+backwards compatibility implications for users who have pinned versions of the
+current ``toml`` PyPI package. For more details, see `<Alternative names for
+module_>`_.
+
+
+Security Implications
+=====================
+
+Errors in the implementation could cause potential security issues.
+The parser's output is limited to simple data types; inability to load
+arbitrary classes avoids security issues common in more "powerful" formats like
+pickle and YAML. Also, the implementation will be in pure Python, which reduces
+security issues endemic to C, such as buffer overflows.
+
+
+How to Teach This
+=================
+
+The API of ``tomllib`` mimics that of other well-established file format
+libraries, such as ``json`` and ``pickle``. The lack of a ``dump`` function will
+be explained in the documentation, with a link to relevant third-party libraries
+(``tomlkit``, ``tomli-w``, ``pytomlpp``).
+
+
+Reference Implementation
+========================
+
+The proposed implementation can be found at https://github.com/hukkin/tomli
+
+
+Rejected Ideas
+==============
+
+Basing on another TOML implementation
+-------------------------------------
+
+Potential alternatives include:
+
+* ``tomlkit``.
+  ``tomlkit`` is well established, actively maintained and supports TOML v1. An
+  important difference is that ``tomlkit`` supports style roundtripping. As a
+  result, it has a more complex API and implementation (about 5x as much code as
+  ``tomli``). The author does not believe that ``tomlkit`` is a good choice for
+  the standard library.
+
+* ``toml``.
+  ``toml`` is a widely used library. However, it is not actively maintained,
+  does not support TOML v1 and has several known bugs. Its API is more complex
+  than that of ``tomli``. It has some very limited and mostly unused ability to
+  preserve style through an undocumented decoder API. It has the ability to
+  customise output style through a complicated encoder API. For more details on
+  API differences to this PEP, refer to `Appendix A`_.
+
+* ``pytomlpp``.
+  ``pytomlpp`` is a Python wrapper for the C++ project ``toml++``. Pure Python
+  libraries are easier to maintain than extension modules.
+
+* ``rtoml``.
+  ``rtoml`` is a Python wrapper for the Rust project ``toml-rs`` and hence has
+  similar shortcomings to ``pytomlpp``.
+  In addition, it does not support TOML v1.
+
+* Writing from scratch.
+  It's unclear what we would get from this: ``tomli`` meets our needs and the
+  author is willing to help with its inclusion in the standard library.
+
+Including an API for writing TOML
+---------------------------------
+
+There are several reasons to not include an API for writing TOML:
+
+The ability to write TOML is not needed for the use cases that motivate this
+PEP: for core Python packaging use cases or for tools that need to read
+configuration.
+
+Use cases that involve editing TOML (as opposed to writing brand new TOML) are
+better served by a style preserving library. TOML is intended as human-readable
+and human-editable configuration, so it's important to preserve human markup,
+such as comments and formatting. This requires a parser whose output includes
+style-related metadata, making it impractical to output plain Python types like
+``str`` and ``dict``. Designing such an API is complicated.
+
+But even without considering style preservation, there are too many degrees of
+freedom in how to design a write API. For example, how much control to allow
+users over output formatting, over serialization of custom types, and over input
+and output validation. While there are reasonable choices on how to resolve
+these, the nature of the standard library is such that one only gets one chance
+to get things right.
+
+Currently no CPython core developers have expressed willingness to maintain a
+write API or sponsor a PEP that includes a write API. Since it is hard to change
+or remove something in the standard library, it is safer to err on the side of
+exclusion and potentially revisit later.
+
+So, writing TOML is left to third-party libraries. If a good API and relevant
+use cases for it are found later, it can be added in a future PEP.
+
+
+Assorted API details
+--------------------
+
+Types accepted by the first argument of ``tomllib.load``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``toml`` library on PyPI allows passing paths (and lists of path-like
+objects, ignoring missing files and merging the documents into a single object).
+Doing this would be inconsistent with ``json.load``, ``pickle.load``, etc. If we
+agree consistency with other stdlib modules is desirable, allowing paths is
+somewhat out of scope for this PEP. This can easily and explicitly be worked
+around in user code, or a third-party library.
+
+The proposed API takes a binary file, while ``toml.load`` takes a text file and
+``json.load`` takes either. Using a binary file allows us to a) ensure utf-8 is
+the encoding used, b) avoid incorrectly parsing single carriage returns as valid
+TOML due to universal newlines.
+
+Type accepted by the first argument of ``tomllib.loads``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+While ``tomllib.load`` takes a binary file, ``tomllib.loads`` takes
+a text string. This may seem inconsistent at first.
+
+Quoting TOML v1.0.0 specification:
+
+> A TOML file must be a valid UTF-8 encoded Unicode document.
+
+``tomllib.loads`` does not intend to load a TOML file, but rather the
+document that the file stores. The most natural representation of
+a Unicode document in Python is ``str``, not ``bytes``.
+
+It is possible to add ``bytes`` support in the future if needed, but
+we are not aware of any use cases for it.
+
+Controlling the type of mappings returned by ``tomllib.load[s]``
+----------------------------------------------------------------
+
+The ``toml`` library on PyPI supports a ``_dict`` argument, which works
+similarly to the ``object_hook`` argument in ``json.load[s]``. There are several
+uses of ``_dict`` found on https://grep.app, however, almost all of them are
+passing ``_dict=OrderedDict``, which should be unnecessary as of Python 3.7. We
+found two instances of legitimate use: in one case, a custom class was passed
+for friendlier KeyErrors, in another case, the custom class had several
+additional lookup and mutation methods (e.g. to help resolve dotted keys).
+
+Such an argument is not necessary for the core use cases outlined in the
+motivation section. The absence of this can be pretty easily worked around using
+a wrapper class, transformer function, or a third-party library. Finally,
+support could be added later in a backward compatible way.
+
+
+Removing support for ``parse_float`` in ``tomllib.load[s]``
+-----------------------------------------------------------
+
+This option is not strictly necessary, since TOML floats are "IEEE 754 binary64
+values", which is ``float`` on most architectures. Using ``decimal.Decimal``
+thus allows users extra precision not promised by the TOML format. However, in
+the author of ``tomli``'s experience, this is useful in scientific and financial
+applications. TOML-facing users may include non-developers who are not aware of
+the limits of double-precision float.
+
+There are also niche architectures where the Python ``float`` is not a IEEE-754
+binary64. The ``parse_float`` argument allows users to achieve correct TOML
+semantics even on such architectures.
+
+
+Alternative names for module
+----------------------------
+
+Ideally, we would be able to use the ``toml`` module name.
+
+However, the ``toml`` package on PyPI is widely used, so there are backward
+compatibility concerns. Since the standard library takes precedence over third
+party packages, users who have pinned versions of ``toml`` would be broken when
+upgrading Python versions by any API incompatibilities.
+
+To further clarify, the user pins are the specific concern here. Even if we were
+able to get control over the ``toml`` PyPI package and repurpose it as a
+standard library backport, we would still break users who have pinned to
+versions of the current ``toml`` package. This is unfortunate, since pinning
+would likely be a common response to breaking changes introduced by repurposing
+the ``toml`` package as a backport (that is incompatible with today's ``toml``).
+
+There are several API incompatibilities between ``toml`` and the API proposed in
+this PEP, listed in `Appendix A`_.
+
+Finally, the ``toml`` package on PyPI is not actively maintained and `we have
+been unable to contact the author <https://github.com/uiri/toml/issues/361>`,
+so action here would likely have to be taken without the author's consent.
+
+This PEP proposes ``tomllib``. This mirrors ``plistlib`` (another file format
+module in the standard library), as well as several others such as ``pathlib``,
+``graphlib``, etc.
+
+Other considered names include:
+
+* ``tomlparser``. This mirrors ``configparser``, but is perhaps slightly less
+  appropriate if we include a write API in the future.
+* ``tomli``. This assumes we use ``tomli`` as the basis for implementation.
+* ``toml`` under some namespace, such as ``parser.toml``. However, this is
+  awkward, especially so since existing libraries like ``json``, ``pickle``,
+  ``marshal``, ``html`` etc. would not be included in the namespace.
+
+
+TODO: Random things
+===================
+
+Previous discussion:
+
+* https://bugs.python.org/issue40059
+* https://mail.python.org/archives/list/python-ideas@python.org/thread/IWJ3I32A4TY6CIVQ6ONPEBPWP4TOV2V7/
+* https://mail.python.org/pipermail/python-dev/2019-May/157405.html
+* https://github.com/hukkin/tomli/issues/141
+* https://discuss.python.org/t/adopting-recommending-a-toml-parser/4068/84
+
+Useful https://grep.app searches (note, ignore vendored):
+
+* toml.load[s] usage https://grep.app/search?q=toml.load&filter[lang][0]=Python
+* toml.dump[s] usage https://grep.app/search?q=toml.dump&filter[lang][0]=Python
+* TomlEncoder subclasses https://grep.app/search?q=TomlEncoder%29%3A&filter[lang][0]=Python
+
+
+.. _Appendix A:
+
+Appendix A: Differences between proposed API and ``toml``
+=========================================================
+
+This appendix covers the differences between the API proposed in this PEP and
+that of the third party package ``toml``. These differences are relevant to
+understanding the amount of breakage we could expect if we used the ``toml``
+name for the standard library module, as well as to better understand the design
+space. Note that this list might not be exhaustive.
+
+#. This PEP currently proposes not to include a write API. That is, there will
+   be no equivalent of ``toml.dump`` or ``toml.dumps``.
+
+   Discussed at `<Including an API for writing TOML_>`_.
+
+   If we included a write API, it would be relatively simple to convert most
+   code that uses ``toml`` to use the API proposed in this PEP (acknowledging
+   that that is very different from a compatible API).
+
+   A significant fraction of ``toml`` users rely on this.
+
+#. Different first argument of ``toml.load``
+
+   ``toml.load`` has the following signature:
+
+   .. code-block::
+
+       def load(
+           f: Union[SupportsRead[str], str, bytes, list[PathLike | str | bytes]],
+           _dict: Type[MutableMapping[str, Any]] = ...,
+           decoder: TomlDecoder = ...,
+       ) -> MutableMapping[str, Any]: ...
+
+   This is pretty different from the first argument proposed in this PEP: ``SupportsRead[bytes]``.
+
+   Recapping the reasons for this, previously mentioned at
+   `<Types accepted by the first argument of tomllib.load_>`_:
+
+   * Allowing passing of paths (and lists of path-like objects, ignoring missing
+     files and merging the documents into a single object) is inconsistent with
+     other similar functions in the standard library.
+   * Using ``SupportsRead[bytes]`` allows us to a) ensure utf-8 is the encoding used,
+     b) avoid incorrectly parsing single carriage returns as valid TOML due to
+     universal newlines. TOML specifies file encoding and valid newline
+     sequences, and hence is simply stricter format than what text file objects
+     represent.
+
+   A significant fraction of ``toml`` users rely on this.
+
+#. Errors
+
+   ``toml`` raises ``TomlDecodeError`` vs the proposed PEP 8 compliant
+   ``TOMLDecodeError``.
+
+   A significant fraction of ``toml`` users rely on this.
+
+#. ``toml.load[s]`` accepts a ``_dict`` argument
+
+   Discussed at `<Controlling the type of mappings returned by tomllib.load[s]_>`_.
+
+   As discussed, almost all usage consists of ``_dict=OrderedDict``, which is
+   not necessary in Python 3.7 and later.
+
+#. ``toml.load[s]`` support an undocumented ``decoder`` argument
+
+   It seems the intended use case is for an implementation of comment
+   preservation. The information recorded is not sufficient to roundtrip the
+   TOML document preserving style, the implementation has known bugs, the
+   feature is undocumented and I could only find one instance of its use on
+   https://grep.app.
+
+   The ``toml.TomlDecoder`` interface exposed is not simple, containing nine methods.
+   See `here <https://github.com/uiri/toml/blob/3f637dba5f68db63d4b30967fedda51c82459471/toml/decoder.pyi#L36>`__.
+
+   Users are probably better served by a more complete implementation of style
+   preserving parsing and writing.
+
+#. ``toml.dump[s]`` support an ``encoder`` argument
+
+   Note that we currently propose not to include a write API, however if that
+   were to change, these differences would likely become relevant.
+
+   This enables two use cases, a) control over how custom types should be
+   serialized, b) control over how output should be formatted.
+
+   The first use case is reasonable, however, I could only find two instances of
+   this on https://grep.app. One of these two instances used this ability to add
+   support for dumping ``decimal.Decimal`` (which a potential standard library
+   implementation would support out of the box).
+
+   If needed, this use case could be well served by the equivalent of the
+   ``default`` argument in ``json.dump``.
+
+   The second use case is enabled by allowing users to specify subclasses of
+   ``toml.TomlEncoder`` and overriding methods to specify parts of the TOML
+   writing process. The API consists of five methods and exposes a lot of
+   implementation detail. See `here <https://github.com/uiri/toml/blob/3f637dba5f68db63d4b30967fedda51c82459471/toml/encoder.pyi#L9>`__.
+
+   There is some usage of the ``encoder`` API on https://grep.app, however, it
+   likely accounts for a tiny fraction of overall usage of ``toml``.
+
+#. Timezones
+
+   ``toml`` uses and exposes custom ``toml.tz.TomlTz`` timezone objects. The
+   proposed implementation uses ``datetime.timezone`` objects from the standard
+   library.
+
+
+Copyright
+=========
+
+This document is placed in the public domain or under the
+CC0-1.0-Universal license, whichever is more permissive.
+
+
+
+..
+    Local Variables:
+    mode: indented-text
+    indent-tabs-mode: nil
+    sentence-end-double-space: t
+    fill-column: 70
+    coding: utf-8
+    End: