PEP 680: "tomllib" Support for parsing TOML in the Standard Library (#2218)

Co-authored-by: Taneli Hukkinen <3275109+hukkin@users.noreply.github.com>
Co-authored-by: Petr Viktorin <encukou@gmail.com>
This commit is contained in:
Shantanu 2022-01-10 12:55:30 -08:00 committed by GitHub
parent 8b9859a142
commit 5056f2a964
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 501 additions and 0 deletions

501
pep-0680.rst Normal file
View File

@ -0,0 +1,501 @@
PEP: 680
Title: tomllib: Support for parsing TOML in the Standard Library
Author: Taneli Hukkinen, Shantanu Jain <hauntsaninja at gmail.com>
Sponsor: Petr Viktorin <encukou@gmail.com>
Discussions-To: https://discuss.python.org/t/adopting-recommending-a-toml-parser/4068
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 01-Jan-2022
Python-Version: 3.11
Post-History: 1900-01-01
Abstract
========
This proposes adding a module, ``tomllib``, to the standard library for
parsing TOML (Tom's Obvious Minimal Language,
`https://toml.io <https://toml.io/en/>`_).
Motivation
==========
The TOML format is the format of choice for Python packaging, as evidenced by
:pep:`517`, :pep:`518` and :pep:`621`. Including TOML support in the standard
library helps avoid bootstrapping problems for Python build tools. Currently
most Python build tools need to vendor a TOML parsing library.
Python tools are increasingly configurable via TOML, for examples: ``black``,
``mypy``, ``pytest``, ``tox``, ``pylint``, ``isort``. Those that are not, such
as ``flake8``, cite the lack of standard library support as a `main reason why
<https://github.com/PyCQA/flake8/issues/234#issuecomment-812800657>`_.
Given the special place TOML already has in the Python ecosystem, it makes sense
for this to be an included battery.
Finally, TOML as a format is increasingly popular (some reasons for this are
outlined in PEP 518). Hence this is likely to be a generally useful addition,
even looking beyond the needs of Python packaging and Python tooling: various
Python TOML libraries have about 2000 reverse dependencies on PyPI. For
comparison, ``requests`` has about 28k reverse dependencies.
Rationale
=========
This PEP proposes basing the standard library support for reading TOML on the
third party library ``tomli``
(`github.com/hukkin/tomli <https://github.com/hukkin/tomli>`_).
Many projects have recently switched to using ``tomli``, for example, ``pip``,
``build``, ``pytest``, ``mypy``, ``black``, ``flit``, ``coverage``,
``setuptools-scm``, ``cibuildwheel``.
``tomli`` is actively maintained and well-tested. ``tomli`` is about 800 lines
of code with 100% test coverage and passes all tests in a test suite `proposed
as the official TOML compliance test suite
<https://github.com/toml-lang/compliance/pull/8>`_, as well as `the more
established BurntSushi/toml-test suite
<https://github.com/BurntSushi/toml-test>`_.
Specification
=============
A new module ``tomllib`` with the following functions will be added:
.. code-block::
def load(fp: SupportsRead[bytes], /, *, parse_float: Callable[[str], Any] = ...) -> dict[str, Any]: ...
def loads(s: str, /, *, parse_float: Callable[[str], Any] = ...) -> dict[str, Any]: ...
``tomllib.load`` deserializes a binary file containing a
TOML document to a Python dict.
The ``fp`` argument must have a ``read()`` method with the same API as
``io.RawIOBase.read()``.
``tomllib.loads`` deserializes a str instance containing a TOML document
to a Python dict.
``parse_float`` is a function that takes a string representing a TOML float and
returns a Python object (similar to ``parse_float`` in ``json.load``). For
example, a function returning a ``decimal.Decimal`` in cases where precision is
important. By default, TOML floats are represented as ``float`` type.
The returned object contains only basic Python objects (``str``, ``int``,
``bool``, ``float``, ``datetime.{datetime,date,time}``, ``list``, ``dict`` with
string keys), and the results of ``parse_float``.
``tomllib.TOMLDecodeError`` is raised in the case of invalid TOML.
Note that this PEP does not propose ``tomllib.dump`` or ``tomllib.dumps``
functions, see `<Including an API for writing TOML_>`_ for details.
Maintenance Implications
========================
Stability of TOML
-----------------
The release of TOML v1 in January 2021 indicates stability. Empirically, TOML
has proven to be a stable format even prior to the release of TOML v1. From the
`changelog <https://github.com/toml-lang/toml/blob/master/CHANGELOG.md>`_, we
see TOML has had no major changes since April 2020 and has had two releases in
the last five years.
In the event of changes to the TOML specification, we could treat minor
revisions as bug fixes and update the implementation in place. In the event of
major breaking changes, we should preserve support for TOML v1.
Maintainability of proposed implementation
------------------------------------------
The proposed implementation (``tomli``) is in pure Python, well tested and
weighs under 1000 lines of code. It is minimalist, offering a smaller API
surface area than other TOML implementations.
The author of ``tomli`` is willing to help integrate ``tomli`` into the standard
library and help maintain it, `as per this post
<https://github.com/hukkin/tomli/issues/141#issuecomment-998018972>`__.
Petr Viktorin has indicated willingness to maintain a read API,
`as per this post
<https://discuss.python.org/t/adopting-recommending-a-toml-parser/4068/88>`__.
Rewriting the parser in C is not deemed necessary at this time. It's rare for
TOML parsing to be a bottleneck in applications. Users with higher performance
needs can use a third party library (as is already often the case with JSON,
despite a stdlib extension module).
TOML support a slippery slope for other things
----------------------------------------------
As discussed in motivations, TOML holds a special place in the Python ecosystem.
This chief reason to include TOML in the standard library does not apply to
other formats, such as YAML or MessagePack.
In addition, the simplicity of TOML can help serve as a dividing line, for
example, YAML is large and complicated.
Including an API for writing TOML may, however, be added in a future PEP.
Backwards Compatibility
=======================
This proposal has no backwards compatibility issues within the stdlib, as it
describes a new module.
Any existing third-party module named ``tomllib`` will break, as
``import tomllib`` will import standard library module.
However, ``tomllib`` is not registered on PyPI, so it is unlikely that such
a module is widely used.
Note that we avoid using the more straightforward name ``toml``, to avoid
backwards compatibility implications for users who have pinned versions of the
current ``toml`` PyPI package. For more details, see `<Alternative names for
module_>`_.
Security Implications
=====================
Errors in the implementation could cause potential security issues.
The parser's output is limited to simple data types; inability to load
arbitrary classes avoids security issues common in more "powerful" formats like
pickle and YAML. Also, the implementation will be in pure Python, which reduces
security issues endemic to C, such as buffer overflows.
How to Teach This
=================
The API of ``tomllib`` mimics that of other well-established file format
libraries, such as ``json`` and ``pickle``. The lack of a ``dump`` function will
be explained in the documentation, with a link to relevant third-party libraries
(``tomlkit``, ``tomli-w``, ``pytomlpp``).
Reference Implementation
========================
The proposed implementation can be found at https://github.com/hukkin/tomli
Rejected Ideas
==============
Basing on another TOML implementation
-------------------------------------
Potential alternatives include:
* ``tomlkit``.
``tomlkit`` is well established, actively maintained and supports TOML v1. An
important difference is that ``tomlkit`` supports style roundtripping. As a
result, it has a more complex API and implementation (about 5x as much code as
``tomli``). The author does not believe that ``tomlkit`` is a good choice for
the standard library.
* ``toml``.
``toml`` is a widely used library. However, it is not actively maintained,
does not support TOML v1 and has several known bugs. Its API is more complex
than that of ``tomli``. It has some very limited and mostly unused ability to
preserve style through an undocumented decoder API. It has the ability to
customise output style through a complicated encoder API. For more details on
API differences to this PEP, refer to `Appendix A`_.
* ``pytomlpp``.
``pytomlpp`` is a Python wrapper for the C++ project ``toml++``. Pure Python
libraries are easier to maintain than extension modules.
* ``rtoml``.
``rtoml`` is a Python wrapper for the Rust project ``toml-rs`` and hence has
similar shortcomings to ``pytomlpp``.
In addition, it does not support TOML v1.
* Writing from scratch.
It's unclear what we would get from this: ``tomli`` meets our needs and the
author is willing to help with its inclusion in the standard library.
Including an API for writing TOML
---------------------------------
There are several reasons to not include an API for writing TOML:
The ability to write TOML is not needed for the use cases that motivate this
PEP: for core Python packaging use cases or for tools that need to read
configuration.
Use cases that involve editing TOML (as opposed to writing brand new TOML) are
better served by a style preserving library. TOML is intended as human-readable
and human-editable configuration, so it's important to preserve human markup,
such as comments and formatting. This requires a parser whose output includes
style-related metadata, making it impractical to output plain Python types like
``str`` and ``dict``. Designing such an API is complicated.
But even without considering style preservation, there are too many degrees of
freedom in how to design a write API. For example, how much control to allow
users over output formatting, over serialization of custom types, and over input
and output validation. While there are reasonable choices on how to resolve
these, the nature of the standard library is such that one only gets one chance
to get things right.
Currently no CPython core developers have expressed willingness to maintain a
write API or sponsor a PEP that includes a write API. Since it is hard to change
or remove something in the standard library, it is safer to err on the side of
exclusion and potentially revisit later.
So, writing TOML is left to third-party libraries. If a good API and relevant
use cases for it are found later, it can be added in a future PEP.
Assorted API details
--------------------
Types accepted by the first argument of ``tomllib.load``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``toml`` library on PyPI allows passing paths (and lists of path-like
objects, ignoring missing files and merging the documents into a single object).
Doing this would be inconsistent with ``json.load``, ``pickle.load``, etc. If we
agree consistency with other stdlib modules is desirable, allowing paths is
somewhat out of scope for this PEP. This can easily and explicitly be worked
around in user code, or a third-party library.
The proposed API takes a binary file, while ``toml.load`` takes a text file and
``json.load`` takes either. Using a binary file allows us to a) ensure utf-8 is
the encoding used, b) avoid incorrectly parsing single carriage returns as valid
TOML due to universal newlines.
Type accepted by the first argument of ``tomllib.loads``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
While ``tomllib.load`` takes a binary file, ``tomllib.loads`` takes
a text string. This may seem inconsistent at first.
Quoting TOML v1.0.0 specification:
> A TOML file must be a valid UTF-8 encoded Unicode document.
``tomllib.loads`` does not intend to load a TOML file, but rather the
document that the file stores. The most natural representation of
a Unicode document in Python is ``str``, not ``bytes``.
It is possible to add ``bytes`` support in the future if needed, but
we are not aware of any use cases for it.
Controlling the type of mappings returned by ``tomllib.load[s]``
----------------------------------------------------------------
The ``toml`` library on PyPI supports a ``_dict`` argument, which works
similarly to the ``object_hook`` argument in ``json.load[s]``. There are several
uses of ``_dict`` found on https://grep.app, however, almost all of them are
passing ``_dict=OrderedDict``, which should be unnecessary as of Python 3.7. We
found two instances of legitimate use: in one case, a custom class was passed
for friendlier KeyErrors, in another case, the custom class had several
additional lookup and mutation methods (e.g. to help resolve dotted keys).
Such an argument is not necessary for the core use cases outlined in the
motivation section. The absence of this can be pretty easily worked around using
a wrapper class, transformer function, or a third-party library. Finally,
support could be added later in a backward compatible way.
Removing support for ``parse_float`` in ``tomllib.load[s]``
-----------------------------------------------------------
This option is not strictly necessary, since TOML floats are "IEEE 754 binary64
values", which is ``float`` on most architectures. Using ``decimal.Decimal``
thus allows users extra precision not promised by the TOML format. However, in
the author of ``tomli``'s experience, this is useful in scientific and financial
applications. TOML-facing users may include non-developers who are not aware of
the limits of double-precision float.
There are also niche architectures where the Python ``float`` is not a IEEE-754
binary64. The ``parse_float`` argument allows users to achieve correct TOML
semantics even on such architectures.
Alternative names for module
----------------------------
Ideally, we would be able to use the ``toml`` module name.
However, the ``toml`` package on PyPI is widely used, so there are backward
compatibility concerns. Since the standard library takes precedence over third
party packages, users who have pinned versions of ``toml`` would be broken when
upgrading Python versions by any API incompatibilities.
To further clarify, the user pins are the specific concern here. Even if we were
able to get control over the ``toml`` PyPI package and repurpose it as a
standard library backport, we would still break users who have pinned to
versions of the current ``toml`` package. This is unfortunate, since pinning
would likely be a common response to breaking changes introduced by repurposing
the ``toml`` package as a backport (that is incompatible with today's ``toml``).
There are several API incompatibilities between ``toml`` and the API proposed in
this PEP, listed in `Appendix A`_.
Finally, the ``toml`` package on PyPI is not actively maintained and `we have
been unable to contact the author <https://github.com/uiri/toml/issues/361>`,
so action here would likely have to be taken without the author's consent.
This PEP proposes ``tomllib``. This mirrors ``plistlib`` (another file format
module in the standard library), as well as several others such as ``pathlib``,
``graphlib``, etc.
Other considered names include:
* ``tomlparser``. This mirrors ``configparser``, but is perhaps slightly less
appropriate if we include a write API in the future.
* ``tomli``. This assumes we use ``tomli`` as the basis for implementation.
* ``toml`` under some namespace, such as ``parser.toml``. However, this is
awkward, especially so since existing libraries like ``json``, ``pickle``,
``marshal``, ``html`` etc. would not be included in the namespace.
TODO: Random things
===================
Previous discussion:
* https://bugs.python.org/issue40059
* https://mail.python.org/archives/list/python-ideas@python.org/thread/IWJ3I32A4TY6CIVQ6ONPEBPWP4TOV2V7/
* https://mail.python.org/pipermail/python-dev/2019-May/157405.html
* https://github.com/hukkin/tomli/issues/141
* https://discuss.python.org/t/adopting-recommending-a-toml-parser/4068/84
Useful https://grep.app searches (note, ignore vendored):
* toml.load[s] usage https://grep.app/search?q=toml.load&filter[lang][0]=Python
* toml.dump[s] usage https://grep.app/search?q=toml.dump&filter[lang][0]=Python
* TomlEncoder subclasses https://grep.app/search?q=TomlEncoder%29%3A&filter[lang][0]=Python
.. _Appendix A:
Appendix A: Differences between proposed API and ``toml``
=========================================================
This appendix covers the differences between the API proposed in this PEP and
that of the third party package ``toml``. These differences are relevant to
understanding the amount of breakage we could expect if we used the ``toml``
name for the standard library module, as well as to better understand the design
space. Note that this list might not be exhaustive.
#. This PEP currently proposes not to include a write API. That is, there will
be no equivalent of ``toml.dump`` or ``toml.dumps``.
Discussed at `<Including an API for writing TOML_>`_.
If we included a write API, it would be relatively simple to convert most
code that uses ``toml`` to use the API proposed in this PEP (acknowledging
that that is very different from a compatible API).
A significant fraction of ``toml`` users rely on this.
#. Different first argument of ``toml.load``
``toml.load`` has the following signature:
.. code-block::
def load(
f: Union[SupportsRead[str], str, bytes, list[PathLike | str | bytes]],
_dict: Type[MutableMapping[str, Any]] = ...,
decoder: TomlDecoder = ...,
) -> MutableMapping[str, Any]: ...
This is pretty different from the first argument proposed in this PEP: ``SupportsRead[bytes]``.
Recapping the reasons for this, previously mentioned at
`<Types accepted by the first argument of tomllib.load_>`_:
* Allowing passing of paths (and lists of path-like objects, ignoring missing
files and merging the documents into a single object) is inconsistent with
other similar functions in the standard library.
* Using ``SupportsRead[bytes]`` allows us to a) ensure utf-8 is the encoding used,
b) avoid incorrectly parsing single carriage returns as valid TOML due to
universal newlines. TOML specifies file encoding and valid newline
sequences, and hence is simply stricter format than what text file objects
represent.
A significant fraction of ``toml`` users rely on this.
#. Errors
``toml`` raises ``TomlDecodeError`` vs the proposed PEP 8 compliant
``TOMLDecodeError``.
A significant fraction of ``toml`` users rely on this.
#. ``toml.load[s]`` accepts a ``_dict`` argument
Discussed at `<Controlling the type of mappings returned by tomllib.load[s]_>`_.
As discussed, almost all usage consists of ``_dict=OrderedDict``, which is
not necessary in Python 3.7 and later.
#. ``toml.load[s]`` support an undocumented ``decoder`` argument
It seems the intended use case is for an implementation of comment
preservation. The information recorded is not sufficient to roundtrip the
TOML document preserving style, the implementation has known bugs, the
feature is undocumented and I could only find one instance of its use on
https://grep.app.
The ``toml.TomlDecoder`` interface exposed is not simple, containing nine methods.
See `here <https://github.com/uiri/toml/blob/3f637dba5f68db63d4b30967fedda51c82459471/toml/decoder.pyi#L36>`__.
Users are probably better served by a more complete implementation of style
preserving parsing and writing.
#. ``toml.dump[s]`` support an ``encoder`` argument
Note that we currently propose not to include a write API, however if that
were to change, these differences would likely become relevant.
This enables two use cases, a) control over how custom types should be
serialized, b) control over how output should be formatted.
The first use case is reasonable, however, I could only find two instances of
this on https://grep.app. One of these two instances used this ability to add
support for dumping ``decimal.Decimal`` (which a potential standard library
implementation would support out of the box).
If needed, this use case could be well served by the equivalent of the
``default`` argument in ``json.dump``.
The second use case is enabled by allowing users to specify subclasses of
``toml.TomlEncoder`` and overriding methods to specify parts of the TOML
writing process. The API consists of five methods and exposes a lot of
implementation detail. See `here <https://github.com/uiri/toml/blob/3f637dba5f68db63d4b30967fedda51c82459471/toml/encoder.pyi#L9>`__.
There is some usage of the ``encoder`` API on https://grep.app, however, it
likely accounts for a tiny fraction of overall usage of ``toml``.
#. Timezones
``toml`` uses and exposes custom ``toml.tz.TomlTz`` timezone objects. The
proposed implementation uses ``datetime.timezone`` objects from the standard
library.
Copyright
=========
This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: