PEP 680: "tomllib" Support for parsing TOML in the Standard Library (#2218)
Co-authored-by: Taneli Hukkinen <3275109+hukkin@users.noreply.github.com> Co-authored-by: Petr Viktorin <encukou@gmail.com>
This commit is contained in:
parent
8b9859a142
commit
5056f2a964
|
@ -0,0 +1,501 @@
|
|||
PEP: 680
|
||||
Title: tomllib: Support for parsing TOML in the Standard Library
|
||||
Author: Taneli Hukkinen, Shantanu Jain <hauntsaninja at gmail.com>
|
||||
Sponsor: Petr Viktorin <encukou@gmail.com>
|
||||
Discussions-To: https://discuss.python.org/t/adopting-recommending-a-toml-parser/4068
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 01-Jan-2022
|
||||
Python-Version: 3.11
|
||||
Post-History: 1900-01-01
|
||||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
This proposes adding a module, ``tomllib``, to the standard library for
|
||||
parsing TOML (Tom's Obvious Minimal Language,
|
||||
`https://toml.io <https://toml.io/en/>`_).
|
||||
|
||||
|
||||
Motivation
|
||||
==========
|
||||
|
||||
The TOML format is the format of choice for Python packaging, as evidenced by
|
||||
:pep:`517`, :pep:`518` and :pep:`621`. Including TOML support in the standard
|
||||
library helps avoid bootstrapping problems for Python build tools. Currently
|
||||
most Python build tools need to vendor a TOML parsing library.
|
||||
|
||||
Python tools are increasingly configurable via TOML, for examples: ``black``,
|
||||
``mypy``, ``pytest``, ``tox``, ``pylint``, ``isort``. Those that are not, such
|
||||
as ``flake8``, cite the lack of standard library support as a `main reason why
|
||||
<https://github.com/PyCQA/flake8/issues/234#issuecomment-812800657>`_.
|
||||
|
||||
Given the special place TOML already has in the Python ecosystem, it makes sense
|
||||
for this to be an included battery.
|
||||
|
||||
Finally, TOML as a format is increasingly popular (some reasons for this are
|
||||
outlined in PEP 518). Hence this is likely to be a generally useful addition,
|
||||
even looking beyond the needs of Python packaging and Python tooling: various
|
||||
Python TOML libraries have about 2000 reverse dependencies on PyPI. For
|
||||
comparison, ``requests`` has about 28k reverse dependencies.
|
||||
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
||||
This PEP proposes basing the standard library support for reading TOML on the
|
||||
third party library ``tomli``
|
||||
(`github.com/hukkin/tomli <https://github.com/hukkin/tomli>`_).
|
||||
|
||||
Many projects have recently switched to using ``tomli``, for example, ``pip``,
|
||||
``build``, ``pytest``, ``mypy``, ``black``, ``flit``, ``coverage``,
|
||||
``setuptools-scm``, ``cibuildwheel``.
|
||||
|
||||
``tomli`` is actively maintained and well-tested. ``tomli`` is about 800 lines
|
||||
of code with 100% test coverage and passes all tests in a test suite `proposed
|
||||
as the official TOML compliance test suite
|
||||
<https://github.com/toml-lang/compliance/pull/8>`_, as well as `the more
|
||||
established BurntSushi/toml-test suite
|
||||
<https://github.com/BurntSushi/toml-test>`_.
|
||||
|
||||
|
||||
Specification
|
||||
=============
|
||||
|
||||
A new module ``tomllib`` with the following functions will be added:
|
||||
|
||||
.. code-block::
|
||||
|
||||
def load(fp: SupportsRead[bytes], /, *, parse_float: Callable[[str], Any] = ...) -> dict[str, Any]: ...
|
||||
def loads(s: str, /, *, parse_float: Callable[[str], Any] = ...) -> dict[str, Any]: ...
|
||||
|
||||
``tomllib.load`` deserializes a binary file containing a
|
||||
TOML document to a Python dict.
|
||||
The ``fp`` argument must have a ``read()`` method with the same API as
|
||||
``io.RawIOBase.read()``.
|
||||
|
||||
``tomllib.loads`` deserializes a str instance containing a TOML document
|
||||
to a Python dict.
|
||||
|
||||
``parse_float`` is a function that takes a string representing a TOML float and
|
||||
returns a Python object (similar to ``parse_float`` in ``json.load``). For
|
||||
example, a function returning a ``decimal.Decimal`` in cases where precision is
|
||||
important. By default, TOML floats are represented as ``float`` type.
|
||||
|
||||
The returned object contains only basic Python objects (``str``, ``int``,
|
||||
``bool``, ``float``, ``datetime.{datetime,date,time}``, ``list``, ``dict`` with
|
||||
string keys), and the results of ``parse_float``.
|
||||
|
||||
``tomllib.TOMLDecodeError`` is raised in the case of invalid TOML.
|
||||
|
||||
Note that this PEP does not propose ``tomllib.dump`` or ``tomllib.dumps``
|
||||
functions, see `<Including an API for writing TOML_>`_ for details.
|
||||
|
||||
|
||||
Maintenance Implications
|
||||
========================
|
||||
|
||||
Stability of TOML
|
||||
-----------------
|
||||
|
||||
The release of TOML v1 in January 2021 indicates stability. Empirically, TOML
|
||||
has proven to be a stable format even prior to the release of TOML v1. From the
|
||||
`changelog <https://github.com/toml-lang/toml/blob/master/CHANGELOG.md>`_, we
|
||||
see TOML has had no major changes since April 2020 and has had two releases in
|
||||
the last five years.
|
||||
|
||||
In the event of changes to the TOML specification, we could treat minor
|
||||
revisions as bug fixes and update the implementation in place. In the event of
|
||||
major breaking changes, we should preserve support for TOML v1.
|
||||
|
||||
Maintainability of proposed implementation
|
||||
------------------------------------------
|
||||
|
||||
The proposed implementation (``tomli``) is in pure Python, well tested and
|
||||
weighs under 1000 lines of code. It is minimalist, offering a smaller API
|
||||
surface area than other TOML implementations.
|
||||
|
||||
The author of ``tomli`` is willing to help integrate ``tomli`` into the standard
|
||||
library and help maintain it, `as per this post
|
||||
<https://github.com/hukkin/tomli/issues/141#issuecomment-998018972>`__.
|
||||
Petr Viktorin has indicated willingness to maintain a read API,
|
||||
`as per this post
|
||||
<https://discuss.python.org/t/adopting-recommending-a-toml-parser/4068/88>`__.
|
||||
|
||||
Rewriting the parser in C is not deemed necessary at this time. It's rare for
|
||||
TOML parsing to be a bottleneck in applications. Users with higher performance
|
||||
needs can use a third party library (as is already often the case with JSON,
|
||||
despite a stdlib extension module).
|
||||
|
||||
TOML support a slippery slope for other things
|
||||
----------------------------------------------
|
||||
|
||||
As discussed in motivations, TOML holds a special place in the Python ecosystem.
|
||||
This chief reason to include TOML in the standard library does not apply to
|
||||
other formats, such as YAML or MessagePack.
|
||||
|
||||
In addition, the simplicity of TOML can help serve as a dividing line, for
|
||||
example, YAML is large and complicated.
|
||||
|
||||
Including an API for writing TOML may, however, be added in a future PEP.
|
||||
|
||||
|
||||
Backwards Compatibility
|
||||
=======================
|
||||
|
||||
This proposal has no backwards compatibility issues within the stdlib, as it
|
||||
describes a new module.
|
||||
Any existing third-party module named ``tomllib`` will break, as
|
||||
``import tomllib`` will import standard library module.
|
||||
However, ``tomllib`` is not registered on PyPI, so it is unlikely that such
|
||||
a module is widely used.
|
||||
|
||||
Note that we avoid using the more straightforward name ``toml``, to avoid
|
||||
backwards compatibility implications for users who have pinned versions of the
|
||||
current ``toml`` PyPI package. For more details, see `<Alternative names for
|
||||
module_>`_.
|
||||
|
||||
|
||||
Security Implications
|
||||
=====================
|
||||
|
||||
Errors in the implementation could cause potential security issues.
|
||||
The parser's output is limited to simple data types; inability to load
|
||||
arbitrary classes avoids security issues common in more "powerful" formats like
|
||||
pickle and YAML. Also, the implementation will be in pure Python, which reduces
|
||||
security issues endemic to C, such as buffer overflows.
|
||||
|
||||
|
||||
How to Teach This
|
||||
=================
|
||||
|
||||
The API of ``tomllib`` mimics that of other well-established file format
|
||||
libraries, such as ``json`` and ``pickle``. The lack of a ``dump`` function will
|
||||
be explained in the documentation, with a link to relevant third-party libraries
|
||||
(``tomlkit``, ``tomli-w``, ``pytomlpp``).
|
||||
|
||||
|
||||
Reference Implementation
|
||||
========================
|
||||
|
||||
The proposed implementation can be found at https://github.com/hukkin/tomli
|
||||
|
||||
|
||||
Rejected Ideas
|
||||
==============
|
||||
|
||||
Basing on another TOML implementation
|
||||
-------------------------------------
|
||||
|
||||
Potential alternatives include:
|
||||
|
||||
* ``tomlkit``.
|
||||
``tomlkit`` is well established, actively maintained and supports TOML v1. An
|
||||
important difference is that ``tomlkit`` supports style roundtripping. As a
|
||||
result, it has a more complex API and implementation (about 5x as much code as
|
||||
``tomli``). The author does not believe that ``tomlkit`` is a good choice for
|
||||
the standard library.
|
||||
|
||||
* ``toml``.
|
||||
``toml`` is a widely used library. However, it is not actively maintained,
|
||||
does not support TOML v1 and has several known bugs. Its API is more complex
|
||||
than that of ``tomli``. It has some very limited and mostly unused ability to
|
||||
preserve style through an undocumented decoder API. It has the ability to
|
||||
customise output style through a complicated encoder API. For more details on
|
||||
API differences to this PEP, refer to `Appendix A`_.
|
||||
|
||||
* ``pytomlpp``.
|
||||
``pytomlpp`` is a Python wrapper for the C++ project ``toml++``. Pure Python
|
||||
libraries are easier to maintain than extension modules.
|
||||
|
||||
* ``rtoml``.
|
||||
``rtoml`` is a Python wrapper for the Rust project ``toml-rs`` and hence has
|
||||
similar shortcomings to ``pytomlpp``.
|
||||
In addition, it does not support TOML v1.
|
||||
|
||||
* Writing from scratch.
|
||||
It's unclear what we would get from this: ``tomli`` meets our needs and the
|
||||
author is willing to help with its inclusion in the standard library.
|
||||
|
||||
Including an API for writing TOML
|
||||
---------------------------------
|
||||
|
||||
There are several reasons to not include an API for writing TOML:
|
||||
|
||||
The ability to write TOML is not needed for the use cases that motivate this
|
||||
PEP: for core Python packaging use cases or for tools that need to read
|
||||
configuration.
|
||||
|
||||
Use cases that involve editing TOML (as opposed to writing brand new TOML) are
|
||||
better served by a style preserving library. TOML is intended as human-readable
|
||||
and human-editable configuration, so it's important to preserve human markup,
|
||||
such as comments and formatting. This requires a parser whose output includes
|
||||
style-related metadata, making it impractical to output plain Python types like
|
||||
``str`` and ``dict``. Designing such an API is complicated.
|
||||
|
||||
But even without considering style preservation, there are too many degrees of
|
||||
freedom in how to design a write API. For example, how much control to allow
|
||||
users over output formatting, over serialization of custom types, and over input
|
||||
and output validation. While there are reasonable choices on how to resolve
|
||||
these, the nature of the standard library is such that one only gets one chance
|
||||
to get things right.
|
||||
|
||||
Currently no CPython core developers have expressed willingness to maintain a
|
||||
write API or sponsor a PEP that includes a write API. Since it is hard to change
|
||||
or remove something in the standard library, it is safer to err on the side of
|
||||
exclusion and potentially revisit later.
|
||||
|
||||
So, writing TOML is left to third-party libraries. If a good API and relevant
|
||||
use cases for it are found later, it can be added in a future PEP.
|
||||
|
||||
|
||||
Assorted API details
|
||||
--------------------
|
||||
|
||||
Types accepted by the first argument of ``tomllib.load``
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The ``toml`` library on PyPI allows passing paths (and lists of path-like
|
||||
objects, ignoring missing files and merging the documents into a single object).
|
||||
Doing this would be inconsistent with ``json.load``, ``pickle.load``, etc. If we
|
||||
agree consistency with other stdlib modules is desirable, allowing paths is
|
||||
somewhat out of scope for this PEP. This can easily and explicitly be worked
|
||||
around in user code, or a third-party library.
|
||||
|
||||
The proposed API takes a binary file, while ``toml.load`` takes a text file and
|
||||
``json.load`` takes either. Using a binary file allows us to a) ensure utf-8 is
|
||||
the encoding used, b) avoid incorrectly parsing single carriage returns as valid
|
||||
TOML due to universal newlines.
|
||||
|
||||
Type accepted by the first argument of ``tomllib.loads``
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
While ``tomllib.load`` takes a binary file, ``tomllib.loads`` takes
|
||||
a text string. This may seem inconsistent at first.
|
||||
|
||||
Quoting TOML v1.0.0 specification:
|
||||
|
||||
> A TOML file must be a valid UTF-8 encoded Unicode document.
|
||||
|
||||
``tomllib.loads`` does not intend to load a TOML file, but rather the
|
||||
document that the file stores. The most natural representation of
|
||||
a Unicode document in Python is ``str``, not ``bytes``.
|
||||
|
||||
It is possible to add ``bytes`` support in the future if needed, but
|
||||
we are not aware of any use cases for it.
|
||||
|
||||
Controlling the type of mappings returned by ``tomllib.load[s]``
|
||||
----------------------------------------------------------------
|
||||
|
||||
The ``toml`` library on PyPI supports a ``_dict`` argument, which works
|
||||
similarly to the ``object_hook`` argument in ``json.load[s]``. There are several
|
||||
uses of ``_dict`` found on https://grep.app, however, almost all of them are
|
||||
passing ``_dict=OrderedDict``, which should be unnecessary as of Python 3.7. We
|
||||
found two instances of legitimate use: in one case, a custom class was passed
|
||||
for friendlier KeyErrors, in another case, the custom class had several
|
||||
additional lookup and mutation methods (e.g. to help resolve dotted keys).
|
||||
|
||||
Such an argument is not necessary for the core use cases outlined in the
|
||||
motivation section. The absence of this can be pretty easily worked around using
|
||||
a wrapper class, transformer function, or a third-party library. Finally,
|
||||
support could be added later in a backward compatible way.
|
||||
|
||||
|
||||
Removing support for ``parse_float`` in ``tomllib.load[s]``
|
||||
-----------------------------------------------------------
|
||||
|
||||
This option is not strictly necessary, since TOML floats are "IEEE 754 binary64
|
||||
values", which is ``float`` on most architectures. Using ``decimal.Decimal``
|
||||
thus allows users extra precision not promised by the TOML format. However, in
|
||||
the author of ``tomli``'s experience, this is useful in scientific and financial
|
||||
applications. TOML-facing users may include non-developers who are not aware of
|
||||
the limits of double-precision float.
|
||||
|
||||
There are also niche architectures where the Python ``float`` is not a IEEE-754
|
||||
binary64. The ``parse_float`` argument allows users to achieve correct TOML
|
||||
semantics even on such architectures.
|
||||
|
||||
|
||||
Alternative names for module
|
||||
----------------------------
|
||||
|
||||
Ideally, we would be able to use the ``toml`` module name.
|
||||
|
||||
However, the ``toml`` package on PyPI is widely used, so there are backward
|
||||
compatibility concerns. Since the standard library takes precedence over third
|
||||
party packages, users who have pinned versions of ``toml`` would be broken when
|
||||
upgrading Python versions by any API incompatibilities.
|
||||
|
||||
To further clarify, the user pins are the specific concern here. Even if we were
|
||||
able to get control over the ``toml`` PyPI package and repurpose it as a
|
||||
standard library backport, we would still break users who have pinned to
|
||||
versions of the current ``toml`` package. This is unfortunate, since pinning
|
||||
would likely be a common response to breaking changes introduced by repurposing
|
||||
the ``toml`` package as a backport (that is incompatible with today's ``toml``).
|
||||
|
||||
There are several API incompatibilities between ``toml`` and the API proposed in
|
||||
this PEP, listed in `Appendix A`_.
|
||||
|
||||
Finally, the ``toml`` package on PyPI is not actively maintained and `we have
|
||||
been unable to contact the author <https://github.com/uiri/toml/issues/361>`,
|
||||
so action here would likely have to be taken without the author's consent.
|
||||
|
||||
This PEP proposes ``tomllib``. This mirrors ``plistlib`` (another file format
|
||||
module in the standard library), as well as several others such as ``pathlib``,
|
||||
``graphlib``, etc.
|
||||
|
||||
Other considered names include:
|
||||
|
||||
* ``tomlparser``. This mirrors ``configparser``, but is perhaps slightly less
|
||||
appropriate if we include a write API in the future.
|
||||
* ``tomli``. This assumes we use ``tomli`` as the basis for implementation.
|
||||
* ``toml`` under some namespace, such as ``parser.toml``. However, this is
|
||||
awkward, especially so since existing libraries like ``json``, ``pickle``,
|
||||
``marshal``, ``html`` etc. would not be included in the namespace.
|
||||
|
||||
|
||||
TODO: Random things
|
||||
===================
|
||||
|
||||
Previous discussion:
|
||||
|
||||
* https://bugs.python.org/issue40059
|
||||
* https://mail.python.org/archives/list/python-ideas@python.org/thread/IWJ3I32A4TY6CIVQ6ONPEBPWP4TOV2V7/
|
||||
* https://mail.python.org/pipermail/python-dev/2019-May/157405.html
|
||||
* https://github.com/hukkin/tomli/issues/141
|
||||
* https://discuss.python.org/t/adopting-recommending-a-toml-parser/4068/84
|
||||
|
||||
Useful https://grep.app searches (note, ignore vendored):
|
||||
|
||||
* toml.load[s] usage https://grep.app/search?q=toml.load&filter[lang][0]=Python
|
||||
* toml.dump[s] usage https://grep.app/search?q=toml.dump&filter[lang][0]=Python
|
||||
* TomlEncoder subclasses https://grep.app/search?q=TomlEncoder%29%3A&filter[lang][0]=Python
|
||||
|
||||
|
||||
.. _Appendix A:
|
||||
|
||||
Appendix A: Differences between proposed API and ``toml``
|
||||
=========================================================
|
||||
|
||||
This appendix covers the differences between the API proposed in this PEP and
|
||||
that of the third party package ``toml``. These differences are relevant to
|
||||
understanding the amount of breakage we could expect if we used the ``toml``
|
||||
name for the standard library module, as well as to better understand the design
|
||||
space. Note that this list might not be exhaustive.
|
||||
|
||||
#. This PEP currently proposes not to include a write API. That is, there will
|
||||
be no equivalent of ``toml.dump`` or ``toml.dumps``.
|
||||
|
||||
Discussed at `<Including an API for writing TOML_>`_.
|
||||
|
||||
If we included a write API, it would be relatively simple to convert most
|
||||
code that uses ``toml`` to use the API proposed in this PEP (acknowledging
|
||||
that that is very different from a compatible API).
|
||||
|
||||
A significant fraction of ``toml`` users rely on this.
|
||||
|
||||
#. Different first argument of ``toml.load``
|
||||
|
||||
``toml.load`` has the following signature:
|
||||
|
||||
.. code-block::
|
||||
|
||||
def load(
|
||||
f: Union[SupportsRead[str], str, bytes, list[PathLike | str | bytes]],
|
||||
_dict: Type[MutableMapping[str, Any]] = ...,
|
||||
decoder: TomlDecoder = ...,
|
||||
) -> MutableMapping[str, Any]: ...
|
||||
|
||||
This is pretty different from the first argument proposed in this PEP: ``SupportsRead[bytes]``.
|
||||
|
||||
Recapping the reasons for this, previously mentioned at
|
||||
`<Types accepted by the first argument of tomllib.load_>`_:
|
||||
|
||||
* Allowing passing of paths (and lists of path-like objects, ignoring missing
|
||||
files and merging the documents into a single object) is inconsistent with
|
||||
other similar functions in the standard library.
|
||||
* Using ``SupportsRead[bytes]`` allows us to a) ensure utf-8 is the encoding used,
|
||||
b) avoid incorrectly parsing single carriage returns as valid TOML due to
|
||||
universal newlines. TOML specifies file encoding and valid newline
|
||||
sequences, and hence is simply stricter format than what text file objects
|
||||
represent.
|
||||
|
||||
A significant fraction of ``toml`` users rely on this.
|
||||
|
||||
#. Errors
|
||||
|
||||
``toml`` raises ``TomlDecodeError`` vs the proposed PEP 8 compliant
|
||||
``TOMLDecodeError``.
|
||||
|
||||
A significant fraction of ``toml`` users rely on this.
|
||||
|
||||
#. ``toml.load[s]`` accepts a ``_dict`` argument
|
||||
|
||||
Discussed at `<Controlling the type of mappings returned by tomllib.load[s]_>`_.
|
||||
|
||||
As discussed, almost all usage consists of ``_dict=OrderedDict``, which is
|
||||
not necessary in Python 3.7 and later.
|
||||
|
||||
#. ``toml.load[s]`` support an undocumented ``decoder`` argument
|
||||
|
||||
It seems the intended use case is for an implementation of comment
|
||||
preservation. The information recorded is not sufficient to roundtrip the
|
||||
TOML document preserving style, the implementation has known bugs, the
|
||||
feature is undocumented and I could only find one instance of its use on
|
||||
https://grep.app.
|
||||
|
||||
The ``toml.TomlDecoder`` interface exposed is not simple, containing nine methods.
|
||||
See `here <https://github.com/uiri/toml/blob/3f637dba5f68db63d4b30967fedda51c82459471/toml/decoder.pyi#L36>`__.
|
||||
|
||||
Users are probably better served by a more complete implementation of style
|
||||
preserving parsing and writing.
|
||||
|
||||
#. ``toml.dump[s]`` support an ``encoder`` argument
|
||||
|
||||
Note that we currently propose not to include a write API, however if that
|
||||
were to change, these differences would likely become relevant.
|
||||
|
||||
This enables two use cases, a) control over how custom types should be
|
||||
serialized, b) control over how output should be formatted.
|
||||
|
||||
The first use case is reasonable, however, I could only find two instances of
|
||||
this on https://grep.app. One of these two instances used this ability to add
|
||||
support for dumping ``decimal.Decimal`` (which a potential standard library
|
||||
implementation would support out of the box).
|
||||
|
||||
If needed, this use case could be well served by the equivalent of the
|
||||
``default`` argument in ``json.dump``.
|
||||
|
||||
The second use case is enabled by allowing users to specify subclasses of
|
||||
``toml.TomlEncoder`` and overriding methods to specify parts of the TOML
|
||||
writing process. The API consists of five methods and exposes a lot of
|
||||
implementation detail. See `here <https://github.com/uiri/toml/blob/3f637dba5f68db63d4b30967fedda51c82459471/toml/encoder.pyi#L9>`__.
|
||||
|
||||
There is some usage of the ``encoder`` API on https://grep.app, however, it
|
||||
likely accounts for a tiny fraction of overall usage of ``toml``.
|
||||
|
||||
#. Timezones
|
||||
|
||||
``toml`` uses and exposes custom ``toml.tz.TomlTz`` timezone objects. The
|
||||
proposed implementation uses ``datetime.timezone`` objects from the standard
|
||||
library.
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document is placed in the public domain or under the
|
||||
CC0-1.0-Universal license, whichever is more permissive.
|
||||
|
||||
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
coding: utf-8
|
||||
End:
|
Loading…
Reference in New Issue