502 lines
21 KiB
ReStructuredText
502 lines
21 KiB
ReStructuredText
PEP: 680
|
||
Title: tomllib: Support for parsing TOML in the Standard Library
|
||
Author: Taneli Hukkinen, Shantanu Jain <hauntsaninja at gmail.com>
|
||
Sponsor: Petr Viktorin <encukou@gmail.com>
|
||
Discussions-To: https://discuss.python.org/t/adopting-recommending-a-toml-parser/4068
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Content-Type: text/x-rst
|
||
Created: 01-Jan-2022
|
||
Python-Version: 3.11
|
||
Post-History: 1900-01-01
|
||
|
||
|
||
Abstract
|
||
========
|
||
|
||
This proposes adding a module, ``tomllib``, to the standard library for
|
||
parsing TOML (Tom's Obvious Minimal Language,
|
||
`https://toml.io <https://toml.io/en/>`_).
|
||
|
||
|
||
Motivation
|
||
==========
|
||
|
||
The TOML format is the format of choice for Python packaging, as evidenced by
|
||
:pep:`517`, :pep:`518` and :pep:`621`. Including TOML support in the standard
|
||
library helps avoid bootstrapping problems for Python build tools. Currently
|
||
most Python build tools need to vendor a TOML parsing library.
|
||
|
||
Python tools are increasingly configurable via TOML, for examples: ``black``,
|
||
``mypy``, ``pytest``, ``tox``, ``pylint``, ``isort``. Those that are not, such
|
||
as ``flake8``, cite the lack of standard library support as a `main reason why
|
||
<https://github.com/PyCQA/flake8/issues/234#issuecomment-812800657>`_.
|
||
|
||
Given the special place TOML already has in the Python ecosystem, it makes sense
|
||
for this to be an included battery.
|
||
|
||
Finally, TOML as a format is increasingly popular (some reasons for this are
|
||
outlined in PEP 518). Hence this is likely to be a generally useful addition,
|
||
even looking beyond the needs of Python packaging and Python tooling: various
|
||
Python TOML libraries have about 2000 reverse dependencies on PyPI. For
|
||
comparison, ``requests`` has about 28k reverse dependencies.
|
||
|
||
|
||
Rationale
|
||
=========
|
||
|
||
This PEP proposes basing the standard library support for reading TOML on the
|
||
third party library ``tomli``
|
||
(`github.com/hukkin/tomli <https://github.com/hukkin/tomli>`_).
|
||
|
||
Many projects have recently switched to using ``tomli``, for example, ``pip``,
|
||
``build``, ``pytest``, ``mypy``, ``black``, ``flit``, ``coverage``,
|
||
``setuptools-scm``, ``cibuildwheel``.
|
||
|
||
``tomli`` is actively maintained and well-tested. ``tomli`` is about 800 lines
|
||
of code with 100% test coverage and passes all tests in a test suite `proposed
|
||
as the official TOML compliance test suite
|
||
<https://github.com/toml-lang/compliance/pull/8>`_, as well as `the more
|
||
established BurntSushi/toml-test suite
|
||
<https://github.com/BurntSushi/toml-test>`_.
|
||
|
||
|
||
Specification
|
||
=============
|
||
|
||
A new module ``tomllib`` with the following functions will be added:
|
||
|
||
.. code-block::
|
||
|
||
def load(fp: SupportsRead[bytes], /, *, parse_float: Callable[[str], Any] = ...) -> dict[str, Any]: ...
|
||
def loads(s: str, /, *, parse_float: Callable[[str], Any] = ...) -> dict[str, Any]: ...
|
||
|
||
``tomllib.load`` deserializes a binary file containing a
|
||
TOML document to a Python dict.
|
||
The ``fp`` argument must have a ``read()`` method with the same API as
|
||
``io.RawIOBase.read()``.
|
||
|
||
``tomllib.loads`` deserializes a str instance containing a TOML document
|
||
to a Python dict.
|
||
|
||
``parse_float`` is a function that takes a string representing a TOML float and
|
||
returns a Python object (similar to ``parse_float`` in ``json.load``). For
|
||
example, a function returning a ``decimal.Decimal`` in cases where precision is
|
||
important. By default, TOML floats are represented as ``float`` type.
|
||
|
||
The returned object contains only basic Python objects (``str``, ``int``,
|
||
``bool``, ``float``, ``datetime.{datetime,date,time}``, ``list``, ``dict`` with
|
||
string keys), and the results of ``parse_float``.
|
||
|
||
``tomllib.TOMLDecodeError`` is raised in the case of invalid TOML.
|
||
|
||
Note that this PEP does not propose ``tomllib.dump`` or ``tomllib.dumps``
|
||
functions, see `<Including an API for writing TOML_>`_ for details.
|
||
|
||
|
||
Maintenance Implications
|
||
========================
|
||
|
||
Stability of TOML
|
||
-----------------
|
||
|
||
The release of TOML v1 in January 2021 indicates stability. Empirically, TOML
|
||
has proven to be a stable format even prior to the release of TOML v1. From the
|
||
`changelog <https://github.com/toml-lang/toml/blob/master/CHANGELOG.md>`_, we
|
||
see TOML has had no major changes since April 2020 and has had two releases in
|
||
the last five years.
|
||
|
||
In the event of changes to the TOML specification, we could treat minor
|
||
revisions as bug fixes and update the implementation in place. In the event of
|
||
major breaking changes, we should preserve support for TOML v1.
|
||
|
||
Maintainability of proposed implementation
|
||
------------------------------------------
|
||
|
||
The proposed implementation (``tomli``) is in pure Python, well tested and
|
||
weighs under 1000 lines of code. It is minimalist, offering a smaller API
|
||
surface area than other TOML implementations.
|
||
|
||
The author of ``tomli`` is willing to help integrate ``tomli`` into the standard
|
||
library and help maintain it, `as per this post
|
||
<https://github.com/hukkin/tomli/issues/141#issuecomment-998018972>`__.
|
||
Petr Viktorin has indicated willingness to maintain a read API,
|
||
`as per this post
|
||
<https://discuss.python.org/t/adopting-recommending-a-toml-parser/4068/88>`__.
|
||
|
||
Rewriting the parser in C is not deemed necessary at this time. It's rare for
|
||
TOML parsing to be a bottleneck in applications. Users with higher performance
|
||
needs can use a third party library (as is already often the case with JSON,
|
||
despite a stdlib extension module).
|
||
|
||
TOML support a slippery slope for other things
|
||
----------------------------------------------
|
||
|
||
As discussed in motivations, TOML holds a special place in the Python ecosystem.
|
||
This chief reason to include TOML in the standard library does not apply to
|
||
other formats, such as YAML or MessagePack.
|
||
|
||
In addition, the simplicity of TOML can help serve as a dividing line, for
|
||
example, YAML is large and complicated.
|
||
|
||
Including an API for writing TOML may, however, be added in a future PEP.
|
||
|
||
|
||
Backwards Compatibility
|
||
=======================
|
||
|
||
This proposal has no backwards compatibility issues within the stdlib, as it
|
||
describes a new module.
|
||
Any existing third-party module named ``tomllib`` will break, as
|
||
``import tomllib`` will import standard library module.
|
||
However, ``tomllib`` is not registered on PyPI, so it is unlikely that such
|
||
a module is widely used.
|
||
|
||
Note that we avoid using the more straightforward name ``toml``, to avoid
|
||
backwards compatibility implications for users who have pinned versions of the
|
||
current ``toml`` PyPI package. For more details, see `<Alternative names for
|
||
module_>`_.
|
||
|
||
|
||
Security Implications
|
||
=====================
|
||
|
||
Errors in the implementation could cause potential security issues.
|
||
The parser's output is limited to simple data types; inability to load
|
||
arbitrary classes avoids security issues common in more "powerful" formats like
|
||
pickle and YAML. Also, the implementation will be in pure Python, which reduces
|
||
security issues endemic to C, such as buffer overflows.
|
||
|
||
|
||
How to Teach This
|
||
=================
|
||
|
||
The API of ``tomllib`` mimics that of other well-established file format
|
||
libraries, such as ``json`` and ``pickle``. The lack of a ``dump`` function will
|
||
be explained in the documentation, with a link to relevant third-party libraries
|
||
(``tomlkit``, ``tomli-w``, ``pytomlpp``).
|
||
|
||
|
||
Reference Implementation
|
||
========================
|
||
|
||
The proposed implementation can be found at https://github.com/hukkin/tomli
|
||
|
||
|
||
Rejected Ideas
|
||
==============
|
||
|
||
Basing on another TOML implementation
|
||
-------------------------------------
|
||
|
||
Potential alternatives include:
|
||
|
||
* ``tomlkit``.
|
||
``tomlkit`` is well established, actively maintained and supports TOML v1. An
|
||
important difference is that ``tomlkit`` supports style roundtripping. As a
|
||
result, it has a more complex API and implementation (about 5x as much code as
|
||
``tomli``). The author does not believe that ``tomlkit`` is a good choice for
|
||
the standard library.
|
||
|
||
* ``toml``.
|
||
``toml`` is a widely used library. However, it is not actively maintained,
|
||
does not support TOML v1 and has several known bugs. Its API is more complex
|
||
than that of ``tomli``. It has some very limited and mostly unused ability to
|
||
preserve style through an undocumented decoder API. It has the ability to
|
||
customise output style through a complicated encoder API. For more details on
|
||
API differences to this PEP, refer to `Appendix A`_.
|
||
|
||
* ``pytomlpp``.
|
||
``pytomlpp`` is a Python wrapper for the C++ project ``toml++``. Pure Python
|
||
libraries are easier to maintain than extension modules.
|
||
|
||
* ``rtoml``.
|
||
``rtoml`` is a Python wrapper for the Rust project ``toml-rs`` and hence has
|
||
similar shortcomings to ``pytomlpp``.
|
||
In addition, it does not support TOML v1.
|
||
|
||
* Writing from scratch.
|
||
It's unclear what we would get from this: ``tomli`` meets our needs and the
|
||
author is willing to help with its inclusion in the standard library.
|
||
|
||
Including an API for writing TOML
|
||
---------------------------------
|
||
|
||
There are several reasons to not include an API for writing TOML:
|
||
|
||
The ability to write TOML is not needed for the use cases that motivate this
|
||
PEP: for core Python packaging use cases or for tools that need to read
|
||
configuration.
|
||
|
||
Use cases that involve editing TOML (as opposed to writing brand new TOML) are
|
||
better served by a style preserving library. TOML is intended as human-readable
|
||
and human-editable configuration, so it's important to preserve human markup,
|
||
such as comments and formatting. This requires a parser whose output includes
|
||
style-related metadata, making it impractical to output plain Python types like
|
||
``str`` and ``dict``. Designing such an API is complicated.
|
||
|
||
But even without considering style preservation, there are too many degrees of
|
||
freedom in how to design a write API. For example, how much control to allow
|
||
users over output formatting, over serialization of custom types, and over input
|
||
and output validation. While there are reasonable choices on how to resolve
|
||
these, the nature of the standard library is such that one only gets one chance
|
||
to get things right.
|
||
|
||
Currently no CPython core developers have expressed willingness to maintain a
|
||
write API or sponsor a PEP that includes a write API. Since it is hard to change
|
||
or remove something in the standard library, it is safer to err on the side of
|
||
exclusion and potentially revisit later.
|
||
|
||
So, writing TOML is left to third-party libraries. If a good API and relevant
|
||
use cases for it are found later, it can be added in a future PEP.
|
||
|
||
|
||
Assorted API details
|
||
--------------------
|
||
|
||
Types accepted by the first argument of ``tomllib.load``
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
The ``toml`` library on PyPI allows passing paths (and lists of path-like
|
||
objects, ignoring missing files and merging the documents into a single object).
|
||
Doing this would be inconsistent with ``json.load``, ``pickle.load``, etc. If we
|
||
agree consistency with other stdlib modules is desirable, allowing paths is
|
||
somewhat out of scope for this PEP. This can easily and explicitly be worked
|
||
around in user code, or a third-party library.
|
||
|
||
The proposed API takes a binary file, while ``toml.load`` takes a text file and
|
||
``json.load`` takes either. Using a binary file allows us to a) ensure utf-8 is
|
||
the encoding used, b) avoid incorrectly parsing single carriage returns as valid
|
||
TOML due to universal newlines.
|
||
|
||
Type accepted by the first argument of ``tomllib.loads``
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
While ``tomllib.load`` takes a binary file, ``tomllib.loads`` takes
|
||
a text string. This may seem inconsistent at first.
|
||
|
||
Quoting TOML v1.0.0 specification:
|
||
|
||
> A TOML file must be a valid UTF-8 encoded Unicode document.
|
||
|
||
``tomllib.loads`` does not intend to load a TOML file, but rather the
|
||
document that the file stores. The most natural representation of
|
||
a Unicode document in Python is ``str``, not ``bytes``.
|
||
|
||
It is possible to add ``bytes`` support in the future if needed, but
|
||
we are not aware of any use cases for it.
|
||
|
||
Controlling the type of mappings returned by ``tomllib.load[s]``
|
||
----------------------------------------------------------------
|
||
|
||
The ``toml`` library on PyPI supports a ``_dict`` argument, which works
|
||
similarly to the ``object_hook`` argument in ``json.load[s]``. There are several
|
||
uses of ``_dict`` found on https://grep.app, however, almost all of them are
|
||
passing ``_dict=OrderedDict``, which should be unnecessary as of Python 3.7. We
|
||
found two instances of legitimate use: in one case, a custom class was passed
|
||
for friendlier KeyErrors, in another case, the custom class had several
|
||
additional lookup and mutation methods (e.g. to help resolve dotted keys).
|
||
|
||
Such an argument is not necessary for the core use cases outlined in the
|
||
motivation section. The absence of this can be pretty easily worked around using
|
||
a wrapper class, transformer function, or a third-party library. Finally,
|
||
support could be added later in a backward compatible way.
|
||
|
||
|
||
Removing support for ``parse_float`` in ``tomllib.load[s]``
|
||
-----------------------------------------------------------
|
||
|
||
This option is not strictly necessary, since TOML floats are "IEEE 754 binary64
|
||
values", which is ``float`` on most architectures. Using ``decimal.Decimal``
|
||
thus allows users extra precision not promised by the TOML format. However, in
|
||
the author of ``tomli``'s experience, this is useful in scientific and financial
|
||
applications. TOML-facing users may include non-developers who are not aware of
|
||
the limits of double-precision float.
|
||
|
||
There are also niche architectures where the Python ``float`` is not a IEEE-754
|
||
binary64. The ``parse_float`` argument allows users to achieve correct TOML
|
||
semantics even on such architectures.
|
||
|
||
|
||
Alternative names for module
|
||
----------------------------
|
||
|
||
Ideally, we would be able to use the ``toml`` module name.
|
||
|
||
However, the ``toml`` package on PyPI is widely used, so there are backward
|
||
compatibility concerns. Since the standard library takes precedence over third
|
||
party packages, users who have pinned versions of ``toml`` would be broken when
|
||
upgrading Python versions by any API incompatibilities.
|
||
|
||
To further clarify, the user pins are the specific concern here. Even if we were
|
||
able to get control over the ``toml`` PyPI package and repurpose it as a
|
||
standard library backport, we would still break users who have pinned to
|
||
versions of the current ``toml`` package. This is unfortunate, since pinning
|
||
would likely be a common response to breaking changes introduced by repurposing
|
||
the ``toml`` package as a backport (that is incompatible with today's ``toml``).
|
||
|
||
There are several API incompatibilities between ``toml`` and the API proposed in
|
||
this PEP, listed in `Appendix A`_.
|
||
|
||
Finally, the ``toml`` package on PyPI is not actively maintained and `we have
|
||
been unable to contact the author <https://github.com/uiri/toml/issues/361>`__,
|
||
so action here would likely have to be taken without the author's consent.
|
||
|
||
This PEP proposes ``tomllib``. This mirrors ``plistlib`` (another file format
|
||
module in the standard library), as well as several others such as ``pathlib``,
|
||
``graphlib``, etc.
|
||
|
||
Other considered names include:
|
||
|
||
* ``tomlparser``. This mirrors ``configparser``, but is perhaps slightly less
|
||
appropriate if we include a write API in the future.
|
||
* ``tomli``. This assumes we use ``tomli`` as the basis for implementation.
|
||
* ``toml`` under some namespace, such as ``parser.toml``. However, this is
|
||
awkward, especially so since existing libraries like ``json``, ``pickle``,
|
||
``marshal``, ``html`` etc. would not be included in the namespace.
|
||
|
||
|
||
TODO: Random things
|
||
===================
|
||
|
||
Previous discussion:
|
||
|
||
* https://bugs.python.org/issue40059
|
||
* https://mail.python.org/archives/list/python-ideas@python.org/thread/IWJ3I32A4TY6CIVQ6ONPEBPWP4TOV2V7/
|
||
* https://mail.python.org/pipermail/python-dev/2019-May/157405.html
|
||
* https://github.com/hukkin/tomli/issues/141
|
||
* https://discuss.python.org/t/adopting-recommending-a-toml-parser/4068/84
|
||
|
||
Useful https://grep.app searches (note, ignore vendored):
|
||
|
||
* toml.load[s] usage https://grep.app/search?q=toml.load&filter[lang][0]=Python
|
||
* toml.dump[s] usage https://grep.app/search?q=toml.dump&filter[lang][0]=Python
|
||
* TomlEncoder subclasses https://grep.app/search?q=TomlEncoder%29%3A&filter[lang][0]=Python
|
||
|
||
|
||
.. _Appendix A:
|
||
|
||
Appendix A: Differences between proposed API and ``toml``
|
||
=========================================================
|
||
|
||
This appendix covers the differences between the API proposed in this PEP and
|
||
that of the third party package ``toml``. These differences are relevant to
|
||
understanding the amount of breakage we could expect if we used the ``toml``
|
||
name for the standard library module, as well as to better understand the design
|
||
space. Note that this list might not be exhaustive.
|
||
|
||
#. This PEP currently proposes not to include a write API. That is, there will
|
||
be no equivalent of ``toml.dump`` or ``toml.dumps``.
|
||
|
||
Discussed at `<Including an API for writing TOML_>`_.
|
||
|
||
If we included a write API, it would be relatively simple to convert most
|
||
code that uses ``toml`` to use the API proposed in this PEP (acknowledging
|
||
that that is very different from a compatible API).
|
||
|
||
A significant fraction of ``toml`` users rely on this.
|
||
|
||
#. Different first argument of ``toml.load``
|
||
|
||
``toml.load`` has the following signature:
|
||
|
||
.. code-block::
|
||
|
||
def load(
|
||
f: Union[SupportsRead[str], str, bytes, list[PathLike | str | bytes]],
|
||
_dict: Type[MutableMapping[str, Any]] = ...,
|
||
decoder: TomlDecoder = ...,
|
||
) -> MutableMapping[str, Any]: ...
|
||
|
||
This is pretty different from the first argument proposed in this PEP: ``SupportsRead[bytes]``.
|
||
|
||
Recapping the reasons for this, previously mentioned at
|
||
`<Types accepted by the first argument of tomllib.load_>`_:
|
||
|
||
* Allowing passing of paths (and lists of path-like objects, ignoring missing
|
||
files and merging the documents into a single object) is inconsistent with
|
||
other similar functions in the standard library.
|
||
* Using ``SupportsRead[bytes]`` allows us to a) ensure utf-8 is the encoding used,
|
||
b) avoid incorrectly parsing single carriage returns as valid TOML due to
|
||
universal newlines. TOML specifies file encoding and valid newline
|
||
sequences, and hence is simply stricter format than what text file objects
|
||
represent.
|
||
|
||
A significant fraction of ``toml`` users rely on this.
|
||
|
||
#. Errors
|
||
|
||
``toml`` raises ``TomlDecodeError`` vs the proposed PEP 8 compliant
|
||
``TOMLDecodeError``.
|
||
|
||
A significant fraction of ``toml`` users rely on this.
|
||
|
||
#. ``toml.load[s]`` accepts a ``_dict`` argument
|
||
|
||
Discussed at `<Controlling the type of mappings returned by tomllib.load[s]_>`_.
|
||
|
||
As discussed, almost all usage consists of ``_dict=OrderedDict``, which is
|
||
not necessary in Python 3.7 and later.
|
||
|
||
#. ``toml.load[s]`` support an undocumented ``decoder`` argument
|
||
|
||
It seems the intended use case is for an implementation of comment
|
||
preservation. The information recorded is not sufficient to roundtrip the
|
||
TOML document preserving style, the implementation has known bugs, the
|
||
feature is undocumented and I could only find one instance of its use on
|
||
https://grep.app.
|
||
|
||
The `toml.TomlDecoder interface <https://github.com/uiri/toml/blob/3f637dba5f68db63d4b30967fedda51c82459471/toml/decoder.pyi#L36>`__
|
||
exposed is not simple, containing nine methods.
|
||
|
||
Users are probably better served by a more complete implementation of style
|
||
preserving parsing and writing.
|
||
|
||
#. ``toml.dump[s]`` support an ``encoder`` argument
|
||
|
||
Note that we currently propose not to include a write API, however if that
|
||
were to change, these differences would likely become relevant.
|
||
|
||
This enables two use cases, a) control over how custom types should be
|
||
serialized, b) control over how output should be formatted.
|
||
|
||
The first use case is reasonable, however, I could only find two instances of
|
||
this on https://grep.app. One of these two instances used this ability to add
|
||
support for dumping ``decimal.Decimal`` (which a potential standard library
|
||
implementation would support out of the box).
|
||
|
||
If needed, this use case could be well served by the equivalent of the
|
||
``default`` argument in ``json.dump``.
|
||
|
||
The second use case is enabled by allowing users to specify subclasses of
|
||
`toml.TomlEncoder <https://github.com/uiri/toml/blob/3f637dba5f68db63d4b30967fedda51c82459471/toml/encoder.pyi#L9>`__
|
||
and overriding methods to specify parts of the TOML writing process. The API
|
||
consists of five methods and exposes a lot of implementation detail.
|
||
|
||
There is some usage of the ``encoder`` API on https://grep.app, however, it
|
||
likely accounts for a tiny fraction of overall usage of ``toml``.
|
||
|
||
#. Timezones
|
||
|
||
``toml`` uses and exposes custom ``toml.tz.TomlTz`` timezone objects. The
|
||
proposed implementation uses ``datetime.timezone`` objects from the standard
|
||
library.
|
||
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document is placed in the public domain or under the
|
||
CC0-1.0-Universal license, whichever is more permissive.
|
||
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
coding: utf-8
|
||
End:
|