python-peps/pep-0706.rst

694 lines
27 KiB
ReStructuredText
Raw Normal View History

PEP: 706
Title: Filter for tarfile.extractall
Author: Petr Viktorin <encukou@gmail.com>
Discussions-To: https://discuss.python.org/t/23903
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Created: 09-Feb-2023
Python-Version: 3.12
Post-History: `25-Jan-2023 <https://discuss.python.org/t/23149>`__,
`15-Feb-2023 <https://discuss.python.org/t/23903>`__,
Resolution: https://discuss.python.org/t/23903/10
.. canonical-doc:: :ref:`tarfile documentation <python:tarfile-extraction-filter>`
Abstract
========
The extraction methods in :external+py3.11:mod:`tarfile` gain a ``filter`` argument,
which allows rejecting files or modifying metadata as the archive is extracted.
Three built-in named filters are provided, aimed at limiting features that
might be surprising or dangerous.
These can be used as-is, or serve as a base for custom filters.
After a deprecation period, a strict (but safer) filter will become the default.
Motivation
==========
The ``tar`` format is used for several use cases, many of which have different
needs. For example:
- A backup of a UNIX workstation should faithfully preserve all kinds of
details like file permissions, symlinks to system configuration, and various
kinds of special files.
- When unpacking a data bundle, its much more important that the unpacking
will not have unintended consequences like exposing a password file by
symlinking it to a public place.
To support all its use cases, the ``tar`` format has many features.
In many cases, it's best to ignore or disallow some of them when extracting
an archive.
Python allows extracting ``tar`` archives using
:external+py3.11:meth:`tarfile.TarFile.extractall`, whose docs warn to
*never extract archives from untrusted sources without prior inspection*.
However, its not clear what kind of inspection should be done.
Indeed, its quite tricky to do such an inspection correctly.
As a result, many people dont bother, or do the check incorrectly, resulting in
security issues such as `CVE-2007-4559`_.
Since :external+py3.11:mod:`tarfile` was first written, it's become more
accepted that warnings in documentation are not enough.
Whenever possible, an unsafe operation should be *explicitly requested*;
potentially dangerous operations should *look* dangerous.
However, ``TarFile.extractall`` looks benign in a code review.
Tarfile extraction is also exposed via :external+py3.11:func:`shutil.unpack_archive`,
which allows the user to not care about the kind of archive they're
dealing with.
The API is very inviting for extracting archives without prior inspection,
even though the docs again warn against it.
It has been argued that Python is not wrong -- it behaves exactly as
documented -- but that's beside the point.
Let's improve the situation rather than assign/avoid blame.
Python and its docs are the best place to improve things.
Rationale
=========
How do we improve things?
Unfortunately, we will need to change the defaults, which implies
breaking backwards compatibility. :external+py3.11:meth:`TarFile.extractall <tarfile.TarFile.extractall>`
is what people reach for when they need to extract a tarball.
Its default behaviour needs to change.
What would be the best behaviour? That depends on the use case.
So, we'll add several general “policies” to control extraction.
They are based on *use cases*, and ideally they should have straightforward
security implications:
- Current behavior: trusting the archive. Suitable e.g. as a building block
for libraries that do the check themselves, or extracting an archive you just
made yourself.
- Unpacking a UNIX archive: roughly following GNU ``tar``, e.g. stripping
leading ``/`` from filenames.
- Unpacking a general data archive: the :external+py3.11:func:`shutil.unpack_archive`
use case,
where it's not important to preserve details specific to ``tar`` or
Unix-like filesystems.
After a deprecation period, the last option -- the most limited
but most secure one -- will become the default.
Even with better general defaults, users should still verify the archives
they extract, and perhaps modify some of the metadata.
Superficially, the following looks like a reasonable way to do this today:
* Call :external+py3.11:meth:`TarFile.getmembers <tarfile.TarFile.getmembers>`
* Verify or modify each member's :external+py3.11:class:`~tarfile.TarInfo`
* Pass the result to ``extractall``'s ``members``
However, there are some issues with this approach:
- It's possible to modify ``TarInfo`` objects, but the changes to them
affect all subsequent operations on the same ``TarFile`` object.
This behavior is fine for most uses, but despite that, it would be very
surprising if ``TarFile.extractall`` did this by default.
- Calling ``getmembers`` can be expensive and it
`requires a seekable archive <https://github.com/python/cpython/issues/45385#issuecomment-1255615199>`__.
- When verifying members in advance, it may be necessary to track how each
member would have changed the filesystem, e.g. how symlinks are being set up.
This is hard. We can't expect users to do it.
To solve these issues we'll:
- Provide a supported way to “clone” and modify ``TarInfo`` objects.
A ``replace`` method, similar to :external+py3.11:func:`dataclasses.replace`
or :external+py3.11:meth:`namedtuple._replace <collections.somenamedtuple._replace>`
should do the trick.
- Provide a “filter” hook in ``extractall``'s loop that can modify or discard
members before they are processed.
- Require that this hook is called just before extracting each member,
so it can scan the *current* state of the disk. This will greatly simplify
the implementation of policies (both in stdlib and user code),
at the cost of not being able to do a precise “dry run”.
The hook API will be very similar to the existing ``filter`` argument
for :external+py3.11:meth:`TarFile.add <tarfile.TarFile.add>`.
We'll also name it ``filter``.
(In some cases “policy” would be a more fitting name,
but the API can be used for more than security policies.)
The built-in policies/filters described above will be implemented using the
public filter API, so they can be used as building blocks or examples.
Setting a precedent
-------------------
If and when other libraries for archive extraction, such as :external+py3.11:mod:`zipfile`,
gain similar functionality, they should mimic this API as much as it's
reasonable.
To enable this for simple cases, the built-in filters will have string names;
e.g. users can pass ``filter='data'`` instead of a specific function that deals
with :external+py3.11:class:`~tarfile.TarInfo` objects.
The :external+py3.11:func:`shutil.unpack_archive` function will get a
``filter`` argument, which it will pass to ``extractall``.
Adding function-based API that would work across archive formats is
out of scope of this PEP.
Full disclosure & redistributor info
------------------------------------
The PEP author works for Red Hat, a redistributor of Python with different
security needs and support periods than CPython in general.
Such redistributors may want to carry vendor patches to:
* Allow configuring the defaults system-wide, and
* Change the default as soon as possible, even in older Python versions.
The proposal makes this easy to do, and it allows users to query
the settings.
Specification
=============
Modifying and forgetting member metadata
----------------------------------------
The :external+py3.11:class:`~tarfile.TarInfo` class will gain a new method,
``replace()``, which will work similarly to ``dataclasses.replace``.
It will return a copy of the ``TarInfo`` object with attributes
replaced as specified by keyword-only arguments:
* ``name``
* ``mtime``
* ``mode``
* ``linkname``
* ``uid``
* ``gid``
* ``uname``
* ``gname``
Any of these, except ``name`` and ``linkname``, will be allowed to be set
to ``None``.
When ``extract`` or ``extractall`` encounters such a ``None``, it will not
set that piece of metadata.
(If ``uname`` or ``gname`` is ``None``, it will fall back to ``uid`` or ``gid``
as if the name wasn't found.)
When ``addfile`` or ``tobuf`` encounters such a ``None``, it will raise a
``ValueError``.
When ``list`` encounters such a ``None``, it will print a placeholder string.
The documentation will mention why the method is there:
``TarInfo`` objects retrieved from :external+py3.11:meth:`TarFile.getmembers <tarfile.TarFile.getmembers>`
are “live”; modifying them directly will affect subsequent unrelated
operations.
Filters
-------
:external+py3.11:meth:`TarFile.extract <tarfile.TarFile.extract>` and
:external+py3.11:meth:`TarFile.extractall <tarfile.TarFile.extractall>` methods
will grow a ``filter`` keyword-only parameter,
which takes a callable that can be called as::
filter(/, member: TarInfo, path: str) -> TarInfo|None
where ``member`` is the member to be extracted, and ``path`` is the path to
where the archive is extracted (i.e., it'll be the same for every member).
When used it will be called on each member as it is extracted,
and extraction will work with the result.
If it returns ``None``, the member will be skipped.
The function can also raise an exception.
This can, depending on ``TarFile.errorlevel``,
abort the extraction or cause the member to be skipped.
.. note::
If extraction is aborted, the archive may be left partially
extracted. It is the users responsibility to clean up.
We will also provide a set of defaults for common use cases.
In addition to a function, the ``filter`` argument can be one
of the following strings:
* ``'fully_trusted'``: Current behavior: honor the metadata as is.
Should be used if the user trusts the archive completely, or implements their
own complex verification.
* ``'tar'``: Roughly follow defaults of the GNU ``tar`` command
(when run as a normal user):
* Strip leading ``'/'`` and ``os.sep`` from filenames
* Refuse to extract files with absolute paths (after the ``/`` stripping
above, e.g. ``C:/foo`` on Windows).
* Refuse to extract files whose absolute path (after following symlinks)
would end up outside the destination.
(Note that GNU ``tar`` instead delays creating some links.)
* Clear high mode bits (setuid, setgid, sticky) and group/other write bits
(:external+py3.11:data:`S_IWGRP|S_IWOTH <stat.S_IWGRP>`).
(This is an approximation of GNU ``tar``'s default, which limits the mode
by the current ``umask`` setting.)
* ``'data'``: Extract a "data" archive, disallowing common attack vectors
but limiting functionality.
In particular, many features specific to UNIX-style filesystems (or
equivalently, to the ``tar`` archive format) are ignored, making this a good
filter for cross-platform archives.
In addition to ``tar``:
* Refuse to extract links (hard or soft) that link to absolute paths.
* Refuse to extract links (hard or soft) which end up linking to a path
outside of the destination.
(On systems that don't support links, ``tarfile`` will, in most cases,
fall back to creating regular files.
This proposal doesn't change that behaviour.)
* Refuse to extract device files (including pipes).
* For regular files and hard links:
* Set the owner read and write permissions (:external+py3.11:data:`S_IRUSR|S_IWUSR <stat.S_IRUSR>`).
* Remove the group & other *executable* permission (:external+py3.11:data:`S_IXGRP|S_IXOTH <stat.S_IXGRP>`)
if the owner doesn't have it (:external+py3.11:data:`~stat.S_IXUSR`).
* For other files (directories), ignore mode entirely (set it to ``None``).
* Ignore user and group info (set ``uid``, ``gid``, ``uname``, ``gname``
to ``None``).
Any other string will cause a ``ValueError``.
The corresponding filter functions will be available as
``tarfile.fully_trusted_filter()``, ``tarfile.tar_filter()``, etc., so
they can be easily used in custom policies.
Note that these filters never return ``None``.
Skipping members this way is a feature for user-defined filters.
Defaults and their configuration
--------------------------------
:external+py3.11:class:`~tarfile.TarFile` will gain a new attribute,
``extraction_filter``, to allow configuring the default filter.
By default it will be ``None``, but users can set it to a callable
that will be used if the ``filter`` argument is missing or ``None``.
.. note::
String names won't be accepted here. That would encourage code like
``my_tarfile.extraction_filter = 'data'``.
On Python versions without this feature, this would do nothing,
silently ignoring a security-related request.
If both the argument and attribute are ``None``:
* In Python 3.12-3.13, a ``DeprecationWarning`` will be emitted and
extraction will use the ``'fully_trusted'`` filter.
* In Python 3.14+, it will use the ``'data'`` filter.
Applications and system integrators may wish to change ``extraction_filter``
of the ``TarFile`` class itself to set a global default.
When using a function, they will generally want to wrap it in ``staticmethod()``
to prevent injection of a ``self`` argument.
Subclasses of ``TarFile`` can also override ``extraction_filter``.
FilterError
-----------
A new exception, ``FilterError``, will be added to the :external+py3.11:mod:`tarfile`
module.
It'll have several new subclasses, one for each of the refusal reasons above.
``FilterError``'s ``member`` attribute will contain the relevant ``TarInfo``.
In the lists above, “refusing" to extract a file means that a ``FilterError``
will be raised.
As with other extraction errors, if the ``TarFile.errorlevel``
is 1 or more, this will abort the extraction; with ``errorlevel=0`` the error
will be logged and the member will be ignored, but extraction will continue.
Note that ``extractall()`` may leave the archive partially extracted;
it is the user's responsibility to clean up.
Errorlevel, and fatal/non-fatal errors
--------------------------------------
Currently, :external+py3.11:class:`~tarfile.TarFile` has an *errorlevel*
argument/attribute, which specifies how errors are handled:
- With ``errorlevel=0``, documentation says that “all errors are ignored
when using :external+py3.11:meth:`~tarfile.TarFile.extract` and
:external+py3.11:meth:`~tarfile.TarFile.extractall`”.
The code only ignores *non-fatal* and *fatal* errors (see below),
so, for example, you still get ``TypeError`` if you pass ``None`` as the
destination path.
- With ``errorlevel=1`` (the default), all *non-fatal* errors are ignored.
(They may be logged to ``sys.stderr`` by setting the *degug*
argument/attribute.)
Which errors are *non-fatal* is not defined in documentation, but code treats
``ExtractionError`` as such. Specifically, it's these issues:
- “unable to resolve link inside archive” (raised on systems that do not
support symlinks)
- “fifo/special devices not supported by system” (not used for failures if
the system supports these, e.g. for a ``PermissionError``)
- “could not change owner/mode/modification time”
Note that, for example, *file name too long* or *out of disk space* don't
qualify.
The *non-fatal* errors are not very likely to appear on a Unix-like system.
- With ``errorlevel=2``, all errors are raised, including *fatal* ones.
Which errors are *fatal* is, again, not defined; in practice it's
``OSError``.
A filter refusing to extract a member does not fit neatly into the
*fatal*/*non-fatal* categories.
- This PEP does not change existing behavior. (Ideas for improvements are
welcome in `Discourse topic 25970 <https://discuss.python.org/t/25970>`_.)
- When a filter refuses to extract a member, the error should not pass
silently by default.
To satisfy this, ``FilterError`` will be considered a *fatal* error, that is,
it'll be ignored only with ``errorlevel=0``.
Users that want to ignore ``FilterError`` but not other *fatal* errors should
create a custom filter function, and call another filter in a ``try`` block.
Hints for further verification
------------------------------
Even with the proposed changes, :external+py3.11:mod:`tarfile` will not be
suited for extracting untrusted files without prior inspection.
Among other issues, the proposed policies don't prevent denial-of-service
attacks.
Users should do additional checks.
New docs will tell users to consider:
* extracting to a new empty directory,
* using external (e.g. OS-level) limits on disk, memory and CPU usage,
* checking filenames against an allow-list of characters (to filter out control
characters, confusables, etc.),
* checking that filenames have expected extensions (discouraging files that
execute when you “click on them”, or extension-less files like Windows
special device names),
* limiting the number of extracted files, total size of extracted data,
and size of individual files,
* checking for files that would be shadowed on case-insensitive filesystems.
Also, the docs will note that:
* tar files commonly contain multiple versions of the same file: later ones are
expected to overwrite earlier ones on extraction,
* ``tarfile`` does not protect against issues with “live” data, e.g. an attacker
tinkering with the destination directory while extracting (or adding) is
going on (see the `GNU tar manual <https://www.gnu.org/software/tar/manual/html_node/Live-untrusted-data.html#Live-untrusted-data>`__
for more info).
This list is not comprehensive, but the documentation is a good place to
collect such general tips.
It can be moved into a separate document if grows too long or if it needs to
be consolidated with :external+py3.11:mod:`zipfile` or :external+py3.11:mod:`shutil`
(which is out of scope for this proposal).
.. _706-offset:
TarInfo identity, and ``offset``
--------------------------------
With filters that use ``replace()``, the ``TarInfo`` objects handled
by the extraction machinery will not necessarily be the same objects
as those present in ``members``.
This may affect ``TarInfo`` subclasses that override methods like
``makelink`` and rely on object identity.
Such code can switch to comparing ``offset``, the position of the member
header inside the file.
Note that both the overridable methods and ``offset`` are only
documented in source comments.
tarfile CLI
-----------
The CLI (``python -m tarfile``) will gain a ``--filter`` option
that will take the name of one of the provided default filters.
It won't be possible to specify a custom filter function.
If ``--filter`` is not given, the CLI will use the default filter
(``'fully_trusted'`` with a deprecation warning now, and ``'data'`` from
Python 3.14 on).
There will be no short option. (``-f`` would be confusingly similar to
the filename option of GNU ``tar``.)
Other archive libraries
-----------------------
If and when other archive libraries, such as :external+py3.11:mod:`zipfile`,
grow similar functionality, their extraction functions should use a ``filter``
argument that takes, at least, the strings ``'fully_trusted'`` (which should
disable any security precautions) and ``'data'`` (which should avoid features
that might surprise users).
Standardizing a function-based filter API is out of scope of this PEP.
Shutil
------
:external+py3.11:func:`shutil.unpack_archive` will gain a ``filter`` argument.
If it's given, it will be passed to the underlying extraction function.
Passing it for a ``zip`` archive will fail for now (until :external+py3.11:mod:`zipfile`
gains a ``filter`` argument, if it ever does).
If ``filter`` is not specified (or left as ``None``), it won't be passed
on, so extracting a tarball will use the default filter
(``'fully_trusted'`` with a deprecation warning now, and ``'data'`` from
Python 3.14 on).
Complex filters
---------------
Note that some user-defined filters need, for example,
to count extracted members of do post-processing.
This requires a more complex API than a ``filter`` callable.
However, that complex API need not be exposed to ``tarfile``.
For example, with a hypothetical ``StatefulFilter`` users would write::
with StatefulFilter() as filter_func:
my_tar.extract(path, filter=filter_func)
A simple ``StatefulFilter`` example will be added to the docs.
.. note::
The need for stateful filters is a reason against allowing
registration of custom filter names in addition to ``'fully_trusted'``,
``'tar'`` and ``'data'``.
With such a mechanism, API for (at least) set-up and tear-down would need
to be set in stone.
Backwards Compatibility
=======================
The default behavior of :external+py3.11:meth:`TarFile.extract <tarfile.TarFile.extract>`
and :external+py3.11:meth:`TarFile.extractall <tarfile.TarFile.extractall>`
will change, after raising ``DeprecationWarning`` for 2 releases
(shortest deprecation period allowed in Python's
:pep:`backwards compatibility policy <387>`).
Additionally, code that relies on :external+py3.11:class:`tarfile.TarInfo`
object identity may break, see :ref:`706-offset`.
Backporting & Forward Compatibility
===================================
This feature may be backported to older versions of Python.
In CPython, we don't add warnings to patch releases, so the default
filter should be changed to ``'fully_trusted'`` in backports.
Other than that, *all* of the changes to ``tarfile`` should be backported, so
``hasattr(tarfile, 'data_filter')`` becomes a reliable check for all
of the new functionality.
Note that CPython's usual policy is to avoid adding new APIs in security
backports.
This feature does not make sense without a new API
(``TarFile.extraction_filter`` and the ``filter`` argument),
so we'll make an exception.
(See `Discourse comment 23149/16 <https://discuss.python.org/t/23149/16>`__
for details.)
Here are examples of code that takes into account that ``tarfile`` may or may
2023-02-21 09:11:43 -05:00
not have the proposed feature.
When copying these snippets, note that setting ``extraction_filter``
will affect subsequent operations.
* Fully trusted archive::
my_tarfile.extraction_filter = (lambda member, path: member)
my_tarfile.extractall()
* Use the ``'data'`` filter if available, but revert to Python 3.11 behavior
(``'fully_trusted'``) if this feature is not available::
my_tarfile.extraction_filter = getattr(tarfile, 'data_filter',
(lambda member, path: member))
my_tarfile.extractall()
(This is an unsafe operation, so it should be spelled out explicitly,
ideally with a comment.)
* Use the ``'data'`` filter; *fail* if it is not available::
my_tarfile.extractall(filter=tarfile.data_filter)
or::
my_tarfile.extraction_filter = tarfile.data_filter
my_tarfile.extractall()
* Use the ``'data'`` filter; *warn* if it is not available::
if hasattr(tarfile, 'data_filter'):
my_tarfile.extractall(filter='data')
else:
# remove this when no longer needed
warn_the_user('Extracting may be unsafe; consider updating Python')
my_tarfile.extractall()
Security Implications
=====================
This proposal improves security, at the expense of backwards compatibility.
In particular, it will help users avoid `CVE-2007-4559`_.
How to Teach This
=================
The API, usage notes and tips for further verification will be added to
the documentation.
These should be usable for users who are familiar with archives in general, but
not with the specifics of UNIX filesystems nor the related security issues.
Reference Implementation
========================
See `pull request #102953 <https://github.com/python/cpython/pull/102953>`_ on GitHub.
Rejected Ideas
==============
SafeTarFile
-----------
An initial idea from Lars Gustäbel was to provide a separate class that
implements security checks (see `gh-65308`_).
There are two major issues with this approach:
* The name is misleading. General archive operations can never be made “safe”
from all kinds of unwanted behavior, without impacting legitimate use cases.
* It does not solve the problem of unsafe defaults.
However, many of the ideas behind SafeTarFile were reused in this PEP.
Add absolute_path option to tarfile
-----------------------------------
Issue `gh-73974`_ asks for adding an ``absolute_path`` option to extraction
methods. This would be a minimal change to formally resolve `CVE-2007-4559`_.
It doesn't go far enough to protect the unaware, nor to empower the diligent
and curious.
Other names for the ``'tar'`` filter
------------------------------------
The ``'tar'`` filter exposes features specific to UNIX-like filesystems,
so it could be named ``'unix'``.
Or ``'unix-like'``, ``'nix'``, ``'*nix'``, ``'posix'``?
2023-02-15 19:49:24 -05:00
Feature-wise, *tar format* and *UNIX-like filesystem* are essentially
equivalent, so ``tar`` is a good name.
Possible Further Work
=====================
Adding filters to zipfile and shutil.unpack_archive
---------------------------------------------------
For consistency, :external+py3.11:mod:`zipfile` and
:external+py3.11:func:`shutil.unpack_archive` could gain support
for a ``filter`` argument.
However, this would require research that this PEP's author can't promise
for Python 3.12.
Filters for ``zipfile`` would probably not help security.
Zip is used primarily for cross-platform data bundles, and correspondingly,
:external+py3.11:meth:`ZipFile.extract <zipfile.ZipFile.extract>`'s defaults
are already similar to what a ``'data'`` filter would do.
A ``'fully_trusted'`` filter, which would *newly allow* absolute paths and
``..`` path components, might not be useful for much except
a unified ``unpack_archive`` API.
Filters should be useful for use cases other than security, but those
would usually need custom filter functions, and those would need API that works
with both :external+py3.11:class:`~tarfile.TarInfo` and
:external+py3.11:class:`~zipfile.ZipInfo`.
That is *definitely* out of scope of this PEP.
If only this PEP is implemented and nothing changes for ``zipfile``,
the effect for callers of ``unpack_archive`` is that the default
for *tar* files is changing from ``'fully_trusted'`` to
the more appropriate ``'data'``.
In the interim period, Python 3.12-3.13 will emit ``DeprecationWarning``.
That's annoying, but there are several ways to handle it: e.g. add a
``filter`` argument conditionally, set ``TarFile.extraction_filter``
globally, or ignore/suppress the warning until Python 3.14.
Also, since many calls to ``unpack_archive`` are likely to be unsafe,
there's hope that the ``DeprecationWarning`` will often turn out to be
a helpful hint to review affected code.
Thanks
======
This proposal is based on prior work and discussions by many people,
in particular Lars Gustäbel, Gregory P. Smith, Larry Hastings, Joachim Wagner,
Jan Matejek, Jakub Wilk, Daniel Garcia, Lumír Balhar, Miro Hrončok,
and many others.
References
==========
.. _CVE-2007-4559: https://nvd.nist.gov/vuln/detail/CVE-2007-4559
.. _gh-65308: https://github.com/python/cpython/issues/65308
.. _gh-73974: https://github.com/python/cpython/issues/73974
Copyright
=========
This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.