2023-02-15 07:32:31 -05:00
|
|
|
|
PEP: 706
|
|
|
|
|
Title: Filter for tarfile.extractall
|
|
|
|
|
Author: Petr Viktorin <encukou@gmail.com>
|
2023-02-15 10:43:24 -05:00
|
|
|
|
Discussions-To: https://discuss.python.org/t/23903
|
2023-05-10 06:54:39 -04:00
|
|
|
|
Status: Final
|
2023-02-15 07:32:31 -05:00
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
|
Created: 09-Feb-2023
|
|
|
|
|
Python-Version: 3.12
|
|
|
|
|
Post-History: `25-Jan-2023 <https://discuss.python.org/t/23149>`__,
|
2023-02-15 10:43:24 -05:00
|
|
|
|
`15-Feb-2023 <https://discuss.python.org/t/23903>`__,
|
2023-03-27 07:34:08 -04:00
|
|
|
|
Resolution: https://discuss.python.org/t/23903/10
|
2023-02-15 07:32:31 -05:00
|
|
|
|
|
2023-05-10 06:54:39 -04:00
|
|
|
|
.. canonical-doc:: :ref:`tarfile documentation <python:tarfile-extraction-filter>`
|
|
|
|
|
|
2023-02-15 07:32:31 -05:00
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
========
|
|
|
|
|
|
|
|
|
|
The extraction methods in :external+py3.11:mod:`tarfile` gain a ``filter`` argument,
|
|
|
|
|
which allows rejecting files or modifying metadata as the archive is extracted.
|
|
|
|
|
Three built-in named filters are provided, aimed at limiting features that
|
|
|
|
|
might be surprising or dangerous.
|
|
|
|
|
These can be used as-is, or serve as a base for custom filters.
|
|
|
|
|
|
|
|
|
|
After a deprecation period, a strict (but safer) filter will become the default.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Motivation
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
The ``tar`` format is used for several use cases, many of which have different
|
|
|
|
|
needs. For example:
|
|
|
|
|
|
|
|
|
|
- A backup of a UNIX workstation should faithfully preserve all kinds of
|
|
|
|
|
details like file permissions, symlinks to system configuration, and various
|
|
|
|
|
kinds of special files.
|
|
|
|
|
- When unpacking a data bundle, it’s much more important that the unpacking
|
|
|
|
|
will not have unintended consequences – like exposing a password file by
|
|
|
|
|
symlinking it to a public place.
|
|
|
|
|
|
|
|
|
|
To support all its use cases, the ``tar`` format has many features.
|
|
|
|
|
In many cases, it's best to ignore or disallow some of them when extracting
|
|
|
|
|
an archive.
|
|
|
|
|
|
|
|
|
|
Python allows extracting ``tar`` archives using
|
|
|
|
|
:external+py3.11:meth:`tarfile.TarFile.extractall`, whose docs warn to
|
|
|
|
|
*never extract archives from untrusted sources without prior inspection*.
|
|
|
|
|
However, it’s not clear what kind of inspection should be done.
|
|
|
|
|
Indeed, it’s quite tricky to do such an inspection correctly.
|
|
|
|
|
As a result, many people don’t bother, or do the check incorrectly, resulting in
|
|
|
|
|
security issues such as `CVE-2007-4559`_.
|
|
|
|
|
|
|
|
|
|
Since :external+py3.11:mod:`tarfile` was first written, it's become more
|
|
|
|
|
accepted that warnings in documentation are not enough.
|
|
|
|
|
Whenever possible, an unsafe operation should be *explicitly requested*;
|
|
|
|
|
potentially dangerous operations should *look* dangerous.
|
|
|
|
|
However, ``TarFile.extractall`` looks benign in a code review.
|
|
|
|
|
|
|
|
|
|
Tarfile extraction is also exposed via :external+py3.11:func:`shutil.unpack_archive`,
|
|
|
|
|
which allows the user to not care about the kind of archive they're
|
|
|
|
|
dealing with.
|
|
|
|
|
The API is very inviting for extracting archives without prior inspection,
|
|
|
|
|
even though the docs again warn against it.
|
|
|
|
|
|
|
|
|
|
It has been argued that Python is not wrong -- it behaves exactly as
|
|
|
|
|
documented -- but that's beside the point.
|
|
|
|
|
Let's improve the situation rather than assign/avoid blame.
|
|
|
|
|
Python and its docs are the best place to improve things.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
How do we improve things?
|
|
|
|
|
Unfortunately, we will need to change the defaults, which implies
|
|
|
|
|
breaking backwards compatibility. :external+py3.11:meth:`TarFile.extractall <tarfile.TarFile.extractall>`
|
|
|
|
|
is what people reach for when they need to extract a tarball.
|
|
|
|
|
Its default behaviour needs to change.
|
|
|
|
|
|
|
|
|
|
What would be the best behaviour? That depends on the use case.
|
|
|
|
|
So, we'll add several general “policies” to control extraction.
|
|
|
|
|
They are based on *use cases*, and ideally they should have straightforward
|
|
|
|
|
security implications:
|
|
|
|
|
|
|
|
|
|
- Current behavior: trusting the archive. Suitable e.g. as a building block
|
|
|
|
|
for libraries that do the check themselves, or extracting an archive you just
|
|
|
|
|
made yourself.
|
|
|
|
|
- Unpacking a UNIX archive: roughly following GNU ``tar``, e.g. stripping
|
|
|
|
|
leading ``/`` from filenames.
|
|
|
|
|
- Unpacking a general data archive: the :external+py3.11:func:`shutil.unpack_archive`
|
|
|
|
|
use case,
|
|
|
|
|
where it's not important to preserve details specific to ``tar`` or
|
|
|
|
|
Unix-like filesystems.
|
|
|
|
|
|
|
|
|
|
After a deprecation period, the last option -- the most limited
|
|
|
|
|
but most secure one -- will become the default.
|
|
|
|
|
|
|
|
|
|
Even with better general defaults, users should still verify the archives
|
|
|
|
|
they extract, and perhaps modify some of the metadata.
|
|
|
|
|
Superficially, the following looks like a reasonable way to do this today:
|
|
|
|
|
|
|
|
|
|
* Call :external+py3.11:meth:`TarFile.getmembers <tarfile.TarFile.getmembers>`
|
|
|
|
|
* Verify or modify each member's :external+py3.11:class:`~tarfile.TarInfo`
|
|
|
|
|
* Pass the result to ``extractall``'s ``members``
|
|
|
|
|
|
|
|
|
|
However, there are some issues with this approach:
|
|
|
|
|
|
|
|
|
|
- It's possible to modify ``TarInfo`` objects, but the changes to them
|
|
|
|
|
affect all subsequent operations on the same ``TarFile`` object.
|
|
|
|
|
This behavior is fine for most uses, but despite that, it would be very
|
|
|
|
|
surprising if ``TarFile.extractall`` did this by default.
|
|
|
|
|
- Calling ``getmembers`` can be expensive and it
|
|
|
|
|
`requires a seekable archive <https://github.com/python/cpython/issues/45385#issuecomment-1255615199>`__.
|
|
|
|
|
- When verifying members in advance, it may be necessary to track how each
|
|
|
|
|
member would have changed the filesystem, e.g. how symlinks are being set up.
|
|
|
|
|
This is hard. We can't expect users to do it.
|
|
|
|
|
|
|
|
|
|
To solve these issues we'll:
|
|
|
|
|
|
|
|
|
|
- Provide a supported way to “clone” and modify ``TarInfo`` objects.
|
|
|
|
|
A ``replace`` method, similar to :external+py3.11:func:`dataclasses.replace`
|
|
|
|
|
or :external+py3.11:meth:`namedtuple._replace <collections.somenamedtuple._replace>`
|
|
|
|
|
should do the trick.
|
|
|
|
|
- Provide a “filter” hook in ``extractall``'s loop that can modify or discard
|
|
|
|
|
members before they are processed.
|
|
|
|
|
- Require that this hook is called just before extracting each member,
|
|
|
|
|
so it can scan the *current* state of the disk. This will greatly simplify
|
|
|
|
|
the implementation of policies (both in stdlib and user code),
|
|
|
|
|
at the cost of not being able to do a precise “dry run”.
|
|
|
|
|
|
|
|
|
|
The hook API will be very similar to the existing ``filter`` argument
|
|
|
|
|
for :external+py3.11:meth:`TarFile.add <tarfile.TarFile.add>`.
|
|
|
|
|
We'll also name it ``filter``.
|
|
|
|
|
(In some cases “policy” would be a more fitting name,
|
|
|
|
|
but the API can be used for more than security policies.)
|
|
|
|
|
|
|
|
|
|
The built-in policies/filters described above will be implemented using the
|
|
|
|
|
public filter API, so they can be used as building blocks or examples.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Setting a precedent
|
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
|
|
If and when other libraries for archive extraction, such as :external+py3.11:mod:`zipfile`,
|
|
|
|
|
gain similar functionality, they should mimic this API as much as it's
|
|
|
|
|
reasonable.
|
|
|
|
|
|
|
|
|
|
To enable this for simple cases, the built-in filters will have string names;
|
|
|
|
|
e.g. users can pass ``filter='data'`` instead of a specific function that deals
|
|
|
|
|
with :external+py3.11:class:`~tarfile.TarInfo` objects.
|
|
|
|
|
|
|
|
|
|
The :external+py3.11:func:`shutil.unpack_archive` function will get a
|
|
|
|
|
``filter`` argument, which it will pass to ``extractall``.
|
|
|
|
|
|
|
|
|
|
Adding function-based API that would work across archive formats is
|
|
|
|
|
out of scope of this PEP.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Full disclosure & redistributor info
|
|
|
|
|
------------------------------------
|
|
|
|
|
|
|
|
|
|
The PEP author works for Red Hat, a redistributor of Python with different
|
|
|
|
|
security needs and support periods than CPython in general.
|
|
|
|
|
Such redistributors may want to carry vendor patches to:
|
|
|
|
|
|
|
|
|
|
* Allow configuring the defaults system-wide, and
|
|
|
|
|
* Change the default as soon as possible, even in older Python versions.
|
|
|
|
|
|
|
|
|
|
The proposal makes this easy to do, and it allows users to query
|
|
|
|
|
the settings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Specification
|
|
|
|
|
=============
|
|
|
|
|
|
|
|
|
|
Modifying and forgetting member metadata
|
|
|
|
|
----------------------------------------
|
|
|
|
|
|
|
|
|
|
The :external+py3.11:class:`~tarfile.TarInfo` class will gain a new method,
|
|
|
|
|
``replace()``, which will work similarly to ``dataclasses.replace``.
|
|
|
|
|
It will return a copy of the ``TarInfo`` object with attributes
|
|
|
|
|
replaced as specified by keyword-only arguments:
|
|
|
|
|
|
|
|
|
|
* ``name``
|
|
|
|
|
* ``mtime``
|
|
|
|
|
* ``mode``
|
|
|
|
|
* ``linkname``
|
|
|
|
|
* ``uid``
|
|
|
|
|
* ``gid``
|
|
|
|
|
* ``uname``
|
|
|
|
|
* ``gname``
|
|
|
|
|
|
|
|
|
|
Any of these, except ``name`` and ``linkname``, will be allowed to be set
|
|
|
|
|
to ``None``.
|
|
|
|
|
When ``extract`` or ``extractall`` encounters such a ``None``, it will not
|
|
|
|
|
set that piece of metadata.
|
|
|
|
|
(If ``uname`` or ``gname`` is ``None``, it will fall back to ``uid`` or ``gid``
|
|
|
|
|
as if the name wasn't found.)
|
|
|
|
|
When ``addfile`` or ``tobuf`` encounters such a ``None``, it will raise a
|
|
|
|
|
``ValueError``.
|
|
|
|
|
When ``list`` encounters such a ``None``, it will print a placeholder string.
|
|
|
|
|
|
|
|
|
|
The documentation will mention why the method is there:
|
|
|
|
|
``TarInfo`` objects retrieved from :external+py3.11:meth:`TarFile.getmembers <tarfile.TarFile.getmembers>`
|
|
|
|
|
are “live”; modifying them directly will affect subsequent unrelated
|
|
|
|
|
operations.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Filters
|
|
|
|
|
-------
|
|
|
|
|
|
|
|
|
|
:external+py3.11:meth:`TarFile.extract <tarfile.TarFile.extract>` and
|
|
|
|
|
:external+py3.11:meth:`TarFile.extractall <tarfile.TarFile.extractall>` methods
|
|
|
|
|
will grow a ``filter`` keyword-only parameter,
|
2023-03-06 10:31:52 -05:00
|
|
|
|
which takes a callable that can be called as::
|
2023-02-15 07:32:31 -05:00
|
|
|
|
|
|
|
|
|
filter(/, member: TarInfo, path: str) -> TarInfo|None
|
|
|
|
|
|
|
|
|
|
where ``member`` is the member to be extracted, and ``path`` is the path to
|
|
|
|
|
where the archive is extracted (i.e., it'll be the same for every member).
|
|
|
|
|
|
|
|
|
|
When used it will be called on each member as it is extracted,
|
|
|
|
|
and extraction will work with the result.
|
|
|
|
|
If it returns ``None``, the member will be skipped.
|
|
|
|
|
|
|
|
|
|
The function can also raise an exception.
|
|
|
|
|
This can, depending on ``TarFile.errorlevel``,
|
|
|
|
|
abort the extraction or cause the member to be skipped.
|
|
|
|
|
|
2023-03-06 10:31:52 -05:00
|
|
|
|
.. note::
|
|
|
|
|
|
|
|
|
|
If extraction is aborted, the archive may be left partially
|
|
|
|
|
extracted. It is the user’s responsibility to clean up.
|
|
|
|
|
|
2023-02-15 07:32:31 -05:00
|
|
|
|
We will also provide a set of defaults for common use cases.
|
|
|
|
|
In addition to a function, the ``filter`` argument can be one
|
|
|
|
|
of the following strings:
|
|
|
|
|
|
|
|
|
|
* ``'fully_trusted'``: Current behavior: honor the metadata as is.
|
|
|
|
|
Should be used if the user trusts the archive completely, or implements their
|
|
|
|
|
own complex verification.
|
|
|
|
|
* ``'tar'``: Roughly follow defaults of the GNU ``tar`` command
|
|
|
|
|
(when run as a normal user):
|
|
|
|
|
|
|
|
|
|
* Strip leading ``'/'`` and ``os.sep`` from filenames
|
|
|
|
|
* Refuse to extract files with absolute paths (after the ``/`` stripping
|
|
|
|
|
above, e.g. ``C:/foo`` on Windows).
|
|
|
|
|
* Refuse to extract files whose absolute path (after following symlinks)
|
|
|
|
|
would end up outside the destination.
|
|
|
|
|
(Note that GNU ``tar`` instead delays creating some links.)
|
|
|
|
|
* Clear high mode bits (setuid, setgid, sticky) and group/other write bits
|
|
|
|
|
(:external+py3.11:data:`S_IWGRP|S_IWOTH <stat.S_IWGRP>`).
|
|
|
|
|
(This is an approximation of GNU ``tar``'s default, which limits the mode
|
|
|
|
|
by the current ``umask`` setting.)
|
|
|
|
|
|
|
|
|
|
* ``'data'``: Extract a "data" archive, disallowing common attack vectors
|
|
|
|
|
but limiting functionality.
|
|
|
|
|
In particular, many features specific to UNIX-style filesystems (or
|
|
|
|
|
equivalently, to the ``tar`` archive format) are ignored, making this a good
|
|
|
|
|
filter for cross-platform archives.
|
|
|
|
|
In addition to ``tar``:
|
|
|
|
|
|
|
|
|
|
* Refuse to extract links (hard or soft) that link to absolute paths.
|
|
|
|
|
* Refuse to extract links (hard or soft) which end up linking to a path
|
|
|
|
|
outside of the destination.
|
|
|
|
|
(On systems that don't support links, ``tarfile`` will, in most cases,
|
|
|
|
|
fall back to creating regular files.
|
|
|
|
|
This proposal doesn't change that behaviour.)
|
|
|
|
|
* Refuse to extract device files (including pipes).
|
|
|
|
|
* For regular files and hard links:
|
|
|
|
|
|
|
|
|
|
* Set the owner read and write permissions (:external+py3.11:data:`S_IRUSR|S_IWUSR <stat.S_IRUSR>`).
|
|
|
|
|
* Remove the group & other *executable* permission (:external+py3.11:data:`S_IXGRP|S_IXOTH <stat.S_IXGRP>`)
|
|
|
|
|
if the owner doesn't have it (:external+py3.11:data:`~stat.S_IXUSR`).
|
|
|
|
|
|
|
|
|
|
* For other files (directories), ignore mode entirely (set it to ``None``).
|
|
|
|
|
* Ignore user and group info (set ``uid``, ``gid``, ``uname``, ``gname``
|
|
|
|
|
to ``None``).
|
|
|
|
|
|
2023-03-06 10:31:52 -05:00
|
|
|
|
Any other string will cause a ``ValueError``.
|
|
|
|
|
|
2023-02-15 07:32:31 -05:00
|
|
|
|
The corresponding filter functions will be available as
|
|
|
|
|
``tarfile.fully_trusted_filter()``, ``tarfile.tar_filter()``, etc., so
|
|
|
|
|
they can be easily used in custom policies.
|
|
|
|
|
|
|
|
|
|
Note that these filters never return ``None``.
|
|
|
|
|
Skipping members this way is a feature for user-defined filters.
|
|
|
|
|
|
|
|
|
|
Defaults and their configuration
|
|
|
|
|
--------------------------------
|
|
|
|
|
|
|
|
|
|
:external+py3.11:class:`~tarfile.TarFile` will gain a new attribute,
|
|
|
|
|
``extraction_filter``, to allow configuring the default filter.
|
|
|
|
|
By default it will be ``None``, but users can set it to a callable
|
|
|
|
|
that will be used if the ``filter`` argument is missing or ``None``.
|
|
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
|
|
|
|
|
String names won't be accepted here. That would encourage code like
|
|
|
|
|
``my_tarfile.extraction_filter = 'data'``.
|
|
|
|
|
On Python versions without this feature, this would do nothing,
|
|
|
|
|
silently ignoring a security-related request.
|
|
|
|
|
|
|
|
|
|
If both the argument and attribute are ``None``:
|
|
|
|
|
|
2023-02-22 11:19:58 -05:00
|
|
|
|
* In Python 3.12-3.13, a ``DeprecationWarning`` will be emitted and
|
|
|
|
|
extraction will use the ``'fully_trusted'`` filter.
|
|
|
|
|
* In Python 3.14+, it will use the ``'data'`` filter.
|
2023-02-15 07:32:31 -05:00
|
|
|
|
|
|
|
|
|
Applications and system integrators may wish to change ``extraction_filter``
|
|
|
|
|
of the ``TarFile`` class itself to set a global default.
|
|
|
|
|
When using a function, they will generally want to wrap it in ``staticmethod()``
|
|
|
|
|
to prevent injection of a ``self`` argument.
|
|
|
|
|
|
|
|
|
|
Subclasses of ``TarFile`` can also override ``extraction_filter``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
FilterError
|
|
|
|
|
-----------
|
|
|
|
|
|
|
|
|
|
A new exception, ``FilterError``, will be added to the :external+py3.11:mod:`tarfile`
|
|
|
|
|
module.
|
|
|
|
|
It'll have several new subclasses, one for each of the refusal reasons above.
|
|
|
|
|
``FilterError``'s ``member`` attribute will contain the relevant ``TarInfo``.
|
|
|
|
|
|
|
|
|
|
In the lists above, “refusing" to extract a file means that a ``FilterError``
|
|
|
|
|
will be raised.
|
|
|
|
|
As with other extraction errors, if the ``TarFile.errorlevel``
|
|
|
|
|
is 1 or more, this will abort the extraction; with ``errorlevel=0`` the error
|
|
|
|
|
will be logged and the member will be ignored, but extraction will continue.
|
|
|
|
|
Note that ``extractall()`` may leave the archive partially extracted;
|
|
|
|
|
it is the user's responsibility to clean up.
|
|
|
|
|
|
|
|
|
|
|
2023-05-10 06:54:39 -04:00
|
|
|
|
Errorlevel, and fatal/non-fatal errors
|
|
|
|
|
--------------------------------------
|
|
|
|
|
|
|
|
|
|
Currently, :external+py3.11:class:`~tarfile.TarFile` has an *errorlevel*
|
|
|
|
|
argument/attribute, which specifies how errors are handled:
|
|
|
|
|
|
|
|
|
|
- With ``errorlevel=0``, documentation says that “all errors are ignored
|
|
|
|
|
when using :external+py3.11:meth:`~tarfile.TarFile.extract` and
|
|
|
|
|
:external+py3.11:meth:`~tarfile.TarFile.extractall`”.
|
|
|
|
|
The code only ignores *non-fatal* and *fatal* errors (see below),
|
|
|
|
|
so, for example, you still get ``TypeError`` if you pass ``None`` as the
|
|
|
|
|
destination path.
|
|
|
|
|
- With ``errorlevel=1`` (the default), all *non-fatal* errors are ignored.
|
|
|
|
|
(They may be logged to ``sys.stderr`` by setting the *degug*
|
|
|
|
|
argument/attribute.)
|
|
|
|
|
Which errors are *non-fatal* is not defined in documentation, but code treats
|
|
|
|
|
``ExtractionError`` as such. Specifically, it's these issues:
|
|
|
|
|
|
|
|
|
|
- “unable to resolve link inside archive” (raised on systems that do not
|
|
|
|
|
support symlinks)
|
|
|
|
|
- “fifo/special devices not supported by system” (not used for failures if
|
|
|
|
|
the system supports these, e.g. for a ``PermissionError``)
|
|
|
|
|
- “could not change owner/mode/modification time”
|
|
|
|
|
|
|
|
|
|
Note that, for example, *file name too long* or *out of disk space* don't
|
|
|
|
|
qualify.
|
|
|
|
|
The *non-fatal* errors are not very likely to appear on a Unix-like system.
|
|
|
|
|
- With ``errorlevel=2``, all errors are raised, including *fatal* ones.
|
|
|
|
|
Which errors are *fatal* is, again, not defined; in practice it's
|
|
|
|
|
``OSError``.
|
|
|
|
|
|
|
|
|
|
A filter refusing to extract a member does not fit neatly into the
|
|
|
|
|
*fatal*/*non-fatal* categories.
|
|
|
|
|
|
|
|
|
|
- This PEP does not change existing behavior. (Ideas for improvements are
|
|
|
|
|
welcome in `Discourse topic 25970 <https://discuss.python.org/t/25970>`_.)
|
|
|
|
|
- When a filter refuses to extract a member, the error should not pass
|
|
|
|
|
silently by default.
|
|
|
|
|
|
|
|
|
|
To satisfy this, ``FilterError`` will be considered a *fatal* error, that is,
|
|
|
|
|
it'll be ignored only with ``errorlevel=0``.
|
|
|
|
|
|
|
|
|
|
Users that want to ignore ``FilterError`` but not other *fatal* errors should
|
|
|
|
|
create a custom filter function, and call another filter in a ``try`` block.
|
|
|
|
|
|
|
|
|
|
|
2023-02-15 07:32:31 -05:00
|
|
|
|
Hints for further verification
|
|
|
|
|
------------------------------
|
|
|
|
|
|
|
|
|
|
Even with the proposed changes, :external+py3.11:mod:`tarfile` will not be
|
|
|
|
|
suited for extracting untrusted files without prior inspection.
|
|
|
|
|
Among other issues, the proposed policies don't prevent denial-of-service
|
|
|
|
|
attacks.
|
|
|
|
|
Users should do additional checks.
|
|
|
|
|
|
|
|
|
|
New docs will tell users to consider:
|
|
|
|
|
|
|
|
|
|
* extracting to a new empty directory,
|
|
|
|
|
* using external (e.g. OS-level) limits on disk, memory and CPU usage,
|
|
|
|
|
* checking filenames against an allow-list of characters (to filter out control
|
|
|
|
|
characters, confusables, etc.),
|
|
|
|
|
* checking that filenames have expected extensions (discouraging files that
|
|
|
|
|
execute when you “click on them”, or extension-less files like Windows
|
|
|
|
|
special device names),
|
2023-02-22 11:19:58 -05:00
|
|
|
|
* limiting the number of extracted files, total size of extracted data,
|
2023-02-15 07:32:31 -05:00
|
|
|
|
and size of individual files,
|
|
|
|
|
* checking for files that would be shadowed on case-insensitive filesystems.
|
|
|
|
|
|
|
|
|
|
Also, the docs will note that:
|
|
|
|
|
|
|
|
|
|
* tar files commonly contain multiple versions of the same file: later ones are
|
|
|
|
|
expected to overwrite earlier ones on extraction,
|
|
|
|
|
* ``tarfile`` does not protect against issues with “live” data, e.g. an attacker
|
|
|
|
|
tinkering with the destination directory while extracting (or adding) is
|
|
|
|
|
going on (see the `GNU tar manual <https://www.gnu.org/software/tar/manual/html_node/Live-untrusted-data.html#Live-untrusted-data>`__
|
|
|
|
|
for more info).
|
|
|
|
|
|
|
|
|
|
This list is not comprehensive, but the documentation is a good place to
|
|
|
|
|
collect such general tips.
|
|
|
|
|
It can be moved into a separate document if grows too long or if it needs to
|
|
|
|
|
be consolidated with :external+py3.11:mod:`zipfile` or :external+py3.11:mod:`shutil`
|
|
|
|
|
(which is out of scope for this proposal).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. _706-offset:
|
|
|
|
|
|
|
|
|
|
TarInfo identity, and ``offset``
|
|
|
|
|
--------------------------------
|
|
|
|
|
|
|
|
|
|
With filters that use ``replace()``, the ``TarInfo`` objects handled
|
|
|
|
|
by the extraction machinery will not necessarily be the same objects
|
|
|
|
|
as those present in ``members``.
|
|
|
|
|
This may affect ``TarInfo`` subclasses that override methods like
|
|
|
|
|
``makelink`` and rely on object identity.
|
|
|
|
|
|
|
|
|
|
Such code can switch to comparing ``offset``, the position of the member
|
|
|
|
|
header inside the file.
|
|
|
|
|
|
|
|
|
|
Note that both the overridable methods and ``offset`` are only
|
|
|
|
|
documented in source comments.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
tarfile CLI
|
|
|
|
|
-----------
|
|
|
|
|
|
|
|
|
|
The CLI (``python -m tarfile``) will gain a ``--filter`` option
|
2023-02-22 11:19:58 -05:00
|
|
|
|
that will take the name of one of the provided default filters.
|
2023-02-15 07:32:31 -05:00
|
|
|
|
It won't be possible to specify a custom filter function.
|
|
|
|
|
|
|
|
|
|
If ``--filter`` is not given, the CLI will use the default filter
|
2023-02-22 11:19:58 -05:00
|
|
|
|
(``'fully_trusted'`` with a deprecation warning now, and ``'data'`` from
|
|
|
|
|
Python 3.14 on).
|
2023-02-15 07:32:31 -05:00
|
|
|
|
|
|
|
|
|
There will be no short option. (``-f`` would be confusingly similar to
|
|
|
|
|
the filename option of GNU ``tar``.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Other archive libraries
|
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
|
|
If and when other archive libraries, such as :external+py3.11:mod:`zipfile`,
|
|
|
|
|
grow similar functionality, their extraction functions should use a ``filter``
|
|
|
|
|
argument that takes, at least, the strings ``'fully_trusted'`` (which should
|
|
|
|
|
disable any security precautions) and ``'data'`` (which should avoid features
|
|
|
|
|
that might surprise users).
|
|
|
|
|
|
|
|
|
|
Standardizing a function-based filter API is out of scope of this PEP.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Shutil
|
|
|
|
|
------
|
|
|
|
|
|
|
|
|
|
:external+py3.11:func:`shutil.unpack_archive` will gain a ``filter`` argument.
|
|
|
|
|
If it's given, it will be passed to the underlying extraction function.
|
|
|
|
|
Passing it for a ``zip`` archive will fail for now (until :external+py3.11:mod:`zipfile`
|
|
|
|
|
gains a ``filter`` argument, if it ever does).
|
|
|
|
|
|
|
|
|
|
If ``filter`` is not specified (or left as ``None``), it won't be passed
|
|
|
|
|
on, so extracting a tarball will use the default filter
|
2023-02-22 11:19:58 -05:00
|
|
|
|
(``'fully_trusted'`` with a deprecation warning now, and ``'data'`` from
|
|
|
|
|
Python 3.14 on).
|
|
|
|
|
|
2023-02-15 07:32:31 -05:00
|
|
|
|
|
|
|
|
|
Complex filters
|
|
|
|
|
---------------
|
|
|
|
|
|
|
|
|
|
Note that some user-defined filters need, for example,
|
|
|
|
|
to count extracted members of do post-processing.
|
|
|
|
|
This requires a more complex API than a ``filter`` callable.
|
|
|
|
|
However, that complex API need not be exposed to ``tarfile``.
|
|
|
|
|
For example, with a hypothetical ``StatefulFilter`` users would write::
|
|
|
|
|
|
|
|
|
|
with StatefulFilter() as filter_func:
|
|
|
|
|
my_tar.extract(path, filter=filter_func)
|
|
|
|
|
|
|
|
|
|
A simple ``StatefulFilter`` example will be added to the docs.
|
|
|
|
|
|
2023-02-22 11:19:58 -05:00
|
|
|
|
.. note::
|
|
|
|
|
|
|
|
|
|
The need for stateful filters is a reason against allowing
|
|
|
|
|
registration of custom filter names in addition to ``'fully_trusted'``,
|
|
|
|
|
``'tar'`` and ``'data'``.
|
|
|
|
|
With such a mechanism, API for (at least) set-up and tear-down would need
|
|
|
|
|
to be set in stone.
|
|
|
|
|
|
2023-02-15 07:32:31 -05:00
|
|
|
|
|
|
|
|
|
Backwards Compatibility
|
|
|
|
|
=======================
|
|
|
|
|
|
|
|
|
|
The default behavior of :external+py3.11:meth:`TarFile.extract <tarfile.TarFile.extract>`
|
|
|
|
|
and :external+py3.11:meth:`TarFile.extractall <tarfile.TarFile.extractall>`
|
|
|
|
|
will change, after raising ``DeprecationWarning`` for 2 releases
|
|
|
|
|
(shortest deprecation period allowed in Python's
|
|
|
|
|
:pep:`backwards compatibility policy <387>`).
|
|
|
|
|
|
|
|
|
|
Additionally, code that relies on :external+py3.11:class:`tarfile.TarInfo`
|
|
|
|
|
object identity may break, see :ref:`706-offset`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Backporting & Forward Compatibility
|
|
|
|
|
===================================
|
|
|
|
|
|
|
|
|
|
This feature may be backported to older versions of Python.
|
|
|
|
|
|
|
|
|
|
In CPython, we don't add warnings to patch releases, so the default
|
|
|
|
|
filter should be changed to ``'fully_trusted'`` in backports.
|
|
|
|
|
|
|
|
|
|
Other than that, *all* of the changes to ``tarfile`` should be backported, so
|
|
|
|
|
``hasattr(tarfile, 'data_filter')`` becomes a reliable check for all
|
|
|
|
|
of the new functionality.
|
|
|
|
|
|
|
|
|
|
Note that CPython's usual policy is to avoid adding new APIs in security
|
|
|
|
|
backports.
|
|
|
|
|
This feature does not make sense without a new API
|
|
|
|
|
(``TarFile.extraction_filter`` and the ``filter`` argument),
|
|
|
|
|
so we'll make an exception.
|
|
|
|
|
(See `Discourse comment 23149/16 <https://discuss.python.org/t/23149/16>`__
|
|
|
|
|
for details.)
|
|
|
|
|
|
|
|
|
|
Here are examples of code that takes into account that ``tarfile`` may or may
|
2023-02-21 09:11:43 -05:00
|
|
|
|
not have the proposed feature.
|
2023-02-15 07:32:31 -05:00
|
|
|
|
|
|
|
|
|
When copying these snippets, note that setting ``extraction_filter``
|
|
|
|
|
will affect subsequent operations.
|
|
|
|
|
|
|
|
|
|
* Fully trusted archive::
|
|
|
|
|
|
|
|
|
|
my_tarfile.extraction_filter = (lambda member, path: member)
|
|
|
|
|
my_tarfile.extractall()
|
|
|
|
|
|
|
|
|
|
* Use the ``'data'`` filter if available, but revert to Python 3.11 behavior
|
|
|
|
|
(``'fully_trusted'``) if this feature is not available::
|
|
|
|
|
|
|
|
|
|
my_tarfile.extraction_filter = getattr(tarfile, 'data_filter',
|
|
|
|
|
(lambda member, path: member))
|
|
|
|
|
my_tarfile.extractall()
|
|
|
|
|
|
|
|
|
|
(This is an unsafe operation, so it should be spelled out explicitly,
|
|
|
|
|
ideally with a comment.)
|
|
|
|
|
|
|
|
|
|
* Use the ``'data'`` filter; *fail* if it is not available::
|
|
|
|
|
|
|
|
|
|
my_tarfile.extractall(filter=tarfile.data_filter)
|
|
|
|
|
|
|
|
|
|
or::
|
|
|
|
|
|
|
|
|
|
my_tarfile.extraction_filter = tarfile.data_filter
|
|
|
|
|
my_tarfile.extractall()
|
|
|
|
|
|
|
|
|
|
* Use the ``'data'`` filter; *warn* if it is not available::
|
|
|
|
|
|
|
|
|
|
if hasattr(tarfile, 'data_filter'):
|
|
|
|
|
my_tarfile.extractall(filter='data')
|
|
|
|
|
else:
|
|
|
|
|
# remove this when no longer needed
|
|
|
|
|
warn_the_user('Extracting may be unsafe; consider updating Python')
|
|
|
|
|
my_tarfile.extractall()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Security Implications
|
|
|
|
|
=====================
|
|
|
|
|
|
|
|
|
|
This proposal improves security, at the expense of backwards compatibility.
|
|
|
|
|
In particular, it will help users avoid `CVE-2007-4559`_.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
How to Teach This
|
|
|
|
|
=================
|
|
|
|
|
|
|
|
|
|
The API, usage notes and tips for further verification will be added to
|
|
|
|
|
the documentation.
|
2023-05-01 15:00:36 -04:00
|
|
|
|
These should be usable for users who are familiar with archives in general, but
|
2023-02-15 07:32:31 -05:00
|
|
|
|
not with the specifics of UNIX filesystems nor the related security issues.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Reference Implementation
|
|
|
|
|
========================
|
|
|
|
|
|
2023-03-27 07:34:08 -04:00
|
|
|
|
See `pull request #102953 <https://github.com/python/cpython/pull/102953>`_ on GitHub.
|
2023-02-15 07:32:31 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rejected Ideas
|
|
|
|
|
==============
|
|
|
|
|
|
|
|
|
|
SafeTarFile
|
|
|
|
|
-----------
|
|
|
|
|
|
|
|
|
|
An initial idea from Lars Gustäbel was to provide a separate class that
|
|
|
|
|
implements security checks (see `gh-65308`_).
|
|
|
|
|
There are two major issues with this approach:
|
|
|
|
|
|
|
|
|
|
* The name is misleading. General archive operations can never be made “safe”
|
|
|
|
|
from all kinds of unwanted behavior, without impacting legitimate use cases.
|
|
|
|
|
* It does not solve the problem of unsafe defaults.
|
|
|
|
|
|
|
|
|
|
However, many of the ideas behind SafeTarFile were reused in this PEP.
|
|
|
|
|
|
|
|
|
|
Add absolute_path option to tarfile
|
|
|
|
|
-----------------------------------
|
|
|
|
|
|
|
|
|
|
Issue `gh-73974`_ asks for adding an ``absolute_path`` option to extraction
|
|
|
|
|
methods. This would be a minimal change to formally resolve `CVE-2007-4559`_.
|
2023-05-01 15:00:36 -04:00
|
|
|
|
It doesn't go far enough to protect the unaware, nor to empower the diligent
|
2023-02-15 07:32:31 -05:00
|
|
|
|
and curious.
|
|
|
|
|
|
|
|
|
|
Other names for the ``'tar'`` filter
|
|
|
|
|
------------------------------------
|
|
|
|
|
|
|
|
|
|
The ``'tar'`` filter exposes features specific to UNIX-like filesystems,
|
|
|
|
|
so it could be named ``'unix'``.
|
|
|
|
|
Or ``'unix-like'``, ``'nix'``, ``'*nix'``, ``'posix'``?
|
|
|
|
|
|
2023-02-15 19:49:24 -05:00
|
|
|
|
Feature-wise, *tar format* and *UNIX-like filesystem* are essentially
|
2023-02-15 07:32:31 -05:00
|
|
|
|
equivalent, so ``tar`` is a good name.
|
|
|
|
|
|
|
|
|
|
|
2023-02-22 11:19:58 -05:00
|
|
|
|
Possible Further Work
|
|
|
|
|
=====================
|
2023-02-15 07:32:31 -05:00
|
|
|
|
|
2023-02-22 11:19:58 -05:00
|
|
|
|
Adding filters to zipfile and shutil.unpack_archive
|
|
|
|
|
---------------------------------------------------
|
|
|
|
|
|
|
|
|
|
For consistency, :external+py3.11:mod:`zipfile` and
|
|
|
|
|
:external+py3.11:func:`shutil.unpack_archive` could gain support
|
|
|
|
|
for a ``filter`` argument.
|
|
|
|
|
However, this would require research that this PEP's author can't promise
|
|
|
|
|
for Python 3.12.
|
|
|
|
|
|
|
|
|
|
Filters for ``zipfile`` would probably not help security.
|
|
|
|
|
Zip is used primarily for cross-platform data bundles, and correspondingly,
|
|
|
|
|
:external+py3.11:meth:`ZipFile.extract <zipfile.ZipFile.extract>`'s defaults
|
|
|
|
|
are already similar to what a ``'data'`` filter would do.
|
|
|
|
|
A ``'fully_trusted'`` filter, which would *newly allow* absolute paths and
|
|
|
|
|
``..`` path components, might not be useful for much except
|
|
|
|
|
a unified ``unpack_archive`` API.
|
|
|
|
|
|
|
|
|
|
Filters should be useful for use cases other than security, but those
|
|
|
|
|
would usually need custom filter functions, and those would need API that works
|
|
|
|
|
with both :external+py3.11:class:`~tarfile.TarInfo` and
|
|
|
|
|
:external+py3.11:class:`~zipfile.ZipInfo`.
|
|
|
|
|
That is *definitely* out of scope of this PEP.
|
|
|
|
|
|
|
|
|
|
If only this PEP is implemented and nothing changes for ``zipfile``,
|
|
|
|
|
the effect for callers of ``unpack_archive`` is that the default
|
|
|
|
|
for *tar* files is changing from ``'fully_trusted'`` to
|
|
|
|
|
the more appropriate ``'data'``.
|
|
|
|
|
In the interim period, Python 3.12-3.13 will emit ``DeprecationWarning``.
|
|
|
|
|
That's annoying, but there are several ways to handle it: e.g. add a
|
|
|
|
|
``filter`` argument conditionally, set ``TarFile.extraction_filter``
|
|
|
|
|
globally, or ignore/suppress the warning until Python 3.14.
|
|
|
|
|
|
|
|
|
|
Also, since many calls to ``unpack_archive`` are likely to be unsafe,
|
|
|
|
|
there's hope that the ``DeprecationWarning`` will often turn out to be
|
|
|
|
|
a helpful hint to review affected code.
|
2023-02-15 07:32:31 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Thanks
|
|
|
|
|
======
|
|
|
|
|
|
|
|
|
|
This proposal is based on prior work and discussions by many people,
|
|
|
|
|
in particular Lars Gustäbel, Gregory P. Smith, Larry Hastings, Joachim Wagner,
|
|
|
|
|
Jan Matejek, Jakub Wilk, Daniel Garcia, Lumír Balhar, Miro Hrončok,
|
|
|
|
|
and many others.
|
|
|
|
|
|
|
|
|
|
References
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
.. _CVE-2007-4559: https://nvd.nist.gov/vuln/detail/CVE-2007-4559
|
|
|
|
|
|
|
|
|
|
.. _gh-65308: https://github.com/python/cpython/issues/65308
|
|
|
|
|
|
|
|
|
|
.. _gh-73974: https://github.com/python/cpython/issues/73974
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
This document is placed in the public domain or under the
|
|
|
|
|
CC0-1.0-Universal license, whichever is more permissive.
|