PEP 675: Updates (#2282)

2022-01-31 19:20:47 -08:00 · 2022-01-31 19:20:47 -08:00 · 6f11197188
parent e43f567e93
commit 6f11197188
1 changed files with 329 additions and 36 deletions
--- a/pep-0675.rst
+++ b/pep-0675.rst
@ -82,7 +82,7 @@ the AST or by other semantic pattern-matching. These tools, however,
 preclude common idioms like storing a large multi-line query in a
 variable before executing it, adding literal string modifiers to the
 query based on some conditions, or transforming the query string using
-a function. (We survey existing tools in the "Rejected Alternatives"
+a function. (We survey existing tools in the `Rejected Alternatives`_
 section.) For example, many tools will detect a false positive issue
 in this benign snippet:

@ -112,7 +112,7 @@ generalization of the ``Literal["foo"]`` type from :pep:`586`.
 A string of type
 ``Literal[str]`` cannot contain user-controlled data. Thus, any API
 that only accepts ``Literal[str]`` will be immune to injection
-vulnerabilities (with pragmatic `limitations <Appendix B:
+vulnerabilities (with `pragmatic limitations <Appendix B:
 Limitations_>`_).

 Since we want the ``sqlite3`` ``execute`` method to disallow strings
@ -202,9 +202,9 @@ heuristics, such as regex-filtering for obviously malicious payloads,
 there will always be a way to work around them (perfectly
 distinguishing good and bad queries reduces to the halting problem).

-Static approaches like checking the AST to see if the query string is
-a literal string expression cannot tell when a string is assigned to
-an intermediate variable or when it is transformed by a benign
+Static approaches, such as checking the AST to see if the query string
+is a literal string expression, cannot tell when a string is assigned
+to an intermediate variable or when it is transformed by a benign
 function. This makes them overly restrictive.

 The type checker, surprisingly, does better than both because it has
@ -300,6 +300,7 @@ if they evaluate to the same value (``str``), such as
 Type Inference
 ==============

+.. _inferring_literal_str:

 Inferring ``Literal[str]``
 --------------------------
@ -327,6 +328,10 @@ following cases:
  has type ``Literal[str]`` if and only if ``s`` and the arguments have
  types compatible with ``Literal[str]``.

+ Literal-preserving methods: In `Appendix C <appendix_C_>`_, we have
+  provided an exhaustive list of ``str`` methods that preserve the
+  ``Literal[str]`` type.
+
 In all other cases, if one or more of the composed values has a
 non-literal type ``str``, the composition of types will have type
 ``str``. For example, if ``s`` has type ``str``, then ``"hello" + s``
@ -337,10 +342,6 @@ checkers.
 methods from ``str``. So, if we have a variable ``s`` of type
 ``Literal[str]``, it is safe to write ``s.startswith("hello")``.

-Note that, beyond the few composition rules mentioned above, this PEP
-doesn't change inference for other ``str`` methods such as
-``literal_string.upper()``.
-
 Some type checkers refine the type of a string when doing an equality
 check:

@ -366,7 +367,7 @@ See the examples below to help clarify the above rules:
    s: str = literal_string  # OK

    literal_string: Literal[str] = s  # Error: Expected Literal[str], got str.
-    literal_string: Literal[str] = "hello" # OK
+    literal_string: Literal[str] = "hello"  # OK


    def expect_literal_str(s: Literal[str]) -> None: ...
@ -577,11 +578,10 @@ Rejected Alternatives
 Why not use tool X?
 -------------------

-Focusing solely on the example of preventing SQL injection, tooling to
-catch this kind of issue seems to come in three flavors: AST based,
-function level analysis, and taint flow analysis.
+Tools to catch issues such as SQL injection seem to come in three
+flavors: AST based, function level analysis, and taint flow analysis.

-**AST based tools include Bandit**: `Bandit
+**AST-based tools**: `Bandit
 <https://github.com/PyCQA/bandit/blob/aac3f16f45648a7756727286ba8f8f0cf5e7d408/bandit/plugins/django_sql_injection.py#L102>`_
 has a plugin to warn when SQL queries are not literal
 strings. The problem is that many perfectly safe SQL
@ -630,7 +630,7 @@ handles it with no burden on the programmer:

    # Example usage
    data_to_insert = {
-        "column_1": value_1, # Note: values are not literals
+        "column_1": value_1,  # Note: values are not literals
        "column_2": value_2,
        "column_3": value_3,
    }
@ -650,6 +650,14 @@ on to library users instead of allowing the libraries themselves to
 specify precisely how their APIs must be called (as is possible with
 ``Literal[str]``).

+One final reason to prefer using a new type over a dedicated tool is
+that type checkers are more widely used than dedicated security
+tooling; for example, MyPy was downloaded `over 7 million times
+<https://www.pypistats.org/packages/mypy>`_ in Jan 2022 vs `less than
+2 million times <https://www.pypistats.org/packages/bandit>`_ for
+Bandit. Having security protections built right into type checkers
+will mean that more developers benefit from them.
+

 Why not use a ``NewType`` for ``str``?
 --------------------------------------
@ -748,27 +756,8 @@ The implementation simply extends the type checker with
 ``Literal[str]`` as a supertype of literal string types.

 To support composition via addition, join, etc., it was sufficient to
-overload the stubs for ``str`` in Pyre's copy of typeshed. For
-example, we replaced ``str`` ``__add__``:
+overload the stubs for ``str`` in Pyre's copy of typeshed.

-::
-
-    # Before:
-    def __add__(self, s: str) -> str: ...
-
-    # After:
-    @overload
-    def __add__(self: Literal[str], other: Literal[str]) -> Literal[str]: ...
-    @overload
-    def __add__(self, other: str) -> str: ...
-
-This means that addition of non-literal string types remains to have
-type ``str``. The only change is that addition of literal string types
-now produces ``Literal[str]``.
-
-One implementation strategy is to update the official Typeshed `stub
-<https://github.com/python/typeshed/blob/aa7e277adb9049e24ea3434fc9848defbfa87673/stdlib/builtins.pyi#L420>`_
-for ``str`` with these changes.

 Appendix A: Other Uses
 ======================
@ -868,6 +857,40 @@ the ``Template`` API to only accept ``Literal[str]``:
        def __init__(self, source: Literal[str]): ...


+Logging Format String Injection
+-------------------------------
+
+Logging frameworks often allow their input strings to contain
+formatting directives. At its worst, allowing users to control the
+logged string has led to `CVE-2021-44228
+<https://nvd.nist.gov/vuln/detail/CVE-2021-44228>`_ (colloquially
+known as ``log4shell``), which has been described as the `"most
+critical vulnerability of the last decade"
+<https://www.theguardian.com/technology/2021/dec/10/software-flaw-most-critical-vulnerability-log-4-shell>`_.
+While no Python frameworks are currently known to be vulnerable to a
+similar attack, the built-in logging framework does provide formatting
+options which are vulnerable to Denial of Service attacks from
+externally controlled logging strings. The following example
+illustrates a simple denial of service scenario:
+
+::
+
+    external_string = "%(foo)999999999s"
+    ...
+    # Tries to add > 1GB of whitespace to the logged string:
+    logger.info(f'Received: {external_string}', some_dict)
+
+This kind of attack could be prevented by requiring that the format
+string passed to the logger be a ``Literal[str]`` and that all
+externally controlled data be passed separately as arguments (as
+proposed in `Issue 46200 <https://bugs.python.org/issue46200>`_):
+
+::
+
+    def info(msg: Literal[str], *args: object) -> None:
+        ...
+
+
 Appendix B: Limitations
 =======================

@ -913,6 +936,275 @@ is documentation, which is easily ignored and often not seen. With
 ``Literal[str]``, API misuse requires conscious thought and artifacts
 in the code that reviewers and future developers can notice.

+.. _appendix_C:
+
+Appendix C: ``str`` methods that preserve ``Literal[str]``
+==========================================================
+
+The ``str`` class has several methods that would benefit from
+``Literal[str]``. For example, users might expect
+``"hello".capitalize()`` to have the type ``Literal[str]`` similar to
+the other examples we have seen in the `Inferring Literal[str]
+<inferring_literal_str>`_ section. Inferring the type ``Literal[str]``
+is correct because the string is not an arbitrary user-supplied string
+- we know that it has the type ``Literal["HELLO"]``, which is
+compatible with ``Literal[str]``. In other words, the ``capitalize``
+method preserves the ``Literal[str]`` type. There are several other
+``str`` methods that preserve ``Literal[str]``.
+
+We propose updating the stub for ``str`` in typeshed so that the
+methods are overloaded with the ``Literal[str]``-preserving
+versions. This means type checkers do not have to hardcode
+``Literal[str]`` behavior for each method. It also lets us easily
+support new methods in the future by updating the typeshed stub.
+
+For example, to preserve literal types for the ``capitalize`` method,
+we would change the stub as below:
+
+::
+
+    # before
+    def capitalize(self) -> str: ...
+
+    # after
+    @overload
+    def capitalize(self: Literal[str]) -> Literal[str]: ...
+    @overload
+    def capitalize(self) -> str: ...
+
+The downside of changing the ``str`` stub is that the stub becomes
+more complicated and can make error messages harder to
+understand. Type checkers may need to special-case ``str`` to make
+error messages understandable for users.
+
+Below is an exhaustive list of ``str`` methods which, when called as
+indicated with arguments of type ``Literal[str]``, must be treated as
+returning a ``Literal[str]``. If this PEP is accepted, we will update
+these method signatures in typeshed:
+
+::
+
+    @overload
+    def capitalize(self: Literal[str]) -> Literal[str]: ...
+    @overload
+    def capitalize(self) -> str: ...
+
+    @overload
+    def casefold(self: Literal[str]) -> Literal[str]: ...
+    @overload
+    def casefold(self) -> str: ...
+
+    @overload
+    def center(self: Literal[str], __width: SupportsIndex, __fillchar: Literal[str] = ...) -> Literal[str]: ...
+    @overload
+    def center(self, __width: SupportsIndex, __fillchar: str = ...) -> str: ...
+
+    if sys.version_info >= (3, 8):
+        @overload
+        def expandtabs(self: Literal[str], tabsize: SupportsIndex = ...) -> Literal[str]: ...
+        @overload
+        def expandtabs(self, tabsize: SupportsIndex = ...) -> str: ...
+
+    else:
+        @overload
+        def expandtabs(self: Literal[str], tabsize: int = ...) -> Literal[str]: ...
+        @overload
+        def expandtabs(self, tabsize: int = ...) -> str: ...
+
+    @overload
+    def format(self: Literal[str], *args: Literal[str], **kwargs: Literal[str]) -> Literal[str]: ...
+    @overload
+    def format(self, *args: str, **kwargs: str) -> str: ...
+
+    @overload
+    def join(self: Literal[str], __iterable: Iterable[Literal[str]]) -> Literal[str]: ...
+    @overload
+    def join(self, __iterable: Iterable[str]) -> str: ...
+
+    @overload
+    def ljust(self: Literal[str], __width: SupportsIndex, __fillchar: Literal[str] = ...) -> Literal[str]: ...
+    @overload
+    def ljust(self, __width: SupportsIndex, __fillchar: str = ...) -> str: ...
+
+    @overload
+    def lower(self: Literal[str]) -> Literal[str]: ...
+    @overload
+    def lower(self) -> Literal[str]: ...
+
+    @overload
+    def lstrip(self: Literal[str], __chars: Literal[str] | None = ...) -> Literal[str]: ...
+    @overload
+    def lstrip(self, __chars: str | None = ...) -> str: ...
+
+    @overload
+    def partition(self: Literal[str], __sep: Literal[str]) -> tuple[Literal[str], Literal[str], Literal[str]]: ...
+    @overload
+    def partition(self, __sep: str) -> tuple[str, str, str]: ...
+
+    @overload
+    def replace(self: Literal[str], __old: Literal[str], __new: Literal[str], __count: SupportsIndex = ...) -> Literal[str]: ...
+    @overload
+    def replace(self, __old: str, __new: str, __count: SupportsIndex = ...) -> str: ...
+
+    if sys.version_info >= (3, 9):
+        @overload
+        def removeprefix(self: Literal[str], __prefix: Literal[str]) -> Literal[str]: ...
+        @overload
+        def removeprefix(self, __prefix: str) -> str: ...
+
+        @overload
+        def removesuffix(self: Literal[str], __suffix: Literal[str]) -> Literal[str]: ...
+        @overload
+        def removesuffix(self, __suffix: str) -> str: ...
+
+    @overload
+    def rjust(self: Literal[str], __width: SupportsIndex, __fillchar: Literal[str] = ...) -> Literal[str]: ...
+    @overload
+    def rjust(self, __width: SupportsIndex, __fillchar: str = ...) -> str: ...
+
+    @overload
+    def rpartition(self: Literal[str], __sep: Literal[str]) -> tuple[Literal[str], Literal[str], Literal[str]]: ...
+    @overload
+    def rpartition(self, __sep: str) -> tuple[str, str, str]: ...
+
+    @overload
+    def rsplit(self: Literal[str], sep: Literal[str] | None = ..., maxsplit: SupportsIndex = ...) -> list[Literal[str]]: ...
+    @overload
+    def rsplit(self, sep: str | None = ..., maxsplit: SupportsIndex = ...) -> list[str]: ...
+
+    @overload
+    def rstrip(self: Literal[str], __chars: Literal[str] | None = ...) -> Literal[str]: ...
+    @overload
+    def rstrip(self, __chars: str | None = ...) -> str: ...
+
+    @overload
+    def split(self: Literal[str], sep: Literal[str] | None = ..., maxsplit: SupportsIndex = ...) -> list[Literal[str]]: ...
+    @overload
+    def split(self, sep: str | None = ..., maxsplit: SupportsIndex = ...) -> list[str]: ...
+
+    @overload
+    def splitlines(self: Literal[str], keepends: bool = ...) -> list[Literal[str]]: ...
+    @overload
+    def splitlines(self, keepends: bool = ...) -> list[str]: ...
+
+    @overload
+    def strip(self: Literal[str], __chars: Literal[str] | None = ...) -> Literal[str]: ...
+    @overload
+    def strip(self, __chars: str | None = ...) -> str: ...
+
+    @overload
+    def swapcase(self: Literal[str]) -> Literal[str]: ...
+    @overload
+    def swapcase(self) -> str: ...
+
+    @overload
+    def title(self: Literal[str]) -> Literal[str]: ...
+    @overload
+    def title(self) -> str: ...
+
+    @overload
+    def upper(self: Literal[str]) -> Literal[str]: ...
+    @overload
+    def upper(self) -> str: ...
+
+    @overload
+    def zfill(self: Literal[str], __width: SupportsIndex) -> Literal[str]: ...
+    @overload
+    def zfill(self, __width: SupportsIndex) -> str: ...
+
+    @overload
+    def __add__(self: Literal[str], __s: Literal[str]) -> Literal[str]: ...
+    @overload
+    def __add__(self, __s: str) -> str: ...
+
+    @overload
+    def __iter__(self: Literal[str]) -> Iterator[str]: ...
+    @overload
+    def __iter__(self) -> Iterator[str]: ...
+
+    @overload
+    def __mod__(self: Literal[str], __x: Union[Literal[str], Tuple[Literal[str], ...]]) -> str: ...
+    @overload
+    def __mod__(self, __x: Union[str, Tuple[str, ...]]) -> str: ...
+
+    @overload
+    def __mul__(self: Literal[str], __n: SupportsIndex) -> Literal[str]: ...
+    @overload
+    def __mul__(self, __n: SupportsIndex) -> str: ...
+
+    @overload
+    def __repr__(self: Literal[str]) -> Literal[str]: ...
+    @overload
+    def __repr__(self) -> str: ...
+
+    @overload
+    def __rmul__(self: Literal[str], n: SupportsIndex) -> Literal[str]: ...
+    @overload
+    def __rmul__(self, n: SupportsIndex) -> str: ...
+
+    @overload
+    def __str__(self: Literal[str]) -> Literal[str]: ...
+    @overload
+    def __str__(self) -> str: ...
+
+
+Appendix D: Guidelines for using ``Literal[str]`` in Stubs
+==========================================================
+
+Libraries that do not contain type annotations within their source may
+specify type stubs in Typeshed. Libraries written in other languages,
+such as those for machine learning, may also provide Python type
+stubs. This means the type checker cannot verify that the type
+annotations match the source code and must trust the type stub. Thus,
+authors of type stubs need to be careful when using ``Literal[str]``
+since a function may falsely appear to be safe when it is not.
+
+We recommend the following guidelines for using ``Literal[str]`` in stubs:
+
+ If the stub is for a function, we recommend using ``Literal[str]``
+  in the return type of the function or of its overloads only if all
+  the corresponding arguments have literal types (i.e.,
+  ``Literal[str]`` or ``Literal["a", "b"]``).
+
+  ::
+
+      # OK
+      @overload
+      def my_transform(x: Literal[str], y: Literal["a", "b"]) -> Literal[str]: ...
+      @overload
+      def my_transform(x: str, y: str) -> str: ...
+
+      # Not OK
+      @overload
+      def my_transform(x: Literal[str], y: str) -> Literal[str]: ...
+      @overload
+      def my_transform(x: str, y: str) -> str: ...
+
+ If the stub is for a ``staticmethod``, we recommend the same
+  guideline as above.
+
+ If the stub is for any other kind of method, we recommend against
+  using ``Literal[str]`` in the return type of the method or any of
+  its overloads. This is because, even if all the explicit arguments
+  have type ``Literal[str]``, the object itself may be created using
+  user data and thus the return type may be user-controlled.
+
+ If the stub is for a class attribute or global variable, we also
+  recommend against using ``Literal[str]`` because the untyped code
+  may write arbitrary values to the attribute.
+
+However, we leave the final call to the library author. They may use
+``Literal[str]`` if they feel confident that the string returned by
+the method or function or the string stored in the attribute is
+guaranteed to have a literal type - i.e., the string is created by
+applying only literal-preserving ``str`` operations to a string
+literal.
+
+Note that these guidelines do not apply to inline type annotations
+since the type checker can verify that, say, a method returning
+``Literal[str]`` does in fact return an expression of that type.
+
+
 Resources
 =========

@ -936,7 +1228,8 @@ Thanks

 Thanks to the following people for their feedback on the PEP:

-Edward Qiu, Jia Chen, Shannon Zhu, Gregory P. Smith, Никита Соболев, and Shengye Wan
+Edward Qiu, Jia Chen, Shannon Zhu, Gregory P. Smith, Никита Соболев,
+CAM Gerlach, and Shengye Wan

 Copyright
 =========