PEP 675: Arbitrary literal strings (#2167)
This commit is contained in:
parent
7d8c2a104a
commit
21f6993114
|
@ -529,6 +529,7 @@ pep-0671.rst @rosuav
|
|||
pep-0672.rst @encukou
|
||||
pep-0673.rst @jellezijlstra
|
||||
pep-0674.rst @vstinner
|
||||
pep-0675.rst @jellezijlstra
|
||||
# ...
|
||||
# pep-0754.txt
|
||||
# ...
|
||||
|
|
|
@ -0,0 +1,950 @@
|
|||
PEP: 675
|
||||
Title: Arbitrary Literal Strings
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Pradeep Kumar Srinivasan <gohanpra@gmail.com>, Graham Bleaney <gbleaney@gmail.com>
|
||||
Sponsor: Jelle Zijlstra <jelle.zijlstra@gmail.com>
|
||||
Discussions-To: Typing-Sig <typing-sig@python.org>
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 30-Nov-2021
|
||||
Python-Version: 3.11
|
||||
Post-History:
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
There is currently no way to specify that a function parameter can be
|
||||
of any literal string type; we have to specify the precise literal
|
||||
string, such as ``Literal["foo"]``. This PEP introduces a supertype of
|
||||
literal string types: ``Literal[str]``. This allows a function to
|
||||
accept arbitrary literal string types such as ``Literal["foo"]`` or
|
||||
``Literal["bar"]``.
|
||||
|
||||
Motivation
|
||||
==========
|
||||
|
||||
A common security vulnerability is for a program to include
|
||||
user-controlled data in a command it executes. For example, a naive
|
||||
way to look up a user record from a database is to accept a user id
|
||||
and insert it into a predefined SQL query:
|
||||
|
||||
::
|
||||
|
||||
def query_user(conn: Connection, user_id: str) -> User:
|
||||
query = f"SELECT * FROM data WHERE user_id = {user_id}"
|
||||
conn.execute(query)
|
||||
|
||||
query_user(conn, "user123") # OK.
|
||||
|
||||
However, the user-controlled data ``user_id`` is being mixed with the
|
||||
SQL command string, which means a malicious user could run arbitrary
|
||||
SQL commands:
|
||||
|
||||
::
|
||||
|
||||
# Delete the table.
|
||||
query_user(conn, "user123; DROP TABLE data;")
|
||||
|
||||
# Fetch all users (since 1 = 1 is always true).
|
||||
query_user(conn, "user123 OR 1 = 1")
|
||||
|
||||
|
||||
To prevent such SQL injection attacks, SQL APIs offer parameterized
|
||||
queries, which separate the executed query from user-controlled data
|
||||
and make it impossible to run arbitrary queries. For example, with
|
||||
`sqlite3 <https://docs.python.org/3/library/sqlite3.html>`_, our
|
||||
original function would be written safely as a query with parameters:
|
||||
|
||||
::
|
||||
|
||||
def query_user(conn: Connection, user_id: str) -> User:
|
||||
query = "SELECT * FROM data WHERE user_id = ?"
|
||||
conn.execute(query, (user_id,))
|
||||
|
||||
|
||||
The problem is that there is no way to enforce this
|
||||
discipline. sqlite3's own `documentation
|
||||
<https://docs.python.org/3/library/sqlite3.html>`_ can only admonish
|
||||
the reader to not dynamically build the ``sql`` argument from external
|
||||
input; the API's authors cannot express that through the type
|
||||
system. Users can (and often do) still use a convenient f-string as
|
||||
before and leave their code vulnerable to SQL injection.
|
||||
|
||||
Existing tools, such as the popular security linter `Bandit
|
||||
<https://github.com/PyCQA/bandit/blob/aac3f16f45648a7756727286ba8f8f0cf5e7d408/bandit/plugins/django_sql_injection.py#L102>`_,
|
||||
attempt to detect unsafe external data used in SQL APIs, by inspecting
|
||||
the AST or by other semantic pattern-matching. These tools, however,
|
||||
preclude common idioms like storing a large multi-line query in a
|
||||
variable before executing it, adding literal string modifiers to the
|
||||
query based on some conditions, or transforming the query string using
|
||||
a function. (We survey existing tools in the "Rejected Alternatives"
|
||||
section.) For example, many tools will detect a false positive issue
|
||||
in this benign snippet:
|
||||
|
||||
|
||||
::
|
||||
|
||||
def query_data(conn: Connection, user_id: str, limit: bool) -> None:
|
||||
query = """
|
||||
SELECT
|
||||
user.name,
|
||||
user.age
|
||||
FROM data
|
||||
WHERE user_id = ?
|
||||
"""
|
||||
if limit:
|
||||
query += " LIMIT 1"
|
||||
|
||||
conn.execute(query, (user_id,))
|
||||
|
||||
We want to forbid harmful execution of user-controlled data while
|
||||
still allowing benign idioms like the above and not requiring extra
|
||||
user work.
|
||||
|
||||
To meet this goal, we introduce the ``Literal[str]`` type, which only
|
||||
accepts string values that are known to be made of literals. This is a
|
||||
generalization of the ``Literal["foo"]`` type from `PEP 586
|
||||
<https://www.python.org/dev/peps/pep-0586/>`_. A string of type
|
||||
``Literal[str]`` cannot contain user-controlled data. Thus, any API
|
||||
that only accepts ``Literal[str]`` will be immune to injection
|
||||
vulnerabilities (with pragmatic `limitations <Appendix B:
|
||||
Limitations_>`_).
|
||||
|
||||
Since we want the ``sqlite3`` ``execute`` method to disallow strings
|
||||
built with user input, we would make its `typeshed stub
|
||||
<https://github.com/python/typeshed/blob/1c88ceeee924ec6cfe05dd4865776b49fec299e6/stdlib/sqlite3/dbapi2.pyi#L153>`_
|
||||
accept a ``sql`` query that is of type ``Literal[str]``:
|
||||
|
||||
::
|
||||
|
||||
def execute(self, sql: Literal[str], parameters: Iterable[str] = ...) -> Cursor: ...
|
||||
|
||||
|
||||
This successfully forbids our unsafe SQL example. The variable
|
||||
``query`` below is inferred to have type ``str``, since it is created
|
||||
from a format string using ``user_id``, and cannot be passed to
|
||||
``execute``:
|
||||
|
||||
::
|
||||
|
||||
def query_user(conn: Connection, user_id: str) -> User:
|
||||
query = f"SELECT * FROM data WHERE user_id = {user_id}"
|
||||
conn.execute(query)
|
||||
# Error: Expected Literal[str], got str.
|
||||
|
||||
The method remains flexible enough to allow our more complicated
|
||||
example:
|
||||
|
||||
::
|
||||
|
||||
def query_data(conn: Connection, user_id: str, limit: bool) -> None:
|
||||
# This is a literal string.
|
||||
query = """
|
||||
SELECT
|
||||
user.name,
|
||||
user.age
|
||||
FROM data
|
||||
WHERE user_id = ?
|
||||
"""
|
||||
|
||||
if limit:
|
||||
# Still has type Literal[str] because we added a literal string.
|
||||
query += " LIMIT 1"
|
||||
|
||||
conn.execute(query, (user_id,)) # OK
|
||||
|
||||
Notice that the user did not have to change their SQL code at all. The
|
||||
type checker was able to infer the literal string type and complain
|
||||
only in case of violations. The ``Literal[str]`` type is also useful
|
||||
in other cases where we want strict command-data separation, such as
|
||||
when building shell commands or when rendering a string into an HTML
|
||||
response without escaping (eg. via Django's ``mark_safe``
|
||||
function). Overall, this combination of strictness and flexibility
|
||||
makes it easy to enforce safer API usage in sensitive code without
|
||||
burdening users.
|
||||
|
||||
Usage statistics
|
||||
----------------
|
||||
|
||||
In a sample of open-source projects using ``sqlite3``, we found that
|
||||
``conn.execute`` was called `~67%
|
||||
<https://grep.app/search?q=conn%5C.execute%5C%28%5Cs%2A%5B%27%22%5D®exp=true&filter[lang][0]=Python>`_
|
||||
of the time with a safe string literal and `~33%
|
||||
<https://grep.app/search?current=3&q=conn%5C.execute%5C%28%5Ba-zA-Z_%5D%2B%5C%29®exp=true&filter[lang][0]=Python>`_
|
||||
of the time with an unsafe, dynamically-built local variable. Using
|
||||
this PEP's literal string type along with a type checker would have
|
||||
prevented ``execute`` from being called in such an unsafe manner.
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
||||
Firstly, why use *types* to prevent security vulnerabilities?
|
||||
|
||||
Warning users in documentation is insufficient - most users either
|
||||
never see these warnings or ignore them. Using an existing dynamic or
|
||||
static analysis approach is too restrictive - these prevent natural
|
||||
idioms, as we saw in the `Motivation`_ section (and will discuss more
|
||||
extensively in the `Rejected Alternatives`_ section). The typing-based
|
||||
approach in this PEP strikes a user-friendly balance between
|
||||
strictness and flexibility.
|
||||
|
||||
Runtime approaches do not work because, at runtime, the query string
|
||||
is a plain ``str``. While we could prevent some exploits using
|
||||
heuristics, such as regex-filtering for obviously malicious payloads,
|
||||
there will always be a way to work around them (perfectly
|
||||
distinguishing good and bad queries reduces to the halting problem).
|
||||
|
||||
Static approaches like checking the AST to see if the query string is
|
||||
a literal string expression cannot tell when a string is assigned to
|
||||
an intermediate variable or when it is transformed by a benign
|
||||
function. This makes them overly restrictive.
|
||||
|
||||
The type checker, surprisingly, does better than both because it has
|
||||
access to information not available in the runtime or static analysis
|
||||
approaches. Specifically, the type checker can tell us whether an
|
||||
expression has a literal string type, say ``Literal["foo"]``. The type
|
||||
checker already propagates types across variable assignments or
|
||||
function calls.
|
||||
|
||||
In the current type system itself, if the SQL or shell command
|
||||
execution function only accepted three possible input strings, our job
|
||||
would be done. We would just say:
|
||||
|
||||
::
|
||||
|
||||
def execute(query: Literal["foo", "bar", "baz"]) -> None: ...
|
||||
|
||||
But, of course, ``execute`` can accept *any* possible query. How do we
|
||||
ensure that the query does not contain an arbitrary, user-controlled
|
||||
string?
|
||||
|
||||
We want to specify that the value must be of some type
|
||||
``Literal[<...>]`` where ``<...>`` is some string. This is what
|
||||
``Literal[str]`` represents. ``Literal[str]`` is the "supertype" of
|
||||
all literal string types. Any particular literal string such as
|
||||
``Literal["foo"]`` or ``Literal["bar"]`` is compatible with
|
||||
``Literal[str]``, but not the other way around. The "supertype" of
|
||||
``Literal[str]`` itself is ``str``. So, ``Literal[str]`` itself is
|
||||
compatible with ``str``, but not the other way around. In effect, this
|
||||
PEP just introduces a type in the type hierarchy between
|
||||
``Literal["foo"]`` and ``str``.
|
||||
|
||||
Note that a ``Union`` of literal types is naturally compatible with
|
||||
``Literal[str]`` because each element of the ``Union`` is individually
|
||||
compatible with ``Literal[str]``. So, ``Literal["foo", "bar"]`` is
|
||||
compatible with ``Literal[str]``.
|
||||
|
||||
However, recall that we don't just want to represent exact literal
|
||||
queries. We also want to support composition of two literal strings,
|
||||
such as ``query + " LIMIT 1"``. This too is possible with the above
|
||||
concept. If ``x`` and ``y`` are two values of type ``Literal[str]``,
|
||||
then ``x + y`` will also be of type compatible with
|
||||
``Literal[str]``. We can reason about this by looking at specific
|
||||
instances such as ``Literal["foo"]`` and ``Literal["bar"]``; the value
|
||||
of the added string ``x + y`` can only be ``"foobar"``, which has type
|
||||
``Literal["foobar"]`` and is thus compatible with
|
||||
``Literal[str]``. The same reasoning applies when ``x`` and ``y`` are
|
||||
unions of literal types; the result of pairwise adding any two literal
|
||||
types from ``x`` and ``y`` respectively is a literal type, which means
|
||||
that the overall result is a ``Union`` of literal types and is thus
|
||||
compatible with ``Literal[str]``.
|
||||
|
||||
In this way, we are able to leverage Python's concept of a ``Literal``
|
||||
string type to specify that our API can only accept strings that are
|
||||
known to be constructed from literals. More specific details follow in
|
||||
the remaining sections.
|
||||
|
||||
Valid Locations for ``Literal[str]``
|
||||
=========================================
|
||||
|
||||
``Literal[str]`` can be used where any other type can be used:
|
||||
|
||||
::
|
||||
|
||||
variable_annotation: Literal[str]
|
||||
|
||||
def my_function(literal_string: Literal[str]) -> Literal[str]: ...
|
||||
|
||||
class Foo:
|
||||
my_attribute: Literal[str]
|
||||
|
||||
type_argument: List[Literal[str]]
|
||||
|
||||
T = TypeVar("T", bound=Literal[str])
|
||||
|
||||
It can be nested within unions of ``Literal`` types:
|
||||
|
||||
::
|
||||
|
||||
union: Literal["hello", Literal[str]]
|
||||
union2: Literal["hello", str]
|
||||
union3: Literal[str, 4]
|
||||
|
||||
nested_literal_string: Literal[Literal[str]]
|
||||
|
||||
|
||||
The restrictions on the parameters of ``Literal`` are the same as in
|
||||
`PEP 586 <https://www.python.org/dev/peps/pep-0586/>`_. The only legal
|
||||
parameter is the literal value ``str``. Other values are rejected even
|
||||
if they evaluate to the same value (``str``), such as
|
||||
``Literal[(lambda x: x)(str)]``.
|
||||
|
||||
Type Inference
|
||||
==============
|
||||
|
||||
|
||||
Inferring ``Literal[str]``
|
||||
--------------------------
|
||||
|
||||
Any literal string type is compatible with ``Literal[str]``. For
|
||||
example, ``x: Literal[str] = "foo"`` is valid because ``"foo"`` is
|
||||
inferred to be of type ``Literal["foo"]``.
|
||||
|
||||
As per the `Rationale`_, we also infer ``Literal[str]`` in the
|
||||
following cases:
|
||||
|
||||
+ Addition: ``x + y`` is of type ``Literal[str]`` if both ``x`` and
|
||||
``y`` are compatible with ``Literal[str]``.
|
||||
|
||||
+ Joining: ``sep.join(xs)`` is of type ``Literal[str]`` if ``sep``'s
|
||||
type is compatible with ``Literal[str]`` and ``xs``'s type is
|
||||
compatible with ``Iterable[Literal[str]]``.
|
||||
|
||||
+ In-place addition: If ``s`` has type ``Literal[str]`` and ``x`` has
|
||||
type compatible with ``Literal[str]``, then ``s += x`` preserves
|
||||
``s``'s type as ``Literal[str]``.
|
||||
|
||||
+ String formatting: An f-string has type ``Literal[str]`` if and only
|
||||
if its constituent expressions are literal strings. ``s.format(...)``
|
||||
has type ``Literal[str]`` if and only if ``s`` and the arguments have
|
||||
types compatible with ``Literal[str]``.
|
||||
|
||||
In all other cases, if one or more of the composed values has a
|
||||
non-literal type ``str``, the composition of types will have type
|
||||
``str``. For example, if ``s`` has type ``str``, then ``"hello" + s``
|
||||
has type ``str``. This matches the pre-existing behavior of type
|
||||
checkers.
|
||||
|
||||
``Literal[str]`` is compatible with the type ``str``. It inherits all
|
||||
methods from ``str``. So, if we have a variable ``s`` of type
|
||||
``Literal[str]``, it is safe to write ``s.startswith("hello")``.
|
||||
|
||||
Note that, beyond the few composition rules mentioned above, this PEP
|
||||
doesn't change inference for other ``str`` methods such as
|
||||
``literal_string.upper()``.
|
||||
|
||||
Some type checkers refine the type of a string when doing an equality
|
||||
check:
|
||||
|
||||
::
|
||||
|
||||
def foo(s: str) -> None:
|
||||
if s == "bar":
|
||||
reveal_type(s) # => Literal["bar"]
|
||||
|
||||
Such a refined type in the if-block is also compatible with
|
||||
``Literal[str]`` because its type is ``Literal["bar"]``.
|
||||
|
||||
|
||||
Examples
|
||||
--------
|
||||
|
||||
See the examples below to help clarify the above rules:
|
||||
|
||||
::
|
||||
|
||||
|
||||
literal_string: Literal[str]
|
||||
s: str = literal_string # OK
|
||||
|
||||
literal_string: Literal[str] = s # Error: Expected Literal[str], got str.
|
||||
literal_string: Literal[str] = "hello" # OK
|
||||
|
||||
|
||||
def expect_literal_str(s: Literal[str]) -> None: ...
|
||||
|
||||
Addition of literal strings:
|
||||
|
||||
::
|
||||
|
||||
expect_literal_str("foo" + "bar") # OK
|
||||
expect_literal_str(literal_string + "bar") # OK
|
||||
literal_string2: Literal[str]
|
||||
expect_literal_str(literal_string + literal_string2) # OK
|
||||
plain_str: str
|
||||
expect_literal_str(literal_string + plain_str) # Not OK.
|
||||
|
||||
Join using literal strings:
|
||||
|
||||
::
|
||||
|
||||
expect_literal_str(",".join(["foo", "bar"])) # OK
|
||||
expect_literal_str(literal_string.join(["foo", "bar"])) # OK
|
||||
expect_literal_str(literal_string.join([literal_string, literal_string2])) # OK
|
||||
xs: List[Literal[str]]
|
||||
expect_literal_str(literal_string.join(xs)) # OK
|
||||
expect_literal_str(plain_str.join([literal_string, literal_string2]))
|
||||
# Not OK because the separator has type ``str``.
|
||||
|
||||
In-place addition using literal strings:
|
||||
|
||||
::
|
||||
|
||||
literal_string += "foo" # OK
|
||||
literal_string += literal_string2 # OK
|
||||
literal_string += plain_str # Not OK
|
||||
|
||||
Format strings using literal strings:
|
||||
|
||||
::
|
||||
|
||||
literal_name: Literal[str]
|
||||
expect_literal_str(f"hello {literal_name}")
|
||||
# OK because it is composed from literal strings.
|
||||
|
||||
expect_literal_str("hello {}".format(literal_name)) # OK
|
||||
|
||||
expect_literal_str(f"hello") # OK
|
||||
|
||||
expect_literal_str(f"hello {username}")
|
||||
# NOT OK. The format-string is constructed from ``username``,
|
||||
# which has type ``str``.
|
||||
|
||||
expect_literal_str("hello {}".format(username)) # Not OK
|
||||
|
||||
Other literal types, such as literal integers, are not compatible with ``Literal[str]``:
|
||||
|
||||
::
|
||||
|
||||
some_int: int
|
||||
expect_literal_str(some_int) # Error: Expected Literal[str], got int.
|
||||
|
||||
literal_one: Literal[1] = 1
|
||||
expect_literal_str(literal_one) # Error: Expected Literal[str], got Literal[1].
|
||||
|
||||
|
||||
We can call functions on literal strings:
|
||||
|
||||
::
|
||||
|
||||
def add_limit(query: Literal[str]) -> Literal[str]:
|
||||
return query + " LIMIT = 1"
|
||||
|
||||
def my_query(query: Literal[str], user_id: str) -> None:
|
||||
sql_connection().execute(add_limit(query), (user_id,)) # OK
|
||||
|
||||
Conditional statements and expressions work as expected:
|
||||
|
||||
::
|
||||
|
||||
def return_literal_str() -> Literal[str]:
|
||||
return "foo" if condition1() else "bar" # OK
|
||||
|
||||
def return_literal_str2(literal_str: Literal[str]) -> Literal[str]:
|
||||
return "foo" if condition1() else literal_str # OK
|
||||
|
||||
def return_literal_str3() -> Literal[str]:
|
||||
if condition1():
|
||||
result: Literal["foo"] = "foo"
|
||||
else:
|
||||
result: Literal[str] = "bar"
|
||||
|
||||
return result # OK
|
||||
|
||||
|
||||
Interaction with TypeVars and Generics
|
||||
--------------------------------------
|
||||
|
||||
TypeVars can be bound to ``Literal[str]``:
|
||||
|
||||
::
|
||||
|
||||
from typing import Literal, TypeVar
|
||||
|
||||
TLiteral = TypeVar("TLiteral", bound=Literal[str])
|
||||
|
||||
def literal_identity(s: TLiteral) -> TLiteral:
|
||||
return s
|
||||
|
||||
hello: Literal["hello"] = "hello"
|
||||
y = literal_identity(hello)
|
||||
reveal_type(y) # => Literal["hello"]
|
||||
|
||||
s: Literal[str]
|
||||
y2 = literal_identity(s)
|
||||
reveal_type(y2) # => Literal[str]
|
||||
|
||||
s_error: str
|
||||
literal_identity(s_error)
|
||||
# Error: Expected TLiteral (bound to Literal[str]), got str.
|
||||
|
||||
|
||||
``Literal[str]`` can be used as type arguments for generic classes:
|
||||
|
||||
::
|
||||
|
||||
class Container(Generic[T]):
|
||||
def __init__(self, value: T) -> None:
|
||||
self.value = value
|
||||
|
||||
literal_str: Literal[str] = "hello"
|
||||
x: Container[Literal[str]] = Container(literal_str) # OK
|
||||
|
||||
s: str
|
||||
x_error: Container[Literal[str]] = Container(s) # Not OK
|
||||
|
||||
Standard containers like ``List`` work as expected:
|
||||
|
||||
::
|
||||
|
||||
xs: List[Literal[str]] = ["foo", "bar", "baz"]
|
||||
|
||||
Interactions with Overloads
|
||||
---------------------------
|
||||
|
||||
Literal strings and overloads do not need to interact in a special
|
||||
way: the existing rules work fine. ``Literal[str]`` can be used as a
|
||||
fallback overload where a specific ``Literal["foo"]`` type does not
|
||||
match:
|
||||
|
||||
::
|
||||
|
||||
@overload
|
||||
def foo(x: Literal["foo"]) -> int: ...
|
||||
@overload
|
||||
def foo(x: Literal[str]) -> bool: ...
|
||||
@overload
|
||||
def foo(x: str) -> str: ...
|
||||
|
||||
x1: int = foo("foo") # First overload.
|
||||
x2: bool = foo("bar") # Second overload.
|
||||
s: str
|
||||
x3: str = foo(s) # Third overload.
|
||||
|
||||
Backwards Compatibility
|
||||
-----------------------
|
||||
|
||||
As PEP 586 `mentions
|
||||
<https://www.python.org/dev/peps/pep-0586/#backwards-compatibility>`_,
|
||||
type checkers "should feel free to experiment with more sophisticated
|
||||
inference techniques". So, if the type checker infers a literal string
|
||||
type for an unannotated variable that is initialized with a literal
|
||||
string, the following example should be OK:
|
||||
|
||||
::
|
||||
|
||||
x = "hello"
|
||||
expect_literal_str(x)
|
||||
# OK, because x is inferred to have type ``Literal["hello"]``.
|
||||
|
||||
This enables precise type checking of idiomatic SQL query code without
|
||||
annotating the code at all (as seen in the `Motivation`_ section
|
||||
example).
|
||||
|
||||
However, like PEP 586, this PEP does not mandate the above inference
|
||||
strategy. In case the type checker doesn't infer ``x`` to have type
|
||||
``Literal["hello"]``, users can aid the type checker by explicitly
|
||||
annotating it as ``x: Literal[str]``:
|
||||
|
||||
::
|
||||
|
||||
x: Literal[str] = "hello"
|
||||
expect_literal_str(x)
|
||||
|
||||
Runtime behavior
|
||||
================
|
||||
|
||||
This PEP does not change the runtime behavior of ``Literal``.
|
||||
|
||||
Backwards compatibility
|
||||
=======================
|
||||
|
||||
Backwards compatibility: ``Literal[str]`` is acceptable at runtime, so
|
||||
this doesn't require any changes to the Python runtime itself. PEP 586
|
||||
already backports ``Literal``, so this PEP does not need to change it.
|
||||
|
||||
|
||||
Rejected Alternatives
|
||||
=====================
|
||||
|
||||
Why not use tool X?
|
||||
-------------------
|
||||
|
||||
Focusing solely on the example of preventing SQL injection, tooling to
|
||||
catch this kind of issue seems to come in three flavors: AST based,
|
||||
function level analysis, and taint flow analysis.
|
||||
|
||||
**AST based tools include Bandit**: `Bandit
|
||||
<https://github.com/PyCQA/bandit/blob/aac3f16f45648a7756727286ba8f8f0cf5e7d408/bandit/plugins/django_sql_injection.py#L102>`_
|
||||
has a plugin to warn when SQL queries are not literal
|
||||
strings. The problem is that many perfectly safe SQL
|
||||
queries are dynamically built out of string literals, as shown in the
|
||||
`Motivation`_ section. At the
|
||||
AST level, the resultant SQL query is not going to appear as a string
|
||||
literal anymore and is thus indistinguishable from a potentially
|
||||
malicious string. To use these tools would require significantly
|
||||
restricting developers' ability to build SQL queries. ``Literal[str]``
|
||||
can provide similar safety guarantees with fewer restrictions.
|
||||
|
||||
**Semgrep and pyanalyze**: Semgrep supports a more sophisticated
|
||||
function level analysis, including `constant propagation
|
||||
<https://semgrep.dev/docs/writing-rules/data-flow/#constant-propagation>`_
|
||||
within a function. This allows us to prevent injection attacks while
|
||||
permitting some forms of safe dynamic SQL queries within a
|
||||
function. `pyanalyze
|
||||
<https://github.com/quora/pyanalyze/blob/afcb58cd3e967e4e3fea9e57bb18b6b1d9d42ed7/README.md#extending-pyanalyze>`_
|
||||
has a similar extension. But neither handles function calls that
|
||||
construct and return safe SQL queries. For example, in the code sample
|
||||
below, ``build_insert_query`` is a helper function to create a query
|
||||
that inserts multiple values into the corresponding columns. Semgrep
|
||||
and pyanalyze forbid this natural usage whereas ``Literal[str]``
|
||||
handles it with no burden on the programmer:
|
||||
|
||||
::
|
||||
|
||||
def build_insert_query(
|
||||
table: Literal[str]
|
||||
insert_columns: Iterable[Literal[str]],
|
||||
) -> Literal[str]:
|
||||
sql = "INSERT INTO " + table
|
||||
|
||||
column_clause = ", ".join(insert_columns)
|
||||
value_clause = ", ".join(["?"] * len(insert_columns))
|
||||
|
||||
sql += f" ({column_clause}) VALUES ({value_clause})"
|
||||
return sql
|
||||
|
||||
def insert_data(
|
||||
conn: Connection,
|
||||
kvs_to_insert: Dict[Literal[str], str]
|
||||
) -> None:
|
||||
query = build_insert_query("data", kvs_to_insert.keys())
|
||||
conn.execute(query, kvs_to_insert.values())
|
||||
|
||||
# Example usage
|
||||
data_to_insert = {
|
||||
"column_1": value_1, # Note: values are not literals
|
||||
"column_2": value_2,
|
||||
"column_3": value_3,
|
||||
}
|
||||
insert_data(conn, data_to_insert)
|
||||
|
||||
|
||||
**Taint flow analysis**: Tools such as `Pysa
|
||||
<https://pyre-check.org/docs/pysa-basics/>`_ or `CodeQL
|
||||
<https://codeql.github.com/>`_ are capable of tracking data flowing
|
||||
from a user controlled input into a SQL query. These tools are
|
||||
powerful but involve considerable overhead in setting up the tool in
|
||||
CI, defining "taint" sinks and sources, and teaching developers how to
|
||||
use them. They also usually take longer to run than a type checker
|
||||
(minutes instead of seconds), which means feedback is not
|
||||
immediate. Finally, they move the burden of preventing vulnerabilities
|
||||
on to library users instead of allowing the libraries themselves to
|
||||
specify precisely how their APIs must be called (as is possible with
|
||||
``Literal[str]``).
|
||||
|
||||
|
||||
Why not use a ``NewType`` for ``str``?
|
||||
--------------------------------------
|
||||
|
||||
Any API for which ``Literal[str]`` would be suitable could instead be
|
||||
updated to accept a different type created within the Python type
|
||||
system, such as ``NewType("SafeSQL", str)``:
|
||||
|
||||
::
|
||||
|
||||
SafeSQL = NewType("SafeSQL", str)
|
||||
|
||||
|
||||
def execute(self, sql: SafeSQL, parameters: Iterable[str] = ...) -> Cursor: ...
|
||||
|
||||
execute(SafeSQL("SELECT * FROM data WHERE user_id = ?"), user_id) # OK
|
||||
|
||||
user_query: str
|
||||
execute(user_query) # Error: Expected SafeSQL, got str.
|
||||
|
||||
|
||||
Having to create a new type to call an API might give some developers
|
||||
pause and encourage more caution, but it doesn't guarantee that
|
||||
developers won't just turn a user controlled string into the new type,
|
||||
and pass it into the modified API anyway:
|
||||
|
||||
::
|
||||
|
||||
query = f"SELECT * FROM data WHERE user_id = f{user_id}"
|
||||
execute(SafeSQL(query)) # No error!
|
||||
|
||||
We are back to square one with the problem of preventing arbitrary
|
||||
inputs to ``SafeSQL``. This is not a theoretical concern
|
||||
either. Django uses the above approach with ``SafeString`` and
|
||||
`mark_safe
|
||||
<https://docs.djangoproject.com/en/dev/_modules/django/utils/safestring/#SafeString>`_. Issues
|
||||
such as `CVE-2020-13596
|
||||
<https://github.com/django/django/commit/2dd4d110c159d0c81dff42eaead2c378a0998735>`_
|
||||
show how this technique can `fail
|
||||
<https://nvd.nist.gov/vuln/detail/CVE-2020-13596>`_.
|
||||
|
||||
Also note that this requires invasive changes to the source code
|
||||
(wrapping the query with ``SafeSQL``) whereas ``Literal[str]``
|
||||
requires no such changes. Users can remain oblivious to it as long as
|
||||
they pass in literal strings to sensitive APIs.
|
||||
|
||||
Why not try to emulate Trusted Types?
|
||||
-------------------------------------
|
||||
|
||||
`Trusted Types
|
||||
<https://w3c.github.io/webappsec-trusted-types/dist/spec/>`_ is a W3C
|
||||
specification for preventing DOM-based Cross Site Scripting (XSS). XSS
|
||||
occurs when dangerous browser APIs accept raw user-controlled
|
||||
strings. The specification modifies these APIs to accept only the
|
||||
"Trusted Types" returned by designated sanitizing functions. These
|
||||
sanitizing functions must take in a potentially malicious string and
|
||||
validate it or render it benign somehow, for example by verifying that
|
||||
it is a valid URL or HTML-encoding it.
|
||||
|
||||
It can be tempting to assume porting the concept of Trusted Types to
|
||||
Python could solve the problem. The fundamental difference, however,
|
||||
is that the output of a Trusted Types sanitizer is usually intended
|
||||
*to not be executable code*. Thus it's easy to HTML encode the input,
|
||||
strip out dangerous tags, or otherwise render it inert. With a SQL
|
||||
query or shell command, the end result *still needs to be executable
|
||||
code*. There is no way to write a sanitizer that can reliably figure
|
||||
out which parts of an input string are benign and which ones are
|
||||
potentially malicious.
|
||||
|
||||
Runtime Checkable ``Literal[str]``
|
||||
----------------------------------
|
||||
|
||||
The ``Literal[str]`` concept could be extended beyond static type
|
||||
checking to be a runtime checkable property of ``str`` objects. This
|
||||
would provide some benefits, such as allowing frameworks to raise
|
||||
errors on dynamic strings. Such runtime errors would be a more robust
|
||||
defense mechanism than type errors, which can potentially be
|
||||
suppressed, ignored, or never even seen if the author does not use a
|
||||
type checker.
|
||||
|
||||
This extension to the ``Literal[str]`` concept would dramatically
|
||||
increase the scope of the proposal by requiring changes to one of the
|
||||
most fundamental types in Python. While runtime taint checking on
|
||||
strings has been `considered <https://bugs.python.org/issue500698>`_
|
||||
and `attempted <https://github.com/felixgr/pytaint>`_ in the past, and
|
||||
others may consider it in the future, such extensions are out of scope
|
||||
for this PEP.
|
||||
|
||||
|
||||
Reference Implementation
|
||||
========================
|
||||
|
||||
This is implemented in Pyre v0.9.8 and is actively being used.
|
||||
|
||||
The implementation simply extends the type checker with
|
||||
``Literal[str]`` as a supertype of literal string types.
|
||||
|
||||
To support composition via addition, join, etc., it was sufficient to
|
||||
overload the stubs for ``str`` in Pyre's copy of typeshed. For
|
||||
example, we replaced ``str`` ``__add__``:
|
||||
|
||||
::
|
||||
|
||||
# Before:
|
||||
def __add__(self, s: str) -> str: ...
|
||||
|
||||
# After:
|
||||
@overload
|
||||
def __add__(self: Literal[str], other: Literal[str]) -> Literal[str]: ...
|
||||
@overload
|
||||
def __add__(self, other: str) -> str: ...
|
||||
|
||||
This means that addition of non-literal string types remains to have
|
||||
type ``str``. The only change is that addition of literal string types
|
||||
now produces ``Literal[str]``.
|
||||
|
||||
One implementation strategy is to update the official Typeshed `stub
|
||||
<https://github.com/python/typeshed/blob/aa7e277adb9049e24ea3434fc9848defbfa87673/stdlib/builtins.pyi#L420>`_
|
||||
for ``str`` with these changes.
|
||||
|
||||
Appendix A: Other Uses
|
||||
======================
|
||||
|
||||
To simplify the discussion and require minimal security knowledge, we
|
||||
focused on SQL injections throughout the PEP. ``Literal[str]``,
|
||||
however, can also be used to prevent many other kinds of `injection
|
||||
vulnerabilities <https://owasp.org/www-community/Injection_Flaws>`_.
|
||||
|
||||
Command Injection
|
||||
-----------------
|
||||
|
||||
APIs such as ``subprocess.run`` accept a string which can be run as a
|
||||
shell command:
|
||||
|
||||
::
|
||||
|
||||
subprocess.run(f"echo 'Hello {name}'", shell=True)
|
||||
|
||||
If attacker controlled data is included in the command string, a
|
||||
command injection vulnerability exists and malicious operations can be
|
||||
run. For example, a value of ``' && rm -rf / #`` would result in the
|
||||
following destructive command being run:
|
||||
|
||||
::
|
||||
|
||||
echo 'Hello ' && rm -rf / #'
|
||||
|
||||
This vulnerability could be prevented by updating ``run`` to only
|
||||
accept ``Literal[str]`` when used in ``shell=True`` mode. Here is one
|
||||
simplified stub:
|
||||
|
||||
::
|
||||
|
||||
def run(command: Literal[str], *args: str, shell: bool=...): ...
|
||||
|
||||
Cross Site Scripting (XSS)
|
||||
--------------------------
|
||||
|
||||
Most popular Python web frameworks, such as Django, use a templating
|
||||
engine to produce HTML from user data. These templating languages
|
||||
auto-escape user data before inserting it into the HTML template and
|
||||
thus prevent cross site scripting (XSS) vulnerabilities.
|
||||
|
||||
But a common way to `bypass auto-escaping
|
||||
<https://django.readthedocs.io/en/stable/ref/templates/language.html#how-to-turn-it-off>`_
|
||||
and render HTML as-is is to use functions like ``mark_safe`` in
|
||||
`Django
|
||||
<https://docs.djangoproject.com/en/dev/ref/utils/#django.utils.safestring.mark_safe>`_
|
||||
or ``do_mark_safe`` in `Jinja2
|
||||
<https://github.com/pallets/jinja/blob/main/src/jinja2/filters.py#L1264>`_,
|
||||
which cause XSS vulnerabilities:
|
||||
|
||||
::
|
||||
|
||||
dangerous_string = django.utils.safestring.mark_safe(f"<script>{user_input}</script>")
|
||||
return(dangerous_string)
|
||||
|
||||
This vulnerability could be prevented by updating ``mark_safe`` to
|
||||
only accept ``Literal[str]``:
|
||||
|
||||
::
|
||||
|
||||
def mark_safe(s: Literal[str]) -> str: ...
|
||||
|
||||
Server Side Template Injection (SSTI)
|
||||
-------------------------------------
|
||||
|
||||
Templating frameworks such as Jinja allow Python expressions which
|
||||
will be evaluated and substituted into the rendered result:
|
||||
|
||||
::
|
||||
|
||||
template_str = "There are {{ len(values) }} values: {{ values }}"
|
||||
template = jinja2.Template(template_str)
|
||||
template.render(values=[1, 2])
|
||||
# Result: "There are 2 values: [1, 2]"
|
||||
|
||||
If an attacker controls all or part of the template string, they can
|
||||
insert expressions which execute arbitrary code and `compromise
|
||||
<https://www.onsecurity.io/blog/server-side-template-injection-with-jinja2/>`_
|
||||
the application:
|
||||
|
||||
::
|
||||
|
||||
malicious_str = "{{''.__class__.__base__.__subclasses__()[408]('rm - rf /',shell=True)}}"
|
||||
template = jinja2.Template(malicious_str)
|
||||
template.render()
|
||||
# Result: The shell command 'rm - rf /' is run
|
||||
|
||||
Template injection exploits like this could be prevented by updating
|
||||
the ``Template`` API to only accept ``Literal[str]``:
|
||||
|
||||
::
|
||||
|
||||
class Template:
|
||||
def __init__(self, source: Literal[str]): ...
|
||||
|
||||
|
||||
Appendix B: Limitations
|
||||
=======================
|
||||
|
||||
There are a number of ways ``Literal[str]`` could still fail to
|
||||
prevent users from passing strings built from non-literal data to an
|
||||
API:
|
||||
|
||||
1. If the developer does not use a type checker or does not add type
|
||||
annotations, then violations will go uncaught.
|
||||
|
||||
2. ``cast(Literal[str], non_literal_str)`` could be used to lie to the
|
||||
type checker and allow a dynamic string value to masquerade as a
|
||||
``Literal[str]``. The same goes for a variable that has type ``Any``.
|
||||
|
||||
3. Comments such as ``# type: ignore`` could be used to ignore
|
||||
warnings about non-literal strings.
|
||||
|
||||
4. Trivial functions could be constructed to convert a ``str`` to a
|
||||
``Literal[str]``:
|
||||
|
||||
::
|
||||
|
||||
def make_literal(s: str) -> Literal[str]:
|
||||
letters: Dict[str, Literal[str]] = {
|
||||
"A": "A",
|
||||
"B": "B",
|
||||
...
|
||||
}
|
||||
output: List[Literal[str]] = [letters[c] for c in s]
|
||||
return "".join(output)
|
||||
|
||||
|
||||
We could mitigate the above using linting, code review, etc., but
|
||||
ultimately a clever, malicious developer attempting to circumvent the
|
||||
protections offered by ``Literal[str]`` will always succeed. The
|
||||
important thing to remember is that ``Literal[str]`` is not intended
|
||||
to protect against *malicious* developers; it is meant to protect
|
||||
against benign developers accidentally using sensitive APIs in a
|
||||
dangerous way (without getting in their way otherwise).
|
||||
|
||||
Without ``Literal[str]``, the best enforcement tool API authors have
|
||||
is documentation, which is easily ignored and often not seen. With
|
||||
``Literal[str]``, API misuse requires conscious thought and artifacts
|
||||
in the code that reviewers and future developers can notice.
|
||||
|
||||
Resources
|
||||
=========
|
||||
|
||||
Literal String Types in Scala
|
||||
-----------------------------
|
||||
|
||||
Scala `uses
|
||||
<https://www.scala-lang.org/api/2.13.x/scala/Singleton.html>`_
|
||||
``Singleton`` as the supertype for singleton types, which includes
|
||||
literal string types such as ``"foo"``. ``Singleton`` is Scala's
|
||||
generalized analogue of this PEP's ``Literal[str]``.
|
||||
|
||||
Tamer Abdulradi showed how Scala's literal string types can be used
|
||||
for "Preventing SQL injection at compile time", Scala Days talk
|
||||
`Literal types: What are they good for?
|
||||
<https://slideslive.com/38907881/literal-types-what-they-are-good-for>`_
|
||||
(slides 52 to 68).
|
||||
|
||||
Thanks
|
||||
------
|
||||
|
||||
Thanks to the following people for their feedback on the PEP:
|
||||
|
||||
Edward Qiu, Jia Chen, Shannon Zhu, Gregory P. Smith, Никита Соболев, and Shengye Wan
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document is placed in the public domain or under the
|
||||
CC0-1.0-Universal license, whichever is more permissive.
|
||||
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
coding: utf-8
|
||||
End:
|
Loading…
Reference in New Issue