PEP 675: Arbitrary literal strings (#2167)

This commit is contained in:
Pradeep Kumar 2021-12-01 09:57:41 -08:00 committed by GitHub
parent 7d8c2a104a
commit 21f6993114
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 951 additions and 0 deletions

1
.github/CODEOWNERS vendored
View File

@ -529,6 +529,7 @@ pep-0671.rst @rosuav
pep-0672.rst @encukou
pep-0673.rst @jellezijlstra
pep-0674.rst @vstinner
pep-0675.rst @jellezijlstra
# ...
# pep-0754.txt
# ...

950
pep-0675.rst Normal file
View File

@ -0,0 +1,950 @@
PEP: 675
Title: Arbitrary Literal Strings
Version: $Revision$
Last-Modified: $Date$
Author: Pradeep Kumar Srinivasan <gohanpra@gmail.com>, Graham Bleaney <gbleaney@gmail.com>
Sponsor: Jelle Zijlstra <jelle.zijlstra@gmail.com>
Discussions-To: Typing-Sig <typing-sig@python.org>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 30-Nov-2021
Python-Version: 3.11
Post-History:
Abstract
========
There is currently no way to specify that a function parameter can be
of any literal string type; we have to specify the precise literal
string, such as ``Literal["foo"]``. This PEP introduces a supertype of
literal string types: ``Literal[str]``. This allows a function to
accept arbitrary literal string types such as ``Literal["foo"]`` or
``Literal["bar"]``.
Motivation
==========
A common security vulnerability is for a program to include
user-controlled data in a command it executes. For example, a naive
way to look up a user record from a database is to accept a user id
and insert it into a predefined SQL query:
::
def query_user(conn: Connection, user_id: str) -> User:
query = f"SELECT * FROM data WHERE user_id = {user_id}"
conn.execute(query)
query_user(conn, "user123") # OK.
However, the user-controlled data ``user_id`` is being mixed with the
SQL command string, which means a malicious user could run arbitrary
SQL commands:
::
# Delete the table.
query_user(conn, "user123; DROP TABLE data;")
# Fetch all users (since 1 = 1 is always true).
query_user(conn, "user123 OR 1 = 1")
To prevent such SQL injection attacks, SQL APIs offer parameterized
queries, which separate the executed query from user-controlled data
and make it impossible to run arbitrary queries. For example, with
`sqlite3 <https://docs.python.org/3/library/sqlite3.html>`_, our
original function would be written safely as a query with parameters:
::
def query_user(conn: Connection, user_id: str) -> User:
query = "SELECT * FROM data WHERE user_id = ?"
conn.execute(query, (user_id,))
The problem is that there is no way to enforce this
discipline. sqlite3's own `documentation
<https://docs.python.org/3/library/sqlite3.html>`_ can only admonish
the reader to not dynamically build the ``sql`` argument from external
input; the API's authors cannot express that through the type
system. Users can (and often do) still use a convenient f-string as
before and leave their code vulnerable to SQL injection.
Existing tools, such as the popular security linter `Bandit
<https://github.com/PyCQA/bandit/blob/aac3f16f45648a7756727286ba8f8f0cf5e7d408/bandit/plugins/django_sql_injection.py#L102>`_,
attempt to detect unsafe external data used in SQL APIs, by inspecting
the AST or by other semantic pattern-matching. These tools, however,
preclude common idioms like storing a large multi-line query in a
variable before executing it, adding literal string modifiers to the
query based on some conditions, or transforming the query string using
a function. (We survey existing tools in the "Rejected Alternatives"
section.) For example, many tools will detect a false positive issue
in this benign snippet:
::
def query_data(conn: Connection, user_id: str, limit: bool) -> None:
query = """
SELECT
user.name,
user.age
FROM data
WHERE user_id = ?
"""
if limit:
query += " LIMIT 1"
conn.execute(query, (user_id,))
We want to forbid harmful execution of user-controlled data while
still allowing benign idioms like the above and not requiring extra
user work.
To meet this goal, we introduce the ``Literal[str]`` type, which only
accepts string values that are known to be made of literals. This is a
generalization of the ``Literal["foo"]`` type from `PEP 586
<https://www.python.org/dev/peps/pep-0586/>`_. A string of type
``Literal[str]`` cannot contain user-controlled data. Thus, any API
that only accepts ``Literal[str]`` will be immune to injection
vulnerabilities (with pragmatic `limitations <Appendix B:
Limitations_>`_).
Since we want the ``sqlite3`` ``execute`` method to disallow strings
built with user input, we would make its `typeshed stub
<https://github.com/python/typeshed/blob/1c88ceeee924ec6cfe05dd4865776b49fec299e6/stdlib/sqlite3/dbapi2.pyi#L153>`_
accept a ``sql`` query that is of type ``Literal[str]``:
::
def execute(self, sql: Literal[str], parameters: Iterable[str] = ...) -> Cursor: ...
This successfully forbids our unsafe SQL example. The variable
``query`` below is inferred to have type ``str``, since it is created
from a format string using ``user_id``, and cannot be passed to
``execute``:
::
def query_user(conn: Connection, user_id: str) -> User:
query = f"SELECT * FROM data WHERE user_id = {user_id}"
conn.execute(query)
# Error: Expected Literal[str], got str.
The method remains flexible enough to allow our more complicated
example:
::
def query_data(conn: Connection, user_id: str, limit: bool) -> None:
# This is a literal string.
query = """
SELECT
user.name,
user.age
FROM data
WHERE user_id = ?
"""
if limit:
# Still has type Literal[str] because we added a literal string.
query += " LIMIT 1"
conn.execute(query, (user_id,)) # OK
Notice that the user did not have to change their SQL code at all. The
type checker was able to infer the literal string type and complain
only in case of violations. The ``Literal[str]`` type is also useful
in other cases where we want strict command-data separation, such as
when building shell commands or when rendering a string into an HTML
response without escaping (eg. via Django's ``mark_safe``
function). Overall, this combination of strictness and flexibility
makes it easy to enforce safer API usage in sensitive code without
burdening users.
Usage statistics
----------------
In a sample of open-source projects using ``sqlite3``, we found that
``conn.execute`` was called `~67%
<https://grep.app/search?q=conn%5C.execute%5C%28%5Cs%2A%5B%27%22%5D&regexp=true&filter[lang][0]=Python>`_
of the time with a safe string literal and `~33%
<https://grep.app/search?current=3&q=conn%5C.execute%5C%28%5Ba-zA-Z_%5D%2B%5C%29&regexp=true&filter[lang][0]=Python>`_
of the time with an unsafe, dynamically-built local variable. Using
this PEP's literal string type along with a type checker would have
prevented ``execute`` from being called in such an unsafe manner.
Rationale
=========
Firstly, why use *types* to prevent security vulnerabilities?
Warning users in documentation is insufficient - most users either
never see these warnings or ignore them. Using an existing dynamic or
static analysis approach is too restrictive - these prevent natural
idioms, as we saw in the `Motivation`_ section (and will discuss more
extensively in the `Rejected Alternatives`_ section). The typing-based
approach in this PEP strikes a user-friendly balance between
strictness and flexibility.
Runtime approaches do not work because, at runtime, the query string
is a plain ``str``. While we could prevent some exploits using
heuristics, such as regex-filtering for obviously malicious payloads,
there will always be a way to work around them (perfectly
distinguishing good and bad queries reduces to the halting problem).
Static approaches like checking the AST to see if the query string is
a literal string expression cannot tell when a string is assigned to
an intermediate variable or when it is transformed by a benign
function. This makes them overly restrictive.
The type checker, surprisingly, does better than both because it has
access to information not available in the runtime or static analysis
approaches. Specifically, the type checker can tell us whether an
expression has a literal string type, say ``Literal["foo"]``. The type
checker already propagates types across variable assignments or
function calls.
In the current type system itself, if the SQL or shell command
execution function only accepted three possible input strings, our job
would be done. We would just say:
::
def execute(query: Literal["foo", "bar", "baz"]) -> None: ...
But, of course, ``execute`` can accept *any* possible query. How do we
ensure that the query does not contain an arbitrary, user-controlled
string?
We want to specify that the value must be of some type
``Literal[<...>]`` where ``<...>`` is some string. This is what
``Literal[str]`` represents. ``Literal[str]`` is the "supertype" of
all literal string types. Any particular literal string such as
``Literal["foo"]`` or ``Literal["bar"]`` is compatible with
``Literal[str]``, but not the other way around. The "supertype" of
``Literal[str]`` itself is ``str``. So, ``Literal[str]`` itself is
compatible with ``str``, but not the other way around. In effect, this
PEP just introduces a type in the type hierarchy between
``Literal["foo"]`` and ``str``.
Note that a ``Union`` of literal types is naturally compatible with
``Literal[str]`` because each element of the ``Union`` is individually
compatible with ``Literal[str]``. So, ``Literal["foo", "bar"]`` is
compatible with ``Literal[str]``.
However, recall that we don't just want to represent exact literal
queries. We also want to support composition of two literal strings,
such as ``query + " LIMIT 1"``. This too is possible with the above
concept. If ``x`` and ``y`` are two values of type ``Literal[str]``,
then ``x + y`` will also be of type compatible with
``Literal[str]``. We can reason about this by looking at specific
instances such as ``Literal["foo"]`` and ``Literal["bar"]``; the value
of the added string ``x + y`` can only be ``"foobar"``, which has type
``Literal["foobar"]`` and is thus compatible with
``Literal[str]``. The same reasoning applies when ``x`` and ``y`` are
unions of literal types; the result of pairwise adding any two literal
types from ``x`` and ``y`` respectively is a literal type, which means
that the overall result is a ``Union`` of literal types and is thus
compatible with ``Literal[str]``.
In this way, we are able to leverage Python's concept of a ``Literal``
string type to specify that our API can only accept strings that are
known to be constructed from literals. More specific details follow in
the remaining sections.
Valid Locations for ``Literal[str]``
=========================================
``Literal[str]`` can be used where any other type can be used:
::
variable_annotation: Literal[str]
def my_function(literal_string: Literal[str]) -> Literal[str]: ...
class Foo:
my_attribute: Literal[str]
type_argument: List[Literal[str]]
T = TypeVar("T", bound=Literal[str])
It can be nested within unions of ``Literal`` types:
::
union: Literal["hello", Literal[str]]
union2: Literal["hello", str]
union3: Literal[str, 4]
nested_literal_string: Literal[Literal[str]]
The restrictions on the parameters of ``Literal`` are the same as in
`PEP 586 <https://www.python.org/dev/peps/pep-0586/>`_. The only legal
parameter is the literal value ``str``. Other values are rejected even
if they evaluate to the same value (``str``), such as
``Literal[(lambda x: x)(str)]``.
Type Inference
==============
Inferring ``Literal[str]``
--------------------------
Any literal string type is compatible with ``Literal[str]``. For
example, ``x: Literal[str] = "foo"`` is valid because ``"foo"`` is
inferred to be of type ``Literal["foo"]``.
As per the `Rationale`_, we also infer ``Literal[str]`` in the
following cases:
+ Addition: ``x + y`` is of type ``Literal[str]`` if both ``x`` and
``y`` are compatible with ``Literal[str]``.
+ Joining: ``sep.join(xs)`` is of type ``Literal[str]`` if ``sep``'s
type is compatible with ``Literal[str]`` and ``xs``'s type is
compatible with ``Iterable[Literal[str]]``.
+ In-place addition: If ``s`` has type ``Literal[str]`` and ``x`` has
type compatible with ``Literal[str]``, then ``s += x`` preserves
``s``'s type as ``Literal[str]``.
+ String formatting: An f-string has type ``Literal[str]`` if and only
if its constituent expressions are literal strings. ``s.format(...)``
has type ``Literal[str]`` if and only if ``s`` and the arguments have
types compatible with ``Literal[str]``.
In all other cases, if one or more of the composed values has a
non-literal type ``str``, the composition of types will have type
``str``. For example, if ``s`` has type ``str``, then ``"hello" + s``
has type ``str``. This matches the pre-existing behavior of type
checkers.
``Literal[str]`` is compatible with the type ``str``. It inherits all
methods from ``str``. So, if we have a variable ``s`` of type
``Literal[str]``, it is safe to write ``s.startswith("hello")``.
Note that, beyond the few composition rules mentioned above, this PEP
doesn't change inference for other ``str`` methods such as
``literal_string.upper()``.
Some type checkers refine the type of a string when doing an equality
check:
::
def foo(s: str) -> None:
if s == "bar":
reveal_type(s) # => Literal["bar"]
Such a refined type in the if-block is also compatible with
``Literal[str]`` because its type is ``Literal["bar"]``.
Examples
--------
See the examples below to help clarify the above rules:
::
literal_string: Literal[str]
s: str = literal_string # OK
literal_string: Literal[str] = s # Error: Expected Literal[str], got str.
literal_string: Literal[str] = "hello" # OK
def expect_literal_str(s: Literal[str]) -> None: ...
Addition of literal strings:
::
expect_literal_str("foo" + "bar") # OK
expect_literal_str(literal_string + "bar") # OK
literal_string2: Literal[str]
expect_literal_str(literal_string + literal_string2) # OK
plain_str: str
expect_literal_str(literal_string + plain_str) # Not OK.
Join using literal strings:
::
expect_literal_str(",".join(["foo", "bar"])) # OK
expect_literal_str(literal_string.join(["foo", "bar"])) # OK
expect_literal_str(literal_string.join([literal_string, literal_string2])) # OK
xs: List[Literal[str]]
expect_literal_str(literal_string.join(xs)) # OK
expect_literal_str(plain_str.join([literal_string, literal_string2]))
# Not OK because the separator has type ``str``.
In-place addition using literal strings:
::
literal_string += "foo" # OK
literal_string += literal_string2 # OK
literal_string += plain_str # Not OK
Format strings using literal strings:
::
literal_name: Literal[str]
expect_literal_str(f"hello {literal_name}")
# OK because it is composed from literal strings.
expect_literal_str("hello {}".format(literal_name)) # OK
expect_literal_str(f"hello") # OK
expect_literal_str(f"hello {username}")
# NOT OK. The format-string is constructed from ``username``,
# which has type ``str``.
expect_literal_str("hello {}".format(username)) # Not OK
Other literal types, such as literal integers, are not compatible with ``Literal[str]``:
::
some_int: int
expect_literal_str(some_int) # Error: Expected Literal[str], got int.
literal_one: Literal[1] = 1
expect_literal_str(literal_one) # Error: Expected Literal[str], got Literal[1].
We can call functions on literal strings:
::
def add_limit(query: Literal[str]) -> Literal[str]:
return query + " LIMIT = 1"
def my_query(query: Literal[str], user_id: str) -> None:
sql_connection().execute(add_limit(query), (user_id,)) # OK
Conditional statements and expressions work as expected:
::
def return_literal_str() -> Literal[str]:
return "foo" if condition1() else "bar" # OK
def return_literal_str2(literal_str: Literal[str]) -> Literal[str]:
return "foo" if condition1() else literal_str # OK
def return_literal_str3() -> Literal[str]:
if condition1():
result: Literal["foo"] = "foo"
else:
result: Literal[str] = "bar"
return result # OK
Interaction with TypeVars and Generics
--------------------------------------
TypeVars can be bound to ``Literal[str]``:
::
from typing import Literal, TypeVar
TLiteral = TypeVar("TLiteral", bound=Literal[str])
def literal_identity(s: TLiteral) -> TLiteral:
return s
hello: Literal["hello"] = "hello"
y = literal_identity(hello)
reveal_type(y) # => Literal["hello"]
s: Literal[str]
y2 = literal_identity(s)
reveal_type(y2) # => Literal[str]
s_error: str
literal_identity(s_error)
# Error: Expected TLiteral (bound to Literal[str]), got str.
``Literal[str]`` can be used as type arguments for generic classes:
::
class Container(Generic[T]):
def __init__(self, value: T) -> None:
self.value = value
literal_str: Literal[str] = "hello"
x: Container[Literal[str]] = Container(literal_str) # OK
s: str
x_error: Container[Literal[str]] = Container(s) # Not OK
Standard containers like ``List`` work as expected:
::
xs: List[Literal[str]] = ["foo", "bar", "baz"]
Interactions with Overloads
---------------------------
Literal strings and overloads do not need to interact in a special
way: the existing rules work fine. ``Literal[str]`` can be used as a
fallback overload where a specific ``Literal["foo"]`` type does not
match:
::
@overload
def foo(x: Literal["foo"]) -> int: ...
@overload
def foo(x: Literal[str]) -> bool: ...
@overload
def foo(x: str) -> str: ...
x1: int = foo("foo") # First overload.
x2: bool = foo("bar") # Second overload.
s: str
x3: str = foo(s) # Third overload.
Backwards Compatibility
-----------------------
As PEP 586 `mentions
<https://www.python.org/dev/peps/pep-0586/#backwards-compatibility>`_,
type checkers "should feel free to experiment with more sophisticated
inference techniques". So, if the type checker infers a literal string
type for an unannotated variable that is initialized with a literal
string, the following example should be OK:
::
x = "hello"
expect_literal_str(x)
# OK, because x is inferred to have type ``Literal["hello"]``.
This enables precise type checking of idiomatic SQL query code without
annotating the code at all (as seen in the `Motivation`_ section
example).
However, like PEP 586, this PEP does not mandate the above inference
strategy. In case the type checker doesn't infer ``x`` to have type
``Literal["hello"]``, users can aid the type checker by explicitly
annotating it as ``x: Literal[str]``:
::
x: Literal[str] = "hello"
expect_literal_str(x)
Runtime behavior
================
This PEP does not change the runtime behavior of ``Literal``.
Backwards compatibility
=======================
Backwards compatibility: ``Literal[str]`` is acceptable at runtime, so
this doesn't require any changes to the Python runtime itself. PEP 586
already backports ``Literal``, so this PEP does not need to change it.
Rejected Alternatives
=====================
Why not use tool X?
-------------------
Focusing solely on the example of preventing SQL injection, tooling to
catch this kind of issue seems to come in three flavors: AST based,
function level analysis, and taint flow analysis.
**AST based tools include Bandit**: `Bandit
<https://github.com/PyCQA/bandit/blob/aac3f16f45648a7756727286ba8f8f0cf5e7d408/bandit/plugins/django_sql_injection.py#L102>`_
has a plugin to warn when SQL queries are not literal
strings. The problem is that many perfectly safe SQL
queries are dynamically built out of string literals, as shown in the
`Motivation`_ section. At the
AST level, the resultant SQL query is not going to appear as a string
literal anymore and is thus indistinguishable from a potentially
malicious string. To use these tools would require significantly
restricting developers' ability to build SQL queries. ``Literal[str]``
can provide similar safety guarantees with fewer restrictions.
**Semgrep and pyanalyze**: Semgrep supports a more sophisticated
function level analysis, including `constant propagation
<https://semgrep.dev/docs/writing-rules/data-flow/#constant-propagation>`_
within a function. This allows us to prevent injection attacks while
permitting some forms of safe dynamic SQL queries within a
function. `pyanalyze
<https://github.com/quora/pyanalyze/blob/afcb58cd3e967e4e3fea9e57bb18b6b1d9d42ed7/README.md#extending-pyanalyze>`_
has a similar extension. But neither handles function calls that
construct and return safe SQL queries. For example, in the code sample
below, ``build_insert_query`` is a helper function to create a query
that inserts multiple values into the corresponding columns. Semgrep
and pyanalyze forbid this natural usage whereas ``Literal[str]``
handles it with no burden on the programmer:
::
def build_insert_query(
table: Literal[str]
insert_columns: Iterable[Literal[str]],
) -> Literal[str]:
sql = "INSERT INTO " + table
column_clause = ", ".join(insert_columns)
value_clause = ", ".join(["?"] * len(insert_columns))
sql += f" ({column_clause}) VALUES ({value_clause})"
return sql
def insert_data(
conn: Connection,
kvs_to_insert: Dict[Literal[str], str]
) -> None:
query = build_insert_query("data", kvs_to_insert.keys())
conn.execute(query, kvs_to_insert.values())
# Example usage
data_to_insert = {
"column_1": value_1, # Note: values are not literals
"column_2": value_2,
"column_3": value_3,
}
insert_data(conn, data_to_insert)
**Taint flow analysis**: Tools such as `Pysa
<https://pyre-check.org/docs/pysa-basics/>`_ or `CodeQL
<https://codeql.github.com/>`_ are capable of tracking data flowing
from a user controlled input into a SQL query. These tools are
powerful but involve considerable overhead in setting up the tool in
CI, defining "taint" sinks and sources, and teaching developers how to
use them. They also usually take longer to run than a type checker
(minutes instead of seconds), which means feedback is not
immediate. Finally, they move the burden of preventing vulnerabilities
on to library users instead of allowing the libraries themselves to
specify precisely how their APIs must be called (as is possible with
``Literal[str]``).
Why not use a ``NewType`` for ``str``?
--------------------------------------
Any API for which ``Literal[str]`` would be suitable could instead be
updated to accept a different type created within the Python type
system, such as ``NewType("SafeSQL", str)``:
::
SafeSQL = NewType("SafeSQL", str)
def execute(self, sql: SafeSQL, parameters: Iterable[str] = ...) -> Cursor: ...
execute(SafeSQL("SELECT * FROM data WHERE user_id = ?"), user_id) # OK
user_query: str
execute(user_query) # Error: Expected SafeSQL, got str.
Having to create a new type to call an API might give some developers
pause and encourage more caution, but it doesn't guarantee that
developers won't just turn a user controlled string into the new type,
and pass it into the modified API anyway:
::
query = f"SELECT * FROM data WHERE user_id = f{user_id}"
execute(SafeSQL(query)) # No error!
We are back to square one with the problem of preventing arbitrary
inputs to ``SafeSQL``. This is not a theoretical concern
either. Django uses the above approach with ``SafeString`` and
`mark_safe
<https://docs.djangoproject.com/en/dev/_modules/django/utils/safestring/#SafeString>`_. Issues
such as `CVE-2020-13596
<https://github.com/django/django/commit/2dd4d110c159d0c81dff42eaead2c378a0998735>`_
show how this technique can `fail
<https://nvd.nist.gov/vuln/detail/CVE-2020-13596>`_.
Also note that this requires invasive changes to the source code
(wrapping the query with ``SafeSQL``) whereas ``Literal[str]``
requires no such changes. Users can remain oblivious to it as long as
they pass in literal strings to sensitive APIs.
Why not try to emulate Trusted Types?
-------------------------------------
`Trusted Types
<https://w3c.github.io/webappsec-trusted-types/dist/spec/>`_ is a W3C
specification for preventing DOM-based Cross Site Scripting (XSS). XSS
occurs when dangerous browser APIs accept raw user-controlled
strings. The specification modifies these APIs to accept only the
"Trusted Types" returned by designated sanitizing functions. These
sanitizing functions must take in a potentially malicious string and
validate it or render it benign somehow, for example by verifying that
it is a valid URL or HTML-encoding it.
It can be tempting to assume porting the concept of Trusted Types to
Python could solve the problem. The fundamental difference, however,
is that the output of a Trusted Types sanitizer is usually intended
*to not be executable code*. Thus it's easy to HTML encode the input,
strip out dangerous tags, or otherwise render it inert. With a SQL
query or shell command, the end result *still needs to be executable
code*. There is no way to write a sanitizer that can reliably figure
out which parts of an input string are benign and which ones are
potentially malicious.
Runtime Checkable ``Literal[str]``
----------------------------------
The ``Literal[str]`` concept could be extended beyond static type
checking to be a runtime checkable property of ``str`` objects. This
would provide some benefits, such as allowing frameworks to raise
errors on dynamic strings. Such runtime errors would be a more robust
defense mechanism than type errors, which can potentially be
suppressed, ignored, or never even seen if the author does not use a
type checker.
This extension to the ``Literal[str]`` concept would dramatically
increase the scope of the proposal by requiring changes to one of the
most fundamental types in Python. While runtime taint checking on
strings has been `considered <https://bugs.python.org/issue500698>`_
and `attempted <https://github.com/felixgr/pytaint>`_ in the past, and
others may consider it in the future, such extensions are out of scope
for this PEP.
Reference Implementation
========================
This is implemented in Pyre v0.9.8 and is actively being used.
The implementation simply extends the type checker with
``Literal[str]`` as a supertype of literal string types.
To support composition via addition, join, etc., it was sufficient to
overload the stubs for ``str`` in Pyre's copy of typeshed. For
example, we replaced ``str`` ``__add__``:
::
# Before:
def __add__(self, s: str) -> str: ...
# After:
@overload
def __add__(self: Literal[str], other: Literal[str]) -> Literal[str]: ...
@overload
def __add__(self, other: str) -> str: ...
This means that addition of non-literal string types remains to have
type ``str``. The only change is that addition of literal string types
now produces ``Literal[str]``.
One implementation strategy is to update the official Typeshed `stub
<https://github.com/python/typeshed/blob/aa7e277adb9049e24ea3434fc9848defbfa87673/stdlib/builtins.pyi#L420>`_
for ``str`` with these changes.
Appendix A: Other Uses
======================
To simplify the discussion and require minimal security knowledge, we
focused on SQL injections throughout the PEP. ``Literal[str]``,
however, can also be used to prevent many other kinds of `injection
vulnerabilities <https://owasp.org/www-community/Injection_Flaws>`_.
Command Injection
-----------------
APIs such as ``subprocess.run`` accept a string which can be run as a
shell command:
::
subprocess.run(f"echo 'Hello {name}'", shell=True)
If attacker controlled data is included in the command string, a
command injection vulnerability exists and malicious operations can be
run. For example, a value of ``' && rm -rf / #`` would result in the
following destructive command being run:
::
echo 'Hello ' && rm -rf / #'
This vulnerability could be prevented by updating ``run`` to only
accept ``Literal[str]`` when used in ``shell=True`` mode. Here is one
simplified stub:
::
def run(command: Literal[str], *args: str, shell: bool=...): ...
Cross Site Scripting (XSS)
--------------------------
Most popular Python web frameworks, such as Django, use a templating
engine to produce HTML from user data. These templating languages
auto-escape user data before inserting it into the HTML template and
thus prevent cross site scripting (XSS) vulnerabilities.
But a common way to `bypass auto-escaping
<https://django.readthedocs.io/en/stable/ref/templates/language.html#how-to-turn-it-off>`_
and render HTML as-is is to use functions like ``mark_safe`` in
`Django
<https://docs.djangoproject.com/en/dev/ref/utils/#django.utils.safestring.mark_safe>`_
or ``do_mark_safe`` in `Jinja2
<https://github.com/pallets/jinja/blob/main/src/jinja2/filters.py#L1264>`_,
which cause XSS vulnerabilities:
::
dangerous_string = django.utils.safestring.mark_safe(f"<script>{user_input}</script>")
return(dangerous_string)
This vulnerability could be prevented by updating ``mark_safe`` to
only accept ``Literal[str]``:
::
def mark_safe(s: Literal[str]) -> str: ...
Server Side Template Injection (SSTI)
-------------------------------------
Templating frameworks such as Jinja allow Python expressions which
will be evaluated and substituted into the rendered result:
::
template_str = "There are {{ len(values) }} values: {{ values }}"
template = jinja2.Template(template_str)
template.render(values=[1, 2])
# Result: "There are 2 values: [1, 2]"
If an attacker controls all or part of the template string, they can
insert expressions which execute arbitrary code and `compromise
<https://www.onsecurity.io/blog/server-side-template-injection-with-jinja2/>`_
the application:
::
malicious_str = "{{''.__class__.__base__.__subclasses__()[408]('rm - rf /',shell=True)}}"
template = jinja2.Template(malicious_str)
template.render()
# Result: The shell command 'rm - rf /' is run
Template injection exploits like this could be prevented by updating
the ``Template`` API to only accept ``Literal[str]``:
::
class Template:
def __init__(self, source: Literal[str]): ...
Appendix B: Limitations
=======================
There are a number of ways ``Literal[str]`` could still fail to
prevent users from passing strings built from non-literal data to an
API:
1. If the developer does not use a type checker or does not add type
annotations, then violations will go uncaught.
2. ``cast(Literal[str], non_literal_str)`` could be used to lie to the
type checker and allow a dynamic string value to masquerade as a
``Literal[str]``. The same goes for a variable that has type ``Any``.
3. Comments such as ``# type: ignore`` could be used to ignore
warnings about non-literal strings.
4. Trivial functions could be constructed to convert a ``str`` to a
``Literal[str]``:
::
def make_literal(s: str) -> Literal[str]:
letters: Dict[str, Literal[str]] = {
"A": "A",
"B": "B",
...
}
output: List[Literal[str]] = [letters[c] for c in s]
return "".join(output)
We could mitigate the above using linting, code review, etc., but
ultimately a clever, malicious developer attempting to circumvent the
protections offered by ``Literal[str]`` will always succeed. The
important thing to remember is that ``Literal[str]`` is not intended
to protect against *malicious* developers; it is meant to protect
against benign developers accidentally using sensitive APIs in a
dangerous way (without getting in their way otherwise).
Without ``Literal[str]``, the best enforcement tool API authors have
is documentation, which is easily ignored and often not seen. With
``Literal[str]``, API misuse requires conscious thought and artifacts
in the code that reviewers and future developers can notice.
Resources
=========
Literal String Types in Scala
-----------------------------
Scala `uses
<https://www.scala-lang.org/api/2.13.x/scala/Singleton.html>`_
``Singleton`` as the supertype for singleton types, which includes
literal string types such as ``"foo"``. ``Singleton`` is Scala's
generalized analogue of this PEP's ``Literal[str]``.
Tamer Abdulradi showed how Scala's literal string types can be used
for "Preventing SQL injection at compile time", Scala Days talk
`Literal types: What are they good for?
<https://slideslive.com/38907881/literal-types-what-they-are-good-for>`_
(slides 52 to 68).
Thanks
------
Thanks to the following people for their feedback on the PEP:
Edward Qiu, Jia Chen, Shannon Zhu, Gregory P. Smith, Никита Соболев, and Shengye Wan
Copyright
=========
This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: