PEP 672: Unicode-related Security Considerations for Python (#2129)
This commit is contained in:
parent
1acde901b8
commit
943bf873ed
|
@ -525,6 +525,8 @@ pep-0665.rst @brettcannon
|
|||
pep-0667.rst @markshannon
|
||||
pep-0668.rst @dstufft
|
||||
pep-0670.rst @vstinner @erlend-aasland
|
||||
# pep-0671.rst
|
||||
pep-0672.rst @encukou
|
||||
# ...
|
||||
# pep-0754.txt
|
||||
# ...
|
||||
|
|
|
@ -0,0 +1,405 @@
|
|||
PEP: 672
|
||||
Title: Unicode-related Security Considerations for Python
|
||||
Author: Petr Viktorin <encukou@gmail.com>
|
||||
Status: Active
|
||||
Type: Informational
|
||||
Content-Type: text/x-rst
|
||||
Created: 01-Nov-2021
|
||||
Post-History: 01-Nov-2021
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
This document explains possible ways to misuse Unicode to write Python
|
||||
programs that appear to do something else than they actually do.
|
||||
|
||||
This document does not give any recommendations and solutions.
|
||||
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
`Unicode`_ is a system for handling all kinds of written language.
|
||||
It aims to allow any character from any human language to be
|
||||
used. Python code may consist of almost all valid Unicode characters.
|
||||
While this allows programmers from all around the world to express themselves,
|
||||
it also allows writing code that is potentially confusing to readers.
|
||||
|
||||
It is possible to misuse Python's Unicode-related features to write code that
|
||||
*appears* to do something else than what it does.
|
||||
Evildoers could take advantage of this to trick code reviewers into
|
||||
accepting malicious code.
|
||||
|
||||
The possible issues generally can't be solved in Python itself without
|
||||
excessive restrictions of the language.
|
||||
They should be solved in code edirors and review tools
|
||||
(such as *diff* displays), by enforcing project-specific policies,
|
||||
and by raising awareness of individual programmers.
|
||||
|
||||
This document purposefully does not give any solutions
|
||||
or recommendations: it is rather a list of things to keep in mind.
|
||||
|
||||
This document is specific to Python.
|
||||
For general security considerations in Unicode text, see [tr36]_ and [tr39]_.
|
||||
|
||||
|
||||
Acknowledgement
|
||||
===============
|
||||
|
||||
Investigation for this document was prompted by `CVE-2021-42574`_,
|
||||
*Trojan Source Attacks*, reported by Nicholas Boucher and Ross Anderson,
|
||||
which focuses on Bidirectional override characters and homoglyphs in a variety
|
||||
of programming languages.
|
||||
|
||||
|
||||
Confusing Features
|
||||
==================
|
||||
|
||||
This section lists some Unicode-related features that can be surprising
|
||||
or misusable.
|
||||
|
||||
|
||||
ASCII-only Considerations
|
||||
-------------------------
|
||||
|
||||
ASCII is a subset of Unicode, consisting of the most common symbols, numbers,
|
||||
Latin letters and control characters.
|
||||
|
||||
While issues with the ASCII character set are generally well understood,
|
||||
the're presented here to help better understanding of the non-ASCII cases.
|
||||
|
||||
Confusables and Typos
|
||||
'''''''''''''''''''''
|
||||
|
||||
Some characters look alike.
|
||||
Before the age of computers, many mechanical typewriters lacked the keys for
|
||||
the digits ``0`` and ``1``: users typed ``O`` (capital o) and ``l``
|
||||
(lowercase L) instead. Human readers could tell them apart by context only.
|
||||
In programming languages, however, distinction between digits and letters is
|
||||
critical -- and most fonts designed for programmers make it easy to tell them
|
||||
apart.
|
||||
|
||||
Similarly, in fonts designed for human languages, the uppercase “I” and
|
||||
lowercase “l” can look similar. Or the letters “rn” may be virtually
|
||||
indistinguishable from the single letter “m”.
|
||||
Again, programmers' fonts make these pairs of *confusables*
|
||||
noticeably different.
|
||||
|
||||
However, what is “noticeably” different always depends on the context.
|
||||
Humans tend to ignore details in longer identifiers: the variable name
|
||||
``accessibi1ity_options`` can still look indistinguishable from
|
||||
``accessibility_options``, while they are distinct for the compiler.
|
||||
The same can be said for plain typos: most humans will not notice the typo in
|
||||
``responsbility_chain_delegate``.
|
||||
|
||||
Control Characters
|
||||
''''''''''''''''''
|
||||
|
||||
Python generally considers all ``CR`` (``\r``), ``LF`` (``\n``), and ``CR-LF``
|
||||
pairs (``\r\n``) as an end of line characters.
|
||||
Most code editors do as well, but there are editors that display “non-native”
|
||||
line endings as unknown characters (or nothing at all), rather than ending
|
||||
the line, displaying this example::
|
||||
|
||||
# Don't call this function:
|
||||
fire_the_missiles()
|
||||
|
||||
as a harmless comment like::
|
||||
|
||||
# Don't call this function:⬛fire_the_missiles()
|
||||
|
||||
CPython may treat the control character NUL (``\0``) as end of input,
|
||||
but many editors simply skip it, possibly showing code that Python will not
|
||||
run as a regular part of a file.
|
||||
|
||||
Some characters can be used to hide/overwrite other characters when source is
|
||||
listed in common terminals. For example:
|
||||
|
||||
* BS (``\b``, Backspace) moves the cursor back, so the character after it
|
||||
will overwrite the character before.
|
||||
* CR (``\r``, carriage return) moves the cursor to the start of line,
|
||||
subsequent characters overwrite the start of the line.
|
||||
* SUB (``\x1A``, Ctrl+Z) means “End of text” on Windows. Some programs
|
||||
(such as ``type``) ignore the rest of the file after it.
|
||||
* ESC (``\x1B``) commonly initiates escape codes which allow arbitrary
|
||||
control of the terminal.
|
||||
|
||||
|
||||
Confusable Characters in Identifiers
|
||||
------------------------------------
|
||||
|
||||
Python is not limited to ASCII.
|
||||
It allows characters of all scripts – Latin letters to ancient Egyptian
|
||||
hieroglyphs – in identifiers (such as variable names).
|
||||
See :pep:`3131` for details and rationale.
|
||||
Only “letters and numbers” are allowed, so while ``γάτα`` is a valid Python
|
||||
identifier, ``🐱`` is not. (See `Identifiers and keywords`_ for details.)
|
||||
|
||||
Non-printing control characters are also not allowed in identifiers.
|
||||
|
||||
However, within the allowed set there is a large number of “confusables”.
|
||||
For example, the uppercase versions of the Latin ``b``, Greek ``β`` (Beta), and
|
||||
Cyrillic ``в`` (Ve) often look identical: ``B``, ``Β`` and ``В``, respectively.
|
||||
|
||||
This allows identifiers that look the same to humans, but not to Python.
|
||||
For example, all of the following are distinct identifiers:
|
||||
|
||||
* ``scope`` (Latin, ASCII-only)
|
||||
* ``scоpe`` (wih a Cyrillic ``о``)
|
||||
* ``scοpe`` (with a Greek ``ο``)
|
||||
* ``ѕсоре`` (all Cyrillic letters)
|
||||
|
||||
Additionally, some letters can look like non-letters:
|
||||
|
||||
* The letter for the Hawaiian *ʻokina* looks like an apostrophe;
|
||||
``ʻHelloʻ`` is a Python identifier, not a string.
|
||||
* The East Asian word for *ten* looks like a plus sign,
|
||||
so ``十= 10`` is a complete Python statement. (The “十” is a word: “ten”
|
||||
rather than “10”.)
|
||||
|
||||
.. note::
|
||||
|
||||
The converse also applies – some symbols look like letters – but since
|
||||
Python does not allow arbitrary symbols in identifiers, this is not an
|
||||
issue.
|
||||
|
||||
|
||||
Confusable Digits
|
||||
------------------
|
||||
|
||||
Numeric literals in Python only use the ASCII digits 0-9 (and non-digits such
|
||||
as ``.`` or ``e``).
|
||||
|
||||
However, when numbers are converted from strings, such as in the ``int`` and
|
||||
``float`` constructors or by the ``str.format`` method, any decimal digit
|
||||
can be used. For example ``߅`` (``NKO DIGIT FIVE``) or ``௫``
|
||||
(``TAMIL DIGIT FIVE``) work as the digit ``5``.
|
||||
|
||||
Some scripts include digits that look similar to ASCII ones, but have a
|
||||
different value. For example::
|
||||
|
||||
>>> int('৪୨')
|
||||
42
|
||||
>>> '{٥}'.format('zero', 'one', 'two', 'three', 'four', 'five')
|
||||
five
|
||||
|
||||
|
||||
Bidirectional Text
|
||||
------------------
|
||||
|
||||
Some scripts, such as Hebrew or Arabic, are written right-to-left.
|
||||
Phrases in such scripts interact with nearby text in ways that can be
|
||||
surprising to people who aren't familiar with these writing systems and their
|
||||
computer representation.
|
||||
|
||||
The exact process is complicated, and explained in Unicode Standard Annex #9,
|
||||
`Unicode Bidirectional Algorithm`_.
|
||||
|
||||
Consider the following code, which assigns a 100-character string to
|
||||
the variable ``s``::
|
||||
|
||||
s = "X" * 100 # "X" is assigned
|
||||
|
||||
When the ``X`` is replaced by the Hebrew letter ``א``, the line becomes::
|
||||
|
||||
s = "א" * 100 # "א" is assigned
|
||||
|
||||
This command still assigns a 100-character string to ``s``, but
|
||||
when displayed as general text following the Bidirectional Algorithm
|
||||
(e.g. in a browser), it appears as ``s = "א"`` followed by a comment.
|
||||
|
||||
Other surprising examples include:
|
||||
|
||||
* In the statement ``ערך = 23``, the variable ``ערך`` is set to the integer 23.
|
||||
|
||||
* In the statement ``قيمة = ערך``, the variable ``قيمة`` is set
|
||||
to the value of ``ערך``.
|
||||
|
||||
* In the statement ``قيمة - (ערך ** 2)``, the value of ``ערך`` is squared and
|
||||
then subtracted from ``قيمة``.
|
||||
The *opening* parenthesis is displayed as ``)``.
|
||||
|
||||
|
||||
|
||||
Bidirectional Marks, Embeddings, Overrides and Isolates
|
||||
-------------------------------------------------------
|
||||
|
||||
Default reordering rules do not always yield the intended direction of text, so
|
||||
Unicode provides several ways to alter it.
|
||||
|
||||
The most basic are **directional marks**, which are invisible but affect text
|
||||
as a left-to-right (or right-to-left) character would.
|
||||
Continuing with the ``s = "X"`` example above, in the next example the ``X`` is
|
||||
replaced by the Latin ``x`` followed or preceded by a
|
||||
right-to-left mark (``U+200F``). This assigns a 200-character string to ``s``
|
||||
(100 copies of ``x`` interspersed with 100 invisible marks),
|
||||
but under Unicode rules for general text, it is rendered as ``s = "x"``
|
||||
followed by an ASCII-only comment::
|
||||
|
||||
s = "x" * 100 # "x" is assigned
|
||||
|
||||
The directional **embedding**, **override** and **isolate** characters
|
||||
are also invisible, but affect the ordering of all text after them until either
|
||||
ended by a dedicated character, or until the end of line.
|
||||
(Unicode specifies the effect to last until the end of a “paragraph” (see
|
||||
`Unicode Bidirectional Algorithm`_),
|
||||
but allows tools to interpret newline characters as paragraph ends
|
||||
(see Unicode `Newline Guidelines`_). Most code editors and terminals do so.)
|
||||
|
||||
These characters essentially allow arbitrary reordering of the text that
|
||||
follows them. Python only allows them in strings and comments, which does limit
|
||||
their potential (especially in combination with the fact that Python's comments
|
||||
always extend to the end of a line), but it doesn't render them harmless.
|
||||
|
||||
|
||||
Normalizing identifiers
|
||||
-----------------------
|
||||
|
||||
Python strings are collections of *Unicode codepoints*, not “characters”.
|
||||
|
||||
For reasons like compatibility with earlier encodings, Unicode often has
|
||||
several ways to encode what is essentially a single “character”.
|
||||
For example, all are these different ways of writing ``Å`` as a Python string,
|
||||
each of which is unequal to the others.
|
||||
|
||||
* ``"\N{LATIN CAPITAL LETTER A WITH RING ABOVE}"`` (1 codepoint)
|
||||
* ``"\N{LATIN CAPITAL LETTER A}\N{COMBINING RING ABOVE}"`` (2 codepoints)
|
||||
* ``"\N{ANGSTROM SIGN}"`` (1 codepoint, but different)
|
||||
|
||||
For another example, the ligature ``fi`` has a dedicated Unicode codepoint,
|
||||
even though it has the same meaning as the two letters ``fi``.
|
||||
|
||||
Also, common letters frequently have several distinct variations.
|
||||
Unicode provides them for contexts where the difference has some semantic
|
||||
meaning, like mathematics. For example, some variations of ``n`` are:
|
||||
|
||||
* ``n`` (LATIN SMALL LETTER N)
|
||||
* ``𝐧`` (MATHEMATICAL BOLD SMALL N)
|
||||
* ``𝘯`` (MATHEMATICAL SANS-SERIF ITALIC SMALL N)
|
||||
* ``n`` (FULLWIDTH LATIN SMALL LETTER N)
|
||||
* ``ⁿ`` (SUPERSCRIPT LATIN SMALL LETTER N)
|
||||
|
||||
Unicode includes algorithms to *normalize* variants like these to a single
|
||||
form, and Python identifiers are normalized.
|
||||
(There are several normal forms; Python uses ``NFKC``.)
|
||||
|
||||
For example, ``xn`` and ``xⁿ`` are the same identifier in Python::
|
||||
|
||||
>>> xⁿ = 8
|
||||
>>> xn
|
||||
8
|
||||
|
||||
… as is ``fi`` and ``fi``, and as are the different ways to encode ``Å``.
|
||||
|
||||
This normalization applies *only* to identifiers, however.
|
||||
Functions that treat strings as identifiers, such as ``getattr``,
|
||||
do not perform normalization::
|
||||
|
||||
>>> class Test:
|
||||
... def finalize(self):
|
||||
... print('OK')
|
||||
...
|
||||
>>> Test().finalize()
|
||||
OK
|
||||
>>> Test().finalize()
|
||||
OK
|
||||
>>> getattr(Test(), 'finalize')
|
||||
Traceback (most recent call last):
|
||||
...
|
||||
AttributeError: 'Test' object has no attribute 'finalize'
|
||||
|
||||
This also applies when importing:
|
||||
|
||||
* ``import finalization`` performs normalization, and looks for a file
|
||||
named ``finalization.py`` (and other ``finalization.*`` files).
|
||||
* ``importlib.import_module("finalization")`` does not normalize,
|
||||
so it looks for a file named ``finalization.py``.
|
||||
|
||||
Some filesystems independently apply normalization and/or case folding.
|
||||
On some systems, ``finalization.py``, ``finalization.py`` and
|
||||
``FINALIZATION.py`` are three distinct filenames; on others, some or all
|
||||
of these name the same file.
|
||||
|
||||
|
||||
Source Encoding
|
||||
---------------
|
||||
|
||||
The encoding of Python source files is given by a specific regex on the first
|
||||
two lines of a file, as per `Encoding declarations`_.
|
||||
This mechanism is very liberal in what it accepts, and thus easy to obfuscate.
|
||||
|
||||
This can be misused in combination with Python-specific special-purpose
|
||||
encodings (see `Text Encodings`_).
|
||||
For example, with ``encoding: unicode_escape``, characters like
|
||||
quotes or braces can be hidden in an (f-)string, with many tools (syntax
|
||||
highlighters, linters, etc.) considering them part of the string.
|
||||
For example::
|
||||
|
||||
# For writing Japanese, you don't need an editor that supports
|
||||
# UTF-8 source encoding: unicode_escape sequences work just as well.
|
||||
|
||||
import os
|
||||
|
||||
message = '''
|
||||
This is "Hello World" in Japanese:
|
||||
\u3053\u3093\u306b\u3061\u306f\u7f8e\u3057\u3044\u4e16\u754c
|
||||
|
||||
This runs `echo WHOA` in your shell:
|
||||
\u0027\u0027\u0027\u002c\u0028\u006f\u0073\u002e
|
||||
\u0073\u0079\u0073\u0074\u0065\u006d\u0028
|
||||
\u0027\u0065\u0063\u0068\u006f\u0020\u0057\u0048\u004f\u0041\u0027
|
||||
\u0029\u0029\u002c\u0027\u0027\u0027
|
||||
'''
|
||||
|
||||
Here, ``encoding: unicode_escape`` in the initial comment is an encoding
|
||||
declaration. The ``unicode_escape`` encoding instructs Python to treat
|
||||
``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
|
||||
a comma (punctuator), etc.
|
||||
|
||||
|
||||
Open Issues
|
||||
===========
|
||||
|
||||
We should probably write and publish:
|
||||
|
||||
* Recommendations for Text Editors and Code Tools
|
||||
* Recommendations for Programmers and Teams
|
||||
* Possible Improvements in Python
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. _Unicode: https://home.unicode.org/
|
||||
.. _`Unicode Bidirectional Algorithm`:
|
||||
http://www.unicode.org/reports/tr9/
|
||||
.. _`Newline Guidelines`:
|
||||
http://www.unicode.org/versions/Unicode14.0.0/ch05.pdf#G10213
|
||||
.. [tr36] Unicode Technical Report #36: Unicode Security Considerations
|
||||
http://www.unicode.org/reports/tr36/
|
||||
.. [tr39] Unicode® Technical Standard #39: Unicode Security Mechanisms
|
||||
http://www.unicode.org/reports/tr39/
|
||||
.. _CVE-2021-42574:
|
||||
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-42574
|
||||
.. _`Encoding declarations`: https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations
|
||||
.. _`Identifiers and keywords`: https://docs.python.org/3/reference/lexical_analysis.html#identifiers
|
||||
.. _`Text Encodings`: https://docs.python.org/3/library/codecs.html#text-encodings
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document is placed in the public domain or under the
|
||||
CC0-1.0-Universal license, whichever is more permissive.
|
||||
|
||||
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
coding: utf-8
|
||||
End:
|
||||
|
Loading…
Reference in New Issue