2021-11-02 12:05:06 -04:00
|
|
|
|
PEP: 672
|
|
|
|
|
Title: Unicode-related Security Considerations for Python
|
|
|
|
|
Author: Petr Viktorin <encukou@gmail.com>
|
|
|
|
|
Status: Active
|
|
|
|
|
Type: Informational
|
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
|
Created: 01-Nov-2021
|
|
|
|
|
Post-History: 01-Nov-2021
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
========
|
|
|
|
|
|
|
|
|
|
This document explains possible ways to misuse Unicode to write Python
|
|
|
|
|
programs that appear to do something else than they actually do.
|
|
|
|
|
|
|
|
|
|
This document does not give any recommendations and solutions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Introduction
|
|
|
|
|
============
|
|
|
|
|
|
|
|
|
|
`Unicode`_ is a system for handling all kinds of written language.
|
|
|
|
|
It aims to allow any character from any human language to be
|
|
|
|
|
used. Python code may consist of almost all valid Unicode characters.
|
|
|
|
|
While this allows programmers from all around the world to express themselves,
|
|
|
|
|
it also allows writing code that is potentially confusing to readers.
|
|
|
|
|
|
|
|
|
|
It is possible to misuse Python's Unicode-related features to write code that
|
|
|
|
|
*appears* to do something else than what it does.
|
|
|
|
|
Evildoers could take advantage of this to trick code reviewers into
|
|
|
|
|
accepting malicious code.
|
|
|
|
|
|
|
|
|
|
The possible issues generally can't be solved in Python itself without
|
|
|
|
|
excessive restrictions of the language.
|
2021-12-13 12:15:18 -05:00
|
|
|
|
They should be solved in code editors and review tools
|
2021-11-02 12:05:06 -04:00
|
|
|
|
(such as *diff* displays), by enforcing project-specific policies,
|
|
|
|
|
and by raising awareness of individual programmers.
|
|
|
|
|
|
|
|
|
|
This document purposefully does not give any solutions
|
|
|
|
|
or recommendations: it is rather a list of things to keep in mind.
|
|
|
|
|
|
|
|
|
|
This document is specific to Python.
|
|
|
|
|
For general security considerations in Unicode text, see [tr36]_ and [tr39]_.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Acknowledgement
|
|
|
|
|
===============
|
|
|
|
|
|
|
|
|
|
Investigation for this document was prompted by `CVE-2021-42574`_,
|
|
|
|
|
*Trojan Source Attacks*, reported by Nicholas Boucher and Ross Anderson,
|
|
|
|
|
which focuses on Bidirectional override characters and homoglyphs in a variety
|
|
|
|
|
of programming languages.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Confusing Features
|
|
|
|
|
==================
|
|
|
|
|
|
|
|
|
|
This section lists some Unicode-related features that can be surprising
|
|
|
|
|
or misusable.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ASCII-only Considerations
|
|
|
|
|
-------------------------
|
|
|
|
|
|
|
|
|
|
ASCII is a subset of Unicode, consisting of the most common symbols, numbers,
|
|
|
|
|
Latin letters and control characters.
|
|
|
|
|
|
|
|
|
|
While issues with the ASCII character set are generally well understood,
|
|
|
|
|
the're presented here to help better understanding of the non-ASCII cases.
|
|
|
|
|
|
|
|
|
|
Confusables and Typos
|
|
|
|
|
'''''''''''''''''''''
|
|
|
|
|
|
|
|
|
|
Some characters look alike.
|
|
|
|
|
Before the age of computers, many mechanical typewriters lacked the keys for
|
|
|
|
|
the digits ``0`` and ``1``: users typed ``O`` (capital o) and ``l``
|
|
|
|
|
(lowercase L) instead. Human readers could tell them apart by context only.
|
|
|
|
|
In programming languages, however, distinction between digits and letters is
|
|
|
|
|
critical -- and most fonts designed for programmers make it easy to tell them
|
|
|
|
|
apart.
|
|
|
|
|
|
|
|
|
|
Similarly, in fonts designed for human languages, the uppercase “I” and
|
|
|
|
|
lowercase “l” can look similar. Or the letters “rn” may be virtually
|
|
|
|
|
indistinguishable from the single letter “m”.
|
|
|
|
|
Again, programmers' fonts make these pairs of *confusables*
|
|
|
|
|
noticeably different.
|
|
|
|
|
|
|
|
|
|
However, what is “noticeably” different always depends on the context.
|
|
|
|
|
Humans tend to ignore details in longer identifiers: the variable name
|
|
|
|
|
``accessibi1ity_options`` can still look indistinguishable from
|
|
|
|
|
``accessibility_options``, while they are distinct for the compiler.
|
|
|
|
|
The same can be said for plain typos: most humans will not notice the typo in
|
|
|
|
|
``responsbility_chain_delegate``.
|
|
|
|
|
|
|
|
|
|
Control Characters
|
|
|
|
|
''''''''''''''''''
|
|
|
|
|
|
|
|
|
|
Python generally considers all ``CR`` (``\r``), ``LF`` (``\n``), and ``CR-LF``
|
|
|
|
|
pairs (``\r\n``) as an end of line characters.
|
|
|
|
|
Most code editors do as well, but there are editors that display “non-native”
|
|
|
|
|
line endings as unknown characters (or nothing at all), rather than ending
|
|
|
|
|
the line, displaying this example::
|
|
|
|
|
|
|
|
|
|
# Don't call this function:
|
|
|
|
|
fire_the_missiles()
|
|
|
|
|
|
|
|
|
|
as a harmless comment like::
|
|
|
|
|
|
|
|
|
|
# Don't call this function:⬛fire_the_missiles()
|
|
|
|
|
|
|
|
|
|
CPython may treat the control character NUL (``\0``) as end of input,
|
|
|
|
|
but many editors simply skip it, possibly showing code that Python will not
|
|
|
|
|
run as a regular part of a file.
|
|
|
|
|
|
|
|
|
|
Some characters can be used to hide/overwrite other characters when source is
|
|
|
|
|
listed in common terminals. For example:
|
|
|
|
|
|
|
|
|
|
* BS (``\b``, Backspace) moves the cursor back, so the character after it
|
|
|
|
|
will overwrite the character before.
|
|
|
|
|
* CR (``\r``, carriage return) moves the cursor to the start of line,
|
|
|
|
|
subsequent characters overwrite the start of the line.
|
|
|
|
|
* SUB (``\x1A``, Ctrl+Z) means “End of text” on Windows. Some programs
|
|
|
|
|
(such as ``type``) ignore the rest of the file after it.
|
|
|
|
|
* ESC (``\x1B``) commonly initiates escape codes which allow arbitrary
|
|
|
|
|
control of the terminal.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Confusable Characters in Identifiers
|
|
|
|
|
------------------------------------
|
|
|
|
|
|
|
|
|
|
Python is not limited to ASCII.
|
|
|
|
|
It allows characters of all scripts – Latin letters to ancient Egyptian
|
|
|
|
|
hieroglyphs – in identifiers (such as variable names).
|
|
|
|
|
See :pep:`3131` for details and rationale.
|
|
|
|
|
Only “letters and numbers” are allowed, so while ``γάτα`` is a valid Python
|
|
|
|
|
identifier, ``🐱`` is not. (See `Identifiers and keywords`_ for details.)
|
|
|
|
|
|
|
|
|
|
Non-printing control characters are also not allowed in identifiers.
|
|
|
|
|
|
|
|
|
|
However, within the allowed set there is a large number of “confusables”.
|
|
|
|
|
For example, the uppercase versions of the Latin ``b``, Greek ``β`` (Beta), and
|
|
|
|
|
Cyrillic ``в`` (Ve) often look identical: ``B``, ``Β`` and ``В``, respectively.
|
|
|
|
|
|
|
|
|
|
This allows identifiers that look the same to humans, but not to Python.
|
|
|
|
|
For example, all of the following are distinct identifiers:
|
|
|
|
|
|
|
|
|
|
* ``scope`` (Latin, ASCII-only)
|
2021-11-15 10:36:12 -05:00
|
|
|
|
* ``scоpe`` (with a Cyrillic ``о``)
|
2021-11-02 12:05:06 -04:00
|
|
|
|
* ``scοpe`` (with a Greek ``ο``)
|
|
|
|
|
* ``ѕсоре`` (all Cyrillic letters)
|
|
|
|
|
|
|
|
|
|
Additionally, some letters can look like non-letters:
|
|
|
|
|
|
|
|
|
|
* The letter for the Hawaiian *ʻokina* looks like an apostrophe;
|
|
|
|
|
``ʻHelloʻ`` is a Python identifier, not a string.
|
|
|
|
|
* The East Asian word for *ten* looks like a plus sign,
|
|
|
|
|
so ``十= 10`` is a complete Python statement. (The “十” is a word: “ten”
|
|
|
|
|
rather than “10”.)
|
|
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
|
|
|
|
|
The converse also applies – some symbols look like letters – but since
|
|
|
|
|
Python does not allow arbitrary symbols in identifiers, this is not an
|
|
|
|
|
issue.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Confusable Digits
|
|
|
|
|
------------------
|
|
|
|
|
|
|
|
|
|
Numeric literals in Python only use the ASCII digits 0-9 (and non-digits such
|
|
|
|
|
as ``.`` or ``e``).
|
|
|
|
|
|
|
|
|
|
However, when numbers are converted from strings, such as in the ``int`` and
|
|
|
|
|
``float`` constructors or by the ``str.format`` method, any decimal digit
|
|
|
|
|
can be used. For example ``߅`` (``NKO DIGIT FIVE``) or ``௫``
|
|
|
|
|
(``TAMIL DIGIT FIVE``) work as the digit ``5``.
|
|
|
|
|
|
|
|
|
|
Some scripts include digits that look similar to ASCII ones, but have a
|
|
|
|
|
different value. For example::
|
|
|
|
|
|
|
|
|
|
>>> int('৪୨')
|
|
|
|
|
42
|
|
|
|
|
>>> '{٥}'.format('zero', 'one', 'two', 'three', 'four', 'five')
|
|
|
|
|
five
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Bidirectional Text
|
|
|
|
|
------------------
|
|
|
|
|
|
|
|
|
|
Some scripts, such as Hebrew or Arabic, are written right-to-left.
|
|
|
|
|
Phrases in such scripts interact with nearby text in ways that can be
|
|
|
|
|
surprising to people who aren't familiar with these writing systems and their
|
|
|
|
|
computer representation.
|
|
|
|
|
|
|
|
|
|
The exact process is complicated, and explained in Unicode Standard Annex #9,
|
|
|
|
|
`Unicode Bidirectional Algorithm`_.
|
|
|
|
|
|
|
|
|
|
Consider the following code, which assigns a 100-character string to
|
|
|
|
|
the variable ``s``::
|
|
|
|
|
|
|
|
|
|
s = "X" * 100 # "X" is assigned
|
|
|
|
|
|
|
|
|
|
When the ``X`` is replaced by the Hebrew letter ``א``, the line becomes::
|
|
|
|
|
|
|
|
|
|
s = "א" * 100 # "א" is assigned
|
|
|
|
|
|
|
|
|
|
This command still assigns a 100-character string to ``s``, but
|
|
|
|
|
when displayed as general text following the Bidirectional Algorithm
|
|
|
|
|
(e.g. in a browser), it appears as ``s = "א"`` followed by a comment.
|
|
|
|
|
|
|
|
|
|
Other surprising examples include:
|
|
|
|
|
|
|
|
|
|
* In the statement ``ערך = 23``, the variable ``ערך`` is set to the integer 23.
|
|
|
|
|
|
|
|
|
|
* In the statement ``قيمة = ערך``, the variable ``قيمة`` is set
|
|
|
|
|
to the value of ``ערך``.
|
|
|
|
|
|
|
|
|
|
* In the statement ``قيمة - (ערך ** 2)``, the value of ``ערך`` is squared and
|
|
|
|
|
then subtracted from ``قيمة``.
|
|
|
|
|
The *opening* parenthesis is displayed as ``)``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Bidirectional Marks, Embeddings, Overrides and Isolates
|
|
|
|
|
-------------------------------------------------------
|
|
|
|
|
|
|
|
|
|
Default reordering rules do not always yield the intended direction of text, so
|
|
|
|
|
Unicode provides several ways to alter it.
|
|
|
|
|
|
|
|
|
|
The most basic are **directional marks**, which are invisible but affect text
|
|
|
|
|
as a left-to-right (or right-to-left) character would.
|
|
|
|
|
Continuing with the ``s = "X"`` example above, in the next example the ``X`` is
|
|
|
|
|
replaced by the Latin ``x`` followed or preceded by a
|
|
|
|
|
right-to-left mark (``U+200F``). This assigns a 200-character string to ``s``
|
|
|
|
|
(100 copies of ``x`` interspersed with 100 invisible marks),
|
|
|
|
|
but under Unicode rules for general text, it is rendered as ``s = "x"``
|
|
|
|
|
followed by an ASCII-only comment::
|
|
|
|
|
|
|
|
|
|
s = "x" * 100 # "x" is assigned
|
|
|
|
|
|
|
|
|
|
The directional **embedding**, **override** and **isolate** characters
|
|
|
|
|
are also invisible, but affect the ordering of all text after them until either
|
|
|
|
|
ended by a dedicated character, or until the end of line.
|
|
|
|
|
(Unicode specifies the effect to last until the end of a “paragraph” (see
|
|
|
|
|
`Unicode Bidirectional Algorithm`_),
|
|
|
|
|
but allows tools to interpret newline characters as paragraph ends
|
|
|
|
|
(see Unicode `Newline Guidelines`_). Most code editors and terminals do so.)
|
|
|
|
|
|
|
|
|
|
These characters essentially allow arbitrary reordering of the text that
|
|
|
|
|
follows them. Python only allows them in strings and comments, which does limit
|
|
|
|
|
their potential (especially in combination with the fact that Python's comments
|
|
|
|
|
always extend to the end of a line), but it doesn't render them harmless.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Normalizing identifiers
|
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
|
|
Python strings are collections of *Unicode codepoints*, not “characters”.
|
|
|
|
|
|
|
|
|
|
For reasons like compatibility with earlier encodings, Unicode often has
|
|
|
|
|
several ways to encode what is essentially a single “character”.
|
2021-12-13 12:15:18 -05:00
|
|
|
|
For example, all these are different ways of writing ``Å`` as a Python string,
|
2021-11-02 12:05:06 -04:00
|
|
|
|
each of which is unequal to the others.
|
|
|
|
|
|
|
|
|
|
* ``"\N{LATIN CAPITAL LETTER A WITH RING ABOVE}"`` (1 codepoint)
|
|
|
|
|
* ``"\N{LATIN CAPITAL LETTER A}\N{COMBINING RING ABOVE}"`` (2 codepoints)
|
|
|
|
|
* ``"\N{ANGSTROM SIGN}"`` (1 codepoint, but different)
|
|
|
|
|
|
|
|
|
|
For another example, the ligature ``fi`` has a dedicated Unicode codepoint,
|
|
|
|
|
even though it has the same meaning as the two letters ``fi``.
|
|
|
|
|
|
|
|
|
|
Also, common letters frequently have several distinct variations.
|
|
|
|
|
Unicode provides them for contexts where the difference has some semantic
|
|
|
|
|
meaning, like mathematics. For example, some variations of ``n`` are:
|
|
|
|
|
|
|
|
|
|
* ``n`` (LATIN SMALL LETTER N)
|
|
|
|
|
* ``𝐧`` (MATHEMATICAL BOLD SMALL N)
|
|
|
|
|
* ``𝘯`` (MATHEMATICAL SANS-SERIF ITALIC SMALL N)
|
|
|
|
|
* ``n`` (FULLWIDTH LATIN SMALL LETTER N)
|
|
|
|
|
* ``ⁿ`` (SUPERSCRIPT LATIN SMALL LETTER N)
|
|
|
|
|
|
|
|
|
|
Unicode includes algorithms to *normalize* variants like these to a single
|
|
|
|
|
form, and Python identifiers are normalized.
|
|
|
|
|
(There are several normal forms; Python uses ``NFKC``.)
|
|
|
|
|
|
|
|
|
|
For example, ``xn`` and ``xⁿ`` are the same identifier in Python::
|
|
|
|
|
|
|
|
|
|
>>> xⁿ = 8
|
|
|
|
|
>>> xn
|
|
|
|
|
8
|
|
|
|
|
|
|
|
|
|
… as is ``fi`` and ``fi``, and as are the different ways to encode ``Å``.
|
|
|
|
|
|
|
|
|
|
This normalization applies *only* to identifiers, however.
|
|
|
|
|
Functions that treat strings as identifiers, such as ``getattr``,
|
|
|
|
|
do not perform normalization::
|
|
|
|
|
|
|
|
|
|
>>> class Test:
|
|
|
|
|
... def finalize(self):
|
|
|
|
|
... print('OK')
|
|
|
|
|
...
|
|
|
|
|
>>> Test().finalize()
|
|
|
|
|
OK
|
|
|
|
|
>>> Test().finalize()
|
|
|
|
|
OK
|
|
|
|
|
>>> getattr(Test(), 'finalize')
|
|
|
|
|
Traceback (most recent call last):
|
|
|
|
|
...
|
|
|
|
|
AttributeError: 'Test' object has no attribute 'finalize'
|
|
|
|
|
|
|
|
|
|
This also applies when importing:
|
|
|
|
|
|
|
|
|
|
* ``import finalization`` performs normalization, and looks for a file
|
|
|
|
|
named ``finalization.py`` (and other ``finalization.*`` files).
|
|
|
|
|
* ``importlib.import_module("finalization")`` does not normalize,
|
|
|
|
|
so it looks for a file named ``finalization.py``.
|
|
|
|
|
|
|
|
|
|
Some filesystems independently apply normalization and/or case folding.
|
|
|
|
|
On some systems, ``finalization.py``, ``finalization.py`` and
|
|
|
|
|
``FINALIZATION.py`` are three distinct filenames; on others, some or all
|
|
|
|
|
of these name the same file.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Source Encoding
|
|
|
|
|
---------------
|
|
|
|
|
|
|
|
|
|
The encoding of Python source files is given by a specific regex on the first
|
|
|
|
|
two lines of a file, as per `Encoding declarations`_.
|
|
|
|
|
This mechanism is very liberal in what it accepts, and thus easy to obfuscate.
|
|
|
|
|
|
|
|
|
|
This can be misused in combination with Python-specific special-purpose
|
|
|
|
|
encodings (see `Text Encodings`_).
|
|
|
|
|
For example, with ``encoding: unicode_escape``, characters like
|
|
|
|
|
quotes or braces can be hidden in an (f-)string, with many tools (syntax
|
|
|
|
|
highlighters, linters, etc.) considering them part of the string.
|
|
|
|
|
For example::
|
|
|
|
|
|
|
|
|
|
# For writing Japanese, you don't need an editor that supports
|
|
|
|
|
# UTF-8 source encoding: unicode_escape sequences work just as well.
|
|
|
|
|
|
|
|
|
|
import os
|
|
|
|
|
|
|
|
|
|
message = '''
|
|
|
|
|
This is "Hello World" in Japanese:
|
|
|
|
|
\u3053\u3093\u306b\u3061\u306f\u7f8e\u3057\u3044\u4e16\u754c
|
|
|
|
|
|
|
|
|
|
This runs `echo WHOA` in your shell:
|
|
|
|
|
\u0027\u0027\u0027\u002c\u0028\u006f\u0073\u002e
|
|
|
|
|
\u0073\u0079\u0073\u0074\u0065\u006d\u0028
|
|
|
|
|
\u0027\u0065\u0063\u0068\u006f\u0020\u0057\u0048\u004f\u0041\u0027
|
|
|
|
|
\u0029\u0029\u002c\u0027\u0027\u0027
|
|
|
|
|
'''
|
|
|
|
|
|
|
|
|
|
Here, ``encoding: unicode_escape`` in the initial comment is an encoding
|
|
|
|
|
declaration. The ``unicode_escape`` encoding instructs Python to treat
|
|
|
|
|
``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
|
|
|
|
|
a comma (punctuator), etc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Open Issues
|
|
|
|
|
===========
|
|
|
|
|
|
|
|
|
|
We should probably write and publish:
|
|
|
|
|
|
|
|
|
|
* Recommendations for Text Editors and Code Tools
|
|
|
|
|
* Recommendations for Programmers and Teams
|
|
|
|
|
* Possible Improvements in Python
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
References
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
.. _Unicode: https://home.unicode.org/
|
|
|
|
|
.. _`Unicode Bidirectional Algorithm`:
|
|
|
|
|
http://www.unicode.org/reports/tr9/
|
|
|
|
|
.. _`Newline Guidelines`:
|
|
|
|
|
http://www.unicode.org/versions/Unicode14.0.0/ch05.pdf#G10213
|
|
|
|
|
.. [tr36] Unicode Technical Report #36: Unicode Security Considerations
|
|
|
|
|
http://www.unicode.org/reports/tr36/
|
|
|
|
|
.. [tr39] Unicode® Technical Standard #39: Unicode Security Mechanisms
|
|
|
|
|
http://www.unicode.org/reports/tr39/
|
|
|
|
|
.. _CVE-2021-42574:
|
|
|
|
|
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-42574
|
|
|
|
|
.. _`Encoding declarations`: https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations
|
|
|
|
|
.. _`Identifiers and keywords`: https://docs.python.org/3/reference/lexical_analysis.html#identifiers
|
|
|
|
|
.. _`Text Encodings`: https://docs.python.org/3/library/codecs.html#text-encodings
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
This document is placed in the public domain or under the
|
|
|
|
|
CC0-1.0-Universal license, whichever is more permissive.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
..
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
coding: utf-8
|
|
|
|
|
End:
|
|
|
|
|
|