PEP 672: Unicode-related Security Considerations for Python (#2129)

2021-11-02 17:05:06 +01:00 · 2021-11-02 17:05:06 +01:00 · 943bf873ed
parent 1acde901b8
commit 943bf873ed
2 changed files with 407 additions and 0 deletions
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@ -525,6 +525,8 @@ pep-0665.rst  @brettcannon
 pep-0667.rst  @markshannon
 pep-0668.rst  @dstufft
 pep-0670.rst  @vstinner @erlend-aasland
+# pep-0671.rst
+pep-0672.rst  @encukou
 # ...
 # pep-0754.txt
 # ...
--- a/pep-0672.rst
+++ b/pep-0672.rst
@ -0,0 +1,405 @@
+PEP: 672
+Title: Unicode-related Security Considerations for Python
+Author: Petr Viktorin <encukou@gmail.com>
+Status: Active
+Type: Informational
+Content-Type: text/x-rst
+Created: 01-Nov-2021
+Post-History: 01-Nov-2021
+
+Abstract
+========
+
+This document explains possible ways to misuse Unicode to write Python
+programs that appear to do something else than they actually do.
+
+This document does not give any recommendations and solutions.
+
+
+Introduction
+============
+
+`Unicode`_ is a system for handling all kinds of written language.
+It aims to allow any character from any human language to be
+used. Python code may consist of almost all valid Unicode characters.
+While this allows programmers from all around the world to express themselves,
+it also allows writing code that is potentially confusing to readers.
+
+It is possible to misuse Python's Unicode-related features to write code that
+*appears* to do something else than what it does.
+Evildoers could take advantage of this to trick code reviewers into
+accepting malicious code.
+
+The possible issues generally can't be solved in Python itself without
+excessive restrictions of the language.
+They should be solved in code edirors and review tools
+(such as *diff* displays), by enforcing project-specific policies,
+and by raising awareness of individual programmers.
+
+This document purposefully does not give any solutions
+or recommendations: it is rather a list of things to keep in mind.
+
+This document is specific to Python.
+For general security considerations in Unicode text, see [tr36]_ and [tr39]_.
+
+
+Acknowledgement
+===============
+
+Investigation for this document was prompted by `CVE-2021-42574`_,
+*Trojan Source Attacks*, reported by Nicholas Boucher and Ross Anderson,
+which focuses on Bidirectional override characters and homoglyphs in a variety
+of programming languages.
+
+
+Confusing Features
+==================
+
+This section lists some Unicode-related features that can be surprising
+or misusable.
+
+
+ASCII-only Considerations
+-------------------------
+
+ASCII is a subset of Unicode, consisting of the most common symbols, numbers,
+Latin letters and control characters.
+
+While issues with the ASCII character set are generally well understood,
+the're presented here to help better understanding of the non-ASCII cases.
+
+Confusables and Typos
+'''''''''''''''''''''
+
+Some characters look alike.
+Before the age of computers, many mechanical typewriters lacked the keys for
+the digits ``0`` and ``1``: users typed ``O`` (capital o) and ``l``
+(lowercase L) instead. Human readers could tell them apart by context only.
+In programming languages, however, distinction between digits and letters is
+critical -- and most fonts designed for programmers make it easy to tell them
+apart.
+
+Similarly, in fonts designed for human languages, the uppercase “I” and
+lowercase “l” can look similar. Or the letters “rn” may be virtually
+indistinguishable from the single letter “m”.
+Again, programmers' fonts make these pairs of *confusables*
+noticeably different.
+
+However, what is “noticeably” different always depends on the context.
+Humans tend to ignore details in longer identifiers: the variable name
+``accessibi1ity_options`` can still look indistinguishable from
+``accessibility_options``, while they are distinct for the compiler.
+The same can be said for plain typos: most humans will not notice the typo in
+``responsbility_chain_delegate``.
+
+Control Characters
+''''''''''''''''''
+
+Python generally considers all ``CR`` (``\r``), ``LF`` (``\n``), and ``CR-LF``
+pairs (``\r\n``) as an end of line characters.
+Most code editors do as well, but there are editors that display “non-native”
+line endings as unknown characters (or nothing at all), rather than ending
+the line, displaying this example::
+
+    # Don't call this function:
+    fire_the_missiles()
+
+as a harmless comment like::
+
+    # Don't call this function:⬛fire_the_missiles()
+
+CPython may treat the control character NUL (``\0``) as end of input,
+but many editors simply skip it, possibly showing code that Python will not
+run as a regular part of a file.
+
+Some characters can be used to hide/overwrite other characters when source is
+listed in common terminals. For example:
+
+* BS (``\b``, Backspace) moves the cursor back, so the character after it
+  will overwrite the character before.
+* CR (``\r``, carriage return) moves the cursor to the start of line,
+  subsequent characters overwrite the start of the line.
+* SUB (``\x1A``, Ctrl+Z) means “End of text” on Windows. Some programs
+  (such as ``type``) ignore the rest of the file after it.
+* ESC (``\x1B``) commonly initiates escape codes which allow arbitrary
+  control of the terminal.
+
+
+Confusable Characters in Identifiers
+------------------------------------
+
+Python is not limited to ASCII.
+It allows characters of all scripts – Latin letters to ancient Egyptian
+hieroglyphs – in identifiers (such as variable names).
+See :pep:`3131` for details and rationale.
+Only “letters and numbers” are allowed, so while ``γάτα`` is a valid Python
+identifier, ``🐱`` is not.  (See `Identifiers and keywords`_ for details.)
+
+Non-printing control characters are also not allowed in identifiers.
+
+However, within the allowed set there is a large number of “confusables”.
+For example, the uppercase versions of the Latin ``b``, Greek ``β`` (Beta), and
+Cyrillic ``в`` (Ve) often look identical: ``B``, ``Β`` and ``В``, respectively.
+
+This allows identifiers that look the same to humans, but not to Python.
+For example, all of the following are distinct identifiers:
+
+* ``scope`` (Latin, ASCII-only)
+* ``scоpe`` (wih a Cyrillic ``о``)
+* ``scοpe`` (with a Greek ``ο``)
+* ``ѕсоре`` (all Cyrillic letters)
+
+Additionally, some letters can look like non-letters:
+
+* The letter for the Hawaiian *ʻokina* looks like an apostrophe;
+  ``ʻHelloʻ`` is a Python identifier, not a string.
+* The East Asian word for *ten* looks like a plus sign,
+  so ``十= 10`` is a complete Python statement. (The “十” is a word: “ten”
+  rather than “10”.)
+
+.. note::
+
+   The converse also applies – some symbols look like letters – but since
+   Python does not allow arbitrary symbols in identifiers, this is not an
+   issue.
+
+
+Confusable  Digits
+------------------
+
+Numeric literals in Python only use the ASCII digits 0-9 (and non-digits such
+as ``.`` or ``e``).
+
+However, when numbers are converted from strings, such as in the ``int`` and
+``float`` constructors or by the ``str.format`` method, any decimal digit
+can be used. For example ``߅`` (``NKO DIGIT FIVE``) or ``௫``
+(``TAMIL DIGIT FIVE``) work as the digit ``5``.
+
+Some scripts include digits that look similar to ASCII ones, but have a
+different value. For example::
+
+    >>> int('৪୨')
+    42
+    >>> '{٥}'.format('zero', 'one', 'two', 'three', 'four', 'five')
+    five
+
+
+Bidirectional Text
+------------------
+
+Some scripts, such as Hebrew or Arabic, are written right-to-left.
+Phrases in such scripts interact with nearby text in ways that can be
+surprising to people who aren't familiar with these writing systems and their
+computer representation.
+
+The exact process is complicated, and explained in Unicode Standard Annex #9,
+`Unicode Bidirectional Algorithm`_.
+
+Consider the following code, which assigns a 100-character string to
+the variable ``s``::
+
+  s = "X" * 100 #    "X" is assigned
+
+When the ``X`` is replaced by the Hebrew letter ``א``, the line becomes::
+
+  s = "א" * 100 #    "א" is assigned
+
+This command still assigns a 100-character string to ``s``, but
+when displayed as general text following the Bidirectional Algorithm
+(e.g. in a browser), it appears as ``s = "א"`` followed by a comment.
+
+Other surprising examples include:
+
+* In the statement ``ערך = 23``, the variable ``ערך`` is set to the integer 23.
+
+* In the statement ``قيمة = ערך``, the variable ``قيمة`` is set
+  to the value of ``ערך``.
+
+* In the statement ``قيمة - (ערך ** 2)``, the value of ``ערך`` is squared and
+  then subtracted from ``قيمة``.
+  The *opening* parenthesis is displayed as ``)``.
+
+
+
+Bidirectional Marks, Embeddings, Overrides and Isolates
+-------------------------------------------------------
+
+Default reordering rules do not always yield the intended direction of text, so
+Unicode provides several ways to alter it.
+
+The most basic are **directional marks**, which are invisible but affect text
+as a left-to-right (or right-to-left) character would.
+Continuing with the ``s = "X"`` example above, in the next example the ``X`` is
+replaced by the Latin ``x`` followed or preceded by a
+right-to-left mark (``U+200F``). This assigns a 200-character string to ``s``
+(100 copies of ``x`` interspersed with 100 invisible marks),
+but under Unicode rules for general text, it is rendered as ``s = "x"``
+followed by an ASCII-only comment::
+
+    s = "x‏" * 100 #    "‏x" is assigned
+
+The directional **embedding**, **override** and **isolate** characters
+are also invisible, but affect the ordering of all text after them until either
+ended by a dedicated character, or until the end of line.
+(Unicode specifies the effect to last until the end of a “paragraph” (see
+`Unicode Bidirectional Algorithm`_),
+but allows tools to interpret newline characters as paragraph ends
+(see Unicode `Newline Guidelines`_). Most code editors and terminals do so.)
+
+These characters essentially allow arbitrary reordering of the text that
+follows them. Python only allows them in strings and comments, which does limit
+their potential (especially in combination with the fact that Python's comments
+always extend to the end of a line), but it doesn't render them harmless.
+
+
+Normalizing identifiers
+-----------------------
+
+Python strings are collections of *Unicode codepoints*, not “characters”.
+
+For reasons like compatibility with earlier encodings, Unicode often has
+several ways to encode what is essentially a single “character”.
+For example, all are these different ways of writing ``Å`` as a Python string,
+each of which is unequal to the others.
+
+* ``"\N{LATIN CAPITAL LETTER A WITH RING ABOVE}"`` (1 codepoint)
+* ``"\N{LATIN CAPITAL LETTER A}\N{COMBINING RING ABOVE}"`` (2 codepoints)
+* ``"\N{ANGSTROM SIGN}"`` (1 codepoint, but different)
+
+For another example, the ligature ``ﬁ`` has a dedicated Unicode codepoint,
+even though it has the same meaning as the two letters ``fi``.
+
+Also, common letters frequently have several distinct variations.
+Unicode provides them for contexts where the difference has some semantic
+meaning, like mathematics. For example, some variations of ``n`` are:
+
+* ``n`` (LATIN SMALL LETTER N)
+* ``𝐧`` (MATHEMATICAL BOLD SMALL N)
+* ``𝘯`` (MATHEMATICAL SANS-SERIF ITALIC SMALL N)
+* ``ｎ`` (FULLWIDTH LATIN SMALL LETTER N)
+* ``ⁿ`` (SUPERSCRIPT LATIN SMALL LETTER N)
+
+Unicode includes algorithms to *normalize* variants like these to a single
+form, and Python identifiers are normalized.
+(There are several normal forms; Python uses ``NFKC``.)
+
+For example, ``xn`` and ``xⁿ`` are the same identifier in Python::
+
+    >>> xⁿ = 8
+    >>> xn
+    8
+
+… as is ``ﬁ`` and ``fi``, and as are the different ways to encode ``Å``.
+
+This normalization applies *only* to identifiers, however.
+Functions that treat strings as identifiers, such as ``getattr``,
+do not perform normalization::
+
+   >>> class Test:
+   ...     def ﬁnalize(self):
+   ...         print('OK')
+   ...
+   >>> Test().finalize()
+   OK
+   >>> Test().ﬁnalize()
+   OK
+   >>> getattr(Test(), 'ﬁnalize')
+   Traceback (most recent call last):
+     ...
+   AttributeError: 'Test' object has no attribute 'ﬁnalize'
+
+This also applies when importing:
+
+* ``import ﬁnalization`` performs normalization, and looks for a file
+  named ``finalization.py`` (and other ``finalization.*`` files).
+* ``importlib.import_module("ﬁnalization")`` does not normalize,
+  so it looks for a file named ``ﬁnalization.py``.
+
+Some filesystems independently apply normalization and/or case folding.
+On some systems, ``ﬁnalization.py``, ``finalization.py`` and
+``FINALIZATION.py`` are three distinct filenames; on others, some or all
+of these name the same file.
+
+
+Source Encoding
+---------------
+
+The encoding of Python source files is given by a specific regex on the first
+two lines of a file, as per `Encoding declarations`_.
+This mechanism is very liberal in what it accepts, and thus easy to obfuscate.
+
+This can be misused in combination with Python-specific special-purpose
+encodings (see `Text Encodings`_).
+For example, with ``encoding: unicode_escape``, characters like
+quotes or braces can be hidden in an (f-)string, with many tools (syntax
+highlighters, linters, etc.) considering them part of the string.
+For example::
+
+    # For writing Japanese, you don't need an editor that supports
+    # UTF-8 source encoding: unicode_escape sequences work just as well.
+
+    import os
+
+    message = '''
+    This is "Hello World" in Japanese:
+    \u3053\u3093\u306b\u3061\u306f\u7f8e\u3057\u3044\u4e16\u754c
+
+    This runs `echo WHOA` in your shell:
+    \u0027\u0027\u0027\u002c\u0028\u006f\u0073\u002e
+    \u0073\u0079\u0073\u0074\u0065\u006d\u0028
+    \u0027\u0065\u0063\u0068\u006f\u0020\u0057\u0048\u004f\u0041\u0027
+    \u0029\u0029\u002c\u0027\u0027\u0027
+    '''
+
+Here, ``encoding: unicode_escape`` in the initial comment is an encoding
+declaration. The ``unicode_escape`` encoding instructs Python to treat
+``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
+a comma (punctuator), etc.
+
+
+Open Issues
+===========
+
+We should probably write and publish:
+
+* Recommendations for Text Editors and Code Tools
+* Recommendations for Programmers and Teams
+* Possible Improvements in Python
+
+
+References
+==========
+
+.. _Unicode: https://home.unicode.org/
+.. _`Unicode Bidirectional Algorithm`:
+   http://www.unicode.org/reports/tr9/
+.. _`Newline Guidelines`:
+   http://www.unicode.org/versions/Unicode14.0.0/ch05.pdf#G10213
+.. [tr36]  Unicode Technical Report #36: Unicode Security Considerations
+   http://www.unicode.org/reports/tr36/
+.. [tr39]   Unicode® Technical Standard #39: Unicode Security Mechanisms
+   http://www.unicode.org/reports/tr39/
+.. _CVE-2021-42574:
+   https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-42574
+.. _`Encoding declarations`: https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations
+.. _`Identifiers and keywords`: https://docs.python.org/3/reference/lexical_analysis.html#identifiers
+.. _`Text Encodings`: https://docs.python.org/3/library/codecs.html#text-encodings
+
+
+Copyright
+=========
+
+This document is placed in the public domain or under the
+CC0-1.0-Universal license, whichever is more permissive.
+
+
+
+..
+    Local Variables:
+    mode: indented-text
+    indent-tabs-mode: nil
+    sentence-end-double-space: t
+    fill-column: 70
+    coding: utf-8
+    End:
+