Major redesign of PEP 501 interpolation

2015-08-22 19:57:17 +10:00 · 2015-08-22 19:57:17 +10:00 · 1bd21bb257
parent 673f1ce88d
commit 1bd21bb257
1 changed files with 243 additions and 186 deletions
--- a/pep-0501.txt
+++ b/pep-0501.txt
@ -1,5 +1,5 @@
 PEP: 501
-Title: Translation ready string interpolation
+Title: General purpose string interpolation
 Version: $Revision$
 Last-Modified: $Date$
 Author: Nick Coghlan <ncoghlan@gmail.com>
@ -18,53 +18,83 @@ transparent to the compiler, allow name references from the interpolation
 operation full access to containing namespaces (as with any other expression),
 rather than being limited to explicitly name references.

-This PEP agrees with the basic motivation of PEP 498, but proposes to focus
-both the syntax and the implementation on the il8n use case, drawing on the
-previous proposals in PEP 292 (which added string.Template) and its predecessor
-PEP 215 (which proposed syntactic support, rather than a runtime string
-manipulation based approach). The text of this PEP currently assumes that the
-reader is familiar with these three previous related proposals.
+However, it only offers this capability for string formatting, making it likely
+we will see code like the following::

-The interpolation syntax proposed for this PEP is that of PEP 292, but expanded
-to allow arbitrary expressions and format specifiers when using the ``${ref}``
-interpolation syntax. The suggested new string prefix is "i" rather than "f",
-with the intended mnemonics being either "interpolated string" or
-"il8n string"::
+    os.system(f"echo {user_message}")
+
+This kind of code is superficially elegant, but poses a significant problem
+if the interpolated value ``user_message`` is in fact provided by a user: it's
+an opening for a form of code injection attack, where the supplied user data
+has not been properly escaped before being passed to the ``os.system`` call.
+
+To address that problem (and a number of other concerns), this PEP proposes an
+alternative approach to compiler supported interpolation, based on a new
+``__interpolate__`` magic method, and using a substitution syntax inspired by
+that used in ``string.Template`` and ES6 JavaScript, rather than adding a 4th
+substitution variable syntax to Python.
+
+Proposal
+========
+
+This PEP proposes that the new syntax::
+
+    value = !interpolator "Substitute $names and ${expressions} at runtime"
+
+be interpreted as::
+
+    _raw_template = "Substitute $names and ${expressions} at runtime"
+    _parsed_fields = (
+        ("Substitute ", 0, "names", "", ""),
+        (" and ", 1, "expressions", "", ""),
+        (" at runtime", None, None, None, None),
+    )
+    _field_values = (names, expressions)
+    value = interpolator.__interpolate__(_raw_template,
+                                         _parsed_fields,
+                                         _field_values)
+
+Whitespace would be permitted between the interpolator name and the opening
+quote, but not required in most cases.
+
+The ``str`` builtin type would gain an ``__interpolate__`` implementation that
+supported the following ``str.format`` based semantics::

  >>> import datetime
  >>> name = 'Jane'
  >>> age = 50
  >>> anniversary = datetime.date(1991, 10, 12)
-  >>> i'My name is $name, my age next year is ${age+1}, my anniversary is ${anniversary:%A, %B %d, %Y}.'
+  >>> !str'My name is $name, my age next year is ${age+1}, my anniversary is ${anniversary:%A, %B %d, %Y}.'
  'My name is Jane, my age next year is 51, my anniversary is Saturday, October 12, 1991.'
-  >>> i'She said her name is ${name!r}.'
+  >>> !str'She said her name is ${name!r}.'
  "She said her name is 'Jane'."

-This PEP also proposes the introduction of three new builtin functions,
-``__interpolate__``, ``__interpolateb__`` and  ``__interpolateu__``, which
-implement key aspects of the interpolation process, and may be overridden in
-accordance with the usual mechanisms for shadowing builtin functions.
+The interpolation prefix could be used with single-quoted, double-quoted and
+triple quoted strings. It may also be used with raw strings, but in that case
+whitespace would be required between the interpolator name and the trailing
+string.

 This PEP does not propose to remove or deprecate any of the existing
 string formatting mechanisms, as those will remain valuable when formatting
 strings that are not present directly in the source code of the application.

-The key aim of this PEP that isn't inherited from PEP 498 is to help ensure
-that future Python applications are written in a "translation ready" way, where
-many interface strings that may need to be translated to allow an application
-to be used in multiple languages are flagged as a natural consequence of the
-development process, even though they won't be translated by default.
-

 Rationale
 =========

 PEP 498 makes interpolating values into strings with full access to Python's
 lexical namespace semantics simpler, but it does so at the cost of introducing
-yet another string interpolation syntax.
+yet another string interpolation syntax, and also creates a situation where
+interpolating values into sensitive targets like SQL queries, shell commands
+and HTML templates will enjoy a much cleaner syntax when handled without
+regard for code injection attacks than when they are handled correctly.
+
+This PEP proposes to handle the latter issue by always specifying an explicit
+interpolator for interpolation operations, and the former by adopting the
+``string.Template`` substitution syntax defined in PEP 292.

 The interpolation syntax devised for PEP 292 is deliberately simple so that the
-template strings can be extracted into an il8n message catalog, and passed to
+template strings can be extracted into an i18n message catalog, and passed to
 translators who may not themselves be developers. For these use cases, it is
 important that the interpolation syntax be as simple as possible, as the
 translators are responsible for preserving the substition markers, even as
@ -77,31 +107,35 @@ introduced for general purpose string formatting in PEP 3101, so this PEP adds
 that flexibility to the ``${ref}`` construct in PEP 292, and allows translation
 tools the option of rejecting usage of that more advanced syntax at runtime,
 rather than categorically rejecting it at compile time. The proposed permitted
-expressions inside ``${ref}`` are exactly as defined in PEP 498.
+expressions, conversion specifiers, and format specifiers inside ``${ref}`` are
+exactly as defined in PEP 498.
+
+The specific proposal in this PEP is also deliberately close in both syntax
+and semantics to the general purpose interpolation syntax introduced to
+JavaScript in ES6, as we can reasonably expect a great many Python to be
+regularly switching back and forth between user interface code written in
+JavaScript and core application code written in Python.


 Specification
 =============

-In source code, i-strings are string literals that are prefixed by the
-letter 'i'. The string will be parsed into its components at compile time,
-which will then be passed to the new ``__interpolate__`` builtin at runtime.
+In source code, interpolation expressions are introduced by the new character
+``!``. This is a new kind of expression, consisting of::

-The 'i' prefix may be combined with 'b', where the 'i' must appear first, in
-which case  ``__interpolateb__`` will be called rather than ``__interpolate__``.
-Similarly, 'i' may also be combined with 'u' to call ``__interpolateu__``
-rather than ``__interpolate__``.
+    !DOTTED_NAME TEMPLATE_STRING

-The 'i' prefix may also be combined with 'r', with or without 'b' or 'u', to
-produce raw i-strings. This disables backslash escape sequences in the string
-literal as usual, but has no effect on the runtime interpolation behaviour.
+Similar to ``yield`` expressions, this construct can be used without
+parentheses as a standalone expression statement, as the sole expression on the
+right hand side of an assignment or return statement, and as the sole argument
+to a function. In other situations, it requires containing parentheses to avoid
+ambiguity.

-In all cases, the only permitted location for the 'i' prefix is before all other
-prefix characters - it indicates a runtime operation, which is largely
-independent of the compile time prefixes (aside from calling different
-interpolation functions when combined with 'b' or 'u').
+The template string must be a Unicode string (byte strings are not permitted),
+and string literal concatenation operates as normal within the template string
+component of the expression.

-i-strings are parsed into literals and expressions. Expressions
+The template string is parsed into literals and expressions. Expressions
 appear as either identifiers prefixed with a single "$" character, or
 surrounded be a leading '${' and a trailing '}. The parts of the format string
 that are not expressions are separated out as string literals.
@ -110,63 +144,68 @@ While parsing the string, any doubled ``$$`` is replaced with a single ``$``
 and is considered part of the literal text, rather than as introducing an
 expression.

-These components are then organised into 3 parallel tuples:
+These components are then organised into a tuple of tuples, and passed to the
+``__interpolate__`` method of the interpolator identified by the given
+name::

-* parsed format string fields
-* expression text
-* expression values
+    DOTTED_NAME.__interpolate__(TEMPLATE_STRING,
+                                <parsed_fields>,
+                                <field_values>)

-And then passed to the ``__interpolate__`` builtin at runtime::
+The template string field tuple is inspired by the interface of
+``string.Formatter.parse``, and consists of a series of 5-tuples each
+containing:

-    __interpolate__(fields, expressions, values)
+* a leading string literal (may be the empty string)
+* the substitution field position (zero-based enumeration)
+* the substitution expression text
+* the substitution conversion specifier (as defined by str.format)
+* the substitution format specifier (as defined by str.format)

-The format string field tuple is inspired by the interface of
-``string.Formatter.parse``, and consists of a series of 4-tuples each containing
-a leading literal, together with a trailing field number, format specifier,
-and conversion specifier. If a given substition field has no leading literal
-section, format specifier or conversion specifier, then the corresponding
-elements in the tuple are the empty string. If the final part of the string
-has no trailing substitution field, then the field number, format specifier
+If a given substition field has no leading literal section, format specifier
+or conversion specifier, then the corresponding elements in the tuple are the
+empty string. If the final part of the string has no trailing substitution
+field, then the field number, format specifier
 and conversion specifier will all be ``None``.

 The expression text is simply the text of each interpolated expression, as it
 appeared in the original string, but without the leading and/or surrounding
 expression markers.

-The expression values are the result of evaluating the interpolated expressions
-in the exact runtime context where the i-string appears in the source code.
+The substitution field values tuple is created by evaluating the interpolated
+expressions in the exact runtime context where the interpolation expression
+appears in the source code.

-For the following example i-string::
+For the following example interpolation expression::

-    i'abc${expr1:spec1}${expr2!r:spec2}def${expr3:!s}ghi $ident $$jkl'``,
+    !str 'abc${expr1:spec1}${expr2!r:spec2}def${expr3:!s}ghi $ident $$jkl'

-the fields tuple would be::
+the parsed fields tuple would be::

    (
-      ('abc', 0, 'spec1', ''),
-      ('', 1, 'spec2' 'r'),
-      (def', 2, '', 's'),
-      ('ghi', 3, '', ''),
-      ('$jkl', None, None, None)
+      ('abc', 0, 'expr1', '', 'spec1'),
+      ('', 1, 'expr2', 'r', 'spec2'),
+      (def', 2, 'expr3', 's', ''),
+      ('ghi', 3, 'ident', '', ''),
+      ('$jkl', None, None, None, None)
    )

-For the same example, the expression text and value tuples would be::
+While the field values tupe would be::

-    ('expr1', 'expr2', 'expr3', 'ident') # Expression text
-    (expr1, expr2, expr2, ident)       # Expression values
+    (expr1, expr2, expr3, ident)

-The fields and expression text tuples can be constant folded at compile time,
-while the expression values tuple will always need to be constructed at runtime.
+The parsed fields tuple can be constant folded at compile time, while the
+expression values tuple will always need to be constructed at runtime.

-The default ``__interpolate__`` implementation would have the following
+The ``str.__interpolate__`` implementation would have the following
 semantics, with field processing being defined in terms of the ``format``
 builtin and ``str.format`` conversion specifiers::

    _converter = string.Formatter().convert_field

-    def __interpolate__(fields, expressions, values):
+    def __interpolate__(raw_template, fields, values):
        template_parts = []
-        for leading_text, field_num, format_spec, conversion in fields:
+        for leading_text, field_num, expr, conversion, format_spec in fields:
            template_parts.append(leading_text)
            if field_num is not None:
                value = values[field_num]
@ -176,167 +215,162 @@ builtin and ``str.format`` conversion specifiers::
                template_parts.append(field_str)
        return "".join(template_parts)

-The default ``__interpolateu__`` implementation would be the
-``__interpolate__`` builtin.
+Writing custom interpolators
+----------------------------

-The default ``__interpolateb__`` implementation would be defined in terms of
-the binary mod-formatting reintroduced in PEP 461::
+To simplify the process of writing custom interpolators, it is proposed to add
+a new builtin decorator, ``interpolator``, which would be defined as::

-    def __interpolateb__(fields, expressions, values):
-        template_parts = []
-        for leading_data, field_num, format_spec, conversion in fields:
-            template_parts.append(leading_data)
-            if field_num is not None:
-                if conversion:
-                    raise ValueError("Conversion specifiers not supported "
-                                     "in default binary interpolation")
-                value = values[field_num]
-                field_data = ("%" + format_spec) % (value,)
-                template_parts.append(field_data)
-        return b"".join(template_parts)
+    def interpolator(f):
+        f.__interpolate__ = f.__call__
+        return f

-This definition permits examples like the following::
+This allows new interpolators to be written as::

-    >>> data = 10
-    >>> ib'$data'
-    b'10'
-    >>> b'${data:%4x}'
-    b'   a'
-    >>> b'${data:#4x}'
-    b' 0xa'
-    >>> b'${data:04X}'
-    b'000A'
+    @interpolator
+    def my_custom_interpolator(raw_template, parsed_fields, field_values):
+        ...


 Expression evaluation
 ---------------------

-The expressions that are extracted from the string are evaluated in
-the context where the i-string appeared. This means the expression has
-full access to local, nonlocal and global variables. Any valid Python
-expression can be used inside ``${}``, including function and method calls.
-References without the surrounding braces are limited to looking up single
-identifiers.
+The subexpressions that are extracted from the interpolation expression are
+evaluated in the context where the interpolation expression appears. This means
+the expression has full access to local, nonlocal and global variables. Any
+valid Python expression can be used inside ``${}``, including function and
+method calls. References without the surrounding braces are limited to looking
+up single identifiers.

-Because the i-strings are evaluated where the string appears in the
-source code, there is no additional expressiveness available with
-i-strings. There are also no additional security concerns: you could
-have also just written the same expression, not inside of an
-i-string::
+Because the substitution expressions are evaluated where the string appears in
+the source code, there are no additional security concerns related to the
+contents of the expression itself, as you could have also just written the
+same expression and used runtime field parsing::

  >>> bar=10
  >>> def foo(data):
  ...   return data + 20
  ...
-  >>> i'input=$bar, output=${foo(bar)}'
+  >>> !str 'input=$bar, output=${foo(bar)}'
  'input=10, output=30'

-Is equivalent to::
+Is essentially equivalent to::

  >>> 'input={}, output={}'.format(bar, foo(bar))
  'input=10, output=30'

-Format specifiers
-----------------
+Handling code injection attacks
+-------------------------------

-Format specifiers are not interpreted by the i-string parser - that is
-handling at runtime by the called interpolation function.
+The proposed interpolation expressions make it potentially attractive to write
+code like the following::

-Concatenating strings
---------------------
+    myquery = !str "SELECT $column FROM $table;"
+    mycommand = !str "cat $filename"
+    mypage = !str "<html><body>$content</body></html>"

-As i-strings are shorthand for a runtime builtin function call, implicit
-concatenation is a syntax error (similar to attempting implicit concatenation
-between bytes and str literals)::
+These all represent potential vectors for code injection attacks, if any of the
+variables being interpolated happen to come from an untrusted source. The
+specific proposal in this PEP is designed to make it straightforward to write
+use case specific interpolators that take care of quoting interpolated values
+appropriately for the relevant security context::

-    >>> i"interpolated" "not interpolated"
-      File "<stdin>", line 1
-    SyntaxError: cannot mix interpolation call with plain literal
+    myquery = !sql "SELECT $column FROM $table;"
+    mycommand = !sh "cat $filename"
+    mypage = !html "<html><body>$content</body></html>"
+
+This PEP does not cover adding such interpolators to the standard library,
+but instead ensures they can be readily provided by third party libraries.
+
+(Although it's tempting to propose adding __interpolate__ implementations to
+``subprocess.call``, ``subprocess.check_call`` and ``subprocess.check_output``)
+
+Format and conversion specifiers
+--------------------------------
+
+Aside from separating them out from the substitution expression, format and
+conversion specifiers are otherwise treated as opaque strings by the
+interpolation template parser - assigning semantics to those (or, alternatively,
+prohibiting their use) is handled at runtime by the specified interpolator.

 Error handling
 --------------

-Either compile time or run time errors can occur when processing
-i-strings. Compile time errors are limited to those errors that can be
-detected when parsing an i-string into its component tuples. These errors all
-raise SyntaxError.
+Either compile time or run time errors can occur when processing interpolation
+expressions. Compile time errors are limited to those errors that can be
+detected when parsing a template string into its component tuples. These
+errors all raise SyntaxError.

 Unmatched braces::

-  >>> i'x=${x'
+  >>> !str 'x=${x'
    File "<stdin>", line 1
  SyntaxError: missing '}' in interpolation expression

 Invalid expressions::

-  >>> i'x=${!x}'
+  >>> !str 'x=${!x}'
    File "<fstring>", line 1
      !x
      ^
  SyntaxError: invalid syntax

 Run time errors occur when evaluating the expressions inside an
-i-string. See PEP 498 for some examples.
+template string. See PEP 498 for some examples.

-Different interpolation functions may also impose additional runtime
+Different interpolators may also impose additional runtime
 constraints on acceptable interpolated expressions and other formatting
 details, which will be reported as runtime exceptions.

-Leading whitespace in expressions is not skipped
------------------------------------------------
-
-Unlike PEP 498, leading whitespace in expressions doesn't need to be skipped -
-'$' is not a legal character in Python's syntax, so it can't appear inside
-a ``${}`` field except as part of another string, whether interpolated or not.
-

 Internationalising interpolated strings
 =======================================

-So far, this PEP has said nothing practical about internationalisation - only
-formatting text using either str.format or bytes.__mod__ semantics depending
-on whether or not a str or bytes object is being interpolated.
+Since this PEP derives its interpolation syntax from the internationalisation
+focused PEP 292, it's worth considering the potential implications this PEP
+may have for the internationalisation use case.

-Internationalisation enters the picture by overriding the ``__interpolate__``
-builtin on a module-by-module basis. For example, the following implementation
-would delegate interpolation calls to string.Template::
+Internationalisation enters the picture by writing a custom interpolator that
+performs internationalisation. For example, the following implementation
+would delegate interpolation calls to ``string.Template``::

-    def _interpolation_fields_to_template(fields, expressions):
-        if not all(expr.isidentifier() for expr in expressions):
-            raise ValueError("Only variable substitions permitted for il8n")
-        template_parts = []
-        for literal_text, field_num, format_spec, conversion in fields:
-            if format_spec:
-                raise ValueError("Format specifiers not permitted for il8n")
-            if conversion:
-                raise ValueError("Conversion specifiers not permitted for il8n")
-            template_parts.append(literal_text)
-            if field_num is not None:
-                template_parts.append("${" + expressions[field_num] + "}")
-        return "".join(template_parts)
-
-    def __interpolate__(fields, expressions, values):
-        catalog_str = _interpolation_fields_to_template(fields, expressions)
-        translated = _(catalog_str)
-        values = {k:v for k, v in zip(expressions, values)}
+    @interpolator
+    def i18n(template, fields, values):
+        translated = gettext.gettext(template)
+        values = _build_interpolation_map(fields, values)
        return string.Template(translated).safe_substitute(values)

-If a module were to import that definition of __interpolate__ into the
-module namespace, then:
+    def _build_interpolation_map(fields, values):
+        field_values = {}
+        for literal_text, field_num, expr, conversion, format_spec in fields:
+            assert expr.isidentifier() and not conversion and not format_spec
+            if field_num is not None:
+                field_values[expr] = values[field_num]
+        return field_values

-* Any i"translated & interpolated" strings would be translated
-* Any iu"untranslated & interpolated" strings would not be translated
-* Any ib"untranslated & interpolated" strings would not be translated
-* Any other string and bytes literals would not be translated unless explicitly
-  passed to the relevant translation machinery at runtime
+And would then be invoked as::

-This shifts the behaviour from the status quo, where translation support needs
-to be added explicitly to each string requiring translation to one where
-opting *in* to translation is done on a module by module basis, and
-individual interpolated strings can then be opted *out* of translation by
-adding the "u" prefix to the string literal in order to call
-``__interpolateu__`` instead of ``__interpolate__``.
+    print(!i18n "This is a $translated $message")

+Any actual implementation would need to address other issues (most notably
+message catalog extraction), but this gives the general idea of what might be
+possible.
+
+It's also worth noting that one of the benefits of the ``$`` based substitution
+syntax in this PEP is its compatibility with Mozilla's
+`l20n syntax <http://l20n.org/>`__, which uses ``{{ name }}`` for global
+substitution, and ``{{ $user }}`` for local context substitution.
+
+With the syntax in this PEP, an l20n interpolator could be written as::
+
+    translated = !l20n "{{ $user }} is running {{ appname }}"
+
+With the syntax proposed in PEP 498 (and neglecting the difficulty of doing
+catalog lookups using PEP 498's semantics), the necessary brace escaping would
+make the string look like this in order to interpolating the user variable
+while preserving all of the expected braces::
+
+    interpolated = "{{{{ ${user} }}}} is running {{{{ appname }}}}"

 Discussion
 ==========
@ -344,19 +378,42 @@ Discussion
 Refer to PEP 498 for additional discussion, as several of the points there
 also apply to this PEP.

-Preserving the unmodified format string
---------------------------------------
+Compatibility with IPython magic strings
+----------------------------------------

-A lot of the complexity in the il8n example is actually in recreating the
-original format string from its component parts. It may make sense to preserve
-and pass that entire string to the interpolation function, in addition to
-the broken down field definitions.
+IPython uses "!" to introduce custom interactive constructs. These are only
+used at statement level, and could continue to be special cased in the
+IPython runtime.

-This approach would also allow translators to more consistently benefit from
-the simplicity of the PEP 292 approach to string formatting (in the example
-above, surrounding braces are added to the catalog strings even for cases that
-don't need them)
+This existing usage *did* help inspire the syntax proposed in this PEP.

+Preserving the raw template string
+----------------------------------
+
+Earlier versions of this PEP failed to make the raw template string available
+to interpolators. This greatly complicated the i18n example, as it needed to
+reconstruct the original template to pass to the message catalog lookup.
+
+Using a magic method rather than a global name lookup
+-----------------------------------------------------
+
+Earlier versions of this PEP used an ``__interpolate__`` builtin, rather than
+a magic method on an explicitly named interpolator. Naming the interpolator
+eliminated a lot of the complexity otherwise associated with shadowing the
+builtin function in order to modify the semantics of interpolation.
+
+Relative order of conversion and format specifier in parsed fields
+------------------------------------------------------------------
+
+The relative order of the conversion specifier and the format specifier in the
+substitution field 5-tuple is defined to match the order they appear in the
+format string, which is unfortunately the inverse of the way they appear in the
+``string.Formatter.parse`` 4-tuple.
+
+I consider this a design defect in ``string.Formatter.parse``, so I think it's
+worth fixing it in for the customer interpolator API, since the tuple already
+has other differences (like including both the field position number *and* the
+text of the expression).

 References
 ==========