PEP: 292
Title: Simpler String Substitutions
Version: $Revision$
Last-Modified: $Date$
Author: barry@zope.com (Barry A. Warsaw)
Status: Draft
Type: Standards Track
Created: 18-Jun-2002
Python-Version: 2.3
Post-History: 18-Jun-2002


Abstract

    This PEP describes a simpler string substitution feature, also
    known as string interpolation.  This PEP is "simpler" in two
    respects:

    1. Python's current string substitution feature (commonly known as
       %-substitutions) is complicated and error prone.  This PEP is
       simpler at the cost of less expressiveness.

    2. PEP 215 proposed an alternative string interpolation feature,
       introducing a new `$' string prefix.  PEP 292 is simpler than
       this because it involves no syntax changes and has much simpler
       rules for what substitutions can occur in the string.
       

Rationale

    Python currently supports a string substitution (a.k.a. string
    interpolation) syntax based on C's printf() % formatting
    character[1].  While quite rich, %-formatting codes are also quite
    error prone, even for experienced Python programmers.  A common
    mistake is to leave off the trailing format character, e.g. the
    `s' in "%(name)s".

    In addition, the rules for what can follow a % sign are fairly
    complex, while the usual application rarely needs such
    complexity.  Also error prone is the right-hand side of the %
    operator: e.g. singleton tuples.

    Most scripts need to do some string interpolation, but most of
    those use simple `stringification' formats, i.e. %s or %(name)s
    This form should be made simpler and less error prone.


A Simpler Proposal

    Here we propose the addition of a new string method, called .sub()
    which performs substitution of mapping values into a string with
    special substitution placeholders.  These placeholders are
    introduced with the $ character.  The following rules for
    $-placeholders apply:

    1. $$ is an escape; it is replaced with a single $

    2. $identifier names a substitution placeholder matching a mapping
       key of "identifier".  "identifier" must be a Python identifier
       as defined in [2].  The first non-identifier character after
       the $ character terminates this placeholder specification.

    3. ${identifier} is equivalent to $identifier.  It is required for
       when valid identifier characters follow the placeholder but are
       not part of the placeholder, e.g. "${noun}ification".

    No other characters have special meaning.

    The .sub() method takes an optional mapping (e.g. dictionary)
    where the keys match placeholders in the string, and the values
    are substituted for the placeholders.  For example:

	'${name} was born in ${country}'.sub({'name': 'Guido',
					      'country': 'the Netherlands'})

    returns

        'Guido was born in the Netherlands'

    The mapping argument is optional; if it is omitted then the
    mapping is taken from the locals and globals of the context in
    which the .sub() method is executed.  For example:

	def birth(self, name):
	    country = self.countryOfOrigin[name]
	    return '${name} was born in ${country}'.sub()

	birth('Guido')

    returns

	'Guido was born in the Netherlands'


Why `$' and Braces?

    The BDFL said it best: The $ means "substitution" in so many
    languages besides Perl that I wonder where you've been. [...] 
    We're copying this from the shell.


Security Issues

    Never use no-arg .sub() on strings that come from untrusted
    sources.  It could be used to gain unauthorized information about
    variables in your local or global scope.


Reference Implementation

    Here's a Python 2.2-based reference implementation.  Of course the
    real implementation would be in C, would not require a string
    subclass, and would not be modeled on the existing %-interpolation
    feature.

	import sys
	import re

	class dstr(str):
            def sub(self, mapping=None):
                # Default mapping is locals/globals of caller
                if mapping is None:
                    frame = sys._getframe(1)
                    mapping = frame.f_globals.copy()
                    mapping.update(frame.f_locals)
                def repl(m):
                    return mapping[m.group(m.lastindex)]
                return re.sub(r'\$(?:([_a-z]\w*)|\{([_a-z]\w*)\})', repl, self)
    
    And here are some examples:

	s = dstr('${name} was born in ${country}')
	print s.sub({'name': 'Guido',
		     'country': 'the Netherlands'})

	name = 'Barry'
	country = 'the USA'
	print s.sub()

    This will print "Guido was born in the Netherlands" followed by
    "Barry was born in the USA".


Handling Missing Keys

    What should happen when one of the substitution keys is missing
    from the mapping (or the locals/globals namespace if no argument
    is given)?  There are two possibilities:

    - We can simply allow the exception.

    - We can return the original substitution placeholder unchanged.

    An example of the first is:

        print dstr('${name} was born in ${country}').sub({'name': 'Bob'})

    would raise:

	Traceback (most recent call last):
	  File "sub.py", line 66, in ?
	    print s.sub({'name': 'Bob'})
	  File "sub.py", line 26, in sub
	    return EMPTYSTRING.join(filter(None, parts)) % mapping
	KeyError: country

    An example of the second is:

        print dstr('${name} was born in ${country}').sub({'name': 'Bob'})

    would print:

	Bob was born in ${country}

    We could almost ignore the issue, since the latter example could
    be accomplished by passing in a "safe-dictionary" in instead of a
    normal dictionary, like so:

	class safedict(dict):
	    def __getitem__(self, key):
		try:
		    return dict.__getitem__(self, key)
		except KeyError:
		    return '${%s}' % key

    so that

	d = safedict({'name': 'Bob'})
	print dstr('${name} was born in ${country}').sub(d)

    would print:

	Bob was born in ${country}

    The one place where this won't work is when no arguments are given
    to the .sub() method.  .sub() wouldn't know whether to wrap
    locals/globals in a safedict or not.

    This ambiguity can be solved in several ways:

    - we could have a parallel method called .safesub() which always
      wrapped its argument in a safedict()

    - .sub() could take an optional keyword argument flag which
      indicates whether to wrap the argument in a safedict or not.

    - .sub() could take an optional keyword argument which is a
      callable that would get called with the original mapping and
      return the mapping to be used for the substitution.  By default,
      this callable would be the identity function, but you could
      easily pass in the safedict constructor instead.

    BDFL proto-pronouncement: Strongly in favor of raising the
    exception, with KeyError when a dict is used and NameError when
    locals/globals are used.  There may not be sufficient use case for
    soft failures in the no-argument version.


Open Issues, Comments, and Suggestions

    - Ka-Ping Yee makes the suggestion that .sub() should take keyword
      arguments instead of a dictionary, and that if a dictionary was
      to be passed in it should be done with **dict.  For example:

      s = '${name} was born in ${country}'
      print s.sub(name='Guido', country='the Netherlands')

      or

      print s.sub(**{'name': 'Guido', 'country': 'the Netherlands'})

    - Paul Prescod wonders whether having a method use sys._getframe()
      doesn't set a bad precedent.

    - Oren Tirosh suggests that .sub() take an optional argument which
      would be used as a default for missing keys.  If the optional
      argument were not given, an exception would be raised.  This may
      not play well with Ka-Ping's suggestion.

    - Other suggestions have been made as an alternative to a string
      method including: a builtin function, a function in a module, an
      operator (similar to "string % dict", e.g. "string / dict").
      One strong argument for making it a built-in is given by Paul
      Prescod:

      "I really hate putting things in modules that will be needed in
       a Python programmer's second program (the one after "Hello
       world").  If this is to be the *simpler* way of doing
       introspection then getting at it should be simpler than getting
       at "%".  $ is taught in hour 2, import is taught on day 2.
       Some people may never make it to the metaphorical day 2 if they
       are doing simple text processing in some kind of
       embedded-Python environment."

     - Should we take a cue from the `make' program and allow $(name)
       as an alternative (or instead of) ${name}?

     - Should we require a dictionary to the .sub() method?  Some
       people feel that it could be a security risk allowing implicit
       access to globals/locals, even with the proper admonitions in
       the documentation.  In that case, a new built-in would be
       necessary (because none of globals(), locals(), or vars() does
       the right the w.r.t. nested scopes, etc.).  Chirstian Tismer
       has suggested allvars().  Perhaps allvars() should be a method
       on a frame object (too?)?

     - It has been suggested that using $ at all violates TOOWTDI.
       Some other suggestions include using the % sign in the
       following way: %{name}


Comparison to PEP 215

    PEP 215 describes an alternate proposal for string interpolation.
    Unlike that PEP, this one does not propose any new syntax for
    Python.  All the proposed new features are embodied in a new
    string method.  PEP 215 proposes a new string prefix
    representation such as $"" which signal to Python that a new type
    of string is present.  $-strings would have to interact with the
    existing r-prefixes and u-prefixes, essentially doubling the
    number of string prefix combinations.

    PEP 215 also allows for arbitrary Python expressions inside the
    $-strings, so that you could do things like:

	import sys
	print $"sys = $sys, sys = $sys.modules['sys']"

    which would return

	sys = <module 'sys' (built-in)>, sys = <module 'sys' (built-in)>
 
    It's generally accepted that the rules in PEP 215 are safe in the
    sense that they introduce no new security issues (see PEP 215,
    "Security Issues" for details).  However, the rules are still
    quite complex, and make it more difficult to see what exactly is
    the substitution placeholder in the original $-string.

    By design, this PEP does not provide as much interpolation power
    as PEP 215, however it is expected that the no-argument version of
    .sub() allows at least as much power with no loss of readability.


BDFL Weathervane

    Guido lays out[3] what he feels are the real issues that need to
    be fleshed out in this PEP:

    - Compile-time vs. run-time parsing.  I've become convinced that
      the compiler should do the parsing: this is the only way to make
      access to variables in nested scopes work, avoids security
      issues, and makes it easier to diagnose errors (e.g. in
      PyChecker).

    - How to support translation.  Here the template must be replaced
      at run-time, but it is still desirable that the collection of
      available names is known at compile time (to avoid the security
      issues).

    - Optional formatting specifiers.  I agree with Lalo that these
      should not be part of the interpolation syntax but need to be
      dealt with at a different level.  I think these are only
      relevant for numeric data.  Funny, there's still a
      (now-deprecated) module fpformat.py that supports arbitrary
      floating point formatting, and string.zfill() supports a bit of
      integer formatting.


References

    [1] String Formatting Operations
        http://www.python.org/doc/current/lib/typesseq-strings.html

    [2] Identifiers and Keywords
	http://www.python.org/doc/current/ref/identifiers.html

    [3] Guido's python-dev posting from 21-Jul-2002
        http://mail.python.org/pipermail/python-dev/2002-July/026397.html


Copyright

    This document has been placed in the public domain.


Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
End: