Updated PEP 3101 to incorporate latest feedback, and simplify even further. Also added additional explanation of custom formatting classes.

2007-07-24 23:36:34 +00:00 · 2007-07-24 23:36:34 +00:00 · 00d28204ef
parent 935f64f730
commit 00d28204ef
1 changed files with 171 additions and 135 deletions
--- a/pep-3101.txt
+++ b/pep-3101.txt
@ -155,25 +155,20 @@ Simple and Compound Field Names
    in a field expression.  The dot operator allows an attribute of
    an input value to be specified as the field value.
-    The types of expressions that can be used in a compound name
+    Unlike some other programming languages, you cannot embed arbitrary
-    have been deliberately limited in order to prevent potential
+    expressions in format strings.  This is by design - the types of
-    security exploits resulting from the ability to place arbitrary
+    expressions that you can use is deliberately limited.  Only two operators
-    Python expressions inside of strings. Only two operators are
+    are supported: the '.' (getattr) operator, and the '[]' (getitem)
-    supported, the '.' (getattr) operator, and the '[]' (getitem)
+    operator.  The reason for allowing these operators is that they dont'
-    operator.
+    normally have side effects in non-pathological code.
    Another limitation that is defined to limit potential security
    issues is that field names or attribute names beginning with an
    underscore are disallowed. This enforces the common convention
    that names beginning with an underscore are 'private'.
    An example of the 'getitem' syntax:
        "My name is {0[name]}".format(dict(name='Fred'))
-    It should be noted that the use of 'getitem' within a string is
+    It should be noted that the use of 'getitem' within a format string
-    much more limited than its normal use. In the above example, the
+    is much more limited than its conventional usage.  In the above example,
-    string 'name' really is the literal string 'name', not a variable
+    the string 'name' really is the literal string 'name', not a variable
    named 'name'.  The rules for parsing an item key are very simple.
    If it starts with a digit, then its treated as a number, otherwise
    it is used as a string.
@ -187,9 +182,7 @@ Simple and Compound Field Names
    of the underlying object to throw an exception if the identifier
    is not legal.  The format function will have a minimalist parser
    which only attempts to figure out when it is "done" with an
-    identifier (by finding a '.' or a ']', or '}', etc.)  The only
+    identifier (by finding a '.' or a ']', or '}', etc.).
    exception to this laissez-faire approach is that, by default,
    strings are not allowed to have leading underscores.
 Conversion Specifiers
@ -269,7 +262,7 @@ Standard Conversion Specifiers
        '+'  - indicates that a sign should be used for both
               positive as well as negative numbers
        '-'  - indicates that a sign should be used only for negative
-               numbers (this is the default behaviour)
+               numbers (this is the default behavior)
        ' '  - indicates that a leading space should be used on
               positive numbers
        '()' - indicates that negative numbers should be surrounded
@ -381,9 +374,8 @@ User-Defined Formatting
    lives in the 'string' module.  This class takes additional options
    which are not accessible via the normal str.format method.
-    An application can create their own Formatter instance which has
+    An application can subclass the Formatter class to create their
-    customized behavior, either by setting the properties of the
+    own customized formatting behavior.
    Formatter instance, or by subclassing the Formatter class.
    The PEP does not attempt to exactly specify all methods and
    properties defined by the Formatter class; Instead, those will be
@ -391,46 +383,25 @@ User-Defined Formatting
    PEP will specify the general requirements for the Formatter class,
    which are listed below.
-
+    Although string.format() does not directly use the Formatter class
-Formatter Creation and Initialization
+    to do formatting, both use the same underlying implementation.  The
-
+    reason that string.format() does not use the Formatter class directly
-    The Formatter class takes a single initialization argument, 'flags':
+    is because "string" is a built-in type, which means that all of its
-
+    methods must be implemented in C, whereas Formatter is a Python
-        Formatter(flags=0)
+    class.  Formatter provides an extensible wrapper around the same
-
+    C functions as are used by string.format().
    The 'flags' argument is used to control certain subtle behavioral
    differences in formatting that would be cumbersome to change via
    subclassing. The flags values are defined as static variables
    in the "Formatter" class:
        Formatter.ALLOW_LEADING_UNDERSCORES
            By default, leading underscores are not allowed in identifier
            lookups (getattr or getitem).  Setting this flag will allow
            this.
        Formatter.CHECK_UNUSED_POSITIONAL
            If this flag is set, the any positional arguments which are
            supplied to the 'format' method but which are not used by
            the format string will cause an error.
        Formatter.CHECK_UNUSED_NAME
            If this flag is set, the any named arguments which are
            supplied to the 'format' method but which are not used by
            the format string will cause an error.
 Formatter Methods
-    The methods of class Formatter are as follows:
+    The Formatter class takes no initialization arguments:
        fmt = Formatter()
    The public API methods of class Formatter are as follows:
        -- format(format_string, *args, **kwargs)
        -- vformat(format_string, args, kwargs)
        -- get_positional(args, index)
        -- get_named(kwds, name)
        -- format_field(value, conversion)
    'format' is the primary API method.  It takes a format template,
    and an arbitrary set of positional and keyword argument.  'format'
@ -442,23 +413,38 @@ Formatter Methods
    repacking the dictionary as individual arguments using the '*args' and
    '**kwds' syntax.  'vformat' does the work of breaking up the format
    template string into character data and replacement fields.  It calls
-    the 'get_positional' and 'get_index' methods as appropriate.
+    the 'get_positional' and 'get_index' methods as appropriate (described
    below.)
-    Note that the checking of unused arguments, and the restriction on
+    Formatter defines the following overridable methods:
-    leading underscores in attribute names are also done in this function.
+        
        -- get_positional(args, index)
        -- get_named(kwds, name)
        -- check_unused_args(used_args, args, kwargs)
        -- format_field(value, conversion)
    'get_positional' and 'get_named' are used to retrieve a given field
    value.  For compound field names, these functions are only called for
    the first component of the field name; Subsequent components are
-    handled through normal attribute and indexing operations. So for
+    handled through normal attribute and indexing operations.
-    example, the field expression '0.name' would cause 'get_positional' to
+    
-    be called with the list of positional arguments and a numeric index of
+    So for example, the field expression '0.name' would cause
-    0, and then the standard 'getattr' function would be called to get the
+    'get_positional' to be called with the parameter 'args' set to the
-    'name' attribute of the result.
+    list of positional arguments to vformat, and 'index' set to zero;
    the returned value would then be passed to the standard 'getattr'
    function to get the 'name' attribute.
    If the index or keyword refers to an item that does not exist, then an
    IndexError/KeyError will be raised.
    'check_unused_args' is used to implement checking for unused arguments
    if desired.  The arguments to this function is the set of all argument
    keys that were actually referred to in the format string (integers for
    positional arguments, and strings for named arguments), and a reference
    to the args and kwargs that was passed to vformat.  The intersection
    of these two sets will be the set of unused args.  'check_unused_args'
    is assumed to throw an exception if the check fails.
    'format_field' actually generates the text for a replacement field.
    The 'value' argument corresponds to the value being formatted, which
    was retrieved from the arguments using the field name.  The
@ -466,11 +452,46 @@ Formatter Methods
    will be either a string or unicode object, depending on the type of
    the original format string.
-    Note: The final implementation of the Formatter class may define
+    To get a better understanding of how these functions relate to each
-    additional overridable methods and hooks. In particular, it may be
+    other, here is pseudocode that explains the general operation of
-    that 'vformat' is itself a composition of several additional,
+    vformat:
-    overridable methods. (Depending on whether it is convenient to the
+    
-    implementor of Formatter.)
+        def vformat(format_string, args, kwargs):
          # Output buffer and set of used args
          buffer = StringIO.StringIO()
          used_args = set()
          # Tokens are either format fields or literal strings
          for token in self.parse(format_string):
            if is_format_field(token):
              field_spec, conversion_spec = token.rsplit(":", 2)
              # 'first_part' is the part before the first '.' or '['
              first_part = get_first_part(token)
              used_args.add(first_part)
              if is_positional(first_part):
                value = self.get_positional(args, first_part) 
              else:
                value = self.get_named(kwargs, first_part)
              # Handle [subfield] or .subfield
              for comp in components(token):
                value = resolve_subfield(value, comp)
              # Write out the converted value
              buffer.write(format_field(value, conversion))
            else:
              buffer.write(token)
          self.check_unused_args(used_args, args, kwargs)
          return buffer.getvalue()
    Note that the actual algorithm of the Formatter class may not be the
    one presented here.  In particular, the final implementation of
    the Formatter class may define additional overridable methods and
    hooks.  Also, the final implementation will be written in C.
 Customizing Formatters
@ -527,8 +548,7 @@ Error handling
    There are two classes of exceptions which can occur during formatting:
    exceptions generated by the formatter code itself, and exceptions
-    generated by user code (such as a field object's getattr function, or
+    generated by user code (such as a field object's 'getattr' function).
    the field_hook function).
    In general, exceptions generated by the formatter code itself are
    of the "ValueError" variety -- there is an error in the actual "value"
@ -615,6 +635,38 @@ Alternate Syntax
    what .Net uses.
 Alternate Feature Proposals
    Restricting attribute access: An earlier version of the PEP
    restricted the ability to access attributes beginning with a
    leading underscore, for example "{0}._private".  However, this
    is a useful ability to have when debugging, so the feature
    was dropped.
    Some developers suggested that the ability to do 'getattr' and
    'getitem' access should be dropped entirely.  However, this
    is in conflict with the needs of another set of developers who
    strongly lobbied for the ability to pass in a large dict as a
    single argument (without flattening it into individual keyword
    arguments using the **kwargs syntax) and then have the format
    string refer to dict entries individually.
    There has also been suggestions to expand the set of expressions
    that are allowed in a format string.  However, this was seen
    to go against the spirit of TOOWTDI, since the same effect can
    be achieved in most cases by executing the same expression on
    the parameter before it's passed in to the formatting function.
    For cases where the format string is being use to do arbitrary
    formatting in a data-rich environment, it's recommended to use
    a templating engine specialized for this purpose, such as
    Genshi [5] or Cheetah [6].
    Many other features were considered and rejected because they
    could easily be achieved by subclassing Formatter instead of
    building the feature into the base implementation.  This includes
    alternate syntax, comments in format strings, and many others.
 Security Considerations
    Historically, string formatting has been a common source of
@ -622,43 +674,21 @@ Security Considerations
    string templating system allows arbitrary expressions to be
    embedded in format strings.
-    The typical scenario is one where the string data being processed
+    The best way to use string formatting in a way that does not
-    is coming from outside the application, perhaps from HTTP headers
+    create potential security holes is to never use format strings
-    or fields within a web form. An attacker could substitute their
+    that come from an untrusted source.
    own strings designed to cause havok.
-    The string formatting system outlined in this PEP is by no means
+    Barring that, the next best approach is to insure that string
-    'secure', in the sense that no Python library module can, on its
+    formatting has no side effects.  Because of the open nature of
-    own, guarantee security, especially given the open nature of
+    Python, it is impossible to guarantee that any non-trivial
-    the Python language. Building a secure application requires a
+    operation has this property.  What this PEP does is limit the
-    secure approach to design.
+    types of expressions in format strings to those in which visible
-
+    side effects are both rare and strongly discouraged by the
-    What this PEP does attempt to do is make the job of designing a
+    culture of Python developers.  So for example, attribute access
-    secure application easier, by making it easier for a programmer
+    is allowed because it would be considered pathological to write
-    to reason about the possible consequences of a string formatting
+    code where the mere access of an attribute has visible side
-    operation. It does this by limiting those consequences to a smaller
+    effects (whether the code has *invisible* side effects - such
-    and more easier understood subset.
+    as creating a cache entry for faster lookup - is irrelevant.)
    For example, because it is possible in Python to override the
    'getattr' operation of a type, the interpretation of a compound
    replacement field such as "0.name" could potentially run
    arbitrary code.
    However, it is *extremely* rare for the mere retrieval of an
    attribute to have side effects. Other operations which are more
    likely to have side effects - such as method calls - are disallowed.
    Thus, a programmer can be reasonably assured that no string
    formatting operation will cause a state change in the program.
    This assurance is not only useful in securing an application, but
    in debugging it as well.
    Similarly, the restriction on field names beginning with
    underscores is intended to provide similar assurances about the
    visibility of private data.
    Of course, programmers would be well-advised to avoid using
    any external data as format strings, and instead use that data
    as the format arguments instead.
 Sample Implementation
@ -693,6 +723,12 @@ References
    [4] Composite Formatting - [.Net Framework Developer's Guide]
        http://msdn.microsoft.com/library/en-us/cpguide/html/cpconcompositeformatting.asp?frame=true
    [5] Genshi templating engine.
        http://genshi.edgewall.org/
    [5] Cheetah - The Python-Powered Template Engine.
        http://www.cheetahtemplate.org/
 Copyright