Updated PEP 3101 to incorporate latest feedback, and simplify even further. Also added additional explanation of custom formatting classes.

2007-07-24 23:36:34 +00:00 · 2007-07-24 23:36:34 +00:00 · 00d28204ef
parent 935f64f730
commit 00d28204ef
1 changed files with 171 additions and 135 deletions
--- a/pep-3101.txt
+++ b/pep-3101.txt
@ -155,25 +155,20 @@ Simple and Compound Field Names
    in a field expression.  The dot operator allows an attribute of
    an input value to be specified as the field value.

-    The types of expressions that can be used in a compound name
-    have been deliberately limited in order to prevent potential
-    security exploits resulting from the ability to place arbitrary
-    Python expressions inside of strings. Only two operators are
-    supported, the '.' (getattr) operator, and the '[]' (getitem)
-    operator.
-
-    Another limitation that is defined to limit potential security
-    issues is that field names or attribute names beginning with an
-    underscore are disallowed. This enforces the common convention
-    that names beginning with an underscore are 'private'.
+    Unlike some other programming languages, you cannot embed arbitrary
+    expressions in format strings.  This is by design - the types of
+    expressions that you can use is deliberately limited.  Only two operators
+    are supported: the '.' (getattr) operator, and the '[]' (getitem)
+    operator.  The reason for allowing these operators is that they dont'
+    normally have side effects in non-pathological code.

    An example of the 'getitem' syntax:

        "My name is {0[name]}".format(dict(name='Fred'))

-    It should be noted that the use of 'getitem' within a string is
-    much more limited than its normal use. In the above example, the
-    string 'name' really is the literal string 'name', not a variable
+    It should be noted that the use of 'getitem' within a format string
+    is much more limited than its conventional usage.  In the above example,
+    the string 'name' really is the literal string 'name', not a variable
    named 'name'.  The rules for parsing an item key are very simple.
    If it starts with a digit, then its treated as a number, otherwise
    it is used as a string.
@ -187,9 +182,7 @@ Simple and Compound Field Names
    of the underlying object to throw an exception if the identifier
    is not legal.  The format function will have a minimalist parser
    which only attempts to figure out when it is "done" with an
-    identifier (by finding a '.' or a ']', or '}', etc.)  The only
-    exception to this laissez-faire approach is that, by default,
-    strings are not allowed to have leading underscores.
+    identifier (by finding a '.' or a ']', or '}', etc.).


 Conversion Specifiers
@ -269,7 +262,7 @@ Standard Conversion Specifiers
        '+'  - indicates that a sign should be used for both
               positive as well as negative numbers
        '-'  - indicates that a sign should be used only for negative
-               numbers (this is the default behaviour)
+               numbers (this is the default behavior)
        ' '  - indicates that a leading space should be used on
               positive numbers
        '()' - indicates that negative numbers should be surrounded
@ -381,9 +374,8 @@ User-Defined Formatting
    lives in the 'string' module.  This class takes additional options
    which are not accessible via the normal str.format method.
    
-    An application can create their own Formatter instance which has
-    customized behavior, either by setting the properties of the
-    Formatter instance, or by subclassing the Formatter class.
+    An application can subclass the Formatter class to create their
+    own customized formatting behavior.

    The PEP does not attempt to exactly specify all methods and
    properties defined by the Formatter class; Instead, those will be
@ -391,46 +383,25 @@ User-Defined Formatting
    PEP will specify the general requirements for the Formatter class,
    which are listed below.

-
-Formatter Creation and Initialization
-
-    The Formatter class takes a single initialization argument, 'flags':
-
-        Formatter(flags=0)
-
-    The 'flags' argument is used to control certain subtle behavioral
-    differences in formatting that would be cumbersome to change via
-    subclassing. The flags values are defined as static variables
-    in the "Formatter" class:
-
-        Formatter.ALLOW_LEADING_UNDERSCORES
-
-            By default, leading underscores are not allowed in identifier
-            lookups (getattr or getitem).  Setting this flag will allow
-            this.
-
-        Formatter.CHECK_UNUSED_POSITIONAL
-
-            If this flag is set, the any positional arguments which are
-            supplied to the 'format' method but which are not used by
-            the format string will cause an error.
-
-        Formatter.CHECK_UNUSED_NAME
-
-            If this flag is set, the any named arguments which are
-            supplied to the 'format' method but which are not used by
-            the format string will cause an error.
+    Although string.format() does not directly use the Formatter class
+    to do formatting, both use the same underlying implementation.  The
+    reason that string.format() does not use the Formatter class directly
+    is because "string" is a built-in type, which means that all of its
+    methods must be implemented in C, whereas Formatter is a Python
+    class.  Formatter provides an extensible wrapper around the same
+    C functions as are used by string.format().


 Formatter Methods

-    The methods of class Formatter are as follows:
+    The Formatter class takes no initialization arguments:
+    
+        fmt = Formatter()
+
+    The public API methods of class Formatter are as follows:

        -- format(format_string, *args, **kwargs)
        -- vformat(format_string, args, kwargs)
-        -- get_positional(args, index)
-        -- get_named(kwds, name)
-        -- format_field(value, conversion)
        
    'format' is the primary API method.  It takes a format template,
    and an arbitrary set of positional and keyword argument.  'format'
@ -442,23 +413,38 @@ Formatter Methods
    repacking the dictionary as individual arguments using the '*args' and
    '**kwds' syntax.  'vformat' does the work of breaking up the format
    template string into character data and replacement fields.  It calls
-    the 'get_positional' and 'get_index' methods as appropriate.
+    the 'get_positional' and 'get_index' methods as appropriate (described
+    below.)

-    Note that the checking of unused arguments, and the restriction on
-    leading underscores in attribute names are also done in this function.
+    Formatter defines the following overridable methods:
+        
+        -- get_positional(args, index)
+        -- get_named(kwds, name)
+        -- check_unused_args(used_args, args, kwargs)
+        -- format_field(value, conversion)

    'get_positional' and 'get_named' are used to retrieve a given field
    value.  For compound field names, these functions are only called for
    the first component of the field name; Subsequent components are
-    handled through normal attribute and indexing operations. So for
-    example, the field expression '0.name' would cause 'get_positional' to
-    be called with the list of positional arguments and a numeric index of
-    0, and then the standard 'getattr' function would be called to get the
-    'name' attribute of the result.
+    handled through normal attribute and indexing operations.
+    
+    So for example, the field expression '0.name' would cause
+    'get_positional' to be called with the parameter 'args' set to the
+    list of positional arguments to vformat, and 'index' set to zero;
+    the returned value would then be passed to the standard 'getattr'
+    function to get the 'name' attribute.

    If the index or keyword refers to an item that does not exist, then an
    IndexError/KeyError will be raised.
    
+    'check_unused_args' is used to implement checking for unused arguments
+    if desired.  The arguments to this function is the set of all argument
+    keys that were actually referred to in the format string (integers for
+    positional arguments, and strings for named arguments), and a reference
+    to the args and kwargs that was passed to vformat.  The intersection
+    of these two sets will be the set of unused args.  'check_unused_args'
+    is assumed to throw an exception if the check fails.
+
    'format_field' actually generates the text for a replacement field.
    The 'value' argument corresponds to the value being formatted, which
    was retrieved from the arguments using the field name.  The
@ -466,11 +452,46 @@ Formatter Methods
    will be either a string or unicode object, depending on the type of
    the original format string.
    
-    Note: The final implementation of the Formatter class may define
-    additional overridable methods and hooks. In particular, it may be
-    that 'vformat' is itself a composition of several additional,
-    overridable methods. (Depending on whether it is convenient to the
-    implementor of Formatter.)
+    To get a better understanding of how these functions relate to each
+    other, here is pseudocode that explains the general operation of
+    vformat:
+    
+        def vformat(format_string, args, kwargs):
+        
+          # Output buffer and set of used args
+          buffer = StringIO.StringIO()
+          used_args = set()
+          
+          # Tokens are either format fields or literal strings
+          for token in self.parse(format_string):
+            if is_format_field(token):
+              field_spec, conversion_spec = token.rsplit(":", 2)
+              
+              # 'first_part' is the part before the first '.' or '['
+              first_part = get_first_part(token)
+              used_args.add(first_part)
+              if is_positional(first_part):
+                value = self.get_positional(args, first_part) 
+              else:
+                value = self.get_named(kwargs, first_part)
+                
+              # Handle [subfield] or .subfield
+              for comp in components(token):
+                value = resolve_subfield(value, comp)
+
+              # Write out the converted value
+              buffer.write(format_field(value, conversion))
+              
+            else:
+              buffer.write(token)
+              
+          self.check_unused_args(used_args, args, kwargs)
+          return buffer.getvalue()
+          
+    Note that the actual algorithm of the Formatter class may not be the
+    one presented here.  In particular, the final implementation of
+    the Formatter class may define additional overridable methods and
+    hooks.  Also, the final implementation will be written in C.


 Customizing Formatters
@ -527,8 +548,7 @@ Error handling

    There are two classes of exceptions which can occur during formatting:
    exceptions generated by the formatter code itself, and exceptions
-    generated by user code (such as a field object's getattr function, or
-    the field_hook function).
+    generated by user code (such as a field object's 'getattr' function).

    In general, exceptions generated by the formatter code itself are
    of the "ValueError" variety -- there is an error in the actual "value"
@ -615,6 +635,38 @@ Alternate Syntax
    what .Net uses.


+Alternate Feature Proposals
+
+    Restricting attribute access: An earlier version of the PEP
+    restricted the ability to access attributes beginning with a
+    leading underscore, for example "{0}._private".  However, this
+    is a useful ability to have when debugging, so the feature
+    was dropped.
+    
+    Some developers suggested that the ability to do 'getattr' and
+    'getitem' access should be dropped entirely.  However, this
+    is in conflict with the needs of another set of developers who
+    strongly lobbied for the ability to pass in a large dict as a
+    single argument (without flattening it into individual keyword
+    arguments using the **kwargs syntax) and then have the format
+    string refer to dict entries individually.
+    
+    There has also been suggestions to expand the set of expressions
+    that are allowed in a format string.  However, this was seen
+    to go against the spirit of TOOWTDI, since the same effect can
+    be achieved in most cases by executing the same expression on
+    the parameter before it's passed in to the formatting function.
+    For cases where the format string is being use to do arbitrary
+    formatting in a data-rich environment, it's recommended to use
+    a templating engine specialized for this purpose, such as
+    Genshi [5] or Cheetah [6].
+    
+    Many other features were considered and rejected because they
+    could easily be achieved by subclassing Formatter instead of
+    building the feature into the base implementation.  This includes
+    alternate syntax, comments in format strings, and many others.
+    
+
 Security Considerations

    Historically, string formatting has been a common source of
@ -622,43 +674,21 @@ Security Considerations
    string templating system allows arbitrary expressions to be
    embedded in format strings.

-    The typical scenario is one where the string data being processed
-    is coming from outside the application, perhaps from HTTP headers
-    or fields within a web form. An attacker could substitute their
-    own strings designed to cause havok.
+    The best way to use string formatting in a way that does not
+    create potential security holes is to never use format strings
+    that come from an untrusted source.
    
-    The string formatting system outlined in this PEP is by no means
-    'secure', in the sense that no Python library module can, on its
-    own, guarantee security, especially given the open nature of
-    the Python language. Building a secure application requires a
-    secure approach to design.
-
-    What this PEP does attempt to do is make the job of designing a
-    secure application easier, by making it easier for a programmer
-    to reason about the possible consequences of a string formatting
-    operation. It does this by limiting those consequences to a smaller
-    and more easier understood subset.
-
-    For example, because it is possible in Python to override the
-    'getattr' operation of a type, the interpretation of a compound
-    replacement field such as "0.name" could potentially run
-    arbitrary code.
-
-    However, it is *extremely* rare for the mere retrieval of an
-    attribute to have side effects. Other operations which are more
-    likely to have side effects - such as method calls - are disallowed.
-    Thus, a programmer can be reasonably assured that no string
-    formatting operation will cause a state change in the program.
-    This assurance is not only useful in securing an application, but
-    in debugging it as well.
-
-    Similarly, the restriction on field names beginning with
-    underscores is intended to provide similar assurances about the
-    visibility of private data.
-
-    Of course, programmers would be well-advised to avoid using
-    any external data as format strings, and instead use that data
-    as the format arguments instead.
+    Barring that, the next best approach is to insure that string
+    formatting has no side effects.  Because of the open nature of
+    Python, it is impossible to guarantee that any non-trivial
+    operation has this property.  What this PEP does is limit the
+    types of expressions in format strings to those in which visible
+    side effects are both rare and strongly discouraged by the
+    culture of Python developers.  So for example, attribute access
+    is allowed because it would be considered pathological to write
+    code where the mere access of an attribute has visible side
+    effects (whether the code has *invisible* side effects - such
+    as creating a cache entry for faster lookup - is irrelevant.)


 Sample Implementation
@ -693,6 +723,12 @@ References
    [4] Composite Formatting - [.Net Framework Developer's Guide]
        http://msdn.microsoft.com/library/en-us/cpguide/html/cpconcompositeformatting.asp?frame=true
        
+    [5] Genshi templating engine.
+        http://genshi.edgewall.org/
+
+    [5] Cheetah - The Python-Powered Template Engine.
+        http://www.cheetahtemplate.org/
+

 Copyright