Updated PEP 3101 to incorporate latest feedback, and simplify even further. Also added additional explanation of custom formatting classes.

This commit is contained in:
Talin 2007-07-24 23:36:34 +00:00
parent 935f64f730
commit 00d28204ef
1 changed files with 171 additions and 135 deletions

View File

@ -155,25 +155,20 @@ Simple and Compound Field Names
in a field expression. The dot operator allows an attribute of in a field expression. The dot operator allows an attribute of
an input value to be specified as the field value. an input value to be specified as the field value.
The types of expressions that can be used in a compound name Unlike some other programming languages, you cannot embed arbitrary
have been deliberately limited in order to prevent potential expressions in format strings. This is by design - the types of
security exploits resulting from the ability to place arbitrary expressions that you can use is deliberately limited. Only two operators
Python expressions inside of strings. Only two operators are are supported: the '.' (getattr) operator, and the '[]' (getitem)
supported, the '.' (getattr) operator, and the '[]' (getitem) operator. The reason for allowing these operators is that they dont'
operator. normally have side effects in non-pathological code.
Another limitation that is defined to limit potential security
issues is that field names or attribute names beginning with an
underscore are disallowed. This enforces the common convention
that names beginning with an underscore are 'private'.
An example of the 'getitem' syntax: An example of the 'getitem' syntax:
"My name is {0[name]}".format(dict(name='Fred')) "My name is {0[name]}".format(dict(name='Fred'))
It should be noted that the use of 'getitem' within a string is It should be noted that the use of 'getitem' within a format string
much more limited than its normal use. In the above example, the is much more limited than its conventional usage. In the above example,
string 'name' really is the literal string 'name', not a variable the string 'name' really is the literal string 'name', not a variable
named 'name'. The rules for parsing an item key are very simple. named 'name'. The rules for parsing an item key are very simple.
If it starts with a digit, then its treated as a number, otherwise If it starts with a digit, then its treated as a number, otherwise
it is used as a string. it is used as a string.
@ -187,9 +182,7 @@ Simple and Compound Field Names
of the underlying object to throw an exception if the identifier of the underlying object to throw an exception if the identifier
is not legal. The format function will have a minimalist parser is not legal. The format function will have a minimalist parser
which only attempts to figure out when it is "done" with an which only attempts to figure out when it is "done" with an
identifier (by finding a '.' or a ']', or '}', etc.) The only identifier (by finding a '.' or a ']', or '}', etc.).
exception to this laissez-faire approach is that, by default,
strings are not allowed to have leading underscores.
Conversion Specifiers Conversion Specifiers
@ -269,7 +262,7 @@ Standard Conversion Specifiers
'+' - indicates that a sign should be used for both '+' - indicates that a sign should be used for both
positive as well as negative numbers positive as well as negative numbers
'-' - indicates that a sign should be used only for negative '-' - indicates that a sign should be used only for negative
numbers (this is the default behaviour) numbers (this is the default behavior)
' ' - indicates that a leading space should be used on ' ' - indicates that a leading space should be used on
positive numbers positive numbers
'()' - indicates that negative numbers should be surrounded '()' - indicates that negative numbers should be surrounded
@ -381,9 +374,8 @@ User-Defined Formatting
lives in the 'string' module. This class takes additional options lives in the 'string' module. This class takes additional options
which are not accessible via the normal str.format method. which are not accessible via the normal str.format method.
An application can create their own Formatter instance which has An application can subclass the Formatter class to create their
customized behavior, either by setting the properties of the own customized formatting behavior.
Formatter instance, or by subclassing the Formatter class.
The PEP does not attempt to exactly specify all methods and The PEP does not attempt to exactly specify all methods and
properties defined by the Formatter class; Instead, those will be properties defined by the Formatter class; Instead, those will be
@ -391,46 +383,25 @@ User-Defined Formatting
PEP will specify the general requirements for the Formatter class, PEP will specify the general requirements for the Formatter class,
which are listed below. which are listed below.
Although string.format() does not directly use the Formatter class
Formatter Creation and Initialization to do formatting, both use the same underlying implementation. The
reason that string.format() does not use the Formatter class directly
The Formatter class takes a single initialization argument, 'flags': is because "string" is a built-in type, which means that all of its
methods must be implemented in C, whereas Formatter is a Python
Formatter(flags=0) class. Formatter provides an extensible wrapper around the same
C functions as are used by string.format().
The 'flags' argument is used to control certain subtle behavioral
differences in formatting that would be cumbersome to change via
subclassing. The flags values are defined as static variables
in the "Formatter" class:
Formatter.ALLOW_LEADING_UNDERSCORES
By default, leading underscores are not allowed in identifier
lookups (getattr or getitem). Setting this flag will allow
this.
Formatter.CHECK_UNUSED_POSITIONAL
If this flag is set, the any positional arguments which are
supplied to the 'format' method but which are not used by
the format string will cause an error.
Formatter.CHECK_UNUSED_NAME
If this flag is set, the any named arguments which are
supplied to the 'format' method but which are not used by
the format string will cause an error.
Formatter Methods Formatter Methods
The methods of class Formatter are as follows: The Formatter class takes no initialization arguments:
fmt = Formatter()
The public API methods of class Formatter are as follows:
-- format(format_string, *args, **kwargs) -- format(format_string, *args, **kwargs)
-- vformat(format_string, args, kwargs) -- vformat(format_string, args, kwargs)
-- get_positional(args, index)
-- get_named(kwds, name)
-- format_field(value, conversion)
'format' is the primary API method. It takes a format template, 'format' is the primary API method. It takes a format template,
and an arbitrary set of positional and keyword argument. 'format' and an arbitrary set of positional and keyword argument. 'format'
@ -442,23 +413,38 @@ Formatter Methods
repacking the dictionary as individual arguments using the '*args' and repacking the dictionary as individual arguments using the '*args' and
'**kwds' syntax. 'vformat' does the work of breaking up the format '**kwds' syntax. 'vformat' does the work of breaking up the format
template string into character data and replacement fields. It calls template string into character data and replacement fields. It calls
the 'get_positional' and 'get_index' methods as appropriate. the 'get_positional' and 'get_index' methods as appropriate (described
below.)
Note that the checking of unused arguments, and the restriction on Formatter defines the following overridable methods:
leading underscores in attribute names are also done in this function.
-- get_positional(args, index)
-- get_named(kwds, name)
-- check_unused_args(used_args, args, kwargs)
-- format_field(value, conversion)
'get_positional' and 'get_named' are used to retrieve a given field 'get_positional' and 'get_named' are used to retrieve a given field
value. For compound field names, these functions are only called for value. For compound field names, these functions are only called for
the first component of the field name; Subsequent components are the first component of the field name; Subsequent components are
handled through normal attribute and indexing operations. So for handled through normal attribute and indexing operations.
example, the field expression '0.name' would cause 'get_positional' to
be called with the list of positional arguments and a numeric index of So for example, the field expression '0.name' would cause
0, and then the standard 'getattr' function would be called to get the 'get_positional' to be called with the parameter 'args' set to the
'name' attribute of the result. list of positional arguments to vformat, and 'index' set to zero;
the returned value would then be passed to the standard 'getattr'
function to get the 'name' attribute.
If the index or keyword refers to an item that does not exist, then an If the index or keyword refers to an item that does not exist, then an
IndexError/KeyError will be raised. IndexError/KeyError will be raised.
'check_unused_args' is used to implement checking for unused arguments
if desired. The arguments to this function is the set of all argument
keys that were actually referred to in the format string (integers for
positional arguments, and strings for named arguments), and a reference
to the args and kwargs that was passed to vformat. The intersection
of these two sets will be the set of unused args. 'check_unused_args'
is assumed to throw an exception if the check fails.
'format_field' actually generates the text for a replacement field. 'format_field' actually generates the text for a replacement field.
The 'value' argument corresponds to the value being formatted, which The 'value' argument corresponds to the value being formatted, which
was retrieved from the arguments using the field name. The was retrieved from the arguments using the field name. The
@ -466,11 +452,46 @@ Formatter Methods
will be either a string or unicode object, depending on the type of will be either a string or unicode object, depending on the type of
the original format string. the original format string.
Note: The final implementation of the Formatter class may define To get a better understanding of how these functions relate to each
additional overridable methods and hooks. In particular, it may be other, here is pseudocode that explains the general operation of
that 'vformat' is itself a composition of several additional, vformat:
overridable methods. (Depending on whether it is convenient to the
implementor of Formatter.) def vformat(format_string, args, kwargs):
# Output buffer and set of used args
buffer = StringIO.StringIO()
used_args = set()
# Tokens are either format fields or literal strings
for token in self.parse(format_string):
if is_format_field(token):
field_spec, conversion_spec = token.rsplit(":", 2)
# 'first_part' is the part before the first '.' or '['
first_part = get_first_part(token)
used_args.add(first_part)
if is_positional(first_part):
value = self.get_positional(args, first_part)
else:
value = self.get_named(kwargs, first_part)
# Handle [subfield] or .subfield
for comp in components(token):
value = resolve_subfield(value, comp)
# Write out the converted value
buffer.write(format_field(value, conversion))
else:
buffer.write(token)
self.check_unused_args(used_args, args, kwargs)
return buffer.getvalue()
Note that the actual algorithm of the Formatter class may not be the
one presented here. In particular, the final implementation of
the Formatter class may define additional overridable methods and
hooks. Also, the final implementation will be written in C.
Customizing Formatters Customizing Formatters
@ -527,8 +548,7 @@ Error handling
There are two classes of exceptions which can occur during formatting: There are two classes of exceptions which can occur during formatting:
exceptions generated by the formatter code itself, and exceptions exceptions generated by the formatter code itself, and exceptions
generated by user code (such as a field object's getattr function, or generated by user code (such as a field object's 'getattr' function).
the field_hook function).
In general, exceptions generated by the formatter code itself are In general, exceptions generated by the formatter code itself are
of the "ValueError" variety -- there is an error in the actual "value" of the "ValueError" variety -- there is an error in the actual "value"
@ -615,6 +635,38 @@ Alternate Syntax
what .Net uses. what .Net uses.
Alternate Feature Proposals
Restricting attribute access: An earlier version of the PEP
restricted the ability to access attributes beginning with a
leading underscore, for example "{0}._private". However, this
is a useful ability to have when debugging, so the feature
was dropped.
Some developers suggested that the ability to do 'getattr' and
'getitem' access should be dropped entirely. However, this
is in conflict with the needs of another set of developers who
strongly lobbied for the ability to pass in a large dict as a
single argument (without flattening it into individual keyword
arguments using the **kwargs syntax) and then have the format
string refer to dict entries individually.
There has also been suggestions to expand the set of expressions
that are allowed in a format string. However, this was seen
to go against the spirit of TOOWTDI, since the same effect can
be achieved in most cases by executing the same expression on
the parameter before it's passed in to the formatting function.
For cases where the format string is being use to do arbitrary
formatting in a data-rich environment, it's recommended to use
a templating engine specialized for this purpose, such as
Genshi [5] or Cheetah [6].
Many other features were considered and rejected because they
could easily be achieved by subclassing Formatter instead of
building the feature into the base implementation. This includes
alternate syntax, comments in format strings, and many others.
Security Considerations Security Considerations
Historically, string formatting has been a common source of Historically, string formatting has been a common source of
@ -622,43 +674,21 @@ Security Considerations
string templating system allows arbitrary expressions to be string templating system allows arbitrary expressions to be
embedded in format strings. embedded in format strings.
The typical scenario is one where the string data being processed The best way to use string formatting in a way that does not
is coming from outside the application, perhaps from HTTP headers create potential security holes is to never use format strings
or fields within a web form. An attacker could substitute their that come from an untrusted source.
own strings designed to cause havok.
The string formatting system outlined in this PEP is by no means Barring that, the next best approach is to insure that string
'secure', in the sense that no Python library module can, on its formatting has no side effects. Because of the open nature of
own, guarantee security, especially given the open nature of Python, it is impossible to guarantee that any non-trivial
the Python language. Building a secure application requires a operation has this property. What this PEP does is limit the
secure approach to design. types of expressions in format strings to those in which visible
side effects are both rare and strongly discouraged by the
What this PEP does attempt to do is make the job of designing a culture of Python developers. So for example, attribute access
secure application easier, by making it easier for a programmer is allowed because it would be considered pathological to write
to reason about the possible consequences of a string formatting code where the mere access of an attribute has visible side
operation. It does this by limiting those consequences to a smaller effects (whether the code has *invisible* side effects - such
and more easier understood subset. as creating a cache entry for faster lookup - is irrelevant.)
For example, because it is possible in Python to override the
'getattr' operation of a type, the interpretation of a compound
replacement field such as "0.name" could potentially run
arbitrary code.
However, it is *extremely* rare for the mere retrieval of an
attribute to have side effects. Other operations which are more
likely to have side effects - such as method calls - are disallowed.
Thus, a programmer can be reasonably assured that no string
formatting operation will cause a state change in the program.
This assurance is not only useful in securing an application, but
in debugging it as well.
Similarly, the restriction on field names beginning with
underscores is intended to provide similar assurances about the
visibility of private data.
Of course, programmers would be well-advised to avoid using
any external data as format strings, and instead use that data
as the format arguments instead.
Sample Implementation Sample Implementation
@ -693,6 +723,12 @@ References
[4] Composite Formatting - [.Net Framework Developer's Guide] [4] Composite Formatting - [.Net Framework Developer's Guide]
http://msdn.microsoft.com/library/en-us/cpguide/html/cpconcompositeformatting.asp?frame=true http://msdn.microsoft.com/library/en-us/cpguide/html/cpconcompositeformatting.asp?frame=true
[5] Genshi templating engine.
http://genshi.edgewall.org/
[5] Cheetah - The Python-Powered Template Engine.
http://www.cheetahtemplate.org/
Copyright Copyright