Updated PEP 3101 to incorporate latest feedback, and simplify even further. Also added additional explanation of custom formatting classes.
This commit is contained in:
parent
935f64f730
commit
00d28204ef
306
pep-3101.txt
306
pep-3101.txt
|
@ -141,7 +141,7 @@ Format Strings
|
|||
|
||||
Simple and Compound Field Names
|
||||
|
||||
Simple field names are either names or numbers. If numbers, they
|
||||
Simple field names are either names or numbers. If numbers, they
|
||||
must be valid base-10 integers; if names, they must be valid
|
||||
Python identifiers. A number is used to identify a positional
|
||||
argument, while a name is used to identify a keyword argument.
|
||||
|
@ -152,44 +152,37 @@ Simple and Compound Field Names
|
|||
"My name is {0.name}".format(file('out.txt'))
|
||||
|
||||
This example shows the use of the 'getattr' or 'dot' operator
|
||||
in a field expression. The dot operator allows an attribute of
|
||||
in a field expression. The dot operator allows an attribute of
|
||||
an input value to be specified as the field value.
|
||||
|
||||
The types of expressions that can be used in a compound name
|
||||
have been deliberately limited in order to prevent potential
|
||||
security exploits resulting from the ability to place arbitrary
|
||||
Python expressions inside of strings. Only two operators are
|
||||
supported, the '.' (getattr) operator, and the '[]' (getitem)
|
||||
operator.
|
||||
|
||||
Another limitation that is defined to limit potential security
|
||||
issues is that field names or attribute names beginning with an
|
||||
underscore are disallowed. This enforces the common convention
|
||||
that names beginning with an underscore are 'private'.
|
||||
Unlike some other programming languages, you cannot embed arbitrary
|
||||
expressions in format strings. This is by design - the types of
|
||||
expressions that you can use is deliberately limited. Only two operators
|
||||
are supported: the '.' (getattr) operator, and the '[]' (getitem)
|
||||
operator. The reason for allowing these operators is that they dont'
|
||||
normally have side effects in non-pathological code.
|
||||
|
||||
An example of the 'getitem' syntax:
|
||||
|
||||
"My name is {0[name]}".format(dict(name='Fred'))
|
||||
|
||||
It should be noted that the use of 'getitem' within a string is
|
||||
much more limited than its normal use. In the above example, the
|
||||
string 'name' really is the literal string 'name', not a variable
|
||||
named 'name'. The rules for parsing an item key are very simple.
|
||||
It should be noted that the use of 'getitem' within a format string
|
||||
is much more limited than its conventional usage. In the above example,
|
||||
the string 'name' really is the literal string 'name', not a variable
|
||||
named 'name'. The rules for parsing an item key are very simple.
|
||||
If it starts with a digit, then its treated as a number, otherwise
|
||||
it is used as a string.
|
||||
|
||||
It is not possible to specify arbitrary dictionary keys from
|
||||
within a format string.
|
||||
|
||||
Implementation note: The implementation of this proposal is
|
||||
Implementation note: The implementation of this proposal is
|
||||
not required to enforce the rule about a name being a valid
|
||||
Python identifier. Instead, it will rely on the getattr function
|
||||
of the underlying object to throw an exception if the identifier
|
||||
is not legal. The format function will have a minimalist parser
|
||||
which only attempts to figure out when it is "done" with an
|
||||
identifier (by finding a '.' or a ']', or '}', etc.) The only
|
||||
exception to this laissez-faire approach is that, by default,
|
||||
strings are not allowed to have leading underscores.
|
||||
identifier (by finding a '.' or a ']', or '}', etc.).
|
||||
|
||||
|
||||
Conversion Specifiers
|
||||
|
@ -215,11 +208,11 @@ Conversion Specifiers
|
|||
Note that the doubled '}' at the end, which would normally be
|
||||
escaped, is not escaped in this case. The reason is because
|
||||
the '{{' and '}}' syntax for escapes is only applied when used
|
||||
*outside* of a format field. Within a format field, the brace
|
||||
*outside* of a format field. Within a format field, the brace
|
||||
characters always have their normal meaning.
|
||||
|
||||
The syntax for conversion specifiers is open-ended, since a class
|
||||
can override the standard conversion specifiers. In such cases,
|
||||
can override the standard conversion specifiers. In such cases,
|
||||
the format() method merely passes all of the characters between
|
||||
the first colon and the matching brace to the relevant underlying
|
||||
formatting method.
|
||||
|
@ -248,7 +241,7 @@ Standard Conversion Specifiers
|
|||
'>' - Forces the field to be right-aligned within the
|
||||
available space.
|
||||
'=' - Forces the padding to be placed after the sign (if any)
|
||||
but before the digits. This is used for printing fields
|
||||
but before the digits. This is used for printing fields
|
||||
in the form '+000000120'.
|
||||
'^' - Forces the field to be centered within the available
|
||||
space.
|
||||
|
@ -261,7 +254,7 @@ Standard Conversion Specifiers
|
|||
pad the field to the minimum width. The alignment flag must be
|
||||
supplied if the character is a number other than 0 (otherwise the
|
||||
character would be interpreted as part of the field width
|
||||
specifier). A zero fill character without an alignment flag
|
||||
specifier). A zero fill character without an alignment flag
|
||||
implies an alignment type of '='.
|
||||
|
||||
The 'sign' element can be one of the following:
|
||||
|
@ -269,20 +262,20 @@ Standard Conversion Specifiers
|
|||
'+' - indicates that a sign should be used for both
|
||||
positive as well as negative numbers
|
||||
'-' - indicates that a sign should be used only for negative
|
||||
numbers (this is the default behaviour)
|
||||
numbers (this is the default behavior)
|
||||
' ' - indicates that a leading space should be used on
|
||||
positive numbers
|
||||
'()' - indicates that negative numbers should be surrounded
|
||||
by parentheses
|
||||
|
||||
'width' is a decimal integer defining the minimum field width. If
|
||||
'width' is a decimal integer defining the minimum field width. If
|
||||
not specified, then the field width will be determined by the
|
||||
content.
|
||||
|
||||
The 'precision' is a decimal number indicating how many digits
|
||||
should be displayed after the decimal point in a floating point
|
||||
conversion. In a string conversion the field indicates how many
|
||||
characters will be used from the field content. The precision is
|
||||
conversion. In a string conversion the field indicates how many
|
||||
characters will be used from the field content. The precision is
|
||||
ignored for integer conversions.
|
||||
|
||||
Finally, the 'type' determines how the data should be presented.
|
||||
|
@ -292,11 +285,11 @@ Standard Conversion Specifiers
|
|||
|
||||
The available string conversion types are:
|
||||
|
||||
's' - String format. Invokes str() on the object.
|
||||
's' - String format. Invokes str() on the object.
|
||||
This is the default conversion specifier type.
|
||||
'r' - Repr format. Invokes repr() on the object.
|
||||
'r' - Repr format. Invokes repr() on the object.
|
||||
|
||||
There are several integer conversion types. All invoke int() on
|
||||
There are several integer conversion types. All invoke int() on
|
||||
the object before attempting to format it.
|
||||
|
||||
The available integer conversion types are:
|
||||
|
@ -311,7 +304,7 @@ Standard Conversion Specifiers
|
|||
'X' - Hex format. Outputs the number in base 16, using upper-
|
||||
case letters for the digits above 9.
|
||||
|
||||
There are several floating point conversion types. All invoke
|
||||
There are several floating point conversion types. All invoke
|
||||
float() on the object before attempting to format it.
|
||||
|
||||
The available floating point conversion types are:
|
||||
|
@ -380,97 +373,125 @@ User-Defined Formatting
|
|||
format engine can be obtained through the 'Formatter' class that
|
||||
lives in the 'string' module. This class takes additional options
|
||||
which are not accessible via the normal str.format method.
|
||||
|
||||
An application can create their own Formatter instance which has
|
||||
customized behavior, either by setting the properties of the
|
||||
Formatter instance, or by subclassing the Formatter class.
|
||||
|
||||
An application can subclass the Formatter class to create their
|
||||
own customized formatting behavior.
|
||||
|
||||
The PEP does not attempt to exactly specify all methods and
|
||||
properties defined by the Formatter class; Instead, those will be
|
||||
defined and documented in the initial implementation. However, this
|
||||
defined and documented in the initial implementation. However, this
|
||||
PEP will specify the general requirements for the Formatter class,
|
||||
which are listed below.
|
||||
|
||||
|
||||
Formatter Creation and Initialization
|
||||
|
||||
The Formatter class takes a single initialization argument, 'flags':
|
||||
|
||||
Formatter(flags=0)
|
||||
|
||||
The 'flags' argument is used to control certain subtle behavioral
|
||||
differences in formatting that would be cumbersome to change via
|
||||
subclassing. The flags values are defined as static variables
|
||||
in the "Formatter" class:
|
||||
|
||||
Formatter.ALLOW_LEADING_UNDERSCORES
|
||||
|
||||
By default, leading underscores are not allowed in identifier
|
||||
lookups (getattr or getitem). Setting this flag will allow
|
||||
this.
|
||||
|
||||
Formatter.CHECK_UNUSED_POSITIONAL
|
||||
|
||||
If this flag is set, the any positional arguments which are
|
||||
supplied to the 'format' method but which are not used by
|
||||
the format string will cause an error.
|
||||
|
||||
Formatter.CHECK_UNUSED_NAME
|
||||
|
||||
If this flag is set, the any named arguments which are
|
||||
supplied to the 'format' method but which are not used by
|
||||
the format string will cause an error.
|
||||
Although string.format() does not directly use the Formatter class
|
||||
to do formatting, both use the same underlying implementation. The
|
||||
reason that string.format() does not use the Formatter class directly
|
||||
is because "string" is a built-in type, which means that all of its
|
||||
methods must be implemented in C, whereas Formatter is a Python
|
||||
class. Formatter provides an extensible wrapper around the same
|
||||
C functions as are used by string.format().
|
||||
|
||||
|
||||
Formatter Methods
|
||||
|
||||
The methods of class Formatter are as follows:
|
||||
The Formatter class takes no initialization arguments:
|
||||
|
||||
fmt = Formatter()
|
||||
|
||||
The public API methods of class Formatter are as follows:
|
||||
|
||||
-- format(format_string, *args, **kwargs)
|
||||
-- vformat(format_string, args, kwargs)
|
||||
-- get_positional(args, index)
|
||||
-- get_named(kwds, name)
|
||||
-- format_field(value, conversion)
|
||||
|
||||
'format' is the primary API method. It takes a format template,
|
||||
and an arbitrary set of positional and keyword argument. 'format'
|
||||
|
||||
'format' is the primary API method. It takes a format template,
|
||||
and an arbitrary set of positional and keyword argument. 'format'
|
||||
is just a wrapper that calls 'vformat'.
|
||||
|
||||
'vformat' is the function that does the actual work of formatting. It
|
||||
'vformat' is the function that does the actual work of formatting. It
|
||||
is exposed as a separate function for cases where you want to pass in
|
||||
a predefined dictionary of arguments, rather than unpacking and
|
||||
repacking the dictionary as individual arguments using the '*args' and
|
||||
'**kwds' syntax. 'vformat' does the work of breaking up the format
|
||||
template string into character data and replacement fields. It calls
|
||||
the 'get_positional' and 'get_index' methods as appropriate.
|
||||
'**kwds' syntax. 'vformat' does the work of breaking up the format
|
||||
template string into character data and replacement fields. It calls
|
||||
the 'get_positional' and 'get_index' methods as appropriate (described
|
||||
below.)
|
||||
|
||||
Note that the checking of unused arguments, and the restriction on
|
||||
leading underscores in attribute names are also done in this function.
|
||||
Formatter defines the following overridable methods:
|
||||
|
||||
-- get_positional(args, index)
|
||||
-- get_named(kwds, name)
|
||||
-- check_unused_args(used_args, args, kwargs)
|
||||
-- format_field(value, conversion)
|
||||
|
||||
'get_positional' and 'get_named' are used to retrieve a given field
|
||||
value. For compound field names, these functions are only called for
|
||||
value. For compound field names, these functions are only called for
|
||||
the first component of the field name; Subsequent components are
|
||||
handled through normal attribute and indexing operations. So for
|
||||
example, the field expression '0.name' would cause 'get_positional' to
|
||||
be called with the list of positional arguments and a numeric index of
|
||||
0, and then the standard 'getattr' function would be called to get the
|
||||
'name' attribute of the result.
|
||||
handled through normal attribute and indexing operations.
|
||||
|
||||
So for example, the field expression '0.name' would cause
|
||||
'get_positional' to be called with the parameter 'args' set to the
|
||||
list of positional arguments to vformat, and 'index' set to zero;
|
||||
the returned value would then be passed to the standard 'getattr'
|
||||
function to get the 'name' attribute.
|
||||
|
||||
If the index or keyword refers to an item that does not exist, then an
|
||||
IndexError/KeyError will be raised.
|
||||
|
||||
'check_unused_args' is used to implement checking for unused arguments
|
||||
if desired. The arguments to this function is the set of all argument
|
||||
keys that were actually referred to in the format string (integers for
|
||||
positional arguments, and strings for named arguments), and a reference
|
||||
to the args and kwargs that was passed to vformat. The intersection
|
||||
of these two sets will be the set of unused args. 'check_unused_args'
|
||||
is assumed to throw an exception if the check fails.
|
||||
|
||||
'format_field' actually generates the text for a replacement field.
|
||||
The 'value' argument corresponds to the value being formatted, which
|
||||
was retrieved from the arguments using the field name. The
|
||||
was retrieved from the arguments using the field name. The
|
||||
'conversion' argument is the conversion spec part of the field, which
|
||||
will be either a string or unicode object, depending on the type of
|
||||
the original format string.
|
||||
|
||||
To get a better understanding of how these functions relate to each
|
||||
other, here is pseudocode that explains the general operation of
|
||||
vformat:
|
||||
|
||||
def vformat(format_string, args, kwargs):
|
||||
|
||||
# Output buffer and set of used args
|
||||
buffer = StringIO.StringIO()
|
||||
used_args = set()
|
||||
|
||||
# Tokens are either format fields or literal strings
|
||||
for token in self.parse(format_string):
|
||||
if is_format_field(token):
|
||||
field_spec, conversion_spec = token.rsplit(":", 2)
|
||||
|
||||
# 'first_part' is the part before the first '.' or '['
|
||||
first_part = get_first_part(token)
|
||||
used_args.add(first_part)
|
||||
if is_positional(first_part):
|
||||
value = self.get_positional(args, first_part)
|
||||
else:
|
||||
value = self.get_named(kwargs, first_part)
|
||||
|
||||
# Handle [subfield] or .subfield
|
||||
for comp in components(token):
|
||||
value = resolve_subfield(value, comp)
|
||||
|
||||
Note: The final implementation of the Formatter class may define
|
||||
additional overridable methods and hooks. In particular, it may be
|
||||
that 'vformat' is itself a composition of several additional,
|
||||
overridable methods. (Depending on whether it is convenient to the
|
||||
implementor of Formatter.)
|
||||
# Write out the converted value
|
||||
buffer.write(format_field(value, conversion))
|
||||
|
||||
else:
|
||||
buffer.write(token)
|
||||
|
||||
self.check_unused_args(used_args, args, kwargs)
|
||||
return buffer.getvalue()
|
||||
|
||||
Note that the actual algorithm of the Formatter class may not be the
|
||||
one presented here. In particular, the final implementation of
|
||||
the Formatter class may define additional overridable methods and
|
||||
hooks. Also, the final implementation will be written in C.
|
||||
|
||||
|
||||
Customizing Formatters
|
||||
|
@ -511,15 +532,15 @@ Customizing Formatters
|
|||
|
||||
It would also be possible to create a 'smart' namespace formatter
|
||||
that could automatically access both locals and globals through
|
||||
snooping of the calling stack. Due to the need for compatibility
|
||||
snooping of the calling stack. Due to the need for compatibility
|
||||
the different versions of Python, such a capability will not be
|
||||
included in the standard library, however it is anticipated that
|
||||
someone will create and publish a recipe for doing this.
|
||||
|
||||
Another type of customization is to change the way that built-in
|
||||
types are formatted by overriding the 'format_field' method. (For
|
||||
types are formatted by overriding the 'format_field' method. (For
|
||||
non-built-in types, you can simply define a __format__ special
|
||||
method on that type.) So for example, you could override the
|
||||
method on that type.) So for example, you could override the
|
||||
formatting of numbers to output scientific notation when needed.
|
||||
|
||||
|
||||
|
@ -527,8 +548,7 @@ Error handling
|
|||
|
||||
There are two classes of exceptions which can occur during formatting:
|
||||
exceptions generated by the formatter code itself, and exceptions
|
||||
generated by user code (such as a field object's getattr function, or
|
||||
the field_hook function).
|
||||
generated by user code (such as a field object's 'getattr' function).
|
||||
|
||||
In general, exceptions generated by the formatter code itself are
|
||||
of the "ValueError" variety -- there is an error in the actual "value"
|
||||
|
@ -605,7 +625,7 @@ Alternate Syntax
|
|||
this PEP used backslash rather than doubling to escape a bracket.
|
||||
This worked because backslashes in Python string literals that
|
||||
don't conform to a standard backslash sequence such as '\n'
|
||||
are left unmodified. However, this caused a certain amount
|
||||
are left unmodified. However, this caused a certain amount
|
||||
of confusion, and led to potential situations of multiple
|
||||
recursive escapes, i.e. '\\\\{' to place a literal backslash
|
||||
in front of a bracket.
|
||||
|
@ -615,6 +635,38 @@ Alternate Syntax
|
|||
what .Net uses.
|
||||
|
||||
|
||||
Alternate Feature Proposals
|
||||
|
||||
Restricting attribute access: An earlier version of the PEP
|
||||
restricted the ability to access attributes beginning with a
|
||||
leading underscore, for example "{0}._private". However, this
|
||||
is a useful ability to have when debugging, so the feature
|
||||
was dropped.
|
||||
|
||||
Some developers suggested that the ability to do 'getattr' and
|
||||
'getitem' access should be dropped entirely. However, this
|
||||
is in conflict with the needs of another set of developers who
|
||||
strongly lobbied for the ability to pass in a large dict as a
|
||||
single argument (without flattening it into individual keyword
|
||||
arguments using the **kwargs syntax) and then have the format
|
||||
string refer to dict entries individually.
|
||||
|
||||
There has also been suggestions to expand the set of expressions
|
||||
that are allowed in a format string. However, this was seen
|
||||
to go against the spirit of TOOWTDI, since the same effect can
|
||||
be achieved in most cases by executing the same expression on
|
||||
the parameter before it's passed in to the formatting function.
|
||||
For cases where the format string is being use to do arbitrary
|
||||
formatting in a data-rich environment, it's recommended to use
|
||||
a templating engine specialized for this purpose, such as
|
||||
Genshi [5] or Cheetah [6].
|
||||
|
||||
Many other features were considered and rejected because they
|
||||
could easily be achieved by subclassing Formatter instead of
|
||||
building the feature into the base implementation. This includes
|
||||
alternate syntax, comments in format strings, and many others.
|
||||
|
||||
|
||||
Security Considerations
|
||||
|
||||
Historically, string formatting has been a common source of
|
||||
|
@ -622,43 +674,21 @@ Security Considerations
|
|||
string templating system allows arbitrary expressions to be
|
||||
embedded in format strings.
|
||||
|
||||
The typical scenario is one where the string data being processed
|
||||
is coming from outside the application, perhaps from HTTP headers
|
||||
or fields within a web form. An attacker could substitute their
|
||||
own strings designed to cause havok.
|
||||
|
||||
The string formatting system outlined in this PEP is by no means
|
||||
'secure', in the sense that no Python library module can, on its
|
||||
own, guarantee security, especially given the open nature of
|
||||
the Python language. Building a secure application requires a
|
||||
secure approach to design.
|
||||
|
||||
What this PEP does attempt to do is make the job of designing a
|
||||
secure application easier, by making it easier for a programmer
|
||||
to reason about the possible consequences of a string formatting
|
||||
operation. It does this by limiting those consequences to a smaller
|
||||
and more easier understood subset.
|
||||
|
||||
For example, because it is possible in Python to override the
|
||||
'getattr' operation of a type, the interpretation of a compound
|
||||
replacement field such as "0.name" could potentially run
|
||||
arbitrary code.
|
||||
|
||||
However, it is *extremely* rare for the mere retrieval of an
|
||||
attribute to have side effects. Other operations which are more
|
||||
likely to have side effects - such as method calls - are disallowed.
|
||||
Thus, a programmer can be reasonably assured that no string
|
||||
formatting operation will cause a state change in the program.
|
||||
This assurance is not only useful in securing an application, but
|
||||
in debugging it as well.
|
||||
|
||||
Similarly, the restriction on field names beginning with
|
||||
underscores is intended to provide similar assurances about the
|
||||
visibility of private data.
|
||||
|
||||
Of course, programmers would be well-advised to avoid using
|
||||
any external data as format strings, and instead use that data
|
||||
as the format arguments instead.
|
||||
The best way to use string formatting in a way that does not
|
||||
create potential security holes is to never use format strings
|
||||
that come from an untrusted source.
|
||||
|
||||
Barring that, the next best approach is to insure that string
|
||||
formatting has no side effects. Because of the open nature of
|
||||
Python, it is impossible to guarantee that any non-trivial
|
||||
operation has this property. What this PEP does is limit the
|
||||
types of expressions in format strings to those in which visible
|
||||
side effects are both rare and strongly discouraged by the
|
||||
culture of Python developers. So for example, attribute access
|
||||
is allowed because it would be considered pathological to write
|
||||
code where the mere access of an attribute has visible side
|
||||
effects (whether the code has *invisible* side effects - such
|
||||
as creating a cache entry for faster lookup - is irrelevant.)
|
||||
|
||||
|
||||
Sample Implementation
|
||||
|
@ -692,6 +722,12 @@ References
|
|||
|
||||
[4] Composite Formatting - [.Net Framework Developer's Guide]
|
||||
http://msdn.microsoft.com/library/en-us/cpguide/html/cpconcompositeformatting.asp?frame=true
|
||||
|
||||
[5] Genshi templating engine.
|
||||
http://genshi.edgewall.org/
|
||||
|
||||
[5] Cheetah - The Python-Powered Template Engine.
|
||||
http://www.cheetahtemplate.org/
|
||||
|
||||
|
||||
Copyright
|
||||
|
|
Loading…
Reference in New Issue