A substantial rewrite of PEP3101.
This commit is contained in:
parent
9e84963d32
commit
fa5ea5f886
361
pep-3101.txt
361
pep-3101.txt
|
@ -26,7 +26,7 @@ Rationale
|
|||
|
||||
- The string.Template module. [2]
|
||||
|
||||
The scope of this PEP will be restricted to proposals for built-in
|
||||
The primary scope of this PEP concerns proposals for built-in
|
||||
string formatting operations (in other words, methods of the
|
||||
built-in string type).
|
||||
|
||||
|
@ -42,8 +42,14 @@ Rationale
|
|||
|
||||
While there is some overlap between this proposal and
|
||||
string.Template, it is felt that each serves a distinct need,
|
||||
and that one does not obviate the other. In any case,
|
||||
string.Template will not be discussed here.
|
||||
and that one does not obviate the other. This proposal is for
|
||||
a mechanism which, like '%', is efficient for small strings
|
||||
which are only used once, so, for example, compilation of a
|
||||
string into a template is not contemplated in this proposal,
|
||||
although the proposal does take care to define format strings
|
||||
and the API in such a way that an efficient template package
|
||||
could reuse the syntax and even some of the underlying
|
||||
formatting code.
|
||||
|
||||
|
||||
Specification
|
||||
|
@ -53,39 +59,43 @@ Specification
|
|||
- Specification of a new formatting method to be added to the
|
||||
built-in string class.
|
||||
|
||||
- Specification of functions and flag values to be added to
|
||||
the string module, so that the underlying formatting engine
|
||||
can be used with additional options.
|
||||
|
||||
- Specification of a new syntax for format strings.
|
||||
|
||||
- Specification of a new set of class methods to control the
|
||||
- Specification of a new set of special methods to control the
|
||||
formatting and conversion of objects.
|
||||
|
||||
- Specification of an API for user-defined formatting classes.
|
||||
|
||||
- Specification of how formatting errors are handled.
|
||||
|
||||
Note on string encodings: Since this PEP is being targeted
|
||||
at Python 3.0, it is assumed that all strings are unicode strings,
|
||||
Note on string encodings: When discussing this PEP in the context
|
||||
of Python 3.0, it is assumed that all strings are unicode strings,
|
||||
and that the use of the word 'string' in the context of this
|
||||
document will generally refer to a Python 3.0 string, which is
|
||||
the same as Python 2.x unicode object.
|
||||
|
||||
If it should happen that this functionality is backported to
|
||||
the 2.x series, then it will be necessary to handle both regular
|
||||
string as well as unicode objects. All of the function call
|
||||
interfaces described in this PEP can be used for both strings
|
||||
and unicode objects, and in all cases there is sufficient
|
||||
information to be able to properly deduce the output string
|
||||
type (in other words, there is no need for two separate APIs).
|
||||
In all cases, the type of the template string dominates - that
|
||||
In the context of Python 2.x, the use of the word 'string' in this
|
||||
document refers to an object which may either be a regular string
|
||||
or a unicode object. All of the function call interfaces
|
||||
described in this PEP can be used for both strings and unicode
|
||||
objects, and in all cases there is sufficient information
|
||||
to be able to properly deduce the output string type (in
|
||||
other words, there is no need for two separate APIs).
|
||||
In all cases, the type of the format string dominates - that
|
||||
is, the result of the conversion will always result in an object
|
||||
that contains the same representation of characters as the
|
||||
input template string.
|
||||
input format string.
|
||||
|
||||
|
||||
String Methods
|
||||
|
||||
The build-in string class will gain a new method, 'format',
|
||||
which takes takes an arbitrary number of positional and keyword
|
||||
arguments:
|
||||
The built-in string class (and also the unicode class in 2.6) will
|
||||
gain a new method, 'format', which takes an arbitrary number of
|
||||
positional and keyword arguments:
|
||||
|
||||
"The story of {0}, {1}, and {c}".format(a, b, c=d)
|
||||
|
||||
|
@ -98,6 +108,15 @@ String Methods
|
|||
|
||||
Format Strings
|
||||
|
||||
Format strings consist of intermingled character data and markup.
|
||||
|
||||
Character data is data which is transferred unchanged from the
|
||||
format string to the output string; markup is not transferred from
|
||||
the format string directly to the output, but instead is used to
|
||||
define 'replacement fields' that describes to the format engine
|
||||
what should be placed in the output string in the place of the
|
||||
markup.
|
||||
|
||||
Brace characters ('curly braces') are used to indicate a
|
||||
replacement field within the string:
|
||||
|
||||
|
@ -143,6 +162,11 @@ Simple and Compound Field Names
|
|||
supported, the '.' (getattr) operator, and the '[]' (getitem)
|
||||
operator.
|
||||
|
||||
Another limitation that is defined to limit potential security
|
||||
issues is that field names or attribute names beginning with an
|
||||
underscore are disallowed. This enforces the common convention
|
||||
that names beginning with an underscore are 'private'.
|
||||
|
||||
An example of the 'getitem' syntax:
|
||||
|
||||
"My name is {0[name]}".format(dict(name='Fred'))
|
||||
|
@ -150,14 +174,23 @@ Simple and Compound Field Names
|
|||
It should be noted that the use of 'getitem' within a string is
|
||||
much more limited than its normal use. In the above example, the
|
||||
string 'name' really is the literal string 'name', not a variable
|
||||
named 'name'. The rules for parsing an item key are the same as
|
||||
for parsing a simple name - in other words, if it looks like a
|
||||
number, then its treated as a number, if it looks like an
|
||||
identifier, then it is used as a string.
|
||||
named 'name'. The rules for parsing an item key are very simple.
|
||||
If it starts with a digit, then its treated as a number, otherwise
|
||||
it is used as a string.
|
||||
|
||||
It is not possible to specify arbitrary dictionary keys from
|
||||
within a format string.
|
||||
|
||||
Implementation note: The implementation of this proposal is
|
||||
not required to enforce the rule about a name being a valid
|
||||
Python identifier. Instead, it will rely on the getattr function
|
||||
of the underlying object to throw an exception if the identifier
|
||||
is not legal. The format function will have a minimalist parser
|
||||
which only attempts to figure out when it is "done" with an
|
||||
identifier (by finding a '.' or a ']', or '}', etc.) The only
|
||||
exception to this laissez-faire approach is that, by default,
|
||||
strings are not allowed to have leading underscores.
|
||||
|
||||
|
||||
Conversion Specifiers
|
||||
|
||||
|
@ -217,6 +250,8 @@ Standard Conversion Specifiers
|
|||
'=' - Forces the padding to be placed after the sign (if any)
|
||||
but before the digits. This is used for printing fields
|
||||
in the form '+000000120'.
|
||||
'^' - Forces the field to be centered within the available
|
||||
space.
|
||||
|
||||
Note that unless a minimum field width is defined, the field
|
||||
width will always be the same size as the data to fill it, so
|
||||
|
@ -307,7 +342,7 @@ Standard Conversion Specifiers
|
|||
"Today is: {0:a b d H:M:S Y}".format(datetime.now())
|
||||
|
||||
|
||||
Controlling Formatting
|
||||
Controlling Formatting on a Per-Type Basis
|
||||
|
||||
A class that wishes to implement a custom interpretation of its
|
||||
conversion specifiers can implement a __format__ method:
|
||||
|
@ -334,107 +369,187 @@ Controlling Formatting
|
|||
3) Otherwise, call str() or unicode() as appropriate.
|
||||
|
||||
|
||||
User-Defined Formatting Classes
|
||||
User-Defined Formatting
|
||||
|
||||
There will be times when customizing the formatting of fields
|
||||
on a per-type basis is not enough. An example might be an
|
||||
accounting application, which displays negative numbers in
|
||||
parentheses rather than using a negative sign.
|
||||
on a per-type basis is not enough. An example might be a
|
||||
spreadsheet application, which displays hash marks '#' when a value
|
||||
is too large to fit in the available space.
|
||||
|
||||
The string formatting system facilitates this kind of application-
|
||||
specific formatting by allowing user code to directly invoke
|
||||
the code that interprets format strings and fields. User-written
|
||||
code can intercept the normal formatting operations on a per-field
|
||||
basis, substituting their own formatting methods.
|
||||
For more powerful and flexible formatting, access to the underlying
|
||||
format engine can be obtained through the 'Formatter' class that
|
||||
lives in the 'string' module. This class takes additional options
|
||||
which are not accessible via the normal str.format method.
|
||||
|
||||
For example, in the aforementioned accounting application, there
|
||||
could be an application-specific number formatter, which reuses
|
||||
the string.format templating code to do most of the work. The
|
||||
API for such an application-specific formatter is up to the
|
||||
application; here are several possible examples:
|
||||
An application can create their own Formatter instance which has
|
||||
customized behavior, either by setting the properties of the
|
||||
Formatter instance, or by subclassing the Formatter class.
|
||||
|
||||
cell_format("The total is: {0}", total)
|
||||
The PEP does not attempt to exactly specify all methods and
|
||||
properties defined by the Formatter class; Instead, those will be
|
||||
defined and documented in the initial implementation. However, this
|
||||
PEP will specify the general requirements for the Formatter class,
|
||||
which are listed below.
|
||||
|
||||
TemplateString("The total is: {0}").format(total)
|
||||
|
||||
Creating an application-specific formatter is relatively straight-
|
||||
forward. The string and unicode classes will have a class method
|
||||
called 'cformat' that does all the actual work of formatting; The
|
||||
built-in format() method is just a wrapper that calls cformat.
|
||||
Formatter Creation and Initialization
|
||||
|
||||
The type signature for the cFormat function is as follows:
|
||||
The Formatter class takes a single initialization argument, 'flags':
|
||||
|
||||
cformat(template, format_hook, args, kwargs)
|
||||
Formatter(flags=0)
|
||||
|
||||
The parameters to the cformat function are:
|
||||
The 'flags' argument is used to control certain subtle behavioral
|
||||
differences in formatting that would be cumbersome to change via
|
||||
subclassing. The flags values are defined as static variables
|
||||
in the "Formatter" class:
|
||||
|
||||
-- The format template string.
|
||||
-- A callable 'format hook', which is called once per field
|
||||
-- A tuple containing the positional arguments
|
||||
-- A dict containing the keyword arguments
|
||||
Formatter.ALLOW_LEADING_UNDERSCORES
|
||||
|
||||
The cformat function will parse all of the fields in the format
|
||||
string, and return a new string (or unicode) with all of the
|
||||
fields replaced with their formatted values.
|
||||
By default, leading underscores are not allowed in identifier
|
||||
lookups (getattr or getitem). Setting this flag will allow
|
||||
this.
|
||||
|
||||
The format hook is a callable object supplied by the user, which
|
||||
is invoked once per field, and which can override the normal
|
||||
formatting for that field. For each field, the cformat function
|
||||
will attempt to call the field format hook with the following
|
||||
arguments:
|
||||
Formatter.CHECK_UNUSED_POSITIONAL
|
||||
|
||||
format_hook(value, conversion)
|
||||
If this flag is set, the any positional arguments which are
|
||||
supplied to the 'format' method but which are not used by
|
||||
the format string will cause an error.
|
||||
|
||||
The 'value' field corresponds to the value being formatted, which
|
||||
was retrieved from the arguments using the field name.
|
||||
Formatter.CHECK_UNUSED_NAME
|
||||
|
||||
The 'conversion' argument is the conversion spec part of the
|
||||
field, which will be either a string or unicode object, depending
|
||||
on the type of the original format string.
|
||||
If this flag is set, the any named arguments which are
|
||||
supplied to the 'format' method but which are not used by
|
||||
the format string will cause an error.
|
||||
|
||||
The field_hook will be called once per field. The field_hook may
|
||||
take one of two actions:
|
||||
|
||||
1) Return a string or unicode object that is the result
|
||||
of the formatting operation.
|
||||
Formatter Methods
|
||||
|
||||
2) Return None, indicating that the field_hook will not
|
||||
process this field and the default formatting should be
|
||||
used. This decision should be based on the type of the
|
||||
value object, and the contents of the conversion string.
|
||||
The methods of class Formatter are as follows:
|
||||
|
||||
-- format(format_string, *args, **kwargs)
|
||||
-- vformat(format_string, args, kwargs)
|
||||
-- get_positional(args, index)
|
||||
-- get_named(kwds, name)
|
||||
-- format_field(value, conversion)
|
||||
|
||||
'format' is the primary API method. It takes a format template,
|
||||
and an arbitrary set of positional and keyword argument. 'format'
|
||||
is just a wrapper that calls 'vformat'.
|
||||
|
||||
'vformat' is the function that does the actual work of formatting. It
|
||||
is exposed as a separate function for cases where you want to pass in
|
||||
a predefined dictionary of arguments, rather than unpacking and
|
||||
repacking the dictionary as individual arguments using the '*args' and
|
||||
'**kwds' syntax. 'vformat' does the work of breaking up the format
|
||||
template string into character data and replacement fields. It calls
|
||||
the 'get_positional' and 'get_index' methods as appropriate.
|
||||
|
||||
Note that the checking of unused arguments, and the restriction on
|
||||
leading underscores in attribute names are also done in this function.
|
||||
|
||||
'get_positional' and 'get_named' are used to retrieve a given field
|
||||
value. For compound field names, these functions are only called for
|
||||
the first component of the field name; Subsequent components are
|
||||
handled through normal attribute and indexing operations. So for
|
||||
example, the field expression '0.name' would cause 'get_positional' to
|
||||
be called with the list of positional arguments and a numeric index of
|
||||
0, and then the standard 'getattr' function would be called to get the
|
||||
'name' attribute of the result.
|
||||
|
||||
If the index or keyword refers to an item that does not exist, then an
|
||||
IndexError/KeyError will be raised.
|
||||
|
||||
'format_field' actually generates the text for a replacement field.
|
||||
The 'value' argument corresponds to the value being formatted, which
|
||||
was retrieved from the arguments using the field name. The
|
||||
'conversion' argument is the conversion spec part of the field, which
|
||||
will be either a string or unicode object, depending on the type of
|
||||
the original format string.
|
||||
|
||||
Note: The final implementation of the Formatter class may define
|
||||
additional overridable methods and hooks. In particular, it may be
|
||||
that 'vformat' is itself a composition of several additional,
|
||||
overridable methods. (Depending on whether it is convenient to the
|
||||
implementor of Formatter.)
|
||||
|
||||
|
||||
Customizing Formatters
|
||||
|
||||
This section describes some typical ways that Formatter objects
|
||||
can be customized.
|
||||
|
||||
To support alternative format-string syntax, the 'vformat' method
|
||||
can be overridden to alter the way format strings are parsed.
|
||||
|
||||
One common desire is to support a 'default' namespace, so that
|
||||
you don't need to pass in keyword arguments to the format()
|
||||
method, but can instead use values in a pre-existing namespace.
|
||||
This can easily be done by overriding get_named() as follows:
|
||||
|
||||
class NamespaceFormatter(Formatter):
|
||||
def __init__(self, namespace={}, flags=0):
|
||||
Formatter.__init__(self, flags)
|
||||
self.namespace = namespace
|
||||
|
||||
def get_named(self, kwds, name):
|
||||
try:
|
||||
# Check explicitly passed arguments first
|
||||
return kwds[name]
|
||||
except KeyError:
|
||||
return self.namespace[name]
|
||||
|
||||
One can use this to easily create a formatting function that allows
|
||||
access to global variables, for example:
|
||||
|
||||
fmt = NamespaceFormatter(globals())
|
||||
|
||||
greeting = "hello"
|
||||
print(fmt("{greeting}, world!"))
|
||||
|
||||
A similar technique can be done with the locals() dictionary to
|
||||
gain access to the locals dictionary.
|
||||
|
||||
It would also be possible to create a 'smart' namespace formatter
|
||||
that could automatically access both locals and globals through
|
||||
snooping of the calling stack. Due to the need for compatibility
|
||||
the different versions of Python, such a capability will not be
|
||||
included in the standard library, however it is anticipated that
|
||||
someone will create and publish a recipe for doing this.
|
||||
|
||||
Another type of customization is to change the way that built-in
|
||||
types are formatted by overriding the 'format_field' method. (For
|
||||
non-built-in types, you can simply define a __format__ special
|
||||
method on that type.) So for example, you could override the
|
||||
formatting of numbers to output scientific notation when needed.
|
||||
|
||||
|
||||
Error handling
|
||||
|
||||
The string formatting system has two error handling modes, which
|
||||
are controlled by the value of a class variable:
|
||||
There are two classes of exceptions which can occur during formatting:
|
||||
exceptions generated by the formatter code itself, and exceptions
|
||||
generated by user code (such as a field object's getattr function, or
|
||||
the field_hook function).
|
||||
|
||||
string.strict_format_errors = True
|
||||
In general, exceptions generated by the formatter code itself are
|
||||
of the "ValueError" variety -- there is an error in the actual "value"
|
||||
of the format string. (This is not always true; for example, the
|
||||
string.format() function might be passed a non-string as its first
|
||||
parameter, which would result in a TypeError.)
|
||||
|
||||
The 'strict_format_errors' flag defaults to False, or 'lenient'
|
||||
mode. Setting it to True enables 'strict' mode. The current mode
|
||||
determines how errors are handled, depending on the type of the
|
||||
error.
|
||||
The text associated with these internally generated ValueError
|
||||
exceptions will indicate the location of the exception inside
|
||||
the format string, as well as the nature of the exception.
|
||||
|
||||
The types of errors that can occur are:
|
||||
For exceptions generated by user code, a trace record and
|
||||
dummy frame will be added to the traceback stack to help
|
||||
in determining the location in the string where the exception
|
||||
occurred. The inserted traceback will indicate that the
|
||||
error occurred at:
|
||||
|
||||
1) Reference to a missing or invalid argument from within a
|
||||
field specifier. In strict mode, this will raise an exception.
|
||||
In lenient mode, this will cause the value of the field to be
|
||||
replaced with the string '?name?', where 'name' will be the
|
||||
type of error (KeyError, IndexError, or AttributeError).
|
||||
File "<format_string>;", line XX, in column_YY
|
||||
|
||||
So for example:
|
||||
|
||||
>>> string.strict_format_errors = False
|
||||
>>> print 'Item 2 of argument 0 is: {0[2]}'.format( [0,1] )
|
||||
"Item 2 of argument 0 is: ?IndexError?"
|
||||
|
||||
2) Unused argument. In strict mode, this will raise an exception.
|
||||
In lenient mode, this will be ignored.
|
||||
|
||||
3) Exception raised by underlying formatter. These exceptions
|
||||
are always passed through, regardless of the current mode.
|
||||
where XX and YY represent the line and character position
|
||||
information in the string, respectively.
|
||||
|
||||
|
||||
Alternate Syntax
|
||||
|
@ -500,11 +615,59 @@ Alternate Syntax
|
|||
what .Net uses.
|
||||
|
||||
|
||||
Security Considerations
|
||||
|
||||
Historically, string formatting has been a common source of
|
||||
security holes in web-based applications, particularly if the
|
||||
string templating system allows arbitrary expressions to be
|
||||
embedded in format strings.
|
||||
|
||||
The typical scenario is one where the string data being processed
|
||||
is coming from outside the application, perhaps from HTTP headers
|
||||
or fields within a web form. An attacker could substitute their
|
||||
own strings designed to cause havok.
|
||||
|
||||
The string formatting system outlined in this PEP is by no means
|
||||
'secure', in the sense that no Python library module can, on its
|
||||
own, guarantee security, especially given the open nature of
|
||||
the Python language. Building a secure application requires a
|
||||
secure approach to design.
|
||||
|
||||
What this PEP does attempt to do is make the job of designing a
|
||||
secure application easier, by making it easier for a programmer
|
||||
to reason about the possible consequences of a string formatting
|
||||
operation. It does this by limiting those consequences to a smaller
|
||||
and more easier understood subset.
|
||||
|
||||
For example, because it is possible in Python to override the
|
||||
'getattr' operation of a type, the interpretation of a compound
|
||||
replacement field such as "0.name" could potentially run
|
||||
arbitrary code.
|
||||
|
||||
However, it is *extremely* rare for the mere retrieval of an
|
||||
attribute to have side effects. Other operations which are more
|
||||
likely to have side effects - such as method calls - are disallowed.
|
||||
Thus, a programmer can be reasonably assured that no string
|
||||
formatting operation will cause a state change in the program.
|
||||
This assurance is not only useful in securing an application, but
|
||||
in debugging it as well.
|
||||
|
||||
Similarly, the restriction on field names beginning with
|
||||
underscores is intended to provide similar assurances about the
|
||||
visibility of private data.
|
||||
|
||||
Of course, programmers would be well-advised to avoid using
|
||||
any external data as format strings, and instead use that data
|
||||
as the format arguments instead.
|
||||
|
||||
|
||||
Sample Implementation
|
||||
|
||||
A rough prototype of the underlying 'cformat' function has been
|
||||
coded in Python, however it needs much refinement before being
|
||||
submitted.
|
||||
An implementation of an earlier version of this PEP was created by
|
||||
Patrick Maupin and Eric V. Smith, and can be found in the pep3101
|
||||
sandbox at:
|
||||
|
||||
http://svn.python.org/view/sandbox/trunk/pep3101/
|
||||
|
||||
|
||||
Backwards Compatibility
|
||||
|
|
Loading…
Reference in New Issue