PEP: 3101 Title: Advanced String Formatting Version: $Revision$ Last-Modified: $Date$ Author: Talin Status: Draft Type: Standards Track Content-Type: text/plain Created: 16-Apr-2006 Python-Version: 3.0 Post-History: 28-Apr-2006, 6-May-2006, 10-Jun-2006 Abstract This PEP proposes a new system for built-in string formatting operations, intended as a replacement for the existing '%' string formatting operator. Rationale Python currently provides two methods of string interpolation: - The '%' operator for strings. [1] - The string.Template module. [2] The primary scope of this PEP concerns proposals for built-in string formatting operations (in other words, methods of the built-in string type). The '%' operator is primarily limited by the fact that it is a binary operator, and therefore can take at most two arguments. One of those arguments is already dedicated to the format string, leaving all other variables to be squeezed into the remaining argument. The current practice is to use either a dictionary or a tuple as the second argument, but as many people have commented [3], this lacks flexibility. The "all or nothing" approach (meaning that one must choose between only positional arguments, or only named arguments) is felt to be overly constraining. While there is some overlap between this proposal and string.Template, it is felt that each serves a distinct need, and that one does not obviate the other. This proposal is for a mechanism which, like '%', is efficient for small strings which are only used once, so, for example, compilation of a string into a template is not contemplated in this proposal, although the proposal does take care to define format strings and the API in such a way that an efficient template package could reuse the syntax and even some of the underlying formatting code. Specification The specification will consist of the following parts: - Specification of a new formatting method to be added to the built-in string class. - Specification of functions and flag values to be added to the string module, so that the underlying formatting engine can be used with additional options. - Specification of a new syntax for format strings. - Specification of a new set of special methods to control the formatting and conversion of objects. - Specification of an API for user-defined formatting classes. - Specification of how formatting errors are handled. Note on string encodings: When discussing this PEP in the context of Python 3.0, it is assumed that all strings are unicode strings, and that the use of the word 'string' in the context of this document will generally refer to a Python 3.0 string, which is the same as Python 2.x unicode object. In the context of Python 2.x, the use of the word 'string' in this document refers to an object which may either be a regular string or a unicode object. All of the function call interfaces described in this PEP can be used for both strings and unicode objects, and in all cases there is sufficient information to be able to properly deduce the output string type (in other words, there is no need for two separate APIs). In all cases, the type of the format string dominates - that is, the result of the conversion will always result in an object that contains the same representation of characters as the input format string. String Methods The built-in string class (and also the unicode class in 2.6) will gain a new method, 'format', which takes an arbitrary number of positional and keyword arguments: "The story of {0}, {1}, and {c}".format(a, b, c=d) Within a format string, each positional argument is identified with a number, starting from zero, so in the above example, 'a' is argument 0 and 'b' is argument 1. Each keyword argument is identified by its keyword name, so in the above example, 'c' is used to refer to the third argument. Format Strings Format strings consist of intermingled character data and markup. Character data is data which is transferred unchanged from the format string to the output string; markup is not transferred from the format string directly to the output, but instead is used to define 'replacement fields' that describes to the format engine what should be placed in the output string in the place of the markup. Brace characters ('curly braces') are used to indicate a replacement field within the string: "My name is {0}".format('Fred') The result of this is the string: "My name is Fred" Braces can be escaped by doubling: "My name is {0} :-{{}}".format('Fred') Which would produce: "My name is Fred :-{}" The element within the braces is called a 'field'. Fields consist of a 'field name', which can either be simple or compound, and an optional 'conversion specifier'. Simple and Compound Field Names Simple field names are either names or numbers. If numbers, they must be valid base-10 integers; if names, they must be valid Python identifiers. A number is used to identify a positional argument, while a name is used to identify a keyword argument. A compound field name is a combination of multiple simple field names in an expression: "My name is {0.name}".format(file('out.txt')) This example shows the use of the 'getattr' or 'dot' operator in a field expression. The dot operator allows an attribute of an input value to be specified as the field value. The types of expressions that can be used in a compound name have been deliberately limited in order to prevent potential security exploits resulting from the ability to place arbitrary Python expressions inside of strings. Only two operators are supported, the '.' (getattr) operator, and the '[]' (getitem) operator. Another limitation that is defined to limit potential security issues is that field names or attribute names beginning with an underscore are disallowed. This enforces the common convention that names beginning with an underscore are 'private'. An example of the 'getitem' syntax: "My name is {0[name]}".format(dict(name='Fred')) It should be noted that the use of 'getitem' within a string is much more limited than its normal use. In the above example, the string 'name' really is the literal string 'name', not a variable named 'name'. The rules for parsing an item key are very simple. If it starts with a digit, then its treated as a number, otherwise it is used as a string. It is not possible to specify arbitrary dictionary keys from within a format string. Implementation note: The implementation of this proposal is not required to enforce the rule about a name being a valid Python identifier. Instead, it will rely on the getattr function of the underlying object to throw an exception if the identifier is not legal. The format function will have a minimalist parser which only attempts to figure out when it is "done" with an identifier (by finding a '.' or a ']', or '}', etc.) The only exception to this laissez-faire approach is that, by default, strings are not allowed to have leading underscores. Conversion Specifiers Each field can also specify an optional set of 'conversion specifiers' which can be used to adjust the format of that field. Conversion specifiers follow the field name, with a colon (':') character separating the two: "My name is {0:8}".format('Fred') The meaning and syntax of the conversion specifiers depends on the type of object that is being formatted, however there is a standard set of conversion specifiers used for any object that does not override them. Conversion specifiers can themselves contain replacement fields. For example, a field whose field width is itself a parameter could be specified via: "{0:{1}}".format(a, b, c) Note that the doubled '}' at the end, which would normally be escaped, is not escaped in this case. The reason is because the '{{' and '}}' syntax for escapes is only applied when used *outside* of a format field. Within a format field, the brace characters always have their normal meaning. The syntax for conversion specifiers is open-ended, since a class can override the standard conversion specifiers. In such cases, the format() method merely passes all of the characters between the first colon and the matching brace to the relevant underlying formatting method. Standard Conversion Specifiers If an object does not define its own conversion specifiers, a standard set of conversion specifiers are used. These are similar in concept to the conversion specifiers used by the existing '%' operator, however there are also a number of significant differences. The standard conversion specifiers fall into three major categories: string conversions, integer conversions and floating point conversions. The general form of a standard conversion specifier is: [[fill]align][sign][width][.precision][type] The brackets ([]) indicate an optional element. Then the optional align flag can be one of the following: '<' - Forces the field to be left-aligned within the available space (This is the default.) '>' - Forces the field to be right-aligned within the available space. '=' - Forces the padding to be placed after the sign (if any) but before the digits. This is used for printing fields in the form '+000000120'. '^' - Forces the field to be centered within the available space. Note that unless a minimum field width is defined, the field width will always be the same size as the data to fill it, so that the alignment option has no meaning in this case. The optional 'fill' character defines the character to be used to pad the field to the minimum width. The alignment flag must be supplied if the character is a number other than 0 (otherwise the character would be interpreted as part of the field width specifier). A zero fill character without an alignment flag implies an alignment type of '='. The 'sign' element can be one of the following: '+' - indicates that a sign should be used for both positive as well as negative numbers '-' - indicates that a sign should be used only for negative numbers (this is the default behaviour) ' ' - indicates that a leading space should be used on positive numbers '()' - indicates that negative numbers should be surrounded by parentheses 'width' is a decimal integer defining the minimum field width. If not specified, then the field width will be determined by the content. The 'precision' is a decimal number indicating how many digits should be displayed after the decimal point in a floating point conversion. In a string conversion the field indicates how many characters will be used from the field content. The precision is ignored for integer conversions. Finally, the 'type' determines how the data should be presented. If the type field is absent, an appropriate type will be assigned based on the value to be formatted ('d' for integers and longs, 'g' for floats, and 's' for everything else.) The available string conversion types are: 's' - String format. Invokes str() on the object. This is the default conversion specifier type. 'r' - Repr format. Invokes repr() on the object. There are several integer conversion types. All invoke int() on the object before attempting to format it. The available integer conversion types are: 'b' - Binary. Outputs the number in base 2. 'c' - Character. Converts the integer to the corresponding unicode character before printing. 'd' - Decimal Integer. Outputs the number in base 10. 'o' - Octal format. Outputs the number in base 8. 'x' - Hex format. Outputs the number in base 16, using lower- case letters for the digits above 9. 'X' - Hex format. Outputs the number in base 16, using upper- case letters for the digits above 9. There are several floating point conversion types. All invoke float() on the object before attempting to format it. The available floating point conversion types are: 'e' - Exponent notation. Prints the number in scientific notation using the letter 'e' to indicate the exponent. 'E' - Exponent notation. Same as 'e' except it uses an upper case 'E' as the separator character. 'f' - Fixed point. Displays the number as a fixed-point number. 'F' - Fixed point. Same as 'f'. 'g' - General format. This prints the number as a fixed-point number, unless the number is too large, in which case it switches to 'e' exponent notation. 'G' - General format. Same as 'g' except switches to 'E' if the number gets to large. 'n' - Number. This is the same as 'g', except that it uses the current locale setting to insert the appropriate number separator characters. '%' - Percentage. Multiplies the number by 100 and displays in fixed ('f') format, followed by a percent sign. Objects are able to define their own conversion specifiers to replace the standard ones. An example is the 'datetime' class, whose conversion specifiers might look something like the arguments to the strftime() function: "Today is: {0:a b d H:M:S Y}".format(datetime.now()) Controlling Formatting on a Per-Type Basis A class that wishes to implement a custom interpretation of its conversion specifiers can implement a __format__ method: class AST: def __format__(self, specifiers): ... The 'specifiers' argument will be either a string object or a unicode object, depending on the type of the original format string. The __format__ method should test the type of the specifiers parameter to determine whether to return a string or unicode object. It is the responsibility of the __format__ method to return an object of the proper type. string.format() will format each field using the following steps: 1) See if the value to be formatted has a __format__ method. If it does, then call it. 2) Otherwise, check the internal formatter within string.format that contains knowledge of certain builtin types. 3) Otherwise, call str() or unicode() as appropriate. User-Defined Formatting There will be times when customizing the formatting of fields on a per-type basis is not enough. An example might be a spreadsheet application, which displays hash marks '#' when a value is too large to fit in the available space. For more powerful and flexible formatting, access to the underlying format engine can be obtained through the 'Formatter' class that lives in the 'string' module. This class takes additional options which are not accessible via the normal str.format method. An application can create their own Formatter instance which has customized behavior, either by setting the properties of the Formatter instance, or by subclassing the Formatter class. The PEP does not attempt to exactly specify all methods and properties defined by the Formatter class; Instead, those will be defined and documented in the initial implementation. However, this PEP will specify the general requirements for the Formatter class, which are listed below. Formatter Creation and Initialization The Formatter class takes a single initialization argument, 'flags': Formatter(flags=0) The 'flags' argument is used to control certain subtle behavioral differences in formatting that would be cumbersome to change via subclassing. The flags values are defined as static variables in the "Formatter" class: Formatter.ALLOW_LEADING_UNDERSCORES By default, leading underscores are not allowed in identifier lookups (getattr or getitem). Setting this flag will allow this. Formatter.CHECK_UNUSED_POSITIONAL If this flag is set, the any positional arguments which are supplied to the 'format' method but which are not used by the format string will cause an error. Formatter.CHECK_UNUSED_NAME If this flag is set, the any named arguments which are supplied to the 'format' method but which are not used by the format string will cause an error. Formatter Methods The methods of class Formatter are as follows: -- format(format_string, *args, **kwargs) -- vformat(format_string, args, kwargs) -- get_positional(args, index) -- get_named(kwds, name) -- format_field(value, conversion) 'format' is the primary API method. It takes a format template, and an arbitrary set of positional and keyword argument. 'format' is just a wrapper that calls 'vformat'. 'vformat' is the function that does the actual work of formatting. It is exposed as a separate function for cases where you want to pass in a predefined dictionary of arguments, rather than unpacking and repacking the dictionary as individual arguments using the '*args' and '**kwds' syntax. 'vformat' does the work of breaking up the format template string into character data and replacement fields. It calls the 'get_positional' and 'get_index' methods as appropriate. Note that the checking of unused arguments, and the restriction on leading underscores in attribute names are also done in this function. 'get_positional' and 'get_named' are used to retrieve a given field value. For compound field names, these functions are only called for the first component of the field name; Subsequent components are handled through normal attribute and indexing operations. So for example, the field expression '0.name' would cause 'get_positional' to be called with the list of positional arguments and a numeric index of 0, and then the standard 'getattr' function would be called to get the 'name' attribute of the result. If the index or keyword refers to an item that does not exist, then an IndexError/KeyError will be raised. 'format_field' actually generates the text for a replacement field. The 'value' argument corresponds to the value being formatted, which was retrieved from the arguments using the field name. The 'conversion' argument is the conversion spec part of the field, which will be either a string or unicode object, depending on the type of the original format string. Note: The final implementation of the Formatter class may define additional overridable methods and hooks. In particular, it may be that 'vformat' is itself a composition of several additional, overridable methods. (Depending on whether it is convenient to the implementor of Formatter.) Customizing Formatters This section describes some typical ways that Formatter objects can be customized. To support alternative format-string syntax, the 'vformat' method can be overridden to alter the way format strings are parsed. One common desire is to support a 'default' namespace, so that you don't need to pass in keyword arguments to the format() method, but can instead use values in a pre-existing namespace. This can easily be done by overriding get_named() as follows: class NamespaceFormatter(Formatter): def __init__(self, namespace={}, flags=0): Formatter.__init__(self, flags) self.namespace = namespace def get_named(self, kwds, name): try: # Check explicitly passed arguments first return kwds[name] except KeyError: return self.namespace[name] One can use this to easily create a formatting function that allows access to global variables, for example: fmt = NamespaceFormatter(globals()) greeting = "hello" print(fmt("{greeting}, world!")) A similar technique can be done with the locals() dictionary to gain access to the locals dictionary. It would also be possible to create a 'smart' namespace formatter that could automatically access both locals and globals through snooping of the calling stack. Due to the need for compatibility the different versions of Python, such a capability will not be included in the standard library, however it is anticipated that someone will create and publish a recipe for doing this. Another type of customization is to change the way that built-in types are formatted by overriding the 'format_field' method. (For non-built-in types, you can simply define a __format__ special method on that type.) So for example, you could override the formatting of numbers to output scientific notation when needed. Error handling There are two classes of exceptions which can occur during formatting: exceptions generated by the formatter code itself, and exceptions generated by user code (such as a field object's getattr function, or the field_hook function). In general, exceptions generated by the formatter code itself are of the "ValueError" variety -- there is an error in the actual "value" of the format string. (This is not always true; for example, the string.format() function might be passed a non-string as its first parameter, which would result in a TypeError.) The text associated with these internally generated ValueError exceptions will indicate the location of the exception inside the format string, as well as the nature of the exception. For exceptions generated by user code, a trace record and dummy frame will be added to the traceback stack to help in determining the location in the string where the exception occurred. The inserted traceback will indicate that the error occurred at: File ";", line XX, in column_YY where XX and YY represent the line and character position information in the string, respectively. Alternate Syntax Naturally, one of the most contentious issues is the syntax of the format strings, and in particular the markup conventions used to indicate fields. Rather than attempting to exhaustively list all of the various proposals, I will cover the ones that are most widely used already. - Shell variable syntax: $name and $(name) (or in some variants, ${name}). This is probably the oldest convention out there, and is used by Perl and many others. When used without the braces, the length of the variable is determined by lexically scanning until an invalid character is found. This scheme is generally used in cases where interpolation is implicit - that is, in environments where any string can contain interpolation variables, and no special subsitution function need be invoked. In such cases, it is important to prevent the interpolation behavior from occuring accidentally, so the '$' (which is otherwise a relatively uncommonly-used character) is used to signal when the behavior should occur. It is the author's opinion, however, that in cases where the formatting is explicitly invoked, that less care needs to be taken to prevent accidental interpolation, in which case a lighter and less unwieldy syntax can be used. - Printf and its cousins ('%'), including variations that add a field index, so that fields can be interpolated out of order. - Other bracket-only variations. Various MUDs (Multi-User Dungeons) such as MUSH have used brackets (e.g. [name]) to do string interpolation. The Microsoft .Net libraries uses braces ({}), and a syntax which is very similar to the one in this proposal, although the syntax for conversion specifiers is quite different. [4] - Backquoting. This method has the benefit of minimal syntactical clutter, however it lacks many of the benefits of a function call syntax (such as complex expression arguments, custom formatters, etc.). - Other variations include Ruby's #{}, PHP's {$name}, and so on. Some specific aspects of the syntax warrant additional comments: 1) Backslash character for escapes. The original version of this PEP used backslash rather than doubling to escape a bracket. This worked because backslashes in Python string literals that don't conform to a standard backslash sequence such as '\n' are left unmodified. However, this caused a certain amount of confusion, and led to potential situations of multiple recursive escapes, i.e. '\\\\{' to place a literal backslash in front of a bracket. 2) The use of the colon character (':') as a separator for conversion specifiers. This was chosen simply because that's what .Net uses. Security Considerations Historically, string formatting has been a common source of security holes in web-based applications, particularly if the string templating system allows arbitrary expressions to be embedded in format strings. The typical scenario is one where the string data being processed is coming from outside the application, perhaps from HTTP headers or fields within a web form. An attacker could substitute their own strings designed to cause havok. The string formatting system outlined in this PEP is by no means 'secure', in the sense that no Python library module can, on its own, guarantee security, especially given the open nature of the Python language. Building a secure application requires a secure approach to design. What this PEP does attempt to do is make the job of designing a secure application easier, by making it easier for a programmer to reason about the possible consequences of a string formatting operation. It does this by limiting those consequences to a smaller and more easier understood subset. For example, because it is possible in Python to override the 'getattr' operation of a type, the interpretation of a compound replacement field such as "0.name" could potentially run arbitrary code. However, it is *extremely* rare for the mere retrieval of an attribute to have side effects. Other operations which are more likely to have side effects - such as method calls - are disallowed. Thus, a programmer can be reasonably assured that no string formatting operation will cause a state change in the program. This assurance is not only useful in securing an application, but in debugging it as well. Similarly, the restriction on field names beginning with underscores is intended to provide similar assurances about the visibility of private data. Of course, programmers would be well-advised to avoid using any external data as format strings, and instead use that data as the format arguments instead. Sample Implementation An implementation of an earlier version of this PEP was created by Patrick Maupin and Eric V. Smith, and can be found in the pep3101 sandbox at: http://svn.python.org/view/sandbox/trunk/pep3101/ Backwards Compatibility Backwards compatibility can be maintained by leaving the existing mechanisms in place. The new system does not collide with any of the method names of the existing string formatting techniques, so both systems can co-exist until it comes time to deprecate the older system. References [1] Python Library Reference - String formating operations http://docs.python.org/lib/typesseq-strings.html [2] Python Library References - Template strings http://docs.python.org/lib/node109.html [3] [Python-3000] String formating operations in python 3k http://mail.python.org/pipermail/python-3000/2006-April/000285.html [4] Composite Formatting - [.Net Framework Developer's Guide] http://msdn.microsoft.com/library/en-us/cpguide/html/cpconcompositeformatting.asp?frame=true Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: