From 0ee269ce09d969938dca97797c69ac5d3df51e50 Mon Sep 17 00:00:00 2001 From: Talin Date: Sun, 11 Jun 2006 00:59:06 +0000 Subject: [PATCH] Lots of changes - added specification for conversions, error handling, complex field specs and general cleanup. --- pep-3101.txt | 306 ++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 228 insertions(+), 78 deletions(-) diff --git a/pep-3101.txt b/pep-3101.txt index 3d9156663..c623bbfb9 100644 --- a/pep-3101.txt +++ b/pep-3101.txt @@ -8,7 +8,7 @@ Type: Standards Content-Type: text/plain Created: 16-Apr-2006 Python-Version: 3.0 -Post-History: 28-Apr-2006 +Post-History: 28-Apr-2006, 6-May-2006, 10-Jun-2006 Abstract @@ -48,7 +48,7 @@ Rationale Specification - The specification will consist of 4 parts: + The specification will consist of the following parts: - Specification of a new formatting method to be added to the built-in string class. @@ -60,6 +60,26 @@ Specification - Specification of an API for user-defined formatting classes. + - Specification of how formatting errors are handled. + + Note on string encodings: Since this PEP is being targeted + at Python 3.0, it is assumed that all strings are unicode strings, + and that the use of the word 'string' in the context of this + document will generally refer to a Python 3.0 string, which is + the same as Python 2.x unicode object. + + If it should happen that this functionality is backported to + the 2.x series, then it will be necessary to handle both regular + string as well as unicode objects. All of the function call + interfaces described in this PEP can be used for both strings + and unicode objects, and in all cases there is sufficient + information to be able to properly deduce the output string + type (in other words, there is no need for two separate APIs). + In all cases, the type of the template string dominates - that + is, the result of the conversion will always result in an object + that contains the same representation of characters as the + input template string. + String Methods @@ -75,9 +95,6 @@ String Methods identified by its keyword name, so in the above example, 'c' is used to refer to the third argument. - The result of the format call is an object of the same type - (string or unicode) as the format string. - Format Strings @@ -90,32 +107,59 @@ Format Strings "My name is Fred" - Braces can be escaped using a backslash: + Braces can be escaped by doubling: - "My name is {0} :-\{\}".format('Fred') + "My name is {0} :-{{}}".format('Fred') Which would produce: "My name is Fred :-{}" - + The element within the braces is called a 'field'. Fields consist of a 'field name', which can either be simple or compound, and an optional 'conversion specifier'. + + +Simple and Compound Field Names Simple field names are either names or numbers. If numbers, they must be valid base-10 integers; if names, they must be valid Python identifiers. A number is used to identify a positional argument, while a name is used to identify a keyword argument. + + A compound field name is a combination of multiple simple field + names in an expression: - Compound names are a sequence of simple names seperated by - periods: + "My name is {0.name}".format(file('out.txt')) + + This example shows the use of the 'getattr' or 'dot' operator + in a field expression. The dot operator allows an attribute of + an input value to be specified as the field value. - "My name is {0.name} :-\{\}".format(dict(name='Fred')) + The types of expressions that can be used in a compound name + have been deliberately limited in order to prevent potential + security exploits resulting from the ability to place arbitrary + Python expressions inside of strings. Only two operators are + supported, the '.' (getattr) operator, and the '[]' (getitem) + operator. + + An example of the 'getitem' syntax: + + "My name is {0[name]}".format(dict(name='Fred')) + + It should be noted that the use of 'getitem' within a string is + much more limited than its normal use. In the above example, the + string 'name' really is the literal string 'name', not a variable + named 'name'. The rules for parsing an item key are the same as + for parsing a simple name - in other words, if it looks like a + number, then its treated as a number, if it looks like an + identifier, then it is used as a string. + + It is not possible to specify arbitrary dictionary keys from + within a format string. - Compound names can be used to access specific dictionary entries, - array elements, or object attributes. In the above example, the - '{0.name}' field refers to the dictionary entry 'name' within - positional argument 0. + +Conversion Specifiers Each field can also specify an optional set of 'conversion specifiers' which can be used to adjust the format of that field. @@ -129,53 +173,135 @@ Format Strings built-in types will recognize a standard set of conversion specifiers. - The conversion specifier consists of a sequence of zero or more - characters, each of which can consist of any printable character - except for a non-escaped '}'. + Conversion specifiers can themselves contain replacement fields. + For example, a field whose field width it itself a parameter + could be specified via: - Conversion specifiers can themselves contain replacement fields; - this will be described in a later section. Except for this - replacement, the format() method does not attempt to intepret the - conversion specifiers in any way; it merely passes all of the - characters between the first colon ':' and the matching right - brace ('}') to the various underlying formatters (described - later.) + "{0:{1}}".format(a, b, c) + + Note that the doubled '}' at the end, which would normally be + escaped, is not escaped in this case. The reason is because + the '{{' and '}}' syntax for escapes is only applied when used + *outside* of a format field. Within a format field, the brace + characters always have their normal meaning. + + The syntax for conversion specifiers is open-ended, since except + than doing field replacements, the format() method does not + attempt to interpret them in any way; it merely passes all of the + characters between the first colon and the matching brace to + the various underlying formatter methods. Standard Conversion Specifiers - For most built-in types, the conversion specifiers will be the - same or similar to the existing conversion specifiers used with - the '%' operator. Thus, instead of '%02.2x", you will say - '{0:02.2x}'. + If an object does not define its own conversion specifiers, a + standard set of conversion specifiers are used. These are similar + in concept to the conversion specifiers used by the existing '%' + operator, however there are also a number of significant + differences. The standard conversion specifiers fall into three + major categories: string conversions, integer conversions and + floating point conversions. + + The general form of a standard conversion specifier is: - There are a few differences however: + [[fill]align][sign][width][.precision][type] - - The trailing letter is optional - you don't need to say '2.2d', - you can instead just say '2.2'. If the letter is omitted, a - default will be assumed based on the type of the argument. - The defaults will be as follows: - - string or unicode object: 's' - integer: 'd' - floating-point number: 'f' - all other types: 's' + The brackets ([]) indicate an optional field. + + Then the optional align flag can be one of the following: - - Variable field width specifiers use a nested version of the {} - syntax, allowing the width specifier to be either a positional - or keyword argument: + '<' - Forces the field to be left-aligned within the available + space (This is the default.) + '>' - Forces the field to be right-aligned within the + available space. + '=' - Forces the padding to be placed between immediately + after the sign, if any. This is used for printing fields + in the form '+000000120'. + + Note that unless a minimum field width is defined, the field + width will always be the same size as the data to fill it, so + that the alignment option has no meaning in this case. + + The optional 'fill' character defines the character to be used to + pad the field to the minimum width. The alignment flag must be + supplied if the character is a number other than 0 (otherwise the + character would be interpreted as part of the field width + specifier). A zero fill character without an alignment flag + implies an alignment type of '='. + + The 'sign' field can be one of the following: - "{0:{1}.{2}d}".format(a, b, c) + '+' - indicates that a sign should be used for both + positive as well as negative numbers + '-' - indicates that a sign should be used only for negative + numbers (this is the default behaviour) + ' ' - indicates that a leading space should be used on + positive numbers + '()' - indicates that negative numbers should be surrounded + by parentheses - - The support for length modifiers (which are ignored by Python - anyway) is dropped. + 'width' is a decimal integer defining the minimum field width. If + not specified, then the field width will be determined by the + content. - For non-built-in types, the conversion specifiers will be specific - to that type. An example is the 'datetime' class, whose - conversion specifiers are identical to the arguments to the - strftime() function: + The 'precision' field is a decimal number indicating how many + digits should be displayed after the decimal point. - "Today is: {0:%a %b %d %H:%M:%S %Y}".format(datetime.now()) + Finally, the 'type' determines how the data should be presented. + If the type field is absent, an appropriate type will be assigned + based on the value to be formatted ('d' for integers and longs, + 'g' for floats, and 's' for everything else.) + + The available string conversion types are: + + 's' - String format. Invokes str() on the object. + This is the default conversion specifier type. + 'r' - Repr format. Invokes repr() on the object. + + There are several integer conversion types. All invoke int() on + the object before attempting to format it. + + The available integer conversion types are: + + 'b' - Binary. Outputs the number in base 2. + 'c' - Character. Converts the integer to the corresponding + unicode character before printing. + 'd' - Decimal Integer. Outputs the number in base 10. + 'o' - Octal format. Outputs the number in base 8. + 'x' - Hex format. Outputs the number in base 16, using lower- + case letters for the digits above 9. + 'X' - Hex format. Outputs the number in base 16, using upper- + case letters for the digits above 9. + + There are several floating point conversion types. All invoke + float() on the object before attempting to format it. + + The available floating point conversion types are: + + 'e' - Exponent notation. Prints the number in scientific + notation using the letter 'e' to indicate the exponent. + 'E' - Exponent notation. Same as 'e' except it uses an upper + case 'E' as the separator character. + 'f' - Fixed point. Displays the number as a fixed-point + number. + 'F' - Fixed point. Same as 'f'. + 'g' - General format. This prints the number as a fixed-point + number, unless the number is too large, in which case + it switches to 'e' exponent notation. + 'G' - General format. Same as 'g' except switches to 'E' + if the number gets to large. + 'n' - Number. This is the same as 'g', except that it uses the + current locale setting to insert the appropriate + number separator characters. + '%' - Percentage. Multiplies the number by 100 and displays + in fixed ('f') format, followed by a percent sign. + + Objects are able to define their own conversion specifiers to + replace the standard ones. An example is the 'datetime' class, + whose conversion specifiers might look something like the + arguments to the strftime() function: + + "Today is: {0:a b d H:M:S Y}".format(datetime.now()) Controlling Formatting @@ -224,19 +350,22 @@ User-Defined Formatting Classes API for such an application-specific formatter is up to the application; here are several possible examples: - cell_format( "The total is: {0}", total ) + cell_format("The total is: {0}", total) - TemplateString( "The total is: {0}" ).format( total ) + TemplateString("The total is: {0}").format(total) Creating an application-specific formatter is relatively straight- forward. The string and unicode classes will have a class method called 'cformat' that does all the actual work of formatting; The built-in format() method is just a wrapper that calls cformat. + + The type signature for the cFormat function is as follows: + + cformat(template, format_hook, args, kwargs) The parameters to the cformat function are: - -- The format string (or unicode; the same function handles - both.) + -- The format template string. -- A callable 'format hook', which is called once per field -- A tuple containing the positional arguments -- A dict containing the keyword arguments @@ -251,7 +380,7 @@ User-Defined Formatting Classes will attempt to call the field format hook with the following arguments: - format_hook(value, conversion, buffer) + format_hook(value, conversion) The 'value' field corresponds to the value being formatted, which was retrieved from the arguments using the field name. @@ -260,20 +389,49 @@ User-Defined Formatting Classes field, which will be either a string or unicode object, depending on the type of the original format string. - The 'buffer' argument is a Python array object, either a byte - array or unicode character array. The buffer object will contain - the partially constructed string; the field hook is free to modify - the contents of this buffer if needed. - The field_hook will be called once per field. The field_hook may take one of two actions: + + 1) Return a string or unicode object that is the result + of the formatting operation. - 1) Return False, indicating that the field_hook will not + 2) Return None, indicating that the field_hook will not process this field and the default formatting should be used. This decision should be based on the type of the value object, and the contents of the conversion string. - 2) Append the formatted field to the buffer, and return True. + +Error handling + + The string formatting system has two error handling modes, which + are controlled by the value of a class variable: + + string.strict_format_errors = True + + The 'strict_format_errors' flag defaults to False, or 'lenient' + mode. Setting it to True enables 'strict' mode. The current mode + determines how errors are handled, depending on the type of the + error. + + The types of errors that can occur are: + + 1) Reference to a missing or invalid argument from within a + field specifier. In strict mode, this will raise an exception. + In lenient mode, this will cause the value of the field to be + replaced with the string '?name?', where 'name' will be the + type of error (KeyError, IndexError, or AttributeError). + + So for example: + + >>> string.strict_format_errors = False + >>> print 'Item 2 of argument 0 is: {0[2]}'.format( [0,1] ) + "Item 2 of argument 0 is: ?IndexError?" + + 2) Unused argument. In strict mode, this will raise an exception. + In lenient mode, this will be ignored. + + 3) Exception raised by underlying formatter. These exceptions + are always passed through, regardless of the current mode. Alternate Syntax @@ -325,22 +483,14 @@ Alternate Syntax Some specific aspects of the syntax warrant additional comments: - 1) The use of the backslash character for escapes. A few people - suggested doubling the brace characters to indicate a literal - brace rather than using backslash as an escape character. This is - also the convention used in the .Net libraries. Here's how the - previously-given example would look with this convention: - - "My name is {0} :-{{}}".format('Fred') - - One problem with this syntax is that it conflicts with the use of - nested braces to allow parameterization of the conversion - specifiers: - - "{0:{1}.{2}}".format(a, b, c) - - (There are alternative solutions, but they are too long to go - into here.) + 1) Backslash character for escapes. The original version of + this PEP used backslash rather than doubling to escape a bracket. + This worked because backslashes in Python string literals that + don't conform to a standard backslash sequence such as '\n' + are left unmodified. However, this caused a certain amount + of confusion, and led to potential situations of multiple + recursive escapes, i.e. '\\\\{' to place a literal backslash + in front of a bracket. 2) The use of the colon character (':') as a separator for conversion specifiers. This was chosen simply because that's