PEP: 3127 Title: Integer Literal Support and Syntax Version: $Revision$ Last-Modified: $Date$ Author: Patrick Maupin Discussions-To: Python-3000@python.org Status: Final Type: Standards Track Content-Type: text/x-rst Created: 14-Mar-2007 Python-Version: 3.0 Post-History: 18-Mar-2007 Abstract ======== This PEP proposes changes to the Python core to rationalize the treatment of string literal representations of integers in different radices (bases). These changes are targeted at Python 3.0, but the backward-compatible parts of the changes should be added to Python 2.6, so that all valid 3.0 integer literals will also be valid in 2.6. The proposal is that: a) octal literals must now be specified with a leading "0o" or "0O" instead of "0"; b) binary literals are now supported via a leading "0b" or "0B"; and c) provision will be made for binary numbers in string formatting. Motivation ========== This PEP was motivated by two different issues: - The default octal representation of integers is silently confusing to people unfamiliar with C-like languages. It is extremely easy to inadvertently create an integer object with the wrong value, because '013' means 'decimal 11', not 'decimal 13', to the Python language itself, which is not the meaning that most humans would assign to this literal. - Some Python users have a strong desire for binary support in the language. Specification ============= Grammar specification --------------------- The grammar will be changed. For Python 2.6, the changed and new token definitions will be:: integer ::= decimalinteger | octinteger | hexinteger | bininteger | oldoctinteger octinteger ::= "0" ("o" | "O") octdigit+ bininteger ::= "0" ("b" | "B") bindigit+ oldoctinteger ::= "0" octdigit+ bindigit ::= "0" | "1" For Python 3.0, "oldoctinteger" will not be supported, and an exception will be raised if a literal has a leading "0" and a second character which is a digit. For both versions, this will require changes to PyLong_FromString as well as the grammar. The documentation will have to be changed as well: grammar.txt, as well as the integer literal section of the reference manual. PEP 306 should be checked for other issues, and that PEP should be updated if the procedure described therein is insufficient. int() specification -------------------- int(s, 0) will also match the new grammar definition. This should happen automatically with the changes to PyLong_FromString required for the grammar change. Also the documentation for int() should be changed to explain that int(s) operates identically to int(s, 10), and the word "guess" should be removed from the description of int(s, 0). long() specification -------------------- For Python 2.6, the long() implementation and documentation should be changed to reflect the new grammar. Tokenizer exception handling ---------------------------- If an invalid token contains a leading "0", the exception error message should be more informative than the current "SyntaxError: invalid token". It should explain that decimal numbers may not have a leading zero, and that octal numbers require an "o" after the leading zero. int() exception handling ------------------------ The ValueError raised for any call to int() with a string should at least explicitly contain the base in the error message, e.g.:: ValueError: invalid literal for base 8 int(): 09 oct() function --------------- oct() should be updated to output '0o' in front of the octal digits (for 3.0, and 2.6 compatibility mode). Output formatting ----------------- The string (and unicode in 2.6) % operator will have 'b' format specifier added for binary in both 2.6 and 3.0. In 3.0, the alternate syntax of the 'o' option will need to be updated to add '0o' in front, instead of '0'. In 2.6, alternate octal formatting will continue to add only '0'. PEP 3101 already supports 'b' for binary output. Transition from 2.6 to 3.0 --------------------------- The 2to3 translator will have to insert 'o' into any octal string literal. The Py3K compatible option to Python 2.6 should cause attempts to use oldoctinteger literals to raise an exception. Rationale ========= Most of the discussion on these issues occurred on the Python-3000 mailing list starting 14-Mar-2007, prompted by an observation that the average human being would be completely mystified upon finding that prepending a "0" to a string of digits changes the meaning of that digit string entirely. It was pointed out during this discussion that a similar, but shorter, discussion on the subject occurred in January of 2006, prompted by a discovery of the same issue. Background ---------- For historical reasons, Python's string representation of integers in different bases (radices), for string formatting and token literals, borrows heavily from C. [1]_ [2]_ Usage has shown that the historical method of specifying an octal number is confusing, and also that it would be nice to have additional support for binary literals. Throughout this document, unless otherwise noted, discussions about the string representation of integers relate to these features: - Literal integer tokens, as used by normal module compilation, by eval(), and by int(token, 0). (int(token) and int(token, 2-36) are not modified by this proposal.) * Under 2.6, long() is treated the same as int() - Formatting of integers into strings, either via the % string operator or the new PEP 3101 advanced string formatting method. It is presumed that: - All of these features should have an identical set of supported radices, for consistency. - Python source code syntax and int(mystring, 0) should continue to share identical behavior. Removal of old octal syntax ---------------------------- This PEP proposes that the ability to specify an octal number by using a leading zero will be removed from the language in Python 3.0 (and the Python 3.0 preview mode of 2.6), and that a SyntaxError will be raised whenever a leading "0" is immediately followed by another digit. During the present discussion, it was almost universally agreed that:: eval('010') == 8 should no longer be true, because that is confusing to new users. It was also proposed that:: eval('0010') == 10 should become true, but that is much more contentious, because it is so inconsistent with usage in other computer languages that mistakes are likely to be made. Almost all currently popular computer languages, including C/C++, Java, Perl, and JavaScript, treat a sequence of digits with a leading zero as an octal number. Proponents of treating these numbers as decimal instead have a very valid point -- as discussed in `Supported radices`_, below, the entire non-computer world uses decimal numbers almost exclusively. There is ample anecdotal evidence that many people are dismayed and confused if they are confronted with non-decimal radices. However, in most situations, most people do not write gratuitous zeros in front of their decimal numbers. The primary exception is when an attempt is being made to line up columns of numbers. But since PEP 8 specifically discourages the use of spaces to try to align Python code, one would suspect the same argument should apply to the use of leading zeros for the same purpose. Finally, although the email discussion often focused on whether anybody actually *uses* octal any more, and whether we should cater to those old-timers in any case, that is almost entirely besides the point. Assume the rare complete newcomer to computing who *does*, either occasionally or as a matter of habit, use leading zeros for decimal numbers. Python could either: a) silently do the wrong thing with his numbers, as it does now; b) immediately disabuse him of the notion that this is viable syntax (and yes, the SyntaxWarning should be more gentle than it currently is, but that is a subject for a different PEP); or c) let him continue to think that computers are happy with multi-digit decimal integers which start with "0". Some people passionately believe that (c) is the correct answer, and they would be absolutely right if we could be sure that new users will never blossom and grow and start writing AJAX applications. So while a new Python user may (currently) be mystified at the delayed discovery that his numbers don't work properly, we can fix it by explaining to him immediately that Python doesn't like leading zeros (hopefully with a reasonable message!), or we can delegate this teaching experience to the JavaScript interpreter in the Internet Explorer browser, and let him try to debug his issue there. Supported radices ----------------- This PEP proposes that the supported radices for the Python language will be 2, 8, 10, and 16. Once it is agreed that the old syntax for octal (radix 8) representation of integers must be removed from the language, the next obvious question is "Do we actually need a way to specify (and display) numbers in octal?" This question is quickly followed by "What radices does the language need to support?" Because computers are so adept at doing what you tell them to, a tempting answer in the discussion was "all of them." This answer has obviously been given before -- the int() constructor will accept an explicit radix with a value between 2 and 36, inclusive, with the latter number bearing a suspicious arithmetic similarity to the sum of the number of numeric digits and the number of same-case letters in the ASCII alphabet. But the best argument for inclusion will have a use-case to back it up, so the idea of supporting all radices was quickly rejected, and the only radices left with any real support were decimal, hexadecimal, octal, and binary. Just because a particular radix has a vocal supporter on the mailing list does not mean that it really should be in the language, so the rest of this section is a treatise on the utility of these particular radices, vs. other possible choices. Humans use other numeric bases constantly. If I tell you that it is 12:30 PM, I have communicated quantitative information arguably composed of *three* separate bases (12, 60, and 2), only one of which is in the "agreed" list above. But the *communication* of that information used two decimal digits each for the base 12 and base 60 information, and, perversely, two letters for information which could have fit in a single decimal digit. So, in general, humans communicate "normal" (non-computer) numerical information either via names (AM, PM, January, ...) or via use of decimal notation. Obviously, names are seldom used for large sets of items, so decimal is used for everything else. There are studies which attempt to explain why this is so, typically reaching the expected conclusion that the Arabic numeral system is well-suited to human cognition. [3]_ There is even support in the history of the design of computers to indicate that decimal notation is the correct way for computers to communicate with humans. One of the first modern computers, ENIAC [4]_ computed in decimal, even though there were already existing computers which operated in binary. Decimal computer operation was important enough that many computers, including the ubiquitous PC, have instructions designed to operate on "binary coded decimal" (BCD) [5]_ , a representation which devotes 4 bits to each decimal digit. These instructions date from a time when the most strenuous calculations ever performed on many numbers were the calculations actually required to perform textual I/O with them. It is possible to display BCD without having to perform a divide/remainder operation on every displayed digit, and this was a huge computational win when most hardware didn't have fast divide capability. Another factor contributing to the use of BCD is that, with BCD calculations, rounding will happen exactly the same way that a human would do it, so BCD is still sometimes used in fields like finance, despite the computational and storage superiority of binary. So, if it weren't for the fact that computers themselves normally use binary for efficient computation and data storage, string representations of integers would probably always be in decimal. Unfortunately, computer hardware doesn't think like humans, so programmers and hardware engineers must often resort to thinking like the computer, which means that it is important for Python to have the ability to communicate binary data in a form that is understandable to humans. The requirement that the binary data notation must be cognitively easy for humans to process means that it should contain an integral number of binary digits (bits) per symbol, while otherwise conforming quite closely to the standard tried-and-true decimal notation (position indicates power, larger magnitude on the left, not too many symbols in the alphabet, etc.). The obvious "sweet spot" for this binary data notation is thus octal, which packs the largest integral number of bits possible into a single symbol chosen from the Arabic numeral alphabet. In fact, some computer architectures, such as the PDP8 and the 8080/Z80, were defined in terms of octal, in the sense of arranging the bitfields of instructions in groups of three, and using octal representations to describe the instruction set. Even today, octal is important because of bit-packed structures which consist of 3 bits per field, such as Unix file permission masks. But octal has a drawback when used for larger numbers. The number of bits per symbol, while integral, is not itself a power of two. This limitation (given that the word size of most computers these days is a power of two) has resulted in hexadecimal, which is more popular than octal despite the fact that it requires a 60% larger alphabet than decimal, because each symbol contains 4 bits. Some numbers, such as Unix file permission masks, are easily decoded by humans when represented in octal, but difficult to decode in hexadecimal, while other numbers are much easier for humans to handle in hexadecimal. Unfortunately, there are also binary numbers used in computers which are not very well communicated in either hexadecimal or octal. Thankfully, fewer people have to deal with these on a regular basis, but on the other hand, this means that several people on the discussion list questioned the wisdom of adding a straight binary representation to Python. One example of where these numbers is very useful is in reading and writing hardware registers. Sometimes hardware designers will eschew human readability and opt for address space efficiency, by packing multiple bit fields into a single hardware register at unaligned bit locations, and it is tedious and error-prone for a human to reconstruct a 5 bit field which consists of the upper 3 bits of one hex digit, and the lower 2 bits of the next hex digit. Even if the ability of Python to communicate binary information to humans is only useful for a small technical subset of the population, it is exactly that population subset which contains most, if not all, members of the Python core team, so even straight binary, the least useful of these notations, has several enthusiastic supporters and few, if any, staunch opponents, among the Python community. Syntax for supported radices ----------------------------- This proposal is to to use a "0o" prefix with either uppercase or lowercase "o" for octal, and a "0b" prefix with either uppercase or lowercase "b" for binary. There was strong support for not supporting uppercase, but this is a separate subject for a different PEP, as 'j' for complex numbers, 'e' for exponent, and 'r' for raw string (to name a few) already support uppercase. The syntax for delimiting the different radices received a lot of attention in the discussion on Python-3000. There are several (sometimes conflicting) requirements and "nice-to-haves" for this syntax: - It should be as compatible with other languages and previous versions of Python as is reasonable, both for the input syntax and for the output (e.g. string % operator) syntax. - It should be as obvious to the casual observer as possible. - It should be easy to visually distinguish integers formatted in the different bases. Proposed syntaxes included things like arbitrary radix prefixes, such as 16r100 (256 in hexadecimal), and radix suffixes, similar to the 100h assembler-style suffix. The debate on whether the letter "O" could be used for octal was intense -- an uppercase "O" looks suspiciously similar to a zero in some fonts. Suggestions were made to use a "c" (the second letter of "oCtal"), or even to use a "t" for "ocTal" and an "n" for "biNary" to go along with the "x" for "heXadecimal". For the string % operator, "o" was already being used to denote octal, and "b" was not used for anything, so this works out much better than, for example, using "c" (which means "character" for the % operator). At the end of the day, since uppercase "O" can look like a zero and uppercase "B" can look like an 8, it was decided that these prefixes should be lowercase only, but, like 'r' for raw string, that can be a preference or style-guide issue. Open Issues =========== It was suggested in the discussion that lowercase should be used for all numeric and string special modifiers, such as 'x' for hexadecimal, 'r' for raw strings, 'e' for exponentiation, and 'j' for complex numbers. This is an issue for a separate PEP. This PEP takes no position on uppercase or lowercase for input, just noting that, for consistency, if uppercase is not to be removed from input parsing for other letters, it should be added for octal and binary, and documenting the changes under this assumption, as there is not yet a PEP about the case issue. Output formatting may be a different story -- there is already ample precedence for case sensitivity in the output format string, and there would need to be a consensus that there is a valid use-case for the "alternate form" of the string % operator to support uppercase 'B' or 'O' characters for binary or octal output. Currently, PEP3101 does not even support this alternate capability, and the hex() function does not allow the programmer to specify the case of the 'x' character. There are still some strong feelings that '0123' should be allowed as a literal decimal in Python 3.0. If this is the right thing to do, this can easily be covered in an additional PEP. This proposal only takes the first step of making '0123' not be a valid octal number, for reasons covered in the rationale. Is there (or should there be) an option for the 2to3 translator which only makes the 2.6 compatible changes? Should this be run on 2.6 library code before the 2.6 release? Should a bin() function which matches hex() and oct() be added? Is hex() really that useful once we have advanced string formatting? References ========== .. [1] GNU libc manual printf integer format conversions (http://www.gnu.org/software/libc/manual/html_node/Integer-Conversions.html) .. [2] Python string formatting operations (http://docs.python.org/lib/typesseq-strings.html) .. [3] The Representation of Numbers, Jiajie Zhang and Donald A. Norman (http://acad88.sahs.uth.tmc.edu/research/publications/Number-Representation.pdf) .. [4] ENIAC page at wikipedia (http://en.wikipedia.org/wiki/ENIAC) .. [5] BCD page at wikipedia (http://en.wikipedia.org/wiki/Binary-coded_decimal) Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: