520 lines
20 KiB
Plaintext
520 lines
20 KiB
Plaintext
PEP: 3127
|
||
Title: Integer Literal Support and Syntax
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Patrick Maupin <pmaupin@gmail.com>
|
||
Discussions-To: Python-3000@python.org
|
||
Status: Final
|
||
Type: Standards Track
|
||
Content-Type: text/x-rst
|
||
Created: 14-Mar-2007
|
||
Python-Version: 3.0
|
||
Post-History: 18-Mar-2007
|
||
|
||
|
||
Abstract
|
||
========
|
||
|
||
This PEP proposes changes to the Python core to rationalize
|
||
the treatment of string literal representations of integers
|
||
in different radices (bases). These changes are targeted at
|
||
Python 3.0, but the backward-compatible parts of the changes
|
||
should be added to Python 2.6, so that all valid 3.0 integer
|
||
literals will also be valid in 2.6.
|
||
|
||
The proposal is that:
|
||
|
||
a) octal literals must now be specified
|
||
with a leading "0o" or "0O" instead of "0";
|
||
|
||
b) binary literals are now supported via a
|
||
leading "0b" or "0B"; and
|
||
|
||
c) provision will be made for binary numbers in
|
||
string formatting.
|
||
|
||
|
||
Motivation
|
||
==========
|
||
|
||
This PEP was motivated by two different issues:
|
||
|
||
- The default octal representation of integers is silently confusing
|
||
to people unfamiliar with C-like languages. It is extremely easy
|
||
to inadvertently create an integer object with the wrong value,
|
||
because '013' means 'decimal 11', not 'decimal 13', to the Python
|
||
language itself, which is not the meaning that most humans would
|
||
assign to this literal.
|
||
|
||
- Some Python users have a strong desire for binary support in
|
||
the language.
|
||
|
||
|
||
Specification
|
||
=============
|
||
|
||
Grammar specification
|
||
---------------------
|
||
|
||
The grammar will be changed. For Python 2.6, the changed and
|
||
new token definitions will be::
|
||
|
||
integer ::= decimalinteger | octinteger | hexinteger |
|
||
bininteger | oldoctinteger
|
||
|
||
octinteger ::= "0" ("o" | "O") octdigit+
|
||
|
||
bininteger ::= "0" ("b" | "B") bindigit+
|
||
|
||
oldoctinteger ::= "0" octdigit+
|
||
|
||
bindigit ::= "0" | "1"
|
||
|
||
For Python 3.0, "oldoctinteger" will not be supported, and
|
||
an exception will be raised if a literal has a leading "0" and
|
||
a second character which is a digit.
|
||
|
||
For both versions, this will require changes to PyLong_FromString
|
||
as well as the grammar.
|
||
|
||
The documentation will have to be changed as well: grammar.txt,
|
||
as well as the integer literal section of the reference manual.
|
||
|
||
PEP 306 should be checked for other issues, and that PEP should
|
||
be updated if the procedure described therein is insufficient.
|
||
|
||
int() specification
|
||
--------------------
|
||
|
||
int(s, 0) will also match the new grammar definition.
|
||
|
||
This should happen automatically with the changes to
|
||
PyLong_FromString required for the grammar change.
|
||
|
||
Also the documentation for int() should be changed to explain
|
||
that int(s) operates identically to int(s, 10), and the word
|
||
"guess" should be removed from the description of int(s, 0).
|
||
|
||
long() specification
|
||
--------------------
|
||
|
||
For Python 2.6, the long() implementation and documentation
|
||
should be changed to reflect the new grammar.
|
||
|
||
Tokenizer exception handling
|
||
----------------------------
|
||
|
||
If an invalid token contains a leading "0", the exception
|
||
error message should be more informative than the current
|
||
"SyntaxError: invalid token". It should explain that decimal
|
||
numbers may not have a leading zero, and that octal numbers
|
||
require an "o" after the leading zero.
|
||
|
||
int() exception handling
|
||
------------------------
|
||
|
||
The ValueError raised for any call to int() with a string
|
||
should at least explicitly contain the base in the error
|
||
message, e.g.::
|
||
|
||
ValueError: invalid literal for base 8 int(): 09
|
||
|
||
oct() function
|
||
---------------
|
||
|
||
oct() should be updated to output '0o' in front of
|
||
the octal digits (for 3.0, and 2.6 compatibility mode).
|
||
|
||
Output formatting
|
||
-----------------
|
||
|
||
The string (and unicode in 2.6) % operator will have
|
||
'b' format specifier added for binary in both 2.6 and 3.0.
|
||
In 3.0, the alternate syntax of the 'o' option will need to
|
||
be updated to add '0o' in front, instead of '0'. In 2.6,
|
||
alternate octal formatting will continue to add only '0'.
|
||
|
||
PEP 3101 already supports 'b' for binary output.
|
||
|
||
|
||
Transition from 2.6 to 3.0
|
||
---------------------------
|
||
|
||
The 2to3 translator will have to insert 'o' into any
|
||
octal string literal.
|
||
|
||
The Py3K compatible option to Python 2.6 should cause
|
||
attempts to use oldoctinteger literals to raise an
|
||
exception.
|
||
|
||
|
||
Rationale
|
||
=========
|
||
|
||
Most of the discussion on these issues occurred on the Python-3000
|
||
mailing list starting 14-Mar-2007, prompted by an observation that
|
||
the average human being would be completely mystified upon finding
|
||
that prepending a "0" to a string of digits changes the meaning of
|
||
that digit string entirely.
|
||
|
||
It was pointed out during this discussion that a similar, but shorter,
|
||
discussion on the subject occurred in January of 2006, prompted by a
|
||
discovery of the same issue.
|
||
|
||
Background
|
||
----------
|
||
|
||
For historical reasons, Python's string representation of integers
|
||
in different bases (radices), for string formatting and token
|
||
literals, borrows heavily from C. [1]_ [2]_ Usage has shown that
|
||
the historical method of specifying an octal number is confusing,
|
||
and also that it would be nice to have additional support for binary
|
||
literals.
|
||
|
||
Throughout this document, unless otherwise noted, discussions about
|
||
the string representation of integers relate to these features:
|
||
|
||
- Literal integer tokens, as used by normal module compilation,
|
||
by eval(), and by int(token, 0). (int(token) and int(token, 2-36)
|
||
are not modified by this proposal.)
|
||
|
||
* Under 2.6, long() is treated the same as int()
|
||
|
||
- Formatting of integers into strings, either via the % string
|
||
operator or the new PEP 3101 advanced string formatting method.
|
||
|
||
It is presumed that:
|
||
|
||
- All of these features should have an identical set
|
||
of supported radices, for consistency.
|
||
|
||
- Python source code syntax and int(mystring, 0) should
|
||
continue to share identical behavior.
|
||
|
||
|
||
Removal of old octal syntax
|
||
----------------------------
|
||
|
||
This PEP proposes that the ability to specify an octal number by
|
||
using a leading zero will be removed from the language in Python 3.0
|
||
(and the Python 3.0 preview mode of 2.6), and that a SyntaxError will
|
||
be raised whenever a leading "0" is immediately followed by another
|
||
digit.
|
||
|
||
During the present discussion, it was almost universally agreed that::
|
||
|
||
eval('010') == 8
|
||
|
||
should no longer be true, because that is confusing to new users.
|
||
It was also proposed that::
|
||
|
||
eval('0010') == 10
|
||
|
||
should become true, but that is much more contentious, because it is so
|
||
inconsistent with usage in other computer languages that mistakes are
|
||
likely to be made.
|
||
|
||
Almost all currently popular computer languages, including C/C++,
|
||
Java, Perl, and JavaScript, treat a sequence of digits with a
|
||
leading zero as an octal number. Proponents of treating these
|
||
numbers as decimal instead have a very valid point -- as discussed
|
||
in `Supported radices`_, below, the entire non-computer world uses
|
||
decimal numbers almost exclusively. There is ample anecdotal
|
||
evidence that many people are dismayed and confused if they
|
||
are confronted with non-decimal radices.
|
||
|
||
However, in most situations, most people do not write gratuitous
|
||
zeros in front of their decimal numbers. The primary exception is
|
||
when an attempt is being made to line up columns of numbers. But
|
||
since PEP 8 specifically discourages the use of spaces to try to
|
||
align Python code, one would suspect the same argument should apply
|
||
to the use of leading zeros for the same purpose.
|
||
|
||
Finally, although the email discussion often focused on whether anybody
|
||
actually *uses* octal any more, and whether we should cater to those
|
||
old-timers in any case, that is almost entirely besides the point.
|
||
|
||
Assume the rare complete newcomer to computing who *does*, either
|
||
occasionally or as a matter of habit, use leading zeros for decimal
|
||
numbers. Python could either:
|
||
|
||
a) silently do the wrong thing with his numbers, as it does now;
|
||
|
||
b) immediately disabuse him of the notion that this is viable syntax
|
||
(and yes, the SyntaxWarning should be more gentle than it
|
||
currently is, but that is a subject for a different PEP); or
|
||
|
||
c) let him continue to think that computers are happy with
|
||
multi-digit decimal integers which start with "0".
|
||
|
||
Some people passionately believe that (c) is the correct answer,
|
||
and they would be absolutely right if we could be sure that new
|
||
users will never blossom and grow and start writing AJAX applications.
|
||
|
||
So while a new Python user may (currently) be mystified at the
|
||
delayed discovery that his numbers don't work properly, we can
|
||
fix it by explaining to him immediately that Python doesn't like
|
||
leading zeros (hopefully with a reasonable message!), or we can
|
||
delegate this teaching experience to the JavaScript interpreter
|
||
in the Internet Explorer browser, and let him try to debug his
|
||
issue there.
|
||
|
||
Supported radices
|
||
-----------------
|
||
|
||
This PEP proposes that the supported radices for the Python
|
||
language will be 2, 8, 10, and 16.
|
||
|
||
Once it is agreed that the old syntax for octal (radix 8) representation
|
||
of integers must be removed from the language, the next obvious
|
||
question is "Do we actually need a way to specify (and display)
|
||
numbers in octal?"
|
||
|
||
This question is quickly followed by "What radices does the language
|
||
need to support?" Because computers are so adept at doing what you
|
||
tell them to, a tempting answer in the discussion was "all of them."
|
||
This answer has obviously been given before -- the int() constructor
|
||
will accept an explicit radix with a value between 2 and 36, inclusive,
|
||
with the latter number bearing a suspicious arithmetic similarity to
|
||
the sum of the number of numeric digits and the number of same-case
|
||
letters in the ASCII alphabet.
|
||
|
||
But the best argument for inclusion will have a use-case to back
|
||
it up, so the idea of supporting all radices was quickly rejected,
|
||
and the only radices left with any real support were decimal,
|
||
hexadecimal, octal, and binary.
|
||
|
||
Just because a particular radix has a vocal supporter on the
|
||
mailing list does not mean that it really should be in the
|
||
language, so the rest of this section is a treatise on the
|
||
utility of these particular radices, vs. other possible choices.
|
||
|
||
Humans use other numeric bases constantly. If I tell you that
|
||
it is 12:30 PM, I have communicated quantitative information
|
||
arguably composed of *three* separate bases (12, 60, and 2),
|
||
only one of which is in the "agreed" list above. But the
|
||
*communication* of that information used two decimal digits
|
||
each for the base 12 and base 60 information, and, perversely,
|
||
two letters for information which could have fit in a single
|
||
decimal digit.
|
||
|
||
So, in general, humans communicate "normal" (non-computer)
|
||
numerical information either via names (AM, PM, January, ...)
|
||
or via use of decimal notation. Obviously, names are
|
||
seldom used for large sets of items, so decimal is used for
|
||
everything else. There are studies which attempt to explain
|
||
why this is so, typically reaching the expected conclusion
|
||
that the Arabic numeral system is well-suited to human
|
||
cognition. [3]_
|
||
|
||
There is even support in the history of the design of
|
||
computers to indicate that decimal notation is the correct
|
||
way for computers to communicate with humans. One of
|
||
the first modern computers, ENIAC [4]_ computed in decimal,
|
||
even though there were already existing computers which
|
||
operated in binary.
|
||
|
||
Decimal computer operation was important enough
|
||
that many computers, including the ubiquitous PC, have
|
||
instructions designed to operate on "binary coded decimal"
|
||
(BCD) [5]_ , a representation which devotes 4 bits to each
|
||
decimal digit. These instructions date from a time when the
|
||
most strenuous calculations ever performed on many numbers
|
||
were the calculations actually required to perform textual
|
||
I/O with them. It is possible to display BCD without having
|
||
to perform a divide/remainder operation on every displayed
|
||
digit, and this was a huge computational win when most
|
||
hardware didn't have fast divide capability. Another factor
|
||
contributing to the use of BCD is that, with BCD calculations,
|
||
rounding will happen exactly the same way that a human would
|
||
do it, so BCD is still sometimes used in fields like finance,
|
||
despite the computational and storage superiority of binary.
|
||
|
||
So, if it weren't for the fact that computers themselves
|
||
normally use binary for efficient computation and data
|
||
storage, string representations of integers would probably
|
||
always be in decimal.
|
||
|
||
Unfortunately, computer hardware doesn't think like humans,
|
||
so programmers and hardware engineers must often resort to
|
||
thinking like the computer, which means that it is important
|
||
for Python to have the ability to communicate binary data
|
||
in a form that is understandable to humans.
|
||
|
||
The requirement that the binary data notation must be cognitively
|
||
easy for humans to process means that it should contain an integral
|
||
number of binary digits (bits) per symbol, while otherwise
|
||
conforming quite closely to the standard tried-and-true decimal
|
||
notation (position indicates power, larger magnitude on the left,
|
||
not too many symbols in the alphabet, etc.).
|
||
|
||
The obvious "sweet spot" for this binary data notation is
|
||
thus octal, which packs the largest integral number of bits
|
||
possible into a single symbol chosen from the Arabic numeral
|
||
alphabet.
|
||
|
||
In fact, some computer architectures, such as the PDP8 and the
|
||
8080/Z80, were defined in terms of octal, in the sense of arranging
|
||
the bitfields of instructions in groups of three, and using
|
||
octal representations to describe the instruction set.
|
||
|
||
Even today, octal is important because of bit-packed structures
|
||
which consist of 3 bits per field, such as Unix file permission
|
||
masks.
|
||
|
||
But octal has a drawback when used for larger numbers. The
|
||
number of bits per symbol, while integral, is not itself
|
||
a power of two. This limitation (given that the word size
|
||
of most computers these days is a power of two) has resulted
|
||
in hexadecimal, which is more popular than octal despite the
|
||
fact that it requires a 60% larger alphabet than decimal,
|
||
because each symbol contains 4 bits.
|
||
|
||
Some numbers, such as Unix file permission masks, are easily
|
||
decoded by humans when represented in octal, but difficult to
|
||
decode in hexadecimal, while other numbers are much easier for
|
||
humans to handle in hexadecimal.
|
||
|
||
Unfortunately, there are also binary numbers used in computers
|
||
which are not very well communicated in either hexadecimal or
|
||
octal. Thankfully, fewer people have to deal with these on a
|
||
regular basis, but on the other hand, this means that several
|
||
people on the discussion list questioned the wisdom of adding
|
||
a straight binary representation to Python.
|
||
|
||
One example of where these numbers is very useful is in
|
||
reading and writing hardware registers. Sometimes hardware
|
||
designers will eschew human readability and opt for address
|
||
space efficiency, by packing multiple bit fields into a single
|
||
hardware register at unaligned bit locations, and it is tedious
|
||
and error-prone for a human to reconstruct a 5 bit field which
|
||
consists of the upper 3 bits of one hex digit, and the lower 2
|
||
bits of the next hex digit.
|
||
|
||
Even if the ability of Python to communicate binary information
|
||
to humans is only useful for a small technical subset of the
|
||
population, it is exactly that population subset which contains
|
||
most, if not all, members of the Python core team, so even straight
|
||
binary, the least useful of these notations, has several enthusiastic
|
||
supporters and few, if any, staunch opponents, among the Python community.
|
||
|
||
Syntax for supported radices
|
||
-----------------------------
|
||
|
||
This proposal is to to use a "0o" prefix with either uppercase
|
||
or lowercase "o" for octal, and a "0b" prefix with either
|
||
uppercase or lowercase "b" for binary.
|
||
|
||
There was strong support for not supporting uppercase, but
|
||
this is a separate subject for a different PEP, as 'j' for
|
||
complex numbers, 'e' for exponent, and 'r' for raw string
|
||
(to name a few) already support uppercase.
|
||
|
||
The syntax for delimiting the different radices received a lot of
|
||
attention in the discussion on Python-3000. There are several
|
||
(sometimes conflicting) requirements and "nice-to-haves" for
|
||
this syntax:
|
||
|
||
- It should be as compatible with other languages and
|
||
previous versions of Python as is reasonable, both
|
||
for the input syntax and for the output (e.g. string
|
||
% operator) syntax.
|
||
|
||
- It should be as obvious to the casual observer as
|
||
possible.
|
||
|
||
- It should be easy to visually distinguish integers
|
||
formatted in the different bases.
|
||
|
||
|
||
Proposed syntaxes included things like arbitrary radix prefixes,
|
||
such as 16r100 (256 in hexadecimal), and radix suffixes, similar
|
||
to the 100h assembler-style suffix. The debate on whether the
|
||
letter "O" could be used for octal was intense -- an uppercase
|
||
"O" looks suspiciously similar to a zero in some fonts. Suggestions
|
||
were made to use a "c" (the second letter of "oCtal"), or even
|
||
to use a "t" for "ocTal" and an "n" for "biNary" to go along
|
||
with the "x" for "heXadecimal".
|
||
|
||
For the string % operator, "o" was already being used to denote
|
||
octal, and "b" was not used for anything, so this works out
|
||
much better than, for example, using "c" (which means "character"
|
||
for the % operator).
|
||
|
||
At the end of the day, since uppercase "O" can look like a zero
|
||
and uppercase "B" can look like an 8, it was decided that these
|
||
prefixes should be lowercase only, but, like 'r' for raw string,
|
||
that can be a preference or style-guide issue.
|
||
|
||
Open Issues
|
||
===========
|
||
|
||
It was suggested in the discussion that lowercase should be used
|
||
for all numeric and string special modifiers, such as 'x' for
|
||
hexadecimal, 'r' for raw strings, 'e' for exponentiation, and
|
||
'j' for complex numbers. This is an issue for a separate PEP.
|
||
|
||
This PEP takes no position on uppercase or lowercase for input,
|
||
just noting that, for consistency, if uppercase is not to be
|
||
removed from input parsing for other letters, it should be
|
||
added for octal and binary, and documenting the changes under
|
||
this assumption, as there is not yet a PEP about the case issue.
|
||
|
||
Output formatting may be a different story -- there is already
|
||
ample precedence for case sensitivity in the output format string,
|
||
and there would need to be a consensus that there is a valid
|
||
use-case for the "alternate form" of the string % operator
|
||
to support uppercase 'B' or 'O' characters for binary or
|
||
octal output. Currently, PEP3101 does not even support this
|
||
alternate capability, and the hex() function does not allow
|
||
the programmer to specify the case of the 'x' character.
|
||
|
||
There are still some strong feelings that '0123' should be
|
||
allowed as a literal decimal in Python 3.0. If this is the
|
||
right thing to do, this can easily be covered in an additional
|
||
PEP. This proposal only takes the first step of making '0123'
|
||
not be a valid octal number, for reasons covered in the rationale.
|
||
|
||
Is there (or should there be) an option for the 2to3 translator
|
||
which only makes the 2.6 compatible changes? Should this be
|
||
run on 2.6 library code before the 2.6 release?
|
||
|
||
Should a bin() function which matches hex() and oct() be added?
|
||
|
||
Is hex() really that useful once we have advanced string formatting?
|
||
|
||
|
||
References
|
||
==========
|
||
|
||
.. [1] GNU libc manual printf integer format conversions
|
||
(http://www.gnu.org/software/libc/manual/html_node/Integer-Conversions.html)
|
||
|
||
.. [2] Python string formatting operations
|
||
(http://docs.python.org/lib/typesseq-strings.html)
|
||
|
||
.. [3] The Representation of Numbers, Jiajie Zhang and Donald A. Norman
|
||
(http://acad88.sahs.uth.tmc.edu/research/publications/Number-Representation.pdf)
|
||
|
||
.. [4] ENIAC page at wikipedia
|
||
(http://en.wikipedia.org/wiki/ENIAC)
|
||
|
||
.. [5] BCD page at wikipedia
|
||
(http://en.wikipedia.org/wiki/Binary-coded_decimal)
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
coding: utf-8
|
||
End:
|