PEP 3127: Integer Literal Support and Syntax Maupin
PEP 3128: BList: A Faster List-like Type Stutzbach
This commit is contained in:
parent
b08fe48dbf
commit
0129f4bde8
|
@ -126,6 +126,7 @@ Index by Category
|
|||
S 3125 Remove Backslash Continuation Jewett
|
||||
S 3126 Remove Implicit String Concatenation Jewett
|
||||
S 3127 Integer Literal Support and Syntax Maupin
|
||||
S 3128 BList: A Faster List-like Type Stutzbach
|
||||
S 3141 A Type Hierarchy for Numbers Yasskin
|
||||
|
||||
Finished PEPs (done, implemented in Subversion)
|
||||
|
@ -494,6 +495,7 @@ Numerical Index
|
|||
S 3125 Remove Backslash Continuation Jewett
|
||||
S 3126 Remove Implicit String Concatenation Jewett
|
||||
S 3127 Integer Literal Support and Syntax Maupin
|
||||
S 3128 BList: A Faster List-like Type Stutzbach
|
||||
S 3141 A Type Hierarchy for Numbers Yasskin
|
||||
|
||||
|
||||
|
|
|
@ -0,0 +1,518 @@
|
|||
PEP: 3127
|
||||
Title: Integer Literal Support and Syntax
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Patrick Maupin <pmaupin@gmail.com>
|
||||
Discussions-To: Python-3000@python.org
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Python-Version: 3.0
|
||||
Content-Type: text/x-rst
|
||||
Created: 14-Mar-2007
|
||||
Post-History: 18-Mar-2007
|
||||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
This PEP proposes changes to the Python core to rationalize
|
||||
the treatment of string literal representations of integers
|
||||
in different radices (bases). These changes are targeted at
|
||||
Python 3.0, but the backward-compatible parts of the changes
|
||||
should be added to Python 2.6, so that all valid 3.0 integer
|
||||
literals will also be valid in 2.6.
|
||||
|
||||
The proposal is that:
|
||||
|
||||
a) octal literals must now be specified
|
||||
with a leading "0o" or "0O" instead of "0";
|
||||
|
||||
b) binary literals are now supported via a
|
||||
leading "0b" or "0B"; and
|
||||
|
||||
c) provision will be made for binary numbers in
|
||||
string formatting.
|
||||
|
||||
|
||||
Motivation
|
||||
==========
|
||||
|
||||
This PEP was motivated by two different issues:
|
||||
|
||||
- The default octal representation of integers is silently confusing
|
||||
to people unfamiliar with C-like languages. It is extremely easy
|
||||
to inadvertently create an integer object with the wrong value,
|
||||
because '013' means 'decimal 11', not 'decimal 13', to the Python
|
||||
language itself, which is not the meaning that most humans would
|
||||
assign to this literal.
|
||||
|
||||
- Some Python users have a strong desire for binary support in
|
||||
the language.
|
||||
|
||||
|
||||
Specification
|
||||
=============
|
||||
|
||||
Grammar specification
|
||||
---------------------
|
||||
|
||||
The grammar will be changed. For Python 2.6, the changed and
|
||||
new token definitions will be::
|
||||
|
||||
integer ::= decimalinteger | octinteger | hexinteger |
|
||||
bininteger | oldoctinteger
|
||||
|
||||
octinteger ::= "0" ("o" | "O") octdigit+
|
||||
|
||||
bininteger ::= "0" ("b" | "B") bindigit+
|
||||
|
||||
oldoctinteger ::= "0" octdigit+
|
||||
|
||||
bindigit ::= "0" | "1"
|
||||
|
||||
For Python 3.0, "oldoctinteger" will not be supported, and
|
||||
an exception will be raised if a literal has a leading "0" and
|
||||
a second character which is a digit.
|
||||
|
||||
For both versions, this will require changes to PyLong_FromString
|
||||
as well as the grammar.
|
||||
|
||||
The documentation will have to be changed as well: grammar.txt,
|
||||
as well as the integer literal section of the reference manual.
|
||||
|
||||
PEP 306 should be checked for other issues, and that PEP should
|
||||
be updated if the procedure described therein is insufficient.
|
||||
|
||||
int() specification
|
||||
--------------------
|
||||
|
||||
int(s, 0) will also match the new grammar definition.
|
||||
|
||||
This should happen automatically with the changes to
|
||||
PyLong_FromString required for the grammar change.
|
||||
|
||||
Also the documentation for int() should be changed to explain
|
||||
that int(s) operates identically to int(s, 10), and the word
|
||||
"guess" should be removed from the description of int(s, 0).
|
||||
|
||||
long() specification
|
||||
--------------------
|
||||
|
||||
For Python 2.6, the long() implementation and documentation
|
||||
should be changed to reflect the new grammar.
|
||||
|
||||
Tokenizer exception handling
|
||||
----------------------------
|
||||
|
||||
If an invalid token contains a leading "0", the exception
|
||||
error message should be more informative than the current
|
||||
"SyntaxError: invalid token". It should explain that decimal
|
||||
numbers may not have a leading zero, and that octal numbers
|
||||
require an "o" after the leading zero.
|
||||
|
||||
int() exception handling
|
||||
------------------------
|
||||
|
||||
The ValueError raised for any call to int() with a string
|
||||
should at least explicitly contain the base in the error
|
||||
message, e.g.::
|
||||
|
||||
ValueError: invalid literal for base 8 int(): 09
|
||||
|
||||
oct() function
|
||||
---------------
|
||||
|
||||
oct() should be updated to output '0o' in front of
|
||||
the octal digits (for 3.0, and 2.6 compatibility mode).
|
||||
|
||||
Output formatting
|
||||
-----------------
|
||||
|
||||
The string (and unicode in 2.6) % operator will have
|
||||
'b' format specifier added for binary, and the alternate
|
||||
syntax of the 'o' option will need to be updated to
|
||||
add '0o' in front, instead of '0'.
|
||||
|
||||
PEP 3101 already supports 'b' for binary output.
|
||||
|
||||
|
||||
Transition from 2.6 to 3.0
|
||||
---------------------------
|
||||
|
||||
The 2to3 translator will have to insert 'o' into any
|
||||
octal string literal.
|
||||
|
||||
The Py3K compatible option to Python 2.6 should cause
|
||||
attempts to use oldoctinteger literals to raise an
|
||||
exception.
|
||||
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
||||
Most of the discussion on these issues occurred on the Python-3000
|
||||
mailing list starting 14-Mar-2007, prompted by an observation that
|
||||
the average human being would be completely mystified upon finding
|
||||
that prepending a "0" to a string of digits changes the meaning of
|
||||
that digit string entirely.
|
||||
|
||||
It was pointed out during this discussion that a similar, but shorter,
|
||||
discussion on the subject occurred in January of 2006, prompted by a
|
||||
discovery of the same issue.
|
||||
|
||||
Background
|
||||
----------
|
||||
|
||||
For historical reasons, Python's string representation of integers
|
||||
in different bases (radices), for string formatting and token
|
||||
literals, borrows heavily from C. [1]_ [2]_ Usage has shown that
|
||||
the historical method of specifying an octal number is confusing,
|
||||
and also that it would be nice to have additional support for binary
|
||||
literals.
|
||||
|
||||
Throughout this document, unless otherwise noted, discussions about
|
||||
the string representation of integers relate to these features:
|
||||
|
||||
- Literal integer tokens, as used by normal module compilation,
|
||||
by eval(), and by int(token, 0). (int(token) and int(token, 2-36)
|
||||
are not modified by this proposal.)
|
||||
|
||||
* Under 2.6, long() is treated the same as int()
|
||||
|
||||
- Formatting of integers into strings, either via the % string
|
||||
operator or the new PEP 3101 advanced string formatting method.
|
||||
|
||||
It is presumed that:
|
||||
|
||||
- All of these features should have an identical set
|
||||
of supported radices, for consistency.
|
||||
|
||||
- Python source code syntax and int(mystring, 0) should
|
||||
continue to share identical behavior.
|
||||
|
||||
|
||||
Removal of old octal syntax
|
||||
----------------------------
|
||||
|
||||
This PEP proposes that the ability to specify an octal number by
|
||||
using a leading zero will be removed from the language in Python 3.0
|
||||
(and the Python 3.0 preview mode of 2.6), and that a SyntaxError will
|
||||
be raised whenever a leading "0" is immediately followed by another
|
||||
digit.
|
||||
|
||||
During the present discussion, it was almost universally agreed that::
|
||||
|
||||
eval('010') == 8
|
||||
|
||||
should no longer be true, because that is confusing to new users.
|
||||
It was also proposed that::
|
||||
|
||||
eval('0010') == 10
|
||||
|
||||
should become true, but that is much more contentious, because it is so
|
||||
inconsistent with usage in other computer languages that mistakes are
|
||||
likely to be made.
|
||||
|
||||
Almost all currently popular computer languages, including C/C++,
|
||||
Java, Perl, and JavaScript, treat a sequence of digits with a
|
||||
leading zero as an octal number. Proponents of treating these
|
||||
numbers as decimal instead have a very valid point -- as discussed
|
||||
in `Supported radices`_, below, the entire non-computer world uses
|
||||
decimal numbers almost exclusively. There is ample anecdotal
|
||||
evidence that many people are dismayed and confused if they
|
||||
are confronted with non-decimal radices.
|
||||
|
||||
However, in most situations, most people do not write gratuitous
|
||||
zeros in front of their decimal numbers. The primary exception is
|
||||
when an attempt is being made to line up columns of numbers. But
|
||||
since PEP 8 specifically discourages the use of spaces to try to
|
||||
align Python code, one would suspect the same argument should apply
|
||||
to the use of leading zeros for the same purpose.
|
||||
|
||||
Finally, although the email discussion often focused on whether anybody
|
||||
actually *uses* octal any more, and whether we should cater to those
|
||||
old-timers in any case, that is almost entirely besides the point.
|
||||
|
||||
Assume the rare complete newcomer to computing who *does*, either
|
||||
occasionally or as a matter of habit, use leading zeros for decimal
|
||||
numbers. Python could either:
|
||||
|
||||
a) silently do the wrong thing with his numbers, as it does now;
|
||||
|
||||
b) immediately disabuse him of the notion that this is viable syntax
|
||||
(and yes, the SyntaxWarning should be more gentle than it
|
||||
currently is, but that is a subject for a different PEP); or
|
||||
|
||||
c) let him continue to think that computers are happy with
|
||||
multi-digit decimal integers which start with "0".
|
||||
|
||||
Some people passionately believe that (c) is the correct answer,
|
||||
and they would be absolutely right if we could be sure that new
|
||||
users will never blossom and grow and start writing AJAX applications.
|
||||
|
||||
So while a new Python user may (currently) be mystified at the
|
||||
delayed discovery that his numbers don't work properly, we can
|
||||
fix it by explaining to him immediately that Python doesn't like
|
||||
leading zeros (hopefully with a reasonable message!), or we can
|
||||
delegate this teaching experience to the JavaScript interpreter
|
||||
in the Internet Explorer browser, and let him try to debug his
|
||||
issue there.
|
||||
|
||||
Supported radices
|
||||
-----------------
|
||||
|
||||
This PEP proposes that the supported radices for the Python
|
||||
language will be 2, 8, 10, and 16.
|
||||
|
||||
Once it is agreed that the old syntax for octal (radix 8) representation
|
||||
of integers must be removed from the language, the next obvious
|
||||
question is "Do we actually need a way to specify (and display)
|
||||
numbers in octal?"
|
||||
|
||||
This question is quickly followed by "What radices does the language
|
||||
need to support?" Because computers are so adept at doing what you
|
||||
tell them to, a tempting answer in the discussion was "all of them."
|
||||
This answer has obviously been given before -- the int() constructor
|
||||
will accept an explicit radix with a value between 2 and 36, inclusive,
|
||||
with the latter number bearing a suspicious arithmetic similarity to
|
||||
the sum of the number of numeric digits and the number of same-case
|
||||
letters in the ASCII alphabet.
|
||||
|
||||
But the best argument for inclusion will have a use-case to back
|
||||
it up, so the idea of supporting all radices was quickly rejected,
|
||||
and the only radices left with any real support were decimal,
|
||||
hexadecimal, octal, and binary.
|
||||
|
||||
Just because a particular radix has a vocal supporter on the
|
||||
mailing list does not mean that it really should be in the
|
||||
language, so the rest of this section is a treatise on the
|
||||
utility of these particular radices, vs. other possible choices.
|
||||
|
||||
Humans use other numeric bases constantly. If I tell you that
|
||||
it is 12:30 PM, I have communicated quantitative information
|
||||
arguably composed of *three* separate bases (12, 60, and 2),
|
||||
only one of which is in the "agreed" list above. But the
|
||||
*communication* of that information used two decimal digits
|
||||
each for the base 12 and base 60 information, and, perversely,
|
||||
two letters for information which could have fit in a single
|
||||
decimal digit.
|
||||
|
||||
So, in general, humans communicate "normal" (non-computer)
|
||||
numerical information either via names (AM, PM, January, ...)
|
||||
or via use of decimal notation. Obviously, names are
|
||||
seldom used for large sets of items, so decimal is used for
|
||||
everything else. There are studies which attempt to explain
|
||||
why this is so, typically reaching the expected conclusion
|
||||
that the Arabic numeral system is well-suited to human
|
||||
cognition. [3]_
|
||||
|
||||
There is even support in the history of the design of
|
||||
computers to indicate that decimal notation is the correct
|
||||
way for computers to communicate with humans. One of
|
||||
the first modern computers, ENIAC [4]_ computed in decimal,
|
||||
even though there were already existing computers which
|
||||
operated in binary.
|
||||
|
||||
Decimal computer operation was important enough
|
||||
that many computers, including the ubiquitous PC, have
|
||||
instructions designed to operate on "binary coded decimal"
|
||||
(BCD) [5]_ , a representation which devotes 4 bits to each
|
||||
decimal digit. These instructions date from a time when the
|
||||
most strenuous calculations ever performed on many numbers
|
||||
were the calculations actually required to perform textual
|
||||
I/O with them. It is possible to display BCD without having
|
||||
to perform a divide/remainder operation on every displayed
|
||||
digit, and this was a huge computational win when most
|
||||
hardware didn't have fast divide capability. Another factor
|
||||
contributing to the use of BCD is that, with BCD calculations,
|
||||
rounding will happen exactly the same way that a human would
|
||||
do it, so BCD is still sometimes used in fields like finance,
|
||||
despite the computational and storage superiority of binary.
|
||||
|
||||
So, if it weren't for the fact that computers themselves
|
||||
normally use binary for efficient computation and data
|
||||
storage, string representations of integers would probably
|
||||
always be in decimal.
|
||||
|
||||
Unfortunately, computer hardware doesn't think like humans,
|
||||
so programmers and hardware engineers must often resort to
|
||||
thinking like the computer, which means that it is important
|
||||
for Python to have the ability to communicate binary data
|
||||
in a form that is understandable to humans.
|
||||
|
||||
The requirement that the binary data notation must be cognitively
|
||||
easy for humans to process means that it should contain an integral
|
||||
number of binary digits (bits) per symbol, while otherwise
|
||||
conforming quite closely to the standard tried-and-true decimal
|
||||
notation (position indicates power, larger magnitude on the left,
|
||||
not too many symbols in the alphabet, etc.).
|
||||
|
||||
The obvious "sweet spot" for this binary data notation is
|
||||
thus octal, which packs the largest integral number of bits
|
||||
possible into a single symbol chosen from the Arabic numeral
|
||||
alphabet.
|
||||
|
||||
In fact, some computer architectures, such as the PDP8 and the
|
||||
8080/Z80, were defined in terms of octal, in the sense of arranging
|
||||
the bitfields of instructions in groups of three, and using
|
||||
octal representations to describe the instruction set.
|
||||
|
||||
Even today, octal is important because of bit-packed structures
|
||||
which consist of 3 bits per field, such as Unix file permission
|
||||
masks.
|
||||
|
||||
But octal has a drawback when used for larger numbers. The
|
||||
number of bits per symbol, while integral, is not itself
|
||||
a power of two. This limitation (given that the word size
|
||||
of most computers these days is a power of two) has resulted
|
||||
in hexadecimal, which is more popular than octal despite the
|
||||
fact that it requires a 60% larger alphabet than decimal,
|
||||
because each symbol contains 4 bits.
|
||||
|
||||
Some numbers, such as Unix file permission masks, are easily
|
||||
decoded by humans when represented in octal, but difficult to
|
||||
decode in hexadecimal, while other numbers are much easier for
|
||||
humans to handle in hexadecimal.
|
||||
|
||||
Unfortunately, there are also binary numbers used in computers
|
||||
which are not very well communicated in either hexadecimal or
|
||||
octal. Thankfully, fewer people have to deal with these on a
|
||||
regular basis, but on the other hand, this means that several
|
||||
people on the discussion list questioned the wisdom of adding
|
||||
a straight binary representation to Python.
|
||||
|
||||
One example of where these numbers is very useful is in
|
||||
reading and writing hardware registers. Sometimes hardware
|
||||
designers will eschew human readability and opt for address
|
||||
space efficiency, by packing multiple bit fields into a single
|
||||
hardware register at unaligned bit locations, and it is tedious
|
||||
and error-prone for a human to reconstruct a 5 bit field which
|
||||
consists of the upper 3 bits of one hex digit, and the lower 2
|
||||
bits of the next hex digit.
|
||||
|
||||
Even if the ability of Python to communicate binary information
|
||||
to humans is only useful for a small technical subset of the
|
||||
population, it is exactly that population subset which contains
|
||||
most, if not all, members of the Python core team, so even straight
|
||||
binary, the least useful of these notations, has several enthusiastic
|
||||
supporters and few, if any, staunch opponents, among the Python community.
|
||||
|
||||
Syntax for supported radices
|
||||
-----------------------------
|
||||
|
||||
This proposal is to to use a "0o" prefix with either uppercase
|
||||
or lowercase "o" for octal, and a "0b" prefix with either
|
||||
uppercase or lowercase "b" for binary.
|
||||
|
||||
There was strong support for not supporting uppercase, but
|
||||
this is a separate subject for a different PEP, as 'j' for
|
||||
complex numbers, 'e' for exponent, and 'r' for raw string
|
||||
(to name a few) already support uppercase.
|
||||
|
||||
The syntax for delimiting the different radices received a lot of
|
||||
attention in the discussion on Python-3000. There are several
|
||||
(sometimes conflicting) requirements and "nice-to-haves" for
|
||||
this syntax:
|
||||
|
||||
- It should be as compatible with other languages and
|
||||
previous versions of Python as is reasonable, both
|
||||
for the input syntax and for the output (e.g. string
|
||||
% operator) syntax.
|
||||
|
||||
- It should be as obvious to the casual observer as
|
||||
possible.
|
||||
|
||||
- It should be easy to visually distinguish integers
|
||||
formatted in the different bases.
|
||||
|
||||
|
||||
Proposed syntaxes included things like arbitrary radix prefixes,
|
||||
such as 16r100 (256 in hexadecimal), and radix suffixes, similar
|
||||
to the 100h assembler-style suffix. The debate on whether the
|
||||
letter "O" could be used for octal was intense -- an uppercase
|
||||
"O" looks suspiciously similar to a zero in some fonts. Suggestions
|
||||
were made to use a "c" (the second letter of "oCtal"), or even
|
||||
to use a "t" for "ocTal" and an "n" for "biNary" to go along
|
||||
with the "x" for "heXadecimal".
|
||||
|
||||
For the string % operator, "o" was already being used to denote
|
||||
octal, and "b" was not used for anything, so this works out
|
||||
much better than, for example, using "c" (which means "character"
|
||||
for the % operator).
|
||||
|
||||
At the end of the day, since uppercase "O" can look like a zero
|
||||
and uppercase "B" can look like an 8, it was decided that these
|
||||
prefixes should be lowercase only, but, like 'r' for raw string,
|
||||
that can be a preference or style-guide issue.
|
||||
|
||||
Open Issues
|
||||
===========
|
||||
|
||||
It was suggested in the discussion that lowercase should be used
|
||||
for all numeric and string special modifiers, such as 'x' for
|
||||
hexadecimal, 'r' for raw strings, 'e' for exponentiation, and
|
||||
'j' for complex numbers. This is an issue for a separate PEP.
|
||||
|
||||
This PEP takes no position on uppercase or lowercase for input,
|
||||
just noting that, for consistency, if uppercase is not to be
|
||||
removed from input parsing for other letters, it should be
|
||||
added for octal and binary, and documenting the changes under
|
||||
this assumption, as there is not yet a PEP about the case issue.
|
||||
|
||||
Output formatting may be a different story -- there is already
|
||||
ample precedence for case sensitivity in the output format string,
|
||||
and there would need to be a consensus that there is a valid
|
||||
use-case for the "alternate form" of the string % operator
|
||||
to support uppercase 'B' or 'O' characters for binary or
|
||||
octal output. Currently, PEP3101 does not even support this
|
||||
alternate capability, and the hex() function does not allow
|
||||
the programmer to specify the case of the 'x' character.
|
||||
|
||||
There are still some strong feelings that '0123' should be
|
||||
allowed as a literal decimal in Python 3.0. If this is the
|
||||
right thing to do, this can easily be covered in an additional
|
||||
PEP. This proposal only takes the first step of making '0123'
|
||||
not be a valid octal number, for reasons covered in the rationale.
|
||||
|
||||
Is there (or should there be) an option for the 2to3 translator
|
||||
which only makes the 2.6 compatible changes? Should this be
|
||||
run on 2.6 library code before the 2.6 release?
|
||||
|
||||
Should a bin() function which matches hex() and oct() be added?
|
||||
|
||||
Is hex() really that useful once we have advanced string formatting?
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] GNU libc manual printf integer format conversions
|
||||
(http://www.gnu.org/software/libc/manual/html_node/Integer-Conversions.html)
|
||||
|
||||
.. [2] Python string formatting operations
|
||||
(http://docs.python.org/lib/typesseq-strings.html)
|
||||
|
||||
.. [3] The Representation of Numbers, Jiajie Zhang and Donald A. Norman
|
||||
(http://acad88.sahs.uth.tmc.edu/research/publications/Number-Representation.pdf)
|
||||
|
||||
.. [4] ENIAC page at wikipedia
|
||||
(http://en.wikipedia.org/wiki/ENIAC)
|
||||
|
||||
.. [5] BCD page at wikipedia
|
||||
(http://en.wikipedia.org/wiki/Binary-coded_decimal)
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
coding: utf-8
|
||||
End:
|
|
@ -0,0 +1,356 @@
|
|||
PEP: 3128
|
||||
Title: BList: A Faster List-like Type
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Daniel Stutzbach <daniel@stutzbachenterprises.com>
|
||||
Discussions-To: Python 3000 List <python-3000@python.org>
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 30-Apr-2007
|
||||
Python-Version: 2.6 and/or 3.0
|
||||
Post-History: 30-Apr-2007
|
||||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
The common case for list operations is on small lists. The current
|
||||
array-based list implementation excels at small lists due to the
|
||||
strong locality of reference and infrequency of memory allocation
|
||||
operations. However, an array takes O(n) time to insert and delete
|
||||
elements, which can become problematic as the list gets large.
|
||||
|
||||
This PEP introduces a new data type, the BList, that has array-like
|
||||
and tree-like aspects. It enjoys the same good performance on small
|
||||
lists as the existing array-based implementation, but offers superior
|
||||
asymptotic performance for most operations. This PEP proposes
|
||||
replacing the makes two mutually exclusive proposals for including the
|
||||
BList type in Python:
|
||||
|
||||
1. Add it to the collections module, or
|
||||
2. Replace the existing list type
|
||||
|
||||
|
||||
Motivation
|
||||
==========
|
||||
|
||||
The BList grew out of the frustration of needing to rewrite intuitive
|
||||
algorithms that worked fine for small inputs but took O(n**2) time for
|
||||
large inputs due to the underlying O(n) behavior of array-based lists.
|
||||
The deque type, introduced in Python 2.4, solved the most common
|
||||
problem of needing a fast FIFO queue. However, the deque type doesn't
|
||||
help if we need to repeatedly insert or delete elements from the
|
||||
middle of a long list.
|
||||
|
||||
A wide variety of data structure provide good asymptotic performance
|
||||
for insertions and deletions, but they either have O(n) performance
|
||||
for other operations (e.g., linked lists) or have inferior performance
|
||||
for small lists (e.g., binary trees and skip lists).
|
||||
|
||||
The BList type proposed in this PEP is based on the principles of
|
||||
B+Trees, which have array-like and tree-like aspects. The BList
|
||||
offers array-like performance on small lists, while offering O(log n)
|
||||
asymptotic performance for all insert and delete operations.
|
||||
Additionally, the BList implements copy-on-write under-the-hood, so
|
||||
even operations like getslice take O(log n) time. The table below
|
||||
compares the asymptotic performance of the current array-based list
|
||||
implementation with the asymptotic performance of the BList.
|
||||
|
||||
========= ================ ====================
|
||||
Operation Array-based list BList
|
||||
========= ================ ====================
|
||||
Copy O(n) **O(1)**
|
||||
Append **O(1)** O(log n)
|
||||
Insert O(n) **O(log n)**
|
||||
Get Item **O(1)** O(log n)
|
||||
Set Item **O(1)** **O(log n)**
|
||||
Del Item O(n) **O(log n)**
|
||||
Iteration O(n) O(n)
|
||||
Get Slice O(k) **O(log n)**
|
||||
Del Slice O(n) **O(log n)**
|
||||
Set Slice O(n+k) **O(log k + log n)**
|
||||
Extend O(k) **O(log k + log n)**
|
||||
Sort O(n log n) O(n log n)
|
||||
Multiply O(nk) **O(log k)**
|
||||
========= ================ ====================
|
||||
|
||||
An extensive empirical comparison of Python's array-based list and the
|
||||
BList are available at [2]_.
|
||||
|
||||
Use Case Trade-offs
|
||||
===================
|
||||
|
||||
The BList offers superior performance for many, but not all,
|
||||
operations. Choosing the correct data type for a particular use case
|
||||
depends on which operations are used. Choosing the correct data type
|
||||
as a built-in depends on balancing the importance of different use
|
||||
cases and the magnitude of the performance differences.
|
||||
|
||||
For the common uses cases of small lists, the array-based list and the
|
||||
BList have similar performance characteristics.
|
||||
|
||||
For the slightly less common case of large lists, there are two common
|
||||
uses cases where the existing array-based list outperforms the
|
||||
existing BList reference implementation. These are:
|
||||
|
||||
1. A large LIFO stack, where there are many .append() and .pop(-1)
|
||||
operations. Each operation is O(1) for an array-based list, but
|
||||
O(log n) for the BList.
|
||||
|
||||
2. A large list that does not change size. The getitem and setitem
|
||||
calls are O(1) for an array-based list, but O(log n) for the BList.
|
||||
|
||||
In performance tests on a 10,000 element list, BLists exhibited a 50%
|
||||
and 5% increase in execution time for these two uses cases,
|
||||
respectively.
|
||||
|
||||
The performance for the LIFO use case could be improved to O(n) time,
|
||||
by caching a pointer to the right-most leaf within the root node. For
|
||||
lists that do not change size, the common case of sequential access
|
||||
could also be improved to O(n) time via caching in the root node.
|
||||
However, the performance of these approaches has not been empirically
|
||||
tested.
|
||||
|
||||
Many operations exhibit a tremendous speed-up (O(n) to O(log n)) when
|
||||
switching from the array-based list to BLists. In performance tests
|
||||
on a 10,000 element list, operations such as getslice, setslice, and
|
||||
FIFO-style insert and deletes on a BList take only 1% of the time
|
||||
needed on array-based lists.
|
||||
|
||||
In light of the large performance speed-ups for many operations, the
|
||||
small performance costs for some operations will be worthwhile for
|
||||
many (but not all) applications.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
The BList is based on the B+Tree data structure. The BList is a wide,
|
||||
bushy tree where each node contains an array of up to 128 pointers to
|
||||
its children. If the node is a leaf, its children are the
|
||||
user-visible objects that the user has placed in the list. If node is
|
||||
not a leaf, its children are other BList nodes that are not
|
||||
user-visible. If the list contains only a few elements, they will all
|
||||
be a children of single node that is both the root and a leaf. Since
|
||||
a node is little more than array of pointers, small lists operate in
|
||||
effectively the same way as an array-based data type and share the
|
||||
same good performance characteristics.
|
||||
|
||||
The BList maintains a few invariants to ensure good (O(log n))
|
||||
asymptotic performance regardless of the sequence of insert and delete
|
||||
operations. The principle invariants are as follows:
|
||||
|
||||
1. Each node has at most 128 children.
|
||||
2. Each non-root node has at least 64 children.
|
||||
3. The root node has at least 2 children, unless the list contains
|
||||
fewer than 2 elements.
|
||||
4. The tree is of uniform depth.
|
||||
|
||||
If an insert would cause a node to exceed 128 children, the node
|
||||
spawns a sibling and transfers half of its children to the sibling.
|
||||
The sibling is inserted into the node's parent. If the node is the
|
||||
root node (and thus has no parent), a new parent is created and the
|
||||
depth of the tree increases by one.
|
||||
|
||||
If a deletion would cause a node to have fewer than 64 children, the
|
||||
node moves elements from one of its siblings if possible. If both of
|
||||
its siblings also only have 64 children, then two of the nodes merge
|
||||
and the empty one is removed from its parent. If the root node is
|
||||
reduced to only one child, its single child becomes the new root
|
||||
(i.e., the depth of the tree is reduced by one).
|
||||
|
||||
In addition to tree-like asymptotic performance and array-like
|
||||
performance on small-lists, BLists support transparent
|
||||
**copy-on-write**. If a non-root node needs to be copied (as part of
|
||||
a getslice, copy, setslice, etc.), the node is shared between multiple
|
||||
parents instead of being copied. If it needs to be modified later, it
|
||||
will be copied at that time. This is completely behind-the-scenes;
|
||||
from the user's point of view, the BList works just like a regular
|
||||
Python list.
|
||||
|
||||
Memory Usage
|
||||
============
|
||||
|
||||
In the worst case, the leaf nodes of a BList have only 64 children
|
||||
each, rather than a full 128, meaning that memory usage is around
|
||||
twice that of a best-case array implementation. Non-leaf nodes use up
|
||||
a negligible amount of additional memory, since there are at least 63
|
||||
times as many leaf nodes as non-leaf nodes.
|
||||
|
||||
The existing array-based list implementation must grow and shrink as
|
||||
items are added and removed. To be efficient, it grows and shrinks
|
||||
only when the list has grow or shrunk exponentially. In the worst
|
||||
case, it, too, uses twice as much memory as the best case.
|
||||
|
||||
In summary, the BList's memory footprint is not significantly
|
||||
different from the existing array-based implementation.
|
||||
|
||||
Backwards Compatibility
|
||||
=======================
|
||||
|
||||
If the BList is added to the collections module, backwards
|
||||
compatibility is not an issue. This section focuses on the option of
|
||||
replacing the existing array-based list with the BList. For users of
|
||||
the Python interpreter, a BList has an identical interface to the
|
||||
current list-implementation. For virtually all operations, the
|
||||
behavior is identical, aside from execution speed.
|
||||
|
||||
For the C API, BList has a different interface than the existing
|
||||
list-implementation. Due to its more complex structure, the BList
|
||||
does not lend itself well to poking and prodding by external sources.
|
||||
Thankfully, the existing list-implementation defines an API of
|
||||
functions and macros for accessing data from list objects. Google
|
||||
Code Search suggests that the majority of third-party modules uses the
|
||||
well-defined API rather than relying on the list's structure
|
||||
directly. The table below summarizes the search queries and results:
|
||||
|
||||
======================== =================
|
||||
Search String Number of Results
|
||||
======================== =================
|
||||
PyList_GetItem 2,000
|
||||
PySequence_GetItem 800
|
||||
PySequence_Fast_GET_ITEM 100
|
||||
PyList_GET_ITEM 400
|
||||
\[^a\-zA\-Z\_\]ob_item 100
|
||||
======================== =================
|
||||
|
||||
|
||||
This can be achieved in one of two ways:
|
||||
|
||||
1. Redefine the various accessor functions and macros in listobject.h
|
||||
to access a BList instead. The interface would be unchanged. The
|
||||
functions can easily be redefined. The macros need a bit more care
|
||||
and would have to resort to function calls for large lists.
|
||||
|
||||
The macros would need to evaluate their arguments more than once,
|
||||
which could be a problem if the arguments have side effects. A
|
||||
Google Code Search for "PyList_GET_ITEM\(\[^)\]+\(" found only a
|
||||
handful of cases where this occurs, so the impact appears to be
|
||||
low.
|
||||
|
||||
The few extension modules that use list's undocumented structure
|
||||
directly, instead of using the API, would break. The core code
|
||||
itself uses the accessor macros fairly consistently and should be
|
||||
easy to port.
|
||||
|
||||
2. Deprecate the existing list type, but continue to include it.
|
||||
Extension modules wishing to use the new BList type must do so
|
||||
explicitly. The BList C interface can be changed to match the
|
||||
existing PyList interface so that a simple search-replace will be
|
||||
sufficient for 99% of module writers.
|
||||
|
||||
Existing modules would continue to compile and work without change,
|
||||
but they would need to make a deliberate (but small) effort to
|
||||
migrate to the BList.
|
||||
|
||||
The downside of this approach is that mixing modules that use
|
||||
BLists and array-based lists might lead to slow down if conversions
|
||||
are frequently necessary.
|
||||
|
||||
Reference Implementation
|
||||
========================
|
||||
|
||||
A reference implementations of the BList is available for CPython at [1]_.
|
||||
|
||||
The source package also includes a pure Python implementation,
|
||||
originally developed as a prototype for the CPython version.
|
||||
Naturally, the pure Python version is rather slow and the asymptotic
|
||||
improvements don't win out until the list is quite large.
|
||||
|
||||
When compiled with Py_DEBUG, the C implementation checks the
|
||||
BList invariants when entering and exiting most functions.
|
||||
|
||||
An extensive set of test cases is also included in the source package.
|
||||
The test cases include the existing Python sequence and list test
|
||||
cases as a subset. When the interpreter is built with Py_DEBUG, the
|
||||
test cases also check for reference leaks.
|
||||
|
||||
Porting to Other Python Variants
|
||||
--------------------------------
|
||||
|
||||
If the BList is added to the collections module, other Python variants
|
||||
can support it in one of three ways:
|
||||
|
||||
1. Make blist an alias for list. The asymptotic performance won't be
|
||||
as good, but it'll work.
|
||||
2. Use the pure Python reference implementation. The performance for
|
||||
small lists won't be as good, but it'll work.
|
||||
3. Port the reference implementation.
|
||||
|
||||
Discussion
|
||||
==========
|
||||
|
||||
This proposal has been discussed briefly on the Python-3000 mailing
|
||||
list [3]_. Although a number of people favored the proposal, there
|
||||
were also some objections. Below summarizes the pros and cons as
|
||||
observed by posters to the thread.
|
||||
|
||||
General comments:
|
||||
|
||||
- Pro: Will outperform the array-based list in most cases
|
||||
- Pro: "I've implemented variants of this ... a few different times"
|
||||
- Con: Desirability and performance in actual applications is unproven
|
||||
|
||||
Comments on adding BList to the collections module:
|
||||
|
||||
- Pro: Matching the list-API reduces the learning curve to near-zero
|
||||
- Pro: Useful for intermediate-level users; won't get in the way of beginners
|
||||
- Con: Proliferation of data types makes the choices for developers harder.
|
||||
|
||||
Comments on replacing the array-based list with the BList:
|
||||
|
||||
- Con: Impact on extension modules (addressed in `Backwards
|
||||
Compatibility`_)
|
||||
- Con: The use cases where BLists are slower are important
|
||||
(see `Use Case Trade-Offs`_ for how these might be addressed).
|
||||
- Con: The array-based list code is simple and easy to maintain
|
||||
|
||||
To assess the desirability and performance in actual applications,
|
||||
Raymond Hettinger suggested releasing the BList as an extension module
|
||||
(now available at [1]_). If it proves useful, he felt it would be a
|
||||
strong candidate for inclusion in 2.6 as part of the collections
|
||||
module. If widely popular, then it could be considered for replacing
|
||||
the array-based list, but not otherwise.
|
||||
|
||||
Guido van Rossum commented that he opposed the proliferation of data
|
||||
types, but favored replacing the array-based list if backwards
|
||||
compatibility could be addressed and the BList's performance was
|
||||
uniformly better.
|
||||
|
||||
On-going Tasks
|
||||
==============
|
||||
|
||||
- Reduce the memory footprint of small lists
|
||||
- Implement TimSort for BLists, so that best-case sorting is O(n)
|
||||
instead of O(log n).
|
||||
- Implement __reversed__
|
||||
- Cache a pointer in the root to the rightmost leaf, to make LIFO
|
||||
operation O(n) time.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] Reference Implementations for C and Python:
|
||||
http://www.python.org/pypi/blist/
|
||||
|
||||
.. [2] Empirical performance comparison between Python's array-based
|
||||
list and the blist: http://stutzbachenterprises.com/blist/
|
||||
|
||||
.. [3] Discussion on python-3000 starting at post:
|
||||
http://mail.python.org/pipermail/python-3000/2007-April/006757.html
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
coding: utf-8
|
||||
End:
|
Loading…
Reference in New Issue