PEP 3127: Integer Literal Support and Syntax Maupin

PEP 3128: BList: A Faster List-like Type               Stutzbach
This commit is contained in:
Guido van Rossum 2007-05-01 16:57:09 +00:00
parent b08fe48dbf
commit 0129f4bde8
3 changed files with 876 additions and 0 deletions

View File

@ -126,6 +126,7 @@ Index by Category
S 3125 Remove Backslash Continuation Jewett
S 3126 Remove Implicit String Concatenation Jewett
S 3127 Integer Literal Support and Syntax Maupin
S 3128 BList: A Faster List-like Type Stutzbach
S 3141 A Type Hierarchy for Numbers Yasskin
Finished PEPs (done, implemented in Subversion)
@ -494,6 +495,7 @@ Numerical Index
S 3125 Remove Backslash Continuation Jewett
S 3126 Remove Implicit String Concatenation Jewett
S 3127 Integer Literal Support and Syntax Maupin
S 3128 BList: A Faster List-like Type Stutzbach
S 3141 A Type Hierarchy for Numbers Yasskin

518
pep-3127.txt Normal file
View File

@ -0,0 +1,518 @@
PEP: 3127
Title: Integer Literal Support and Syntax
Version: $Revision$
Last-Modified: $Date$
Author: Patrick Maupin <pmaupin@gmail.com>
Discussions-To: Python-3000@python.org
Status: Draft
Type: Standards Track
Python-Version: 3.0
Content-Type: text/x-rst
Created: 14-Mar-2007
Post-History: 18-Mar-2007
Abstract
========
This PEP proposes changes to the Python core to rationalize
the treatment of string literal representations of integers
in different radices (bases). These changes are targeted at
Python 3.0, but the backward-compatible parts of the changes
should be added to Python 2.6, so that all valid 3.0 integer
literals will also be valid in 2.6.
The proposal is that:
a) octal literals must now be specified
with a leading "0o" or "0O" instead of "0";
b) binary literals are now supported via a
leading "0b" or "0B"; and
c) provision will be made for binary numbers in
string formatting.
Motivation
==========
This PEP was motivated by two different issues:
- The default octal representation of integers is silently confusing
to people unfamiliar with C-like languages. It is extremely easy
to inadvertently create an integer object with the wrong value,
because '013' means 'decimal 11', not 'decimal 13', to the Python
language itself, which is not the meaning that most humans would
assign to this literal.
- Some Python users have a strong desire for binary support in
the language.
Specification
=============
Grammar specification
---------------------
The grammar will be changed. For Python 2.6, the changed and
new token definitions will be::
integer ::= decimalinteger | octinteger | hexinteger |
bininteger | oldoctinteger
octinteger ::= "0" ("o" | "O") octdigit+
bininteger ::= "0" ("b" | "B") bindigit+
oldoctinteger ::= "0" octdigit+
bindigit ::= "0" | "1"
For Python 3.0, "oldoctinteger" will not be supported, and
an exception will be raised if a literal has a leading "0" and
a second character which is a digit.
For both versions, this will require changes to PyLong_FromString
as well as the grammar.
The documentation will have to be changed as well: grammar.txt,
as well as the integer literal section of the reference manual.
PEP 306 should be checked for other issues, and that PEP should
be updated if the procedure described therein is insufficient.
int() specification
--------------------
int(s, 0) will also match the new grammar definition.
This should happen automatically with the changes to
PyLong_FromString required for the grammar change.
Also the documentation for int() should be changed to explain
that int(s) operates identically to int(s, 10), and the word
"guess" should be removed from the description of int(s, 0).
long() specification
--------------------
For Python 2.6, the long() implementation and documentation
should be changed to reflect the new grammar.
Tokenizer exception handling
----------------------------
If an invalid token contains a leading "0", the exception
error message should be more informative than the current
"SyntaxError: invalid token". It should explain that decimal
numbers may not have a leading zero, and that octal numbers
require an "o" after the leading zero.
int() exception handling
------------------------
The ValueError raised for any call to int() with a string
should at least explicitly contain the base in the error
message, e.g.::
ValueError: invalid literal for base 8 int(): 09
oct() function
---------------
oct() should be updated to output '0o' in front of
the octal digits (for 3.0, and 2.6 compatibility mode).
Output formatting
-----------------
The string (and unicode in 2.6) % operator will have
'b' format specifier added for binary, and the alternate
syntax of the 'o' option will need to be updated to
add '0o' in front, instead of '0'.
PEP 3101 already supports 'b' for binary output.
Transition from 2.6 to 3.0
---------------------------
The 2to3 translator will have to insert 'o' into any
octal string literal.
The Py3K compatible option to Python 2.6 should cause
attempts to use oldoctinteger literals to raise an
exception.
Rationale
=========
Most of the discussion on these issues occurred on the Python-3000
mailing list starting 14-Mar-2007, prompted by an observation that
the average human being would be completely mystified upon finding
that prepending a "0" to a string of digits changes the meaning of
that digit string entirely.
It was pointed out during this discussion that a similar, but shorter,
discussion on the subject occurred in January of 2006, prompted by a
discovery of the same issue.
Background
----------
For historical reasons, Python's string representation of integers
in different bases (radices), for string formatting and token
literals, borrows heavily from C. [1]_ [2]_ Usage has shown that
the historical method of specifying an octal number is confusing,
and also that it would be nice to have additional support for binary
literals.
Throughout this document, unless otherwise noted, discussions about
the string representation of integers relate to these features:
- Literal integer tokens, as used by normal module compilation,
by eval(), and by int(token, 0). (int(token) and int(token, 2-36)
are not modified by this proposal.)
* Under 2.6, long() is treated the same as int()
- Formatting of integers into strings, either via the % string
operator or the new PEP 3101 advanced string formatting method.
It is presumed that:
- All of these features should have an identical set
of supported radices, for consistency.
- Python source code syntax and int(mystring, 0) should
continue to share identical behavior.
Removal of old octal syntax
----------------------------
This PEP proposes that the ability to specify an octal number by
using a leading zero will be removed from the language in Python 3.0
(and the Python 3.0 preview mode of 2.6), and that a SyntaxError will
be raised whenever a leading "0" is immediately followed by another
digit.
During the present discussion, it was almost universally agreed that::
eval('010') == 8
should no longer be true, because that is confusing to new users.
It was also proposed that::
eval('0010') == 10
should become true, but that is much more contentious, because it is so
inconsistent with usage in other computer languages that mistakes are
likely to be made.
Almost all currently popular computer languages, including C/C++,
Java, Perl, and JavaScript, treat a sequence of digits with a
leading zero as an octal number. Proponents of treating these
numbers as decimal instead have a very valid point -- as discussed
in `Supported radices`_, below, the entire non-computer world uses
decimal numbers almost exclusively. There is ample anecdotal
evidence that many people are dismayed and confused if they
are confronted with non-decimal radices.
However, in most situations, most people do not write gratuitous
zeros in front of their decimal numbers. The primary exception is
when an attempt is being made to line up columns of numbers. But
since PEP 8 specifically discourages the use of spaces to try to
align Python code, one would suspect the same argument should apply
to the use of leading zeros for the same purpose.
Finally, although the email discussion often focused on whether anybody
actually *uses* octal any more, and whether we should cater to those
old-timers in any case, that is almost entirely besides the point.
Assume the rare complete newcomer to computing who *does*, either
occasionally or as a matter of habit, use leading zeros for decimal
numbers. Python could either:
a) silently do the wrong thing with his numbers, as it does now;
b) immediately disabuse him of the notion that this is viable syntax
(and yes, the SyntaxWarning should be more gentle than it
currently is, but that is a subject for a different PEP); or
c) let him continue to think that computers are happy with
multi-digit decimal integers which start with "0".
Some people passionately believe that (c) is the correct answer,
and they would be absolutely right if we could be sure that new
users will never blossom and grow and start writing AJAX applications.
So while a new Python user may (currently) be mystified at the
delayed discovery that his numbers don't work properly, we can
fix it by explaining to him immediately that Python doesn't like
leading zeros (hopefully with a reasonable message!), or we can
delegate this teaching experience to the JavaScript interpreter
in the Internet Explorer browser, and let him try to debug his
issue there.
Supported radices
-----------------
This PEP proposes that the supported radices for the Python
language will be 2, 8, 10, and 16.
Once it is agreed that the old syntax for octal (radix 8) representation
of integers must be removed from the language, the next obvious
question is "Do we actually need a way to specify (and display)
numbers in octal?"
This question is quickly followed by "What radices does the language
need to support?" Because computers are so adept at doing what you
tell them to, a tempting answer in the discussion was "all of them."
This answer has obviously been given before -- the int() constructor
will accept an explicit radix with a value between 2 and 36, inclusive,
with the latter number bearing a suspicious arithmetic similarity to
the sum of the number of numeric digits and the number of same-case
letters in the ASCII alphabet.
But the best argument for inclusion will have a use-case to back
it up, so the idea of supporting all radices was quickly rejected,
and the only radices left with any real support were decimal,
hexadecimal, octal, and binary.
Just because a particular radix has a vocal supporter on the
mailing list does not mean that it really should be in the
language, so the rest of this section is a treatise on the
utility of these particular radices, vs. other possible choices.
Humans use other numeric bases constantly. If I tell you that
it is 12:30 PM, I have communicated quantitative information
arguably composed of *three* separate bases (12, 60, and 2),
only one of which is in the "agreed" list above. But the
*communication* of that information used two decimal digits
each for the base 12 and base 60 information, and, perversely,
two letters for information which could have fit in a single
decimal digit.
So, in general, humans communicate "normal" (non-computer)
numerical information either via names (AM, PM, January, ...)
or via use of decimal notation. Obviously, names are
seldom used for large sets of items, so decimal is used for
everything else. There are studies which attempt to explain
why this is so, typically reaching the expected conclusion
that the Arabic numeral system is well-suited to human
cognition. [3]_
There is even support in the history of the design of
computers to indicate that decimal notation is the correct
way for computers to communicate with humans. One of
the first modern computers, ENIAC [4]_ computed in decimal,
even though there were already existing computers which
operated in binary.
Decimal computer operation was important enough
that many computers, including the ubiquitous PC, have
instructions designed to operate on "binary coded decimal"
(BCD) [5]_ , a representation which devotes 4 bits to each
decimal digit. These instructions date from a time when the
most strenuous calculations ever performed on many numbers
were the calculations actually required to perform textual
I/O with them. It is possible to display BCD without having
to perform a divide/remainder operation on every displayed
digit, and this was a huge computational win when most
hardware didn't have fast divide capability. Another factor
contributing to the use of BCD is that, with BCD calculations,
rounding will happen exactly the same way that a human would
do it, so BCD is still sometimes used in fields like finance,
despite the computational and storage superiority of binary.
So, if it weren't for the fact that computers themselves
normally use binary for efficient computation and data
storage, string representations of integers would probably
always be in decimal.
Unfortunately, computer hardware doesn't think like humans,
so programmers and hardware engineers must often resort to
thinking like the computer, which means that it is important
for Python to have the ability to communicate binary data
in a form that is understandable to humans.
The requirement that the binary data notation must be cognitively
easy for humans to process means that it should contain an integral
number of binary digits (bits) per symbol, while otherwise
conforming quite closely to the standard tried-and-true decimal
notation (position indicates power, larger magnitude on the left,
not too many symbols in the alphabet, etc.).
The obvious "sweet spot" for this binary data notation is
thus octal, which packs the largest integral number of bits
possible into a single symbol chosen from the Arabic numeral
alphabet.
In fact, some computer architectures, such as the PDP8 and the
8080/Z80, were defined in terms of octal, in the sense of arranging
the bitfields of instructions in groups of three, and using
octal representations to describe the instruction set.
Even today, octal is important because of bit-packed structures
which consist of 3 bits per field, such as Unix file permission
masks.
But octal has a drawback when used for larger numbers. The
number of bits per symbol, while integral, is not itself
a power of two. This limitation (given that the word size
of most computers these days is a power of two) has resulted
in hexadecimal, which is more popular than octal despite the
fact that it requires a 60% larger alphabet than decimal,
because each symbol contains 4 bits.
Some numbers, such as Unix file permission masks, are easily
decoded by humans when represented in octal, but difficult to
decode in hexadecimal, while other numbers are much easier for
humans to handle in hexadecimal.
Unfortunately, there are also binary numbers used in computers
which are not very well communicated in either hexadecimal or
octal. Thankfully, fewer people have to deal with these on a
regular basis, but on the other hand, this means that several
people on the discussion list questioned the wisdom of adding
a straight binary representation to Python.
One example of where these numbers is very useful is in
reading and writing hardware registers. Sometimes hardware
designers will eschew human readability and opt for address
space efficiency, by packing multiple bit fields into a single
hardware register at unaligned bit locations, and it is tedious
and error-prone for a human to reconstruct a 5 bit field which
consists of the upper 3 bits of one hex digit, and the lower 2
bits of the next hex digit.
Even if the ability of Python to communicate binary information
to humans is only useful for a small technical subset of the
population, it is exactly that population subset which contains
most, if not all, members of the Python core team, so even straight
binary, the least useful of these notations, has several enthusiastic
supporters and few, if any, staunch opponents, among the Python community.
Syntax for supported radices
-----------------------------
This proposal is to to use a "0o" prefix with either uppercase
or lowercase "o" for octal, and a "0b" prefix with either
uppercase or lowercase "b" for binary.
There was strong support for not supporting uppercase, but
this is a separate subject for a different PEP, as 'j' for
complex numbers, 'e' for exponent, and 'r' for raw string
(to name a few) already support uppercase.
The syntax for delimiting the different radices received a lot of
attention in the discussion on Python-3000. There are several
(sometimes conflicting) requirements and "nice-to-haves" for
this syntax:
- It should be as compatible with other languages and
previous versions of Python as is reasonable, both
for the input syntax and for the output (e.g. string
% operator) syntax.
- It should be as obvious to the casual observer as
possible.
- It should be easy to visually distinguish integers
formatted in the different bases.
Proposed syntaxes included things like arbitrary radix prefixes,
such as 16r100 (256 in hexadecimal), and radix suffixes, similar
to the 100h assembler-style suffix. The debate on whether the
letter "O" could be used for octal was intense -- an uppercase
"O" looks suspiciously similar to a zero in some fonts. Suggestions
were made to use a "c" (the second letter of "oCtal"), or even
to use a "t" for "ocTal" and an "n" for "biNary" to go along
with the "x" for "heXadecimal".
For the string % operator, "o" was already being used to denote
octal, and "b" was not used for anything, so this works out
much better than, for example, using "c" (which means "character"
for the % operator).
At the end of the day, since uppercase "O" can look like a zero
and uppercase "B" can look like an 8, it was decided that these
prefixes should be lowercase only, but, like 'r' for raw string,
that can be a preference or style-guide issue.
Open Issues
===========
It was suggested in the discussion that lowercase should be used
for all numeric and string special modifiers, such as 'x' for
hexadecimal, 'r' for raw strings, 'e' for exponentiation, and
'j' for complex numbers. This is an issue for a separate PEP.
This PEP takes no position on uppercase or lowercase for input,
just noting that, for consistency, if uppercase is not to be
removed from input parsing for other letters, it should be
added for octal and binary, and documenting the changes under
this assumption, as there is not yet a PEP about the case issue.
Output formatting may be a different story -- there is already
ample precedence for case sensitivity in the output format string,
and there would need to be a consensus that there is a valid
use-case for the "alternate form" of the string % operator
to support uppercase 'B' or 'O' characters for binary or
octal output. Currently, PEP3101 does not even support this
alternate capability, and the hex() function does not allow
the programmer to specify the case of the 'x' character.
There are still some strong feelings that '0123' should be
allowed as a literal decimal in Python 3.0. If this is the
right thing to do, this can easily be covered in an additional
PEP. This proposal only takes the first step of making '0123'
not be a valid octal number, for reasons covered in the rationale.
Is there (or should there be) an option for the 2to3 translator
which only makes the 2.6 compatible changes? Should this be
run on 2.6 library code before the 2.6 release?
Should a bin() function which matches hex() and oct() be added?
Is hex() really that useful once we have advanced string formatting?
References
==========
.. [1] GNU libc manual printf integer format conversions
(http://www.gnu.org/software/libc/manual/html_node/Integer-Conversions.html)
.. [2] Python string formatting operations
(http://docs.python.org/lib/typesseq-strings.html)
.. [3] The Representation of Numbers, Jiajie Zhang and Donald A. Norman
(http://acad88.sahs.uth.tmc.edu/research/publications/Number-Representation.pdf)
.. [4] ENIAC page at wikipedia
(http://en.wikipedia.org/wiki/ENIAC)
.. [5] BCD page at wikipedia
(http://en.wikipedia.org/wiki/Binary-coded_decimal)
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:

356
pep-3128.txt Normal file
View File

@ -0,0 +1,356 @@
PEP: 3128
Title: BList: A Faster List-like Type
Version: $Revision$
Last-Modified: $Date$
Author: Daniel Stutzbach <daniel@stutzbachenterprises.com>
Discussions-To: Python 3000 List <python-3000@python.org>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 30-Apr-2007
Python-Version: 2.6 and/or 3.0
Post-History: 30-Apr-2007
Abstract
========
The common case for list operations is on small lists. The current
array-based list implementation excels at small lists due to the
strong locality of reference and infrequency of memory allocation
operations. However, an array takes O(n) time to insert and delete
elements, which can become problematic as the list gets large.
This PEP introduces a new data type, the BList, that has array-like
and tree-like aspects. It enjoys the same good performance on small
lists as the existing array-based implementation, but offers superior
asymptotic performance for most operations. This PEP proposes
replacing the makes two mutually exclusive proposals for including the
BList type in Python:
1. Add it to the collections module, or
2. Replace the existing list type
Motivation
==========
The BList grew out of the frustration of needing to rewrite intuitive
algorithms that worked fine for small inputs but took O(n**2) time for
large inputs due to the underlying O(n) behavior of array-based lists.
The deque type, introduced in Python 2.4, solved the most common
problem of needing a fast FIFO queue. However, the deque type doesn't
help if we need to repeatedly insert or delete elements from the
middle of a long list.
A wide variety of data structure provide good asymptotic performance
for insertions and deletions, but they either have O(n) performance
for other operations (e.g., linked lists) or have inferior performance
for small lists (e.g., binary trees and skip lists).
The BList type proposed in this PEP is based on the principles of
B+Trees, which have array-like and tree-like aspects. The BList
offers array-like performance on small lists, while offering O(log n)
asymptotic performance for all insert and delete operations.
Additionally, the BList implements copy-on-write under-the-hood, so
even operations like getslice take O(log n) time. The table below
compares the asymptotic performance of the current array-based list
implementation with the asymptotic performance of the BList.
========= ================ ====================
Operation Array-based list BList
========= ================ ====================
Copy O(n) **O(1)**
Append **O(1)** O(log n)
Insert O(n) **O(log n)**
Get Item **O(1)** O(log n)
Set Item **O(1)** **O(log n)**
Del Item O(n) **O(log n)**
Iteration O(n) O(n)
Get Slice O(k) **O(log n)**
Del Slice O(n) **O(log n)**
Set Slice O(n+k) **O(log k + log n)**
Extend O(k) **O(log k + log n)**
Sort O(n log n) O(n log n)
Multiply O(nk) **O(log k)**
========= ================ ====================
An extensive empirical comparison of Python's array-based list and the
BList are available at [2]_.
Use Case Trade-offs
===================
The BList offers superior performance for many, but not all,
operations. Choosing the correct data type for a particular use case
depends on which operations are used. Choosing the correct data type
as a built-in depends on balancing the importance of different use
cases and the magnitude of the performance differences.
For the common uses cases of small lists, the array-based list and the
BList have similar performance characteristics.
For the slightly less common case of large lists, there are two common
uses cases where the existing array-based list outperforms the
existing BList reference implementation. These are:
1. A large LIFO stack, where there are many .append() and .pop(-1)
operations. Each operation is O(1) for an array-based list, but
O(log n) for the BList.
2. A large list that does not change size. The getitem and setitem
calls are O(1) for an array-based list, but O(log n) for the BList.
In performance tests on a 10,000 element list, BLists exhibited a 50%
and 5% increase in execution time for these two uses cases,
respectively.
The performance for the LIFO use case could be improved to O(n) time,
by caching a pointer to the right-most leaf within the root node. For
lists that do not change size, the common case of sequential access
could also be improved to O(n) time via caching in the root node.
However, the performance of these approaches has not been empirically
tested.
Many operations exhibit a tremendous speed-up (O(n) to O(log n)) when
switching from the array-based list to BLists. In performance tests
on a 10,000 element list, operations such as getslice, setslice, and
FIFO-style insert and deletes on a BList take only 1% of the time
needed on array-based lists.
In light of the large performance speed-ups for many operations, the
small performance costs for some operations will be worthwhile for
many (but not all) applications.
Implementation
==============
The BList is based on the B+Tree data structure. The BList is a wide,
bushy tree where each node contains an array of up to 128 pointers to
its children. If the node is a leaf, its children are the
user-visible objects that the user has placed in the list. If node is
not a leaf, its children are other BList nodes that are not
user-visible. If the list contains only a few elements, they will all
be a children of single node that is both the root and a leaf. Since
a node is little more than array of pointers, small lists operate in
effectively the same way as an array-based data type and share the
same good performance characteristics.
The BList maintains a few invariants to ensure good (O(log n))
asymptotic performance regardless of the sequence of insert and delete
operations. The principle invariants are as follows:
1. Each node has at most 128 children.
2. Each non-root node has at least 64 children.
3. The root node has at least 2 children, unless the list contains
fewer than 2 elements.
4. The tree is of uniform depth.
If an insert would cause a node to exceed 128 children, the node
spawns a sibling and transfers half of its children to the sibling.
The sibling is inserted into the node's parent. If the node is the
root node (and thus has no parent), a new parent is created and the
depth of the tree increases by one.
If a deletion would cause a node to have fewer than 64 children, the
node moves elements from one of its siblings if possible. If both of
its siblings also only have 64 children, then two of the nodes merge
and the empty one is removed from its parent. If the root node is
reduced to only one child, its single child becomes the new root
(i.e., the depth of the tree is reduced by one).
In addition to tree-like asymptotic performance and array-like
performance on small-lists, BLists support transparent
**copy-on-write**. If a non-root node needs to be copied (as part of
a getslice, copy, setslice, etc.), the node is shared between multiple
parents instead of being copied. If it needs to be modified later, it
will be copied at that time. This is completely behind-the-scenes;
from the user's point of view, the BList works just like a regular
Python list.
Memory Usage
============
In the worst case, the leaf nodes of a BList have only 64 children
each, rather than a full 128, meaning that memory usage is around
twice that of a best-case array implementation. Non-leaf nodes use up
a negligible amount of additional memory, since there are at least 63
times as many leaf nodes as non-leaf nodes.
The existing array-based list implementation must grow and shrink as
items are added and removed. To be efficient, it grows and shrinks
only when the list has grow or shrunk exponentially. In the worst
case, it, too, uses twice as much memory as the best case.
In summary, the BList's memory footprint is not significantly
different from the existing array-based implementation.
Backwards Compatibility
=======================
If the BList is added to the collections module, backwards
compatibility is not an issue. This section focuses on the option of
replacing the existing array-based list with the BList. For users of
the Python interpreter, a BList has an identical interface to the
current list-implementation. For virtually all operations, the
behavior is identical, aside from execution speed.
For the C API, BList has a different interface than the existing
list-implementation. Due to its more complex structure, the BList
does not lend itself well to poking and prodding by external sources.
Thankfully, the existing list-implementation defines an API of
functions and macros for accessing data from list objects. Google
Code Search suggests that the majority of third-party modules uses the
well-defined API rather than relying on the list's structure
directly. The table below summarizes the search queries and results:
======================== =================
Search String Number of Results
======================== =================
PyList_GetItem 2,000
PySequence_GetItem 800
PySequence_Fast_GET_ITEM 100
PyList_GET_ITEM 400
\[^a\-zA\-Z\_\]ob_item 100
======================== =================
This can be achieved in one of two ways:
1. Redefine the various accessor functions and macros in listobject.h
to access a BList instead. The interface would be unchanged. The
functions can easily be redefined. The macros need a bit more care
and would have to resort to function calls for large lists.
The macros would need to evaluate their arguments more than once,
which could be a problem if the arguments have side effects. A
Google Code Search for "PyList_GET_ITEM\(\[^)\]+\(" found only a
handful of cases where this occurs, so the impact appears to be
low.
The few extension modules that use list's undocumented structure
directly, instead of using the API, would break. The core code
itself uses the accessor macros fairly consistently and should be
easy to port.
2. Deprecate the existing list type, but continue to include it.
Extension modules wishing to use the new BList type must do so
explicitly. The BList C interface can be changed to match the
existing PyList interface so that a simple search-replace will be
sufficient for 99% of module writers.
Existing modules would continue to compile and work without change,
but they would need to make a deliberate (but small) effort to
migrate to the BList.
The downside of this approach is that mixing modules that use
BLists and array-based lists might lead to slow down if conversions
are frequently necessary.
Reference Implementation
========================
A reference implementations of the BList is available for CPython at [1]_.
The source package also includes a pure Python implementation,
originally developed as a prototype for the CPython version.
Naturally, the pure Python version is rather slow and the asymptotic
improvements don't win out until the list is quite large.
When compiled with Py_DEBUG, the C implementation checks the
BList invariants when entering and exiting most functions.
An extensive set of test cases is also included in the source package.
The test cases include the existing Python sequence and list test
cases as a subset. When the interpreter is built with Py_DEBUG, the
test cases also check for reference leaks.
Porting to Other Python Variants
--------------------------------
If the BList is added to the collections module, other Python variants
can support it in one of three ways:
1. Make blist an alias for list. The asymptotic performance won't be
as good, but it'll work.
2. Use the pure Python reference implementation. The performance for
small lists won't be as good, but it'll work.
3. Port the reference implementation.
Discussion
==========
This proposal has been discussed briefly on the Python-3000 mailing
list [3]_. Although a number of people favored the proposal, there
were also some objections. Below summarizes the pros and cons as
observed by posters to the thread.
General comments:
- Pro: Will outperform the array-based list in most cases
- Pro: "I've implemented variants of this ... a few different times"
- Con: Desirability and performance in actual applications is unproven
Comments on adding BList to the collections module:
- Pro: Matching the list-API reduces the learning curve to near-zero
- Pro: Useful for intermediate-level users; won't get in the way of beginners
- Con: Proliferation of data types makes the choices for developers harder.
Comments on replacing the array-based list with the BList:
- Con: Impact on extension modules (addressed in `Backwards
Compatibility`_)
- Con: The use cases where BLists are slower are important
(see `Use Case Trade-Offs`_ for how these might be addressed).
- Con: The array-based list code is simple and easy to maintain
To assess the desirability and performance in actual applications,
Raymond Hettinger suggested releasing the BList as an extension module
(now available at [1]_). If it proves useful, he felt it would be a
strong candidate for inclusion in 2.6 as part of the collections
module. If widely popular, then it could be considered for replacing
the array-based list, but not otherwise.
Guido van Rossum commented that he opposed the proliferation of data
types, but favored replacing the array-based list if backwards
compatibility could be addressed and the BList's performance was
uniformly better.
On-going Tasks
==============
- Reduce the memory footprint of small lists
- Implement TimSort for BLists, so that best-case sorting is O(n)
instead of O(log n).
- Implement __reversed__
- Cache a pointer in the root to the rightmost leaf, to make LIFO
operation O(n) time.
References
==========
.. [1] Reference Implementations for C and Python:
http://www.python.org/pypi/blist/
.. [2] Empirical performance comparison between Python's array-based
list and the blist: http://stutzbachenterprises.com/blist/
.. [3] Discussion on python-3000 starting at post:
http://mail.python.org/pipermail/python-3000/2007-April/006757.html
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: