From 0129f4bde861c267cd5b0e6d59812b382a4e6816 Mon Sep 17 00:00:00 2001 From: Guido van Rossum Date: Tue, 1 May 2007 16:57:09 +0000 Subject: [PATCH] PEP 3127: Integer Literal Support and Syntax Maupin PEP 3128: BList: A Faster List-like Type Stutzbach --- pep-0000.txt | 2 + pep-3127.txt | 518 +++++++++++++++++++++++++++++++++++++++++++++++++++ pep-3128.txt | 356 +++++++++++++++++++++++++++++++++++ 3 files changed, 876 insertions(+) create mode 100644 pep-3127.txt create mode 100644 pep-3128.txt diff --git a/pep-0000.txt b/pep-0000.txt index 9b3de2e67..3c19e04ca 100644 --- a/pep-0000.txt +++ b/pep-0000.txt @@ -126,6 +126,7 @@ Index by Category S 3125 Remove Backslash Continuation Jewett S 3126 Remove Implicit String Concatenation Jewett S 3127 Integer Literal Support and Syntax Maupin + S 3128 BList: A Faster List-like Type Stutzbach S 3141 A Type Hierarchy for Numbers Yasskin Finished PEPs (done, implemented in Subversion) @@ -494,6 +495,7 @@ Numerical Index S 3125 Remove Backslash Continuation Jewett S 3126 Remove Implicit String Concatenation Jewett S 3127 Integer Literal Support and Syntax Maupin + S 3128 BList: A Faster List-like Type Stutzbach S 3141 A Type Hierarchy for Numbers Yasskin diff --git a/pep-3127.txt b/pep-3127.txt new file mode 100644 index 000000000..03aff9be7 --- /dev/null +++ b/pep-3127.txt @@ -0,0 +1,518 @@ +PEP: 3127 +Title: Integer Literal Support and Syntax +Version: $Revision$ +Last-Modified: $Date$ +Author: Patrick Maupin +Discussions-To: Python-3000@python.org +Status: Draft +Type: Standards Track +Python-Version: 3.0 +Content-Type: text/x-rst +Created: 14-Mar-2007 +Post-History: 18-Mar-2007 + + +Abstract +======== + +This PEP proposes changes to the Python core to rationalize +the treatment of string literal representations of integers +in different radices (bases). These changes are targeted at +Python 3.0, but the backward-compatible parts of the changes +should be added to Python 2.6, so that all valid 3.0 integer +literals will also be valid in 2.6. + +The proposal is that: + + a) octal literals must now be specified + with a leading "0o" or "0O" instead of "0"; + + b) binary literals are now supported via a + leading "0b" or "0B"; and + + c) provision will be made for binary numbers in + string formatting. + + +Motivation +========== + +This PEP was motivated by two different issues: + + - The default octal representation of integers is silently confusing + to people unfamiliar with C-like languages. It is extremely easy + to inadvertently create an integer object with the wrong value, + because '013' means 'decimal 11', not 'decimal 13', to the Python + language itself, which is not the meaning that most humans would + assign to this literal. + + - Some Python users have a strong desire for binary support in + the language. + + +Specification +============= + +Grammar specification +--------------------- + +The grammar will be changed. For Python 2.6, the changed and +new token definitions will be:: + + integer ::= decimalinteger | octinteger | hexinteger | + bininteger | oldoctinteger + + octinteger ::= "0" ("o" | "O") octdigit+ + + bininteger ::= "0" ("b" | "B") bindigit+ + + oldoctinteger ::= "0" octdigit+ + + bindigit ::= "0" | "1" + +For Python 3.0, "oldoctinteger" will not be supported, and +an exception will be raised if a literal has a leading "0" and +a second character which is a digit. + +For both versions, this will require changes to PyLong_FromString +as well as the grammar. + +The documentation will have to be changed as well: grammar.txt, +as well as the integer literal section of the reference manual. + +PEP 306 should be checked for other issues, and that PEP should +be updated if the procedure described therein is insufficient. + +int() specification +-------------------- + +int(s, 0) will also match the new grammar definition. + +This should happen automatically with the changes to +PyLong_FromString required for the grammar change. + +Also the documentation for int() should be changed to explain +that int(s) operates identically to int(s, 10), and the word +"guess" should be removed from the description of int(s, 0). + +long() specification +-------------------- + +For Python 2.6, the long() implementation and documentation +should be changed to reflect the new grammar. + +Tokenizer exception handling +---------------------------- + +If an invalid token contains a leading "0", the exception +error message should be more informative than the current +"SyntaxError: invalid token". It should explain that decimal +numbers may not have a leading zero, and that octal numbers +require an "o" after the leading zero. + +int() exception handling +------------------------ + +The ValueError raised for any call to int() with a string +should at least explicitly contain the base in the error +message, e.g.:: + + ValueError: invalid literal for base 8 int(): 09 + +oct() function +--------------- + +oct() should be updated to output '0o' in front of +the octal digits (for 3.0, and 2.6 compatibility mode). + +Output formatting +----------------- + +The string (and unicode in 2.6) % operator will have +'b' format specifier added for binary, and the alternate +syntax of the 'o' option will need to be updated to +add '0o' in front, instead of '0'. + +PEP 3101 already supports 'b' for binary output. + + +Transition from 2.6 to 3.0 +--------------------------- + +The 2to3 translator will have to insert 'o' into any +octal string literal. + +The Py3K compatible option to Python 2.6 should cause +attempts to use oldoctinteger literals to raise an +exception. + + +Rationale +========= + +Most of the discussion on these issues occurred on the Python-3000 +mailing list starting 14-Mar-2007, prompted by an observation that +the average human being would be completely mystified upon finding +that prepending a "0" to a string of digits changes the meaning of +that digit string entirely. + +It was pointed out during this discussion that a similar, but shorter, +discussion on the subject occurred in January of 2006, prompted by a +discovery of the same issue. + +Background +---------- + +For historical reasons, Python's string representation of integers +in different bases (radices), for string formatting and token +literals, borrows heavily from C. [1]_ [2]_ Usage has shown that +the historical method of specifying an octal number is confusing, +and also that it would be nice to have additional support for binary +literals. + +Throughout this document, unless otherwise noted, discussions about +the string representation of integers relate to these features: + + - Literal integer tokens, as used by normal module compilation, + by eval(), and by int(token, 0). (int(token) and int(token, 2-36) + are not modified by this proposal.) + + * Under 2.6, long() is treated the same as int() + + - Formatting of integers into strings, either via the % string + operator or the new PEP 3101 advanced string formatting method. + +It is presumed that: + + - All of these features should have an identical set + of supported radices, for consistency. + + - Python source code syntax and int(mystring, 0) should + continue to share identical behavior. + + +Removal of old octal syntax +---------------------------- + +This PEP proposes that the ability to specify an octal number by +using a leading zero will be removed from the language in Python 3.0 +(and the Python 3.0 preview mode of 2.6), and that a SyntaxError will +be raised whenever a leading "0" is immediately followed by another +digit. + +During the present discussion, it was almost universally agreed that:: + + eval('010') == 8 + +should no longer be true, because that is confusing to new users. +It was also proposed that:: + + eval('0010') == 10 + +should become true, but that is much more contentious, because it is so +inconsistent with usage in other computer languages that mistakes are +likely to be made. + +Almost all currently popular computer languages, including C/C++, +Java, Perl, and JavaScript, treat a sequence of digits with a +leading zero as an octal number. Proponents of treating these +numbers as decimal instead have a very valid point -- as discussed +in `Supported radices`_, below, the entire non-computer world uses +decimal numbers almost exclusively. There is ample anecdotal +evidence that many people are dismayed and confused if they +are confronted with non-decimal radices. + +However, in most situations, most people do not write gratuitous +zeros in front of their decimal numbers. The primary exception is +when an attempt is being made to line up columns of numbers. But +since PEP 8 specifically discourages the use of spaces to try to +align Python code, one would suspect the same argument should apply +to the use of leading zeros for the same purpose. + +Finally, although the email discussion often focused on whether anybody +actually *uses* octal any more, and whether we should cater to those +old-timers in any case, that is almost entirely besides the point. + +Assume the rare complete newcomer to computing who *does*, either +occasionally or as a matter of habit, use leading zeros for decimal +numbers. Python could either: + + a) silently do the wrong thing with his numbers, as it does now; + + b) immediately disabuse him of the notion that this is viable syntax + (and yes, the SyntaxWarning should be more gentle than it + currently is, but that is a subject for a different PEP); or + + c) let him continue to think that computers are happy with + multi-digit decimal integers which start with "0". + +Some people passionately believe that (c) is the correct answer, +and they would be absolutely right if we could be sure that new +users will never blossom and grow and start writing AJAX applications. + +So while a new Python user may (currently) be mystified at the +delayed discovery that his numbers don't work properly, we can +fix it by explaining to him immediately that Python doesn't like +leading zeros (hopefully with a reasonable message!), or we can +delegate this teaching experience to the JavaScript interpreter +in the Internet Explorer browser, and let him try to debug his +issue there. + +Supported radices +----------------- + +This PEP proposes that the supported radices for the Python +language will be 2, 8, 10, and 16. + +Once it is agreed that the old syntax for octal (radix 8) representation +of integers must be removed from the language, the next obvious +question is "Do we actually need a way to specify (and display) +numbers in octal?" + +This question is quickly followed by "What radices does the language +need to support?" Because computers are so adept at doing what you +tell them to, a tempting answer in the discussion was "all of them." +This answer has obviously been given before -- the int() constructor +will accept an explicit radix with a value between 2 and 36, inclusive, +with the latter number bearing a suspicious arithmetic similarity to +the sum of the number of numeric digits and the number of same-case +letters in the ASCII alphabet. + +But the best argument for inclusion will have a use-case to back +it up, so the idea of supporting all radices was quickly rejected, +and the only radices left with any real support were decimal, +hexadecimal, octal, and binary. + +Just because a particular radix has a vocal supporter on the +mailing list does not mean that it really should be in the +language, so the rest of this section is a treatise on the +utility of these particular radices, vs. other possible choices. + +Humans use other numeric bases constantly. If I tell you that +it is 12:30 PM, I have communicated quantitative information +arguably composed of *three* separate bases (12, 60, and 2), +only one of which is in the "agreed" list above. But the +*communication* of that information used two decimal digits +each for the base 12 and base 60 information, and, perversely, +two letters for information which could have fit in a single +decimal digit. + +So, in general, humans communicate "normal" (non-computer) +numerical information either via names (AM, PM, January, ...) +or via use of decimal notation. Obviously, names are +seldom used for large sets of items, so decimal is used for +everything else. There are studies which attempt to explain +why this is so, typically reaching the expected conclusion +that the Arabic numeral system is well-suited to human +cognition. [3]_ + +There is even support in the history of the design of +computers to indicate that decimal notation is the correct +way for computers to communicate with humans. One of +the first modern computers, ENIAC [4]_ computed in decimal, +even though there were already existing computers which +operated in binary. + +Decimal computer operation was important enough +that many computers, including the ubiquitous PC, have +instructions designed to operate on "binary coded decimal" +(BCD) [5]_ , a representation which devotes 4 bits to each +decimal digit. These instructions date from a time when the +most strenuous calculations ever performed on many numbers +were the calculations actually required to perform textual +I/O with them. It is possible to display BCD without having +to perform a divide/remainder operation on every displayed +digit, and this was a huge computational win when most +hardware didn't have fast divide capability. Another factor +contributing to the use of BCD is that, with BCD calculations, +rounding will happen exactly the same way that a human would +do it, so BCD is still sometimes used in fields like finance, +despite the computational and storage superiority of binary. + +So, if it weren't for the fact that computers themselves +normally use binary for efficient computation and data +storage, string representations of integers would probably +always be in decimal. + +Unfortunately, computer hardware doesn't think like humans, +so programmers and hardware engineers must often resort to +thinking like the computer, which means that it is important +for Python to have the ability to communicate binary data +in a form that is understandable to humans. + +The requirement that the binary data notation must be cognitively +easy for humans to process means that it should contain an integral +number of binary digits (bits) per symbol, while otherwise +conforming quite closely to the standard tried-and-true decimal +notation (position indicates power, larger magnitude on the left, +not too many symbols in the alphabet, etc.). + +The obvious "sweet spot" for this binary data notation is +thus octal, which packs the largest integral number of bits +possible into a single symbol chosen from the Arabic numeral +alphabet. + +In fact, some computer architectures, such as the PDP8 and the +8080/Z80, were defined in terms of octal, in the sense of arranging +the bitfields of instructions in groups of three, and using +octal representations to describe the instruction set. + +Even today, octal is important because of bit-packed structures +which consist of 3 bits per field, such as Unix file permission +masks. + +But octal has a drawback when used for larger numbers. The +number of bits per symbol, while integral, is not itself +a power of two. This limitation (given that the word size +of most computers these days is a power of two) has resulted +in hexadecimal, which is more popular than octal despite the +fact that it requires a 60% larger alphabet than decimal, +because each symbol contains 4 bits. + +Some numbers, such as Unix file permission masks, are easily +decoded by humans when represented in octal, but difficult to +decode in hexadecimal, while other numbers are much easier for +humans to handle in hexadecimal. + +Unfortunately, there are also binary numbers used in computers +which are not very well communicated in either hexadecimal or +octal. Thankfully, fewer people have to deal with these on a +regular basis, but on the other hand, this means that several +people on the discussion list questioned the wisdom of adding +a straight binary representation to Python. + +One example of where these numbers is very useful is in +reading and writing hardware registers. Sometimes hardware +designers will eschew human readability and opt for address +space efficiency, by packing multiple bit fields into a single +hardware register at unaligned bit locations, and it is tedious +and error-prone for a human to reconstruct a 5 bit field which +consists of the upper 3 bits of one hex digit, and the lower 2 +bits of the next hex digit. + +Even if the ability of Python to communicate binary information +to humans is only useful for a small technical subset of the +population, it is exactly that population subset which contains +most, if not all, members of the Python core team, so even straight +binary, the least useful of these notations, has several enthusiastic +supporters and few, if any, staunch opponents, among the Python community. + +Syntax for supported radices +----------------------------- + +This proposal is to to use a "0o" prefix with either uppercase +or lowercase "o" for octal, and a "0b" prefix with either +uppercase or lowercase "b" for binary. + +There was strong support for not supporting uppercase, but +this is a separate subject for a different PEP, as 'j' for +complex numbers, 'e' for exponent, and 'r' for raw string +(to name a few) already support uppercase. + +The syntax for delimiting the different radices received a lot of +attention in the discussion on Python-3000. There are several +(sometimes conflicting) requirements and "nice-to-haves" for +this syntax: + + - It should be as compatible with other languages and + previous versions of Python as is reasonable, both + for the input syntax and for the output (e.g. string + % operator) syntax. + + - It should be as obvious to the casual observer as + possible. + + - It should be easy to visually distinguish integers + formatted in the different bases. + + +Proposed syntaxes included things like arbitrary radix prefixes, +such as 16r100 (256 in hexadecimal), and radix suffixes, similar +to the 100h assembler-style suffix. The debate on whether the +letter "O" could be used for octal was intense -- an uppercase +"O" looks suspiciously similar to a zero in some fonts. Suggestions +were made to use a "c" (the second letter of "oCtal"), or even +to use a "t" for "ocTal" and an "n" for "biNary" to go along +with the "x" for "heXadecimal". + +For the string % operator, "o" was already being used to denote +octal, and "b" was not used for anything, so this works out +much better than, for example, using "c" (which means "character" +for the % operator). + +At the end of the day, since uppercase "O" can look like a zero +and uppercase "B" can look like an 8, it was decided that these +prefixes should be lowercase only, but, like 'r' for raw string, +that can be a preference or style-guide issue. + +Open Issues +=========== + +It was suggested in the discussion that lowercase should be used +for all numeric and string special modifiers, such as 'x' for +hexadecimal, 'r' for raw strings, 'e' for exponentiation, and +'j' for complex numbers. This is an issue for a separate PEP. + +This PEP takes no position on uppercase or lowercase for input, +just noting that, for consistency, if uppercase is not to be +removed from input parsing for other letters, it should be +added for octal and binary, and documenting the changes under +this assumption, as there is not yet a PEP about the case issue. + +Output formatting may be a different story -- there is already +ample precedence for case sensitivity in the output format string, +and there would need to be a consensus that there is a valid +use-case for the "alternate form" of the string % operator +to support uppercase 'B' or 'O' characters for binary or +octal output. Currently, PEP3101 does not even support this +alternate capability, and the hex() function does not allow +the programmer to specify the case of the 'x' character. + +There are still some strong feelings that '0123' should be +allowed as a literal decimal in Python 3.0. If this is the +right thing to do, this can easily be covered in an additional +PEP. This proposal only takes the first step of making '0123' +not be a valid octal number, for reasons covered in the rationale. + +Is there (or should there be) an option for the 2to3 translator +which only makes the 2.6 compatible changes? Should this be +run on 2.6 library code before the 2.6 release? + +Should a bin() function which matches hex() and oct() be added? + +Is hex() really that useful once we have advanced string formatting? + + +References +========== + +.. [1] GNU libc manual printf integer format conversions + (http://www.gnu.org/software/libc/manual/html_node/Integer-Conversions.html) + +.. [2] Python string formatting operations + (http://docs.python.org/lib/typesseq-strings.html) + +.. [3] The Representation of Numbers, Jiajie Zhang and Donald A. Norman + (http://acad88.sahs.uth.tmc.edu/research/publications/Number-Representation.pdf) + +.. [4] ENIAC page at wikipedia + (http://en.wikipedia.org/wiki/ENIAC) + +.. [5] BCD page at wikipedia + (http://en.wikipedia.org/wiki/Binary-coded_decimal) + +Copyright +========= + +This document has been placed in the public domain. + + + +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + coding: utf-8 + End: diff --git a/pep-3128.txt b/pep-3128.txt new file mode 100644 index 000000000..14a0042f3 --- /dev/null +++ b/pep-3128.txt @@ -0,0 +1,356 @@ +PEP: 3128 +Title: BList: A Faster List-like Type +Version: $Revision$ +Last-Modified: $Date$ +Author: Daniel Stutzbach +Discussions-To: Python 3000 List +Status: Draft +Type: Standards Track +Content-Type: text/x-rst +Created: 30-Apr-2007 +Python-Version: 2.6 and/or 3.0 +Post-History: 30-Apr-2007 + + +Abstract +======== + +The common case for list operations is on small lists. The current +array-based list implementation excels at small lists due to the +strong locality of reference and infrequency of memory allocation +operations. However, an array takes O(n) time to insert and delete +elements, which can become problematic as the list gets large. + +This PEP introduces a new data type, the BList, that has array-like +and tree-like aspects. It enjoys the same good performance on small +lists as the existing array-based implementation, but offers superior +asymptotic performance for most operations. This PEP proposes +replacing the makes two mutually exclusive proposals for including the +BList type in Python: + +1. Add it to the collections module, or +2. Replace the existing list type + + +Motivation +========== + +The BList grew out of the frustration of needing to rewrite intuitive +algorithms that worked fine for small inputs but took O(n**2) time for +large inputs due to the underlying O(n) behavior of array-based lists. +The deque type, introduced in Python 2.4, solved the most common +problem of needing a fast FIFO queue. However, the deque type doesn't +help if we need to repeatedly insert or delete elements from the +middle of a long list. + +A wide variety of data structure provide good asymptotic performance +for insertions and deletions, but they either have O(n) performance +for other operations (e.g., linked lists) or have inferior performance +for small lists (e.g., binary trees and skip lists). + +The BList type proposed in this PEP is based on the principles of +B+Trees, which have array-like and tree-like aspects. The BList +offers array-like performance on small lists, while offering O(log n) +asymptotic performance for all insert and delete operations. +Additionally, the BList implements copy-on-write under-the-hood, so +even operations like getslice take O(log n) time. The table below +compares the asymptotic performance of the current array-based list +implementation with the asymptotic performance of the BList. + +========= ================ ==================== +Operation Array-based list BList +========= ================ ==================== +Copy O(n) **O(1)** +Append **O(1)** O(log n) +Insert O(n) **O(log n)** +Get Item **O(1)** O(log n) +Set Item **O(1)** **O(log n)** +Del Item O(n) **O(log n)** +Iteration O(n) O(n) +Get Slice O(k) **O(log n)** +Del Slice O(n) **O(log n)** +Set Slice O(n+k) **O(log k + log n)** +Extend O(k) **O(log k + log n)** +Sort O(n log n) O(n log n) +Multiply O(nk) **O(log k)** +========= ================ ==================== + +An extensive empirical comparison of Python's array-based list and the +BList are available at [2]_. + +Use Case Trade-offs +=================== + +The BList offers superior performance for many, but not all, +operations. Choosing the correct data type for a particular use case +depends on which operations are used. Choosing the correct data type +as a built-in depends on balancing the importance of different use +cases and the magnitude of the performance differences. + +For the common uses cases of small lists, the array-based list and the +BList have similar performance characteristics. + +For the slightly less common case of large lists, there are two common +uses cases where the existing array-based list outperforms the +existing BList reference implementation. These are: + +1. A large LIFO stack, where there are many .append() and .pop(-1) + operations. Each operation is O(1) for an array-based list, but + O(log n) for the BList. + +2. A large list that does not change size. The getitem and setitem + calls are O(1) for an array-based list, but O(log n) for the BList. + +In performance tests on a 10,000 element list, BLists exhibited a 50% +and 5% increase in execution time for these two uses cases, +respectively. + +The performance for the LIFO use case could be improved to O(n) time, +by caching a pointer to the right-most leaf within the root node. For +lists that do not change size, the common case of sequential access +could also be improved to O(n) time via caching in the root node. +However, the performance of these approaches has not been empirically +tested. + +Many operations exhibit a tremendous speed-up (O(n) to O(log n)) when +switching from the array-based list to BLists. In performance tests +on a 10,000 element list, operations such as getslice, setslice, and +FIFO-style insert and deletes on a BList take only 1% of the time +needed on array-based lists. + +In light of the large performance speed-ups for many operations, the +small performance costs for some operations will be worthwhile for +many (but not all) applications. + +Implementation +============== + +The BList is based on the B+Tree data structure. The BList is a wide, +bushy tree where each node contains an array of up to 128 pointers to +its children. If the node is a leaf, its children are the +user-visible objects that the user has placed in the list. If node is +not a leaf, its children are other BList nodes that are not +user-visible. If the list contains only a few elements, they will all +be a children of single node that is both the root and a leaf. Since +a node is little more than array of pointers, small lists operate in +effectively the same way as an array-based data type and share the +same good performance characteristics. + +The BList maintains a few invariants to ensure good (O(log n)) +asymptotic performance regardless of the sequence of insert and delete +operations. The principle invariants are as follows: + +1. Each node has at most 128 children. +2. Each non-root node has at least 64 children. +3. The root node has at least 2 children, unless the list contains + fewer than 2 elements. +4. The tree is of uniform depth. + +If an insert would cause a node to exceed 128 children, the node +spawns a sibling and transfers half of its children to the sibling. +The sibling is inserted into the node's parent. If the node is the +root node (and thus has no parent), a new parent is created and the +depth of the tree increases by one. + +If a deletion would cause a node to have fewer than 64 children, the +node moves elements from one of its siblings if possible. If both of +its siblings also only have 64 children, then two of the nodes merge +and the empty one is removed from its parent. If the root node is +reduced to only one child, its single child becomes the new root +(i.e., the depth of the tree is reduced by one). + +In addition to tree-like asymptotic performance and array-like +performance on small-lists, BLists support transparent +**copy-on-write**. If a non-root node needs to be copied (as part of +a getslice, copy, setslice, etc.), the node is shared between multiple +parents instead of being copied. If it needs to be modified later, it +will be copied at that time. This is completely behind-the-scenes; +from the user's point of view, the BList works just like a regular +Python list. + +Memory Usage +============ + +In the worst case, the leaf nodes of a BList have only 64 children +each, rather than a full 128, meaning that memory usage is around +twice that of a best-case array implementation. Non-leaf nodes use up +a negligible amount of additional memory, since there are at least 63 +times as many leaf nodes as non-leaf nodes. + +The existing array-based list implementation must grow and shrink as +items are added and removed. To be efficient, it grows and shrinks +only when the list has grow or shrunk exponentially. In the worst +case, it, too, uses twice as much memory as the best case. + +In summary, the BList's memory footprint is not significantly +different from the existing array-based implementation. + +Backwards Compatibility +======================= + +If the BList is added to the collections module, backwards +compatibility is not an issue. This section focuses on the option of +replacing the existing array-based list with the BList. For users of +the Python interpreter, a BList has an identical interface to the +current list-implementation. For virtually all operations, the +behavior is identical, aside from execution speed. + +For the C API, BList has a different interface than the existing +list-implementation. Due to its more complex structure, the BList +does not lend itself well to poking and prodding by external sources. +Thankfully, the existing list-implementation defines an API of +functions and macros for accessing data from list objects. Google +Code Search suggests that the majority of third-party modules uses the +well-defined API rather than relying on the list's structure +directly. The table below summarizes the search queries and results: + +======================== ================= +Search String Number of Results +======================== ================= +PyList_GetItem 2,000 +PySequence_GetItem 800 +PySequence_Fast_GET_ITEM 100 +PyList_GET_ITEM 400 +\[^a\-zA\-Z\_\]ob_item 100 +======================== ================= + + +This can be achieved in one of two ways: + +1. Redefine the various accessor functions and macros in listobject.h + to access a BList instead. The interface would be unchanged. The + functions can easily be redefined. The macros need a bit more care + and would have to resort to function calls for large lists. + + The macros would need to evaluate their arguments more than once, + which could be a problem if the arguments have side effects. A + Google Code Search for "PyList_GET_ITEM\(\[^)\]+\(" found only a + handful of cases where this occurs, so the impact appears to be + low. + + The few extension modules that use list's undocumented structure + directly, instead of using the API, would break. The core code + itself uses the accessor macros fairly consistently and should be + easy to port. + +2. Deprecate the existing list type, but continue to include it. + Extension modules wishing to use the new BList type must do so + explicitly. The BList C interface can be changed to match the + existing PyList interface so that a simple search-replace will be + sufficient for 99% of module writers. + + Existing modules would continue to compile and work without change, + but they would need to make a deliberate (but small) effort to + migrate to the BList. + + The downside of this approach is that mixing modules that use + BLists and array-based lists might lead to slow down if conversions + are frequently necessary. + +Reference Implementation +======================== + +A reference implementations of the BList is available for CPython at [1]_. + +The source package also includes a pure Python implementation, +originally developed as a prototype for the CPython version. +Naturally, the pure Python version is rather slow and the asymptotic +improvements don't win out until the list is quite large. + +When compiled with Py_DEBUG, the C implementation checks the +BList invariants when entering and exiting most functions. + +An extensive set of test cases is also included in the source package. +The test cases include the existing Python sequence and list test +cases as a subset. When the interpreter is built with Py_DEBUG, the +test cases also check for reference leaks. + +Porting to Other Python Variants +-------------------------------- + +If the BList is added to the collections module, other Python variants +can support it in one of three ways: + +1. Make blist an alias for list. The asymptotic performance won't be + as good, but it'll work. +2. Use the pure Python reference implementation. The performance for + small lists won't be as good, but it'll work. +3. Port the reference implementation. + +Discussion +========== + +This proposal has been discussed briefly on the Python-3000 mailing +list [3]_. Although a number of people favored the proposal, there +were also some objections. Below summarizes the pros and cons as +observed by posters to the thread. + +General comments: + +- Pro: Will outperform the array-based list in most cases +- Pro: "I've implemented variants of this ... a few different times" +- Con: Desirability and performance in actual applications is unproven + +Comments on adding BList to the collections module: + +- Pro: Matching the list-API reduces the learning curve to near-zero +- Pro: Useful for intermediate-level users; won't get in the way of beginners +- Con: Proliferation of data types makes the choices for developers harder. + +Comments on replacing the array-based list with the BList: + +- Con: Impact on extension modules (addressed in `Backwards + Compatibility`_) +- Con: The use cases where BLists are slower are important + (see `Use Case Trade-Offs`_ for how these might be addressed). +- Con: The array-based list code is simple and easy to maintain + +To assess the desirability and performance in actual applications, +Raymond Hettinger suggested releasing the BList as an extension module +(now available at [1]_). If it proves useful, he felt it would be a +strong candidate for inclusion in 2.6 as part of the collections +module. If widely popular, then it could be considered for replacing +the array-based list, but not otherwise. + +Guido van Rossum commented that he opposed the proliferation of data +types, but favored replacing the array-based list if backwards +compatibility could be addressed and the BList's performance was +uniformly better. + +On-going Tasks +============== + +- Reduce the memory footprint of small lists +- Implement TimSort for BLists, so that best-case sorting is O(n) + instead of O(log n). +- Implement __reversed__ +- Cache a pointer in the root to the rightmost leaf, to make LIFO + operation O(n) time. + +References +========== + +.. [1] Reference Implementations for C and Python: + http://www.python.org/pypi/blist/ + +.. [2] Empirical performance comparison between Python's array-based + list and the blist: http://stutzbachenterprises.com/blist/ + +.. [3] Discussion on python-3000 starting at post: + http://mail.python.org/pipermail/python-3000/2007-April/006757.html + +Copyright +========= + +This document has been placed in the public domain. + + + +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + coding: utf-8 + End: