python-peps/pep-0372.txt

325 lines
12 KiB
Plaintext

PEP: 372
Title: Adding an ordered dictionary to collections
Version: $Revision$
Last-Modified: $Date$
Author: Armin Ronacher <armin.ronacher@active-4.com>
Raymond Hettinger <python@rcn.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 15-Jun-2008
Python-Version: 2.7, 3.1
Post-History:
Abstract
========
This PEP proposes an ordered dictionary as a new data structure for
the ``collections`` module, called "odict" in this PEP for short. The
proposed API incorporates the experiences gained from working with
similar implementations that exist in various real-world applications
and other programming languages.
Rationale
=========
In current Python versions, the widely used built-in dict type does
not specify an order for the key/value pairs stored. This makes it
hard to use dictionaries as data storage for some specific use cases.
Some dynamic programming languages like PHP and Ruby 1.9 guarantee a
certain order on iteration. In those languages, and existing Python
ordered-dict implementations, the ordering of items is defined by the
time of insertion of the key. New keys are appended at the end, but
keys that are overwritten are not moved to the end.
The following example shows the behavior for simple assignments:
>>> d = odict()
>>> d['parrot'] = 'dead'
>>> d['penguin'] = 'exploded'
>>> d.items()
[('parrot', 'dead'), ('penguin', 'exploded')]
That the ordering is preserved makes an odict useful for a couple of
situations:
- XML/HTML processing libraries currently drop the ordering of
attributes, use a list instead of a dict which makes filtering
cumbersome, or implement their own ordered dictionary. This affects
ElementTree, html5lib, Genshi and many more libraries.
- There are many ordered dict implementations in various libraries
and applications, most of them subtly incompatible with each other.
Furthermore, subclassing dict is a non-trivial task and many
implementations don't override all the methods properly which can
lead to unexpected results.
Additionally, many ordered dicts are implemented in an inefficient
way, making many operations more complex then they have to be.
- PEP 3115 allows metaclasses to change the mapping object used for
the class body. An ordered dict could be used to create ordered
member declarations similar to C structs. This could be useful, for
example, for future ``ctypes`` releases as well as ORMs that define
database tables as classes, like the one the Django framework ships.
Django currently uses an ugly hack to restore the ordering of
members in database models.
- The RawConfigParser class accepts a ``dict_type`` argument that
allows an application to set the type of dictionary used internally.
The motivation for this addition was expressly to allow users to
provide an ordered dictionary. [1]_
- Code ported from other programming languages such as PHP often
depends on an ordered dict. Having an implementation of an
ordering-preserving dictionary in the standard library could ease
the transition and improve the compatibility of different libraries.
Ordered Dict API
================
The ordered dict API would be mostly compatible with dict and existing
ordered dicts. Note: this PEP refers to the 2.7 and 3.0 dictionary
API as described in collections.Mapping abstract base class.
The constructor and ``update()`` both accept iterables of tuples as
well as mappings like a dict does. Unlike a regular dictionary,
the insertion order is preserved.
>>> d = odict([('a', 'b'), ('c', 'd')])
>>> d.update({'foo': 'bar'})
>>> d
collections.odict([('a', 'b'), ('c', 'd'), ('foo', 'bar')])
If ordered dicts are updated from regular dicts, the ordering of new
keys is of course undefined.
All iteration methods as well as ``keys()``, ``values()`` and
``items()`` return the values ordered by the time the key was
first inserted:
>>> d['spam'] = 'eggs'
>>> d.keys()
['a', 'c', 'foo', 'spam']
>>> d.values()
['b', 'd', 'bar', 'eggs']
>>> d.items()
[('a', 'b'), ('c', 'd'), ('foo', 'bar'), ('spam', 'eggs')]
New methods not available on dict:
``odict.__reversed__()``
Supports reverse iteration by key.
Questions and Answers
=====================
What happens if an existing key is reassigned?
The key is not moved but assigned a new value in place. This is
consistent with existing implementations and allows subclasses to
change the behavior easily::
class moving_odict(collections.odict):
def __setitem__(self, key, value):
self.pop(key, None)
collections.odict.__setitem__(self, key, value)
What happens if keys appear multiple times in the list passed to the
constructor?
The same as for regular dicts: The latter item overrides the
former. This has the side-effect that the position of the first
key is used because only the value is actually overwritten:
>>> odict([('a', 1), ('b', 2), ('a', 3)])
collections.odict([('a', 3), ('b', 2)])
This behavior is consistent with existing implementations in
Python, the PHP array and the hashmap in Ruby 1.9.
Is the ordered dict a dict subclass? Why?
Yes. Like ``defaultdict``, ``odict`` subclasses ``dict``.
Being a dict subclass confers speed upon methods that aren't overridden
like ``__getitem__`` and ``__len__``. Also, being a dict gives the
most utility with tools that were expecting regular dicts (like the
json module).
Do any limitations arise from subclassing dict?
Yes. Since the API for dicts is different in Py2.x and Py3.x, the
odict API must also be different (i.e. Py2.6 needs to override
iterkeys, itervalues, and iteritems).
Does ``odict.popitem()`` return a particular key/value pair?
Yes. It pops-off the most recently inserted new key and its
corresponding value. This corresponds to the usual LIFO behavior
exhibited by traditional push/pop pairs. It is semantically
equivalent to ``k=list(od)[-1]; v=od[k]; del od[k]; return (k,v)``.
The actual implementation is more efficient and pops directly
off of a sorted list of keys.
Does odict support indexing, slicing, and whatnot?
As a matter of fact, ``odict`` does not implement the ``Sequence``
interface. Rather, it is a ``MutableMapping`` that remembers
the order of key insertion. The only sequence-like addition is
automatic support for ``reversed``.
Does odict support alternate sort orders such as alphabetical?
No. Those wanting different sort orders really need to be using another
technique. The odict is all about recording insertion order. If any
other order is of interest, then another structure (like an in-memory
dbm) is likely a better fit. It would be a mistake to try to be all
things to all users.
How well does odict work with the json module, PyYAML, and ConfigParser?
For json, the good news is that json's encoder respects odict's iteration order:
>>> items = [('one', 1), ('two', 2), ('three',3), ('four',4), ('five',5)]
>>> json.dumps(OrderedDict(items))
'{"one": 1, "two": 2, "three": 3, "four": 4, "five": 5}'
In Py2.6, the object_hook for json decoders passes-in an already built
dictionary so order is lost before the object hook sees it. This
problem is being fixed for Python 2.7/3.1 by adding an new hook that
preserves order (see http://bugs.python.org/issue5381 ).
With the new hook, order can be preserved:
>>> jtext = '{"one": 1, "two": 2, "three": 3, "four": 4, "five": 5}'
>>> json.loads(jtext, object_pairs_hook=OrderedDict)
OrderedDict({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5})
For PyYAML, a full round-trip is problem free:
>>> ytext = yaml.dump(OrderedDict(items))
>>> print ytext
!!python/object/apply:collections.OrderedDict
- - [one, 1]
- [two, 2]
- [three, 3]
- [four, 4]
- [five, 5]
>>> yaml.load(ytext)
OrderedDict({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5})
For the ConfigParser module, round-tripping is problem free. Custom
dicts were added in Py2.6 specifically to support ordered dictionaries:
>>> config = ConfigParser(dict_type=OrderedDict)
>>> config.read('myconfig.ini')
>>> config.remove_option('Log', 'error')
>>> config.write(open('myconfig.ini', 'w'))
How does odict handle equality testing?
Being a dict, one might expect equality tests to not care about order. For
an odict to dict comparison, this would be a necessity and it's probably
not wise to silently switch comparison modes based on the input types.
Also, some third-party tools that expect dict inputs may also expect the
comparison to not care about order. Accordingly, we decided to punt and
let the usual dict equality testing run without reference to internal
ordering. This should be documented clearly since different people will
have different expectations. If a use case does arise, it's not hard to
explicitly craft an order based comparison:
``list(od1.items())==list(od2.items())``.
What are the trade-offs of the possible underlying data structures?
* Keeping a sorted list of keys is very fast for all operations except
__delitem__() which becomes an O(n) exercise. This structure leads to
very simple code and little wasted space.
* Keeping a separate dictionary to record insertion sequence numbers makes
the code a little bit more complex. All of the basic operations are O(1)
but the constant factor is increased for __setitem__() and __delitem__()
meaning that every use case will have to pay for this speedup (since all
buildup go through __setitem__). Also, the first traveral incurs a
one-time ``O(n log n)`` sorting cost. The storage costs are double that
for the sorted-list-of-keys approach.
* A version written in C could use a linked list. The code would be more
complex than the other two approaches but it would conserve space and
would keep the same big-oh performance as regular dictionaries. It is
the fastest and most space efficient.
Reference Implementation
========================
A proposed version is at:
`OrderedDict recipe <http://code.activestate.com/recipes/576669/>`_
The proposed version has several merits:
* Strict compliance with the MutableMapping API and no new methods
so that the learning curve is near zero. It is simply a dictionary
that remembers insertion order.
* Generally good performance. The big-oh times are the same as regular
dictionaries except that key deletion is O(n).
* The code runs without modification on Py2.6, Py2.7, Py3.0, and Py3.1.
Other implementations of ordered dicts in various Python projects or
standalone libraries, that inspired the API proposed here, are:
- `odict in Python`_
- `odict in Babel`_
- `OrderedDict in Django`_
- `The odict module`_
- `ordereddict`_ (a C implementation of the odict module)
- `StableDict`_
- `Armin Rigo's OrderedDict`_
.. _odict in Python: http://dev.pocoo.org/hg/sandbox/raw-file/tip/odict.py
.. _odict in Babel: http://babel.edgewall.org/browser/trunk/babel/util.py?rev=374#L178
.. _OrderedDict in Django:
http://code.djangoproject.com/browser/django/trunk/django/utils/datastructures.py?rev=7140#L53
.. _The odict module: http://www.voidspace.org.uk/python/odict.html
.. _ordereddict: http://www.xs4all.nl/~anthon/Python/ordereddict/
.. _StableDict: http://pypi.python.org/pypi/StableDict/0.2
.. _Armin Rigo's OrderedDict: http://codespeak.net/svn/user/arigo/hack/pyfuse/OrderedDict.py
Future Directions
=================
With the availability of an ordered dict in the standard library,
other libraries may take advantage of that. For example, ElementTree
could return odicts in the future that retain the attribute ordering
of the source file.
References
==========
.. [1] http://bugs.python.org/issue1371075
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: