PEP: 205 Title: Weak References Version: $Revision$ Author: Fred L. Drake, Jr. Python-Version: 2.1 Status: Incomplete Type: Standards Track Post-History: Motivation There are two basic applications for weak references which have been noted by Python programmers: object caches and reduction of pain from circular references. Caches (weak dictionaries) There is a need to allow objects to be maintained that represent external state, mapping a single instance to the external reality, where allowing multiple instances to be mapped to the same external resource would create unnecessary difficulty maintaining synchronization among instances. In these cases, a common idiom is to support a cache of instances; a factory function is used to return either a new or existing instance. The difficulty in this approach is that one of two things must be tolerated: either the cache grows without bound, or there needs to be explicit management of the cache elsewhere in the application. The later can be very tedious and leads to more code than is really necessary to solve the problem at hand, and the former can be unacceptable for long-running processes or even relatively short processes with substantial memory requirements. - External objects that need to be represented by a single instance, no matter how many internal users there are. This can be useful for representing files that need to be written back to disk in whole rather than locked & modified for every use. - Objects that are expensive to create, but may be needed by multiple internal consumers. Similar to the first case, but not necessarily bound to external resources, and possibly not an issue for shared state. Weak references are only useful in this case if there is some flavor of "soft" references or if there is a high likelihood that users of individual objects will overlap in lifespan. Circular references - DOMs require a huge amount of circular (to parent & document nodes), but these could be eliminated using a weak dictionary mapping from each node to it's parent. This might be especially useful in the context of something like xml.dom.pulldom, allowing the .unlink() operation to become a no-op. This proposal is divided into the following sections: - Proposed Solution - Implementation Strategy - Possible Applications - Previous Weak Reference Work in Python - Weak References in Java The full text of one early proposal is included as an appendix since it does not appear to be available on the net. Proposed Solution Weak references should be able to point to any Python object that may have substantial memory size (directly or indirectly), or hold references to external resources (database connections, open files, etc.). A new module, weakref, will contain two new functions used to create weak references. weakref.new() will create a "weak reference object" and optionally attach a callback which will be called when the object is about to be finalized. weakref.mapping() will create a "weak dictionary". A weak reference object will allow access to the referenced object if it hasn't been collected, allows application code to either "clear" the weak reference and the associated callback, and to determine if the object still exists in memory. These operations are accessed via the .get(), .clear(), and .islive() methods. If the object has been collected, .get() will return None and islive() will return false; calling .clear() will produce the same result, and is allowed to occur repeatedly. Weak references are not hashable. A weak dictionary maps arbitrary keys to values, but does not own a reference to the values. When the values are finalized, the (key, value) pairs for which it is a value are removed from all the mappings containing such pairs. These mapping objects provide an additional method, .setitem(), which accepts a key, value, and optional callback which will be called when the object is removed from the mapping. If the object is removed from the mapping by deleting the entry by key or calling the mapping's .clear() method, the callback is not called (since the object is not about to be finalized), and will be unregistered from the object. Like dictionaries, weak dictionaries are not hashable. The callbacks registered with weak reference objects or in weak dictionaries must accept a single parameter, which will be the weak-ly referenced object itself. The object can be resurrected by creating some other reference to the object in the callback, in which case the weak reference generating the callback will still be cleared but no weak references remaining to the object will be cleared. Implementation Strategy The implementation of weak references will include a list of reference containers that must be cleared for each weakly- referencable object. If the reference is from a weak dictionary, the dictionary entry is cleared first. Then, any associated callback is called with the object passed as a parameter. If the callback generates additional references, no further callbacks are called and the object is resurrected. Once all callbacks have been called and the object has not been resurrected, it is finalized and deallocated. Many built-in types will participate in the weak-reference management, and any extension type can elect to do so. The type structure will contain an additional field which provides an offset into the instance structure which contains a list of weak reference structures. If the value of the field is <= 0, the object does not participate. In this case, weakref.new(), .setitem(), .setdefault(), and item assignment will raise TypeError. If the value of the field is > 0, a new weak reference can be generated and added to the list. This approach is taken to allow arbitrary extension types to participate, without taking a memory hit for numbers or other small types. XXX -- Need to determine the exact list of standard types which should support weak references. This should include instances and dictionaries. Whether it should include strings, buffers, lists, files, or sockets is debatable. Possible Applications PyGTK+ bindings? Tkinter -- could avoid circular references by using weak references from widgets to their parents. Objects won't be discarded any sooner in the typical case, but there won't be so much dependence on the programmer calling .destroy() before releasing a reference. This would mostly benefit long-running applications. DOM trees. Previous Weak Reference Work in Python Dianne Hackborn has proposed something called "virtual references". 'vref' objects are very similar to java.lang.ref.WeakReference objects, except there is no equivalent to the invalidation queues. Implementing a "weak dictionary" would be just as difficult as using only weak references (without the invalidation queue) in Java. Information on this has disappeared from the Web, but is included below as an Appendix. Marc-André Lemburg's mx.Proxy package: http://www.lemburg.com/files/python/mxProxy.html The weakdict module by Dieter Maurer is implemented in C and Python. It appears that the Web pages have not been updated since Python 1.5.2a, so I'm not yet sure if the implementation is compatible with Python 2.0. http://www.handshake.de/~dieter/weakdict.html PyWeakReference by Alex Shindich: http://sourceforge.net/projects/pyweakreference/ Eric Tiedemann has a weak dictionary implementation: http://www.hyperreal.org/~est/python/weak/ Weak References in Java http://java.sun.com/j2se/1.3/docs/api/java/lang/ref/package-summary.html Java provides three forms of weak references, and one interesting helper class. The three forms are called "weak", "soft", and "phantom" references. The relevant classes are defined in the java.lang.ref package. For each of the reference types, there is an option to add the reference to a queue when it is invalidated by the memory allocator. The primary purpose of this facility seems to be that it allows larger structures to be composed to incorporate weak-reference semantics without having to impose substantial additional locking requirements. For instance, it would not be difficult to use this facility to create a "weak" hash table which removes keys and referents when a reference is no longer used elsewhere. Using weak references for the objects without some sort of notification queue for invalidations leads to much more tedious implementation of the various operations required on hash tables. This can be a performance bottleneck if deallocations of the stored objects are infrequent. Java's "weak" references are most like Dianne Hackborn's old vref proposal: a reference object refers to a single Python object, but does not own a reference to that object. When that object is deallocated, the reference object is invalidated. Users of the reference object can easily determine that the reference has been invalidated, or a NullObjectDereferenceError can be raised when an attempt is made to use the referred-to object. The "soft" references are similar, but are not invalidated as soon as all other references to the referred-to object have been released. The "soft" reference does own a reference, but allows the memory allocator to free the referent if the memory is needed elsewhere. It is not clear whether this means soft references are released before the malloc() implementation calls sbrk() or its equivalent, or if soft references are only cleared when malloc() returns NULL. XXX -- Need to figure out what phantom references are all about. Unlike the other two reference types, "phantom" references must be associated with an invalidation queue. Appendix -- Dianne Hackborn's vref proposal (1995) [This has been indented and paragraphs reflowed, but there have be no content changes. --Fred] Proposal: Virtual References In an attempt to partly address the recurring discussion concerning reference counting vs. garbage collection, I would like to propose an extension to Python which should help in the creation of "well structured" cyclic graphs. In particular, it should allow at least trees with parent back-pointers and doubly-linked lists to be created without worry about cycles. The basic mechanism I'd like to propose is that of a "virtual reference," or a "vref" from here on out. A vref is essentially a handle on an object that does not increment the object's reference count. This means that holding a vref on an object will not keep the object from being destroyed. This would allow the Python programmer, for example, to create the aforementioned tree structure tree structure, which is automatically destroyed when it is no longer in use -- by making all of the parent back-references into vrefs, they no longer create reference cycles which keep the tree from being destroyed. In order to implement this mechanism, the Python core must ensure that no -real- pointers are ever left referencing objects that no longer exist. The implementation I would like to propose involves two basic additions to the current Python system: 1. A new "vref" type, through which the Python programmer creates and manipulates virtual references. Internally, it is basically a C-level Python object with a pointer to the Python object it is a reference to. Unlike all other Python code, however, it does not change the reference count of this object. In addition, it includes two pointers to implement a doubly-linked list, which is used below. 2. The addition of a new field to the basic Python object [PyObject_Head in object.h], which is either NULL, or points to the head of a list of all vref objects that reference it. When a vref object attaches itself to another object, it adds itself to this linked list. Then, if an object with any vrefs on it is deallocated, it may walk this list and ensure that all of the vrefs on it point to some safe value, e.g. Nothing. This implementation should hopefully have a minimal impact on the current Python core -- when no vrefs exist, it should only add one pointer to all objects, and a check for a NULL pointer every time an object is deallocated. Back at the Python language level, I have considered two possible semantics for the vref object -- ==> Pointer semantics: In this model, a vref behaves essentially like a Python-level pointer; the Python program must explicitly dereference the vref to manipulate the actual object it references. An example vref module using this model could include the function "new"; When used as 'MyVref = vref.new(MyObject)', it returns a new vref object such that that MyVref.object == MyObject. MyVref.object would then change to Nothing if MyObject is ever deallocated. For a concrete example, we may introduce some new C-style syntax: & -- unary operator, creates a vref on an object, same as vref.new(). * -- unary operator, dereference a vref, same as VrefObject.object. We can then define: 1. type(&MyObject) == vref.VrefType 2. *(&MyObject) == MyObject 3. (*(&MyObject)).attr == MyObject.attr 4. &&MyObject == Nothing 5. *MyObject -> exception Rule #4 is subtle, but comes about because we have made a vref to (a vref with no real references). Thus the outer vref is cleared to Nothing when the inner one inevitably disappears. ==> Proxy semantics: In this model, the Python programmer manipulates vref objects just as if she were manipulating the object it is a reference of. This is accomplished by implementing the vref so that all operations on it are redirected to its referenced object. With this model, the dereference operator (*) no longer makes sense; instead, we have only the reference operator (&), and define: 1. type(&MyObject) == type(MyObject) 2. &MyObject == MyObject 3. (&MyObject).attr == MyObject.attr 4. &&MyObject == MyObject Again, rule #4 is important -- here, the outer vref is in fact a reference to the original object, and -not- the inner vref. This is because all operations applied to a vref actually apply to its object, so that creating a vref of a vref actually results in creating a vref of the latter's object. The first, pointer semantics, has the advantage that it would be very easy to implement; the vref type is extremely simple, requiring at minimum a single attribute, object, and a function to create a reference. However, I really like the proxy semantics. Not only does it put less of a burden on the Python programmer, but it allows you to do nice things like use a vref anywhere you would use the actual object. Unfortunately, it would probably an extreme pain, if not practically impossible, to implement in the current Python implementation. I do have some thoughts, though, on how to do this, if it seems interesting; one possibility is to introduce new type-checking functions which handle the vref. This would hopefully older C modules which don't expect vrefs to simply return a type error, until they can be fixed. Finally, there are some other additional capabilities that this system could provide. One that seems particularily interesting to me involves allowing the Python programmer to add "destructor" function to a vref -- this Python function would be called immediately prior to the referenced object being deallocated, allowing a Python program to invisibly attach itself to another object and watch for it to disappear. This seems neat, though I haven't actually come up with any practical uses for it, yet... :) -- Dianne Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil End: