python-peps/pep-0253.txt

669 lines
31 KiB
Plaintext
Raw Normal View History

2001-05-14 09:43:23 -04:00
PEP: 253
Title: Subtyping Built-in Types
Version: $Revision$
Author: guido@python.org (Guido van Rossum)
Status: Draft
Type: Standards Track
Python-Version: 2.2
Created: 14-May-2001
Post-History:
Abstract
This PEP proposes additions to the type object API that will allow
the creation of subtypes of built-in types, in C and in Python.
Introduction
2001-05-14 09:43:23 -04:00
Traditionally, types in Python have been created statically, by
declaring a global variable of type PyTypeObject and initializing
it with a static initializer. The fields in the type object
describe all aspects of a Python type that are relevant to the
Python interpreter. A few fields contain dimensional information
(e.g. the basic allocation size of instances), others contain
various flags, but most fields are pointers to functions to
implement various kinds of behaviors. A NULL pointer means that
the type does not implement the specific behavior; in that case
the system may provide a default behavior in that case or raise an
exception when the behavior is invoked. Some collections of
functions pointers that are usually defined together are obtained
indirectly via a pointer to an additional structure containing
more function pointers.
2001-05-14 09:43:23 -04:00
While the details of initializing a PyTypeObject structure haven't
been documented as such, they are easily gleaned from the examples
in the source code, and I am assuming that the reader is
sufficiently familiar with the traditional way of creating new
Python types in C.
This PEP will introduce the following features:
- a type can be a factory function for its instances
- types can be subtyped in C
- types can be subtyped in Python with the class statement
- multiple inheritance from types is supported (insofar as
practical)
- the standard coercions functions (int, tuple, str etc.) will be
redefined to be the corresponding type objects, which serve as
their own factory functions
- there will be a standard type hierarchy
- a class statement can contain a metaclass declaration,
specifying the metaclass to be used to create the new class
- a class statement can contain a slots declaration, specifying
the specific names of the instance variables supported
This PEP builds on pep-0252, which adds standard introspection to
types; e.g., when the type object defines the tp_hash slot, the
type object has a __hash__ method. pep-0252 also adds a
dictionary to type objects which contains all methods. At the
Python level, this dictionary is read-only for built-in types; at
the C level, it is accessible directly (but it should not be
modified except as part of initialization).
For binary compatibility, a flag bit in the tp_flags slot
indicates the existence of the various new slots in the type
object introduced below. Types that don't have the
Py_TPFLAGS_HAVE_CLASS bit set in their tp_flags field are assumed
to have NULL values for all the subtyping slots. (Warning: the
current implementation prototype is not yet consistent in its
checking of this flag bit. This should be fixed before the final
release.)
In current Python, a distinction is made between types and
classes. This PEP together with pep-0254 will remove that
distinction. However, for backwards compatibility there will
probably remain a bit of a distinction for years to come, and
without pep-0254, the distinction is still large: types ultimately
have a built-in type as a base class, while classes ultimately
derive from a user-defined class. Therefore, in the rest of this
PEP, I will use the word type whenever I can -- including base
type or supertype, derived type or subtype, and metatype.
However, sometimes the terminology necessarily blends, e.g. an
object's type is given by its __class__ attribute, and subtyping
in Python is spelled with a class statement. If further
distinction is necessary, user-defined classes can be referred to
as "classic" classes.
About metatypes
Inevitably the following discussion will come to mention metatypes
(or metaclasses). Metatypes are nothing new in Python: Python has
always been able to talk about the type of a type:
>>> a = 0
>>> type(a)
<type 'int'>
>>> type(type(a))
<type 'type'>
>>> type(type(type(a)))
<type 'type'>
>>>
In this example, type(a) is a "regular" type, and type(type(a)) is
a metatype. While as distributed all types have the same metatype
(PyType_Type, which is also its own metatype), this is not a
requirement, and in fact a useful and relevant 3rd party extension
(ExtensionClasses by Jim Fulton) creates an additional metatype.
A related feature is the "Don Beaudry hook", which says that if a
metatype is callable, its instances (which are regular types) can
be subclassed (really subtyped) using a Python class statement.
I will use this rule to support subtyping of built-in types, and
in fact it greatly simplifies the logic of class creation to
always simply call the metatype. When no base class is specified,
a default metatype is called -- the default metatype is the
"ClassType" object, so the class statement will behave as before
in the normal case.
Python uses the concept of metatypes or metaclasses in a different
way than Smalltalk. In Smalltalk-80, there is a hierarchy of
metaclasses that mirrors the hierarchy of regular classes,
metaclasses map 1-1 to classes (except for some funny business at
the root of the hierarchy), and each class statement creates both
a regular class and its metaclass, putting class methods in the
metaclass and instance methods in the regular class.
Nice though this may be in the context of Smalltalk, it's not
compatible with the traditional use of metatypes in Python, and I
prefer to continue in the Python way. This means that Python
metatypes are typically written in C, and may be shared between
many regular types. (It will be possible to subtype metatypes in
Python, so it won't be absolutely necessary to write C in order to
use metatypes; but the power of Python metatypes will be limited,
e.g. Python code will never be allowed to allocate raw memory and
initialize it at will.)
Metatypes determine various *policies* for types, e.g. what
happens when a type is called, how dynamic types are (whether a
type's __dict__ can be modified after it is created), what the
method resolution order is, how instance attributes are looked
up, and so on.
I'll argue that left-to-right depth-first is not the best
solution when you want to get the most use from multiple
inheritance.
I'll argue that with multiple inheritance, the metatype of the
subtype must be a descendant of the metatypes of all base types.
I'll come back to metatypes later.
Making a type a factory for its instances
Traditionally, for each type there is at least one C factory
function that creates instances of the type (PyTuple_New(),
PyInt_FromLong() and so on). These factory functions take care of
both allocating memory for the object and initializing that
memory. As of Python 2.0, they also have to interface with the
garbage collection subsystem, if the type chooses to participate
in garbage collection (which is optional, but strongly recommended
for so-called "container" types: types that may contain arbitrary
references to other objects, and hence may participate in
reference cycles).
In this proposal, type objects can be factory functions for their
instances, making the types directly callable from Python. This
mimics the way classes are instantiated. Of course, the C APIs
for creating instances of various built-in types will remain valid
and probably the most common; and not all types will become their
own factory functions.
The type object has a new slot, tp_new, which can act as a factory
for instances of the type. Types are made callable by providing a
tp_call slot in PyType_Type (the metatype); the slot
implementation function looks for the tp_new slot of the type that
is being called.
If the type's tp_new slot is NULL, an exception is raised.
Otherwise, the tp_new slot is called. The signature for the
tp_new slot is
PyObject *tp_new(PyTypeObject *type,
PyObject *args,
PyObject *kwds)
where 'type' is the type whose tp_new slot is called, and 'args'
and 'kwds' are the sequential and keyword arguments to the call,
passed unchanged from tp_call. (The 'type' argument is used in
combination with inheritance, see below.)
There are no constraints on the object type that is returned,
although by convention it should be an instance of the given
type. It is not necessary that a new object is returned; a
reference to an existing object is fine too. The return value
should always be a new reference, owned by the caller.
Requirements for a type to allow subtyping
The simplest form of subtyping is subtyping in C. It is the
simplest form because we can require the C code to be aware of the
various problems, and it's acceptable for C code that doesn't
follow the rules to dump core. For added simplicity, it is
limited to single inheritance.
The idea behind subtyping is very similar to that of single
inheritance in C++. A base type is described by a structure
declaration plus a type object. A derived type can extend the
structure (but must leave the names, order and type of the fields
of the base structure unchanged) and can override certain slots in
the type object, leaving others the same.
Most issues have to do with construction and destruction of
instances of derived types.
Creation of a new object is separated into allocation and
initialization: allocation allocates the memory, and
initialization fill it with appropriate initial values. The
separation is needed for the convenience of subtypes.
Instantiation of a subtype goes as follows:
1. allocate memory for the whole (subtype) instance
2. initialize the base type
3. initialize the subtype's instance variables
If allocation and initialization were done by the same function,
you would need a way to tell the base type's constructor to
allocate additional memory for the subtype's instance variables,
and there would be no way to change the allocation method for a
subtype (without giving up on calling the base type to initialize
its part of the instance structure).
A similar reasoning applies to destruction: if a subtype changes
the instance allocator (e.g. to use a different heap), it must
also change the instance deallocator; but it must still call on
the base type's destructor to DECREF the base type's instance
variables.
In this proposal, I assign stricter meanings to two existing
slots for deallocation and deinitialization, and I add two new
slots for allocation and initialization.
The tp_clear slot gets the new task of deinitializing an object so
that all that remains to be done is free its memory. Originally,
all it had to do was clear object references. The difference is
subtle: the list and dictionary objects contain references to an
additional heap-allocated piece of memory that isn't freed by
tp_clear in Python 2.1, but which must be freed by tp_clear under
this proposal. It should be safe to call tp_clear repeatedly on
the same object. If an object contains no references to other
objects or heap-allocated memory, the tp_clear slot may be NULL.
The only additional requirement for the tp_dealloc slot is that it
should do the right thing whether or not tp_clear has been called.
The new slots are tp_alloc for allocation and tp_init for
initialization. Their signatures:
PyObject *tp_alloc(PyTypeObject *type,
PyObject *args,
PyObject *kwds)
int tp_init(PyObject *self,
PyObject *args,
PyObject *kwds)
[XXX We'll have to rename tp_alloc to something else, because in
debug mode there's already a tp_alloc field.]
The arguments for tp_alloc are the same as for tp_new, described
above. The arguments for tp_init are the same except that the
first argument is replaced with the instance to be initialized.
Its return value is 0 for success or -1 for failure.
It is possible that tp_init is called more than once or not at
all. The implementation should allow this usage. The object may
be non-functional until tp_init is called, and a second call to
tp_init may raise an exception, but it should not be possible to
cause a core dump or memory leakage this way.
Because tp_init is in a sense optional, tp_alloc is required to do
*some* initialization of the object. It must initialize ob_refcnt
to 1 and ob_type to its type argument. It should zero out the
rest of the object.
The constructor arguments are passed to tp_alloc so that for
variable-size objects (like tuples and strings) it knows to
allocate the right amount of memory.
For immutable types, tp_alloc may have to do the full
initialization; otherwise, different calls to tp_init might cause
an immutable object to be modified, which is considered a grave
offense in Python (unlike in Fortran :-).
Not every type can serve as a base type. The assumption is made
that if a type has a non-NULL value in its tp_init slot, it is
ready to be subclassed; otherwise, it is not, and using it as a
base class will raise an exception.
In order to be usefully subtyped in C, a type must also export the
structure declaration for its instances through a header file, as
it is needed in order to derive a subtype. The type object for
the base type must also be exported.
If the base type has a type-checking macro (e.g. PyDict_Check()),
this macro probably should be changed to recognize subtypes. This
can be done by using the new PyObject_TypeCheck(object, type)
macro, which calls a function that follows the base class links.
(An argument against changing the type-checking macro could be
that the type check is used frequently and a function call would
slow things down too much, but I find this hard to believe. One
could also fear that a subtype might break an invariant assumed by
the support functions of the base type. Usually it is best to
change the base type to remove this reliance, at least to the
point of raising an exception rather than dumping core when the
invariant is broken.)
Here are the inteactions between, tp_alloc, tp_clear, tp_dealloc
and subtypes; all assuming that the base type defines tp_init
(otherwise it cannot be subtyped anyway):
- If the base type's allocation scheme doesn't use the standard
heap, it should not define tp_alloc. This is a signal for the
subclass to provide its own tp_alloc *and* tp_dealloc
implementation (probably using the standard heap).
- If the base type's tp_dealloc does anything besides calling
PyObject_DEL() (typically, calling Py_XDECREF() on contained
objects or freeing dependent memory blocks), it should define a
tp_clear that does the same without calling PyObject_DEL(), and
which checks for zero pointers before and zeros the pointers
afterwards, so that calling tp_clear more than once or calling
tp_dealloc after tp_clear will not attempt to DECREF or free the
same object/memory twice. (It should also be allowed to
continue using the object after tp_clear -- tp_clear should
simply reset the object to its pristine state.)
- If the derived type overrides tp_alloc, it should also override
tp_dealloc, and tp_dealloc should call the derived type's
tp_clear if non-NULL (or its own tp_clear).
- If the derived type overrides tp_clear, it should call the base
type's tp_clear if non-NULL.
- If the base type defines tp_init as well as tp_new, its tp_new
should be inheritable: it should call the tp_alloc and the
tp_init of the type passed in as its first argument.
- If the base type defines tp_init as well as tp_alloc, its
tp_alloc should be inheritable: it should look in the
tp_basicsize slot of the type passed in for the amount of memory
to allocate, and it should initialize all allocated bytes to
zero.
- For types whose tp_itemsize is nonzero, the allocation size used
in tp_alloc should be tp_basicsize + n*tp_itemsize, rounded up
to the next integral multiple of sizeof(PyObject *), where n is
the number of items determined by the arguments to tp_alloc.
- Things are further complicated by the garbage collection API.
This affects tp_basicsize, and the actions to be taken by
tp_alloc. tp_alloc should look at the Py_TPFLAGS_GC flag bit in
the tp_flags field of the type passed in, and not assume that
this is the same as the corresponding bit in the base type. (In
part, the GC API is at fault; Neil Schemenauer has a patch that
fixes the API, but it is currently backwards incompatible.)
Note: the rules here are very complicated -- probably too
complicated. It may be better to give up on subtyping immutable
types, types with custom allocators, and types with variable size
allocation (such as int, string and tuple) -- then the rules can
be much simplified because you can assume allocation on the
standard heap, no requirement beyond zeroing memory in tp_alloc,
and no variable length allocation.
Creating a subtype of a built-in type in C
Let's assume we're deriving from a mutable base type whose
tp_itemsize is zero. The subtype code is not GC-aware, although
it may inherit GC-awareness from the base type (this is
automatic). The base type's allocation uses the standard heap.
The derived type begins by declaring a type structure which
contains the base type's structure. For example, here's the type
structure for a subtype of the built-in list type:
typedef struct {
PyListObject list;
int state;
} spamlistobject;
Note that the base type structure field (here PyListObject) must
be the first field in the structure; any following fields are
extension fields. Also note that the base type is not referenced
via a pointer; the actual contents of its structure must be
included! (The goal is for the memory lay out of the beginning of
the subtype instance to be the same as that of the base type
instance.)
Next, the derived type must declare a type object and initialize
it. Most of the slots in the type object may be initialized to
zero, which is a signal that the base type slot must be copied
into it. Some fields that must be initialized properly:
- The object header must be filled in as usual; the type should be
&PyType_Type.
- The tp_basicsize field must be set to the size of the subtype
instance struct (in the above example: sizeof(spamlistobject)).
- The tp_base field must be set to the address of the base type's
type object.
- If the derived slot defines any pointer fields, the tp_dealloc
slot function requires special attention, see below; otherwise,
it can be set to zero, to inherit the base type's deallocation
function.
- The tp_flags field must be set to the usual Py_TPFLAGS_DEFAULT
value.
- The tp_name field must be set; it is recommended to set tp_doc
as well (these are not inherited).
Exception: if the subtype defines no additional fields in its
structure (i.e., it only defines new behavior, no new data), the
tp_basicsize and the tp_dealloc fields may be set to zero.
In order to complete the initialization of the type,
PyType_InitDict() must be called. This replaces zero slots in the
subtype with the value of the corresponding base type slots. (It
also fills in tp_dict, the type's dictionary, and does various
other initializations necessary for type objects.)
The subtype's tp_dealloc slot deserves special attention. If the
derived type defines no additional pointers that need to be
DECREF'ed or freed when the object is deallocated, it can be set
to zero. Otherwise, the subtype's deallocation function must call
Py_XDECREF() for any PyObject * fields and the correct memory
freeing function for any other pointers it owns, and then call the
base class's tp_dealloc slot. Because deallocation functions
typically are not exported, this call has to be made via the base
type's type structure, e.g., when deriving from the standard list
type:
PyList_Type.tp_dealloc(self);
(If the subtype uses a different allocation heap than the base
type, the subtype must call the base type's tp_clear() slot
instead, followed by a call to free the object's memory from the
appropriate heap, e.g. PyObject_DEL(self) if the subtype uses the
standard heap. But in this case subtyping is not recommended.)
A subtype is not usable until PyType_InitDict() is called for it;
this is best done during module initialization, assuming the
subtype belongs to a module. An alternative for subtypes added to
the Python core (which don't live in a particular module) would be
to initialize the subtype in their constructor function. It is
allowed to call PyType_InitDict() more than once, the second and
further calls have no effect. In order to avoid unnecessary
calls, a test for tp_dict==NULL can be made.
To create a subtype instance, the base type's tp_alloc slot must
be called with the subtype as its first argument. Then, if the
base type has a tp_init slot, that must be called to initialize
the base portion of the instance; finally the subtype's own fields
must be initialized. After allocation, the initialization can
also be done by calling the subtype's tp_init slot, assuming this
correctly calls its base type's tp_init slot.
Subtyping in Python
The next step is to allow subtyping of selected built-in types
through a class statement in Python. Limiting ourselves to single
inheritance for now, here is what happens for a simple class
statement:
class C(B):
var1 = 1
def method1(self): pass
# etc.
The body of the class statement is executes in a fresh environment
(basically, a new dictionary used as local namespace), and then C
is created. The following explains how C is created.
Assume B is a type object. Since type objects are objects, and
every object has a type, B has a type. B's type is accessible via
type(B) or B.__class__ (the latter notation is new for types; it
is introduced in pep-0252). Let's say B's type is M (for
Metatype). The class statement will create a new type, C. Since
C will be a type object just like B, we view the creation of C as
an instantiation of the metatype, M. The information that needs
to be provided for the creation of C is: its name (in this example
the string "C"); the list of base classes (a singleton tuple
containing B); and the results of executing the class body, in the
form of a dictionary (e.g. {"var1": 1, "method1": <function
method1 at ...>, ...}).
I propose to rig the class statement to make the following call:
C = M("C", (B,), dict)
(where dict is the dictionary resulting from execution of the
class body). In other words, the metatype (M) is called.
Note that even though we currently require there to be exactly one
base class, we still pass in a (singleton) sequence of base
classes; this makes it possible to support multiple inheritance
later (or for types with a different metaclass!) without changing
this interface.
In current Python, this is called the "Don Beaudry hook" after its
inventor; it is an exceptional case that is only invoked when a
base class is not a regular class. For a regular base class (or
when no base class is specified), current Python calls
PyClass_New(), the C level factory function for classes, directly.
I propose to change this so that Python *always* determines a
metaclass and calls it as given above. When one or more bases are
given, the type of the first base is used as the metatype;
when no base class is given, a default metaclass is chosen. By
setting the default metaclass to PyClass_Type, the metatype of
"classic" classes, the classic behavior of the class statement is
retained.
There are two further refinements here. First, a useful feature
is to be able to specify a metatype directly. If the class
statement defines a variable __metaclass__, that is the metatype
to call.
Second, with multiple bases, not all bases need to have the same
metatype. This is called a metaclass conflict [1]. Some
metaclass conflicts can be resolved by searching through the set
of bases for a metatype that derives from all other given
metatypes. If such a metatype cannot be found, an exception is
raised and the class statement fails.
This conflict resultion can be implemented in the metatypes
itself: the class statement just calls the metatype of the first
base, and this metatype's constructor looks for the most derived
metatype. If that is itself, it proceeds; otherwise, it calls
that metatype's constructor. (Ultimate flexibility: another
metatype might choose to require that all bases have the same
metatype, or that there's only one base class, or whatever.)
(Theoretically, it might be possible to automatically derive a new
metatype that is a subtype of all given metatypes; but since it is
questionable how conflicting method definitions of the various
metatypes should be merged, I don't think this is useful or
feasible. Should the need arise, the user can derive such a
metatype and specify it using the __metaclass__ variable. It is
also possible to have a new metatype that does this.)
HIRO
Note that calling M requires that M itself has a type: the
meta-metatype. In the current implementation, I have introduced a
new type object for this purpose, named turtle because of my
fondness of the phrase "turtles all the way down". However I now
believe that it would be better if M were its own metatype, just
like before. This can be accomplished by making M's tp_call slot
slightly more flexible.
In any case, the work for creating C is done by M's tp_construct
slot. It allocates space for an "extended" type structure, which
contains space for: the type object; the auxiliary structures
(as_sequence etc.); the string object containing the type name (to
ensure that this object isn't deallocated while the type object is
still referencing it); and some more auxiliary storage (to be
described later). It initializes this storage to zeros except for
a few crucial slots (e.g. tp_name is set to point to the type
name) and then sets the tp_base slot to point to B. Then
PyType_InitDict() is called to inherit B's slots. Finally, C's
tp_dict slot is updated with the contents of the namespace
dictionary (the third argument to the call to M).
2001-05-14 09:43:23 -04:00
Implementation
A prototype implementation of this PEP is available from CVS as a
branch named "descr-branch". To experiment with this
implementation, proceed to check out Python from CVS according to
the instructions at http://sourceforge.net/cvs/?group_id=5470 but
add the arguments "-r descr-branch" to the cvs checkout command.
(You can also start with an existing checkout and do "cvs update
-r descr-branch".) For some examples of the features described
here, see the file Lib/test/test_descr.py and the extension module
Modules/spam.c.
Note: the code in this branch is for pep-0252, pep-0253, and
pep-254.
References
[1] "Putting Metaclasses to Work", by Ira R. Forman and Scott
H. Danforth, Addison-Wesley 1999.
(http://www.aw.com/product/0,2627,0201433052,00.html)
2001-05-14 09:43:23 -04:00
Copyright
This document has been placed in the public domain.
Junk text (to be reused somewhere above)
The deallocation mechanism chosen should match the allocation
mechanism: an allocation policy should prescribe both the
allocation and deallocation mechanism. And again, planning ahead
for subtyping would be nice. But the available mechanisms are
different. The deallocation function has always been part of the
type structure, as tp_dealloc, which combines the
"uninitialization" with deallocation. This was good enough for
the traditional situation, where it matched the combined
allocation and initialization of the creation function. But now
imagine a type whose creation function uses a special free list
for allocation. It's deallocation function puts the object's
memory back on the same free list. But when allocation and
creation are separate, the object may have been allocated from the
regular heap, and it would be wrong (in some cases disastrous) if
it were placed on the free list by the deallocation function.
A solution would be for the tp_construct function to somehow mark
whether the object was allocated from the special free list, so
that the tp_dealloc function can choose the right deallocation
method (assuming that the only two alternatives are a special free
list or the regular heap). A variant that doesn't require space
for an allocation flag bit would be to have two type objects,
identical in the contents of all their slots except for their
deallocation slot. But this requires that all type-checking code
(e.g. the PyDict_Check()) recognizes both types. We'll come back
to this solution in the context of subtyping. Another alternative
is to require the metatype's tp_call to leave the allocation to
the tp_construct method, by passing in a NULL pointer. But this
doesn't work once we allow subtyping.
Eventually, when we add any form of subtyping, we'll have to
separate deallocation from uninitialization. The way to do this
is to add a separate slot to the type object that does the
uninitialization without the deallocation. Fortunately, there is
already such a slot: tp_clear, currently used by the garbage
collection subsystem. A simple rule makes this slot reusable as
an uninitialization: for types that support separate allocation
and initialization, tp_clear must be defined (even if the object
doesn't support garbage collection) and it must DECREF all
contained objects and FREE all other memory areas the object owns.
It must also be reentrant: it must be possible to clear an already
cleared object. The easiest way to do this is to replace all
pointers DECREFed or FREEd with NULL pointers.
2001-05-14 09:43:23 -04:00
Local Variables:
mode: indented-text
indent-tabs-mode: nil
End: