diff --git a/pep-0209.txt b/pep-0209.txt index 54289593e..03d9200ca 100644 --- a/pep-0209.txt +++ b/pep-0209.txt @@ -1,12 +1,680 @@ PEP: 209 -Title: Adding Multidimensional Arrays +Title: Multi-dimensional Arrays Version: $Revision$ -Author: davida@activestate.com (David Ascher) +Author: barrett@stsci.edu (Paul Barrett), oliphant@ee.byu.edu (Travis Oliphant) +Python-Version: 2.2 Status: Draft Type: Standards Track -Python-Version: 2.1 -Created: 15-Jul-2000 -Post-History: +Created: 03-Jan-2001 +Post-History: + + +Abstract + + This PEP proposes a redesign and re-implementation of the multi- + dimensional array module, Numeric, to make it easier to add new + features and functionality to the module. Aspects of Numeric 2 + that will receive special attention are efficient access to arrays + exceeding a gigabyte in size and composed of inhomogeneous data + structures or records. The proposed design uses four Python + classes: ArrayType, UFunc, Array, and ArrayView; and a low-level + C-extension module, _ufunc, to handle the array operations + efficiently. In addition, each array type has its own C-extension + module which defines the coercion rules, operations, and methods + for that type. This design enables new types, features, and + functionality to be added in a modular fashion. The new version + will introduce some incompatibilities with the current Numeric. + + +Motivation + + Multi-dimensional arrays are commonly used to store and manipulate + data in science, engineering, and computing. Python currently has + an extension module, named Numeric (henceforth called Numeric 1), + which provides a satisfactory set of functionality for users + manipulating homogeneous arrays of data of moderate size (of order + 10 MB). For access to larger arrays (of order 100 MB or more) of + possibly inhomogeneous data, the implementation of Numeric 1 is + inefficient and cumbersome. In the future, requests by the + Numerical Python community for additional functionality is also + likely as PEPs 211: Adding New Linear Operators to Python, and + 225: Elementwise/Objectwise Operators illustrate. + + +Proposal + + This proposal recommends a re-design and re-implementation of + Numeric 1, henceforth called Numeric 2, which will enable new + types, features, and functionality to be added in an easy and + modular manner. The initial design of Numeric 2 should focus on + providing a generic framework for manipulating arrays of various + types and should enable a straightforward mechanism for adding new + array types and UFuncs. Functional methods that are more specific + to various disciplines can then be layered on top of this core. + This new module will still be called Numeric and most of the + behavior found in Numeric 1 will be preserved. + + The proposed design uses four Python classes: ArrayType, UFunc, + Array, and ArrayView; and a low-level C-extension module to handle + the array operations efficiently. In addition, each array type + has its own C-extension module which defines the coercion rules, + operations, and methods for that type. At a later date, when core + functionality is stable, some Python classes can be converted to + C-extension types. + + Some planned features are: + + 1. Improved memory usage + + This feature is particularly important when handling large arrays + and can produce significant improvements in performance as well as + memory usage. We have identified several areas where memory usage + can be improved: + + a. Use a local coercion model + + Instead of using Python's global coercion model which creates + temporary arrays, Numeric 2, like Numeric 1, will implement a + local coercion model as described in PEP 208 which defers the + responsibility of coercion to the operator. By using internal + buffers, a coercion operation can be done for each array + (including output arrays), if necessary, at the time of the + operation. Benchmarks [1] have shown that performance is at + most degraded only slightly and is improved in cases where the + internal buffers are less than the L2 cache size and the + processor is under load. To avoid array coercion altogether, + C functions having arguments of mixed type are allowed in + Numeric 2. + + b. Avoid creation of temporary arrays + + In complex array expressions (i.e. having more than one + operation), each operation will create a temporary array which + will be used and then deleted by the succeeding operation. A + better approach would be to identify these temporary arrays + and reuse their data buffers when possible, namely when the + array shape and type are the same as the temporary array being + created. This can be done by checking the temporary array's + reference count. If it is 1, then it will be deleted once the + operation is done and is a candidate for reuse. + + c. Optional use of memory-mapped files + + Numeric users sometimes need to access data from very large + files or to handle data that is greater than the available + memory. Memory-mapped arrays provide a mechanism to do this + by storing the data on disk while making it appear to be in + memory. Memory- mapped arrays should improve access to all + files by eliminating one of two copy steps during a file + access. Numeric should be able to access in-memory and + memory-mapped arrays transparently. + + d. Record access + + In some fields of science, data is stored in files as binary + records. For example in astronomy, photon data is stored as a + 1 dimensional list of photons in order of arrival time. These + records or C-like structures contain information about the + detected photon, such as its arrival time, its position on the + detector, and its energy. Each field may be of a different + type, such as char, int, or float. Such arrays introduce new + issues that must be dealt with, in particular byte alignment + or byte swapping may need to be performed for the numeric + values to be properly accessed (though byte swapping is also + an issue for memory mapped data). Numeric 2 is designed to + automatically handle alignment and representational issues + when data is accessed or operated on. There are two + approaches to implementing records; as either a derived array + class or a special array type, depending on your point-of- + view. We defer this discussion to the Open Issues section. + + + 2. Additional array types + + Numeric 1 has 11 defined types: char, ubyte, sbyte, short, int, + long, float, double, cfloat, cdouble, and object. There are no + ushort, uint, or ulong types, nor are there more complex types + such as a bit type which is of use to some fields of science and + possibly for implementing masked-arrays. The design of Numeric 1 + makes the addition of these and other types a difficult and + error-prone process. To enable the easy addition (and deletion) + of new array types such as a bit type described below, a re-design + of Numeric is necessary. + + a. Bit type + + The result of a rich comparison between arrays is an array of + boolean values. The result can be stored in an array of type + char, but this is an unnecessary waste of memory. A better + implementation would use a bit or boolean type, compressing + the array size by a factor of eight. This is currently being + implemented for Numeric 1 (by Travis Oliphant) and should be + included in Numeric 2. + + 3. Enhanced array indexing syntax + + The extended slicing syntax was added to Python to provide greater + flexibility when manipulating Numeric arrays by allowing + step-sizes greater than 1. This syntax works well as a shorthand + for a list of regularly spaced indices. For those situations + where a list of irregularly spaced indices are needed, an enhanced + array indexing syntax would allow 1-D arrays to be arguments. + + 4. Rich comparisons + + The implementation of PEP 207: Rich Comparisons in Python 2.1 + provides additional flexibility when manipulating arrays. We + intend to implement this feature in Numeric 2. + + 5. Array broadcasting rules + + When an operation between a scalar and an array is done, the + implied behavior is to create a new array having the same shape as + the array operand containing the scalar value. This is called + array broadcasting. It also works with arrays of lesser rank, + such as vectors. This implicit behavior is implemented in Numeric + 1 and will also be implemented in Numeric 2. + + +Design and Implementation + + The design of Numeric 2 has four primary classes: + + 1. ArrayType: + + This is a simple class that describes the fundamental properties + of an array-type, e.g. its name, its size in bytes, its coercion + relations with respect to other types, etc., e.g. + + > Int32 = ArrayType('Int32', 4, 'doc-string') + + Its relation to the other types is defined when the C-extension + module for that type is imported. The corresponding Python code + is: + + > Int32.astype[Real64] = Real64 + + This says that the Real64 array-type has higher priority than the + Int32 array-type. + + The following attributes and methods are proposed for the core + implementation. Additional attributes can be added on an + individual basis, e.g. .bitsize or .bitstrides for the bit type. + + Attributes: + .name: e.g. "Int32", "Float64", etc. + .typecode: e.g. 'i', 'f', etc. + (for backward compatibility) + .size (in bytes): e.g. 4, 8, etc. + .array_rules (mapping): rules between array types + .pyobj_rules (mapping): rules between array and python types + .doc: documentation string + Methods: + __init__(): initialization + __del__(): destruction + __repr__(): representation + + C-API: + This still needs to be fleshed-out. + + + 2. UFunc: + + This class is the heart of Numeric 2. Its design is similar to + that of ArrayType in that the UFunc creates a singleton callable + object whose attributes are name, total and input number of + arguments, a document string, and an empty CFunc dictionary; e.g. + + > add = UFunc('add', 3, 2, 'doc-string') + + When defined the add instance has no C functions associated with + it and therefore can do no work. The CFunc dictionary is + populated or registered later when the C-extension module for an + array-type is imported. The arguments of the register method are: + function name, function descriptor, and the CUFunc object. The + corresponding Python code is + + > add.register('add', (Int32, Int32, Int32), cfunc-add) + + In the initialization function of an array type module, e.g. + Int32, there are two C API functions: one to initialize the + coercion rules and the other to register the CFunc objects. + + When an operation is applied to some arrays, the __call__ method + is invoked. It gets the type of each array (if the output array + is not given, it is created from the coercion rules) and checks + the CFunc dictionary for a key that matches the argument types. + If it exists the operation is performed immediately, otherwise the + coercion rules are used to search for a related operation and set + of conversion functions. The __call__ method then invokes a + compute method written in C to iterate over slices of each array, + namely: + + > _ufunc.compute(slice, data, func, swap, conv) + + The 'func' argument is a CFuncObject, while the 'swap' and 'conv' + arguments are lists of CFuncObjects for those arrays needing pre- + or post-processing, otherwise None is used. The data argument is + a list of buffer objects, and the slice argument gives the number + of iterations for each dimension along with the buffer offset and + step size for each array and each dimension. + + We have predefined several UFuncs for use by the __call__ method: + cast, swap, getobj, and setobj. The cast and swap functions do + coercion and byte-swapping, respectively and the getobj and setobj + functions do coercion between Numeric arrays and Python sequences. + + The following attributes and methods are proposed for the core + implementation. + + Attributes: + .name: e.g. "add", "subtract", etc. + .nargs: number of total arguments + .iargs: number of input arguments + .cfuncs (mapping): the set C functions + .doc: documentation string + Methods: + __init__(): initialization + __del__(): destruction + __repr__(): representation + __call__(): look-up and dispatch method + initrule(): initialize coercion rule + uninitrule(): uninitialize coercion rule + register(): register a CUFunc + unregister(): unregister a CUFunc + + C-API: + This still needs to be fleshed-out. + + 3. Array: + + This class contains information about the array, such as shape, + type, endian-ness of the data, etc.. Its operators, '+', '-', + etc. just invoke the corresponding UFunc function, e.g. + + > def __add__(self, other): + > return ufunc.add(self, other) + + The following attributes, methods, and functions are proposed for + the core implementation. + + Attributes: + .shape: shape of the array + .format: type of the array + .real (only complex): real part of a complex array + .imag (only complex): imaginary part of a complex array + Methods: + __init__(): initialization + __del__(): destruction + __repr_(): representation + __str__(): pretty representation + __cmp__(): rich comparison + __len__(): + __getitem__(): + __setitem__(): + __getslice__(): + __setslice__(): + numeric methods: + copy(): copy of array + aslist(): create list from array + asstring(): create string from array + + Functions: + fromlist(): create array from sequence + fromstring(): create array from string + array(): create array with shape and value + concat(): concatenate two arrays + resize(): resize array + + C-API: + This still needs to be fleshed-out. + + 4. ArrayView + + This class is similar to the Array class except that the reshape + and flat methods will raise exceptions, since non-contiguous + arrays cannot be reshaped or flattened using just pointer and + step-size information. + + C-API: + This still needs to be fleshed-out. + + 5. C-extension modules: + + Numeric2 will have several C-extension modules. + + a. _ufunc: + + The primary module of this set is the _ufuncmodule.c. The + intention of this module is to do the bare minimum, + i.e. iterate over arrays using a specified C function. The + interface of these functions is the same as Numeric 1, i.e. + + int (*CFunc)(char *data, int *steps, int repeat, void *func); + + and their functionality is expected to be the same, i.e. they + iterate over the inner-most dimension. + + The following attributes and methods are proposed for the core + implementation. + + Attributes: + + Methods: + compute(): + + C-API: + This still needs to be fleshed-out. + + b. _int32, _real64, etc.: + + There will also be C-extension modules for each array type, + e.g. _int32module.c, _real64module.c, etc. As mentioned + previously, when these modules are imported by the UFunc + module, they will automatically register their functions and + coercion rules. New or improved versions of these modules can + be easily implemented and used without affecting the rest of + Numeric 2. + + +Open Issues + + 1. Does slicing syntax default to copy or view behavior? + + The default behavior of Python is to return a copy of a sub-list + or tuple when slicing syntax is used, whereas Numeric 1 returns a + view into the array. The choice made for Numeric 1 is apparently + for reasons of performance: the developers wish to avoid the + penalty of allocating and copying the data buffer during each + array operation and feel that the need for a deep copy of an array + to be rare. Yet, some have argued that Numeric's slice notation + should also have copy behavior to be consistent with Python lists. + In this case the performance penalty associated with copy behavior + can be minimized by implementing copy-on-write. This scheme has + both arrays sharing one data buffer (as in view behavior) until + either array is assigned new data at which point a copy of the + data buffer is made. View behavior would then be implemented by + an ArrayView class, whose behavior be similar to Numeric 1 arrays, + i.e. .shape is not settable for non-contiguous arrays. The use of + an ArrayView class also makes explicit what type of data the array + contains. + + 2. Does item syntax default to copy or view behavior? + + A similar question arises with the item syntax. For example, if a + = [[0,1,2], [3,4,5]] and b = a[0], then changing b[0] also changes + a[0][0], because a[0] is a reference or view of the first row of + a. Therefore, if c is a 2-d array, it would appear that c[i] + should return a 1-d array which is a view into, instead of a copy + of, c for consistency. Yet, c[i] can be considered just a + shorthand for c[i,:] which would imply copy behavior assuming + slicing syntax returns a copy. Should Numeric 2 behave the same + way as lists and return a view or should it return a copy. + + 3. How is scalar coercion implemented? + + Python has fewer numeric types than Numeric which can cause + coercion problems. For example when multiplying a Python scalar + of type float and a Numeric array of type float, the Numeric array + is converted to a double, since the Python float type is actually + a double. This is often not the desired behavior, since the + Numeric array will be doubled in size which is likely to be + annoying, particularly for very large arrays. We prefer that the + array type trumps the python type for the same type class, namely + integer, float, and complex. Therefore an operation between a + Python integer and an Int16 (short) array will return an Int16 + array. Whereas an operation between a Python float and an Int16 + array would return a Float64 (double) array. Operations between + two arrays use normal coercion rules. + + 4. How is integer division handled? + + In a future version of Python, the behavior of integer division + will change. The operands will be converted to floats, so the + result will be a float. If we implement the proposed scalar + coercion rules where arrays have precedence over Python scalars, + then dividing an array by an integer will return an integer array + and will not be consistent with a future version of Python which + would return an array of type double. Scientific programmers are + familiar with the distinction between integer and float-point + division, so should Numeric 2 continue with this behavior? + + 5. How should records be implemented? + + There are two approaches to implementing records depending on your + point-of-view. The first is two divide arrays into separate + classes depending on the behavior of their types. For example + numeric arrays are one class, strings a second, and records a + third, because the range and type of operations of each class + differ. As such, a record array is not a new type, but a + mechanism for a more flexible form of array. To easily access and + manipulate such complex data, the class is comprised of numeric + arrays having different byte offsets into the data buffer. For + example, one might have a table consisting of an array of Int16, + Real32 values. Two numeric arrays, one with an offset of 0 bytes + and a stride of 6 bytes to be interpreted as Int16, and one with an + offset of 2 bytes and a stride of 6 bytes to be interpreted as + Real32 would represent the record array. Both numeric arrays + would refer to the same data buffer, but have different offset and + stride attributes, and a different numeric type. + + The second approach is to consider a record as one of many array + types, albeit with fewer, and possibly different, array operations + than for numeric arrays. This approach considers an array type to + be a mapping of a fixed-length string. The mapping can either be + simple, like integer and floating-point numbers, or complex, like + a complex number, a byte string, and a C-structure. The record + type effectively merges the struct and Numeric modules into a + multi-dimensional struct array. This approach implies certain + changes to the array interface. For example, the 'typecode' + keyword argument should probably be changed to the more + descriptive 'format' keyword. + + a. How are record semantics defined and implemented? + + Which ever implementation approach is taken for records, the + syntax and semantics of how they are to be accessed and + manipulated must be decided, if one wishes to have access to + sub-fields of records. In this case, the record type can + essentially be considered an inhomogeneous list, like a tuple + returned by the unpack method of the struct module; and a 1-d + array of records may be interpreted as a 2-d array with the + second dimension being the index into the list of fields. + This enhanced array semantics makes access to an array of one + or more of the fields easy and straightforward. It also + allows a user to do array operations on a field in a natural + and intuitive way. If we assume that records are implemented + as an array type, then last dimension defaults to 0 and can + therefore be neglected for arrays comprised of simple types, + like numeric. + + 6. How are masked-arrays implemented? + + Masked-arrays in Numeric 1 are implemented as a separate array + class. With the ability to add new array types to Numeric 2, it + is possible that masked-arrays in Numeric 2 could be implemented + as a new array type instead of an array class. + + 7. How are numerical errors handled (IEEE floating-point errors in + particular)? + + It is not clear to the proposers (Paul Barrett and Travis + Oliphant) what is the best or preferred way of handling errors. + Since most of the C functions that do the operation, iterate over + the inner-most (last) dimension of the array. This dimension + could contain a thousand or more items having one or more errors + of differing type, such as divide-by-zero, underflow, and + overflow. Additionally, keeping track of these errors may come at + the expense of performance. Therefore, we suggest several + options: + + a. Print a message of the most severe error, leaving it to + the user to locate the errors. + + b. Print a message of all errors that occurred and the number + of occurrences, leaving it to the user to locate the errors. + + c. Print a message of all errors that occurred and a list of + where they occurred. + + d. Or use a hybrid approach, printing only the most severe + error, yet keeping track of what and where the errors + occurred. This would allow the user to locate the errors + while keeping the error message brief. + + 8. What features are needed to ease the integration of FORTRAN + libraries and code? + + It would be a good idea at this stage to consider how to ease the + integration of FORTRAN libraries and user code in Numeric 2. + + +Implementation Steps + + 1. Implement basic UFunc capability + + a. Minimal Array class: + + Necessary class attributes and methods, e.g. .shape, .data, + .type, etc. + + b. Minimal ArrayType class: + + Int32, Real64, Complex64, Char, Object + + c. Minimal UFunc class: + + UFunc instantiation, CFunction registration, UFunc call for + 1-D arrays including the rules for doing alignment, + byte-swapping, and coercion. + + d. Minimal C-extension module: + + _UFunc, which does the innermost array loop in C. + + This step implements whatever is needed to do: 'c = add(a, b)' + where a, b, and c are 1-D arrays. It teaches us how to add + new UFuncs, to coerce the arrays, to pass the necessary + information to a C iterator method and to do the actually + computation. + + 2. Continue enhancing the UFunc iterator and Array class + + a. Implement some access methods for the Array class: + print, repr, getitem, setitem, etc. + + b. Implement multidimensional arrays + + c. Implement some of basic Array methods using UFuncs: + +, -, *, /, etc. + + d. Enable UFuncs to use Python sequences. + + 3. Complete the standard UFunc and Array class behavior + + a. Implement getslice and setslice behavior + + b. Work on Array broadcasting rules + + c. Implement Record type + + 4. Add additional functionality + + a. Add more UFuncs + + b. Implement buffer or mmap access + + +Incompatibilities + + The following is a list of incompatibilities in behavior between + Numeric 1 and Numeric 2. + + 1. Scalar coercion rules + + Numeric 1 has single set of coercion rules for array and Python + numeric types. This can cause unexpected and annoying problems + during the calculation of an array expression. Numeric 2 intends + to overcome these problems by having two sets of coercion rules: + one for arrays and Python numeric types, and another just for + arrays. + + 2. No savespace attribute + + The savespace attribute in Numeric 1 makes arrays with this + attribute set take precedence over those that do not have it set. + Numeric 2 will not have such an attribute and therefore normal + array coercion rules will be in effect. + + 3. Slicing syntax returns a copy + + The slicing syntax in Numeric 1 returns a view into the original + array. The slicing behavior for Numeric 2 will be a copy. You + should use the ArrayView class to get a view into an array. + + 4. Boolean comparisons return a boolean array + + A comparison between arrays in Numeric 1 results in a Boolean + scalar, because of current limitations in Python. The advent of + Rich Comparisons in Python 2.1 will allow an array of Booleans to + be returned. + + 5. Type characters are deprecated + + Numeric 2 will have an ArrayType class composed of Type instances, + for example Int8, Int16, Int32, and Int for signed integers. The + typecode scheme in Numeric 1 will be available for backward + compatibility, but will be deprecated. + + +Appendices + + A. Implicit sub-arrays iteration + + A computer animation is composed of a number of 2-D images or + frames of identical shape. By stacking these images into a single + block of memory, a 3-D array is created. Yet the operations to be + performed are not meant for the entire 3-D array, but on the set + of 2-D sub-arrays. In most array languages, each frame has to be + extracted, operated on, and then reinserted into the output array + using a for-like loop. The J language allows the programmer to + perform such operations implicitly by having a rank for the frame + and array. By default these ranks will be the same during the + creation of the array. It was the intention of the Numeric 1 + developers to implement this feature, since it is based on the + language J. The Numeric 1 code has the required variables for + implementing this behavior, but was never implemented. We intend + to implement implicit sub-array iteration in Numeric 2, if the + array broadcasting rules found in Numeric 1 do not fully support + this behavior. + + +Copyright + + This document is placed in the public domain. + + +Related PEPs + + PEP 207: Rich Comparisons + by Guido van Rossum and David Ascher + + PEP 208: Reworking the Coercion Model + by Neil Schemenauer and Marc-Andre' Lemburg + + PEP 211: Adding New Linear Algebra Operators to Python + by Greg Wilson + + PEP 225: Elementwise/Objectwise Operators + by Huaiyu Zhu + + PEP 228: Reworking Python's Numeric Model + by Moshe Zadka + + +References + + [1] P. Greenfield 2000. private communication.