SciPy

NEP 22 — Duck typing for NumPy arrays – high level overview

Author:Stephan Hoyer <shoyer@google.com>, Nathaniel J. Smith <njs@pobox.com>
Status:Final
Type:Informational
Created:2018-03-22
Resolution:https://mail.python.org/pipermail/numpy-discussion/2018-September/078752.html

Abstract

We outline a high-level vision for how NumPy will approach handling “duck arrays”. This is an Informational-class NEP; it doesn’t prescribe full details for any particular implementation. In brief, we propose developing a number of new protocols for defining implementations of multi-dimensional arrays with high-level APIs matching NumPy.

Detailed description

Traditionally, NumPy’s ndarray objects have provided two things: a high level API for expression operations on homogenously-typed, arbitrary-dimensional, array-structured data, and a concrete implementation of the API based on strided in-RAM storage. The API is powerful, fairly general, and used ubiquitously across the scientific Python stack. The concrete implementation, on the other hand, is suitable for a wide range of uses, but has limitations: as data sets grow and NumPy becomes used in a variety of new environments, there are increasingly cases where the strided in-RAM storage strategy is inappropriate, and users find they need sparse arrays, lazily evaluated arrays (as in dask), compressed arrays (as in blosc), arrays stored in GPU memory, arrays stored in alternative formats such as Arrow, and so forth – yet users still want to work with these arrays using the familiar NumPy APIs, and re-use existing code with minimal (ideally zero) porting overhead. As a working shorthand, we call these “duck arrays”, by analogy with Python’s “duck typing”: a “duck array” is a Python object which “quacks like” a numpy array in the sense that it has the same or similar Python API, but doesn’t share the C-level implementation.

This NEP doesn’t propose any specific changes to NumPy or other projects; instead, it gives an overview of how we hope to extend NumPy to support a robust ecosystem of projects implementing and relying upon its high level API.

Terminology

“Duck array” works fine as a placeholder for now, but it’s pretty jargony and may confuse new users, so we may want to pick something else for the actual API functions. Unfortunately, “array-like” is already taken for the concept of “anything that can be coerced into an array” (including e.g. list objects), and “anyarray” is already taken for the concept of “something that shares ndarray’s implementation, but has different semantics”, which is the opposite of a duck array (e.g., np.matrix is an “anyarray”, but is not a “duck array”). This is a classic bike-shed so for now we’re just using “duck array”. Some possible options though include: arrayish, pseudoarray, nominalarray, ersatzarray, arraymimic, …

General approach

At a high level, duck array support requires working through each of the API functions provided by NumPy, and figuring out how it can be extended to work with duck array objects. In some cases this is easy (e.g., methods/attributes on ndarray itself); in other cases it’s more difficult. Here are some principles we’ve found useful so far:

Principle 1: Focus on “full” duck arrays, but don’t rule out “partial” duck arrays

We can distinguish between two classes:

  • “full” duck arrays, which aspire to fully implement np.ndarray’s Python-level APIs and work essentially anywhere that np.ndarray works
  • “partial” duck arrays, which intentionally implement only a subset of np.ndarray’s API.

Full duck arrays are, well, kind of boring. They have exactly the same semantics as ndarray, with differences being restricted to under-the-hood decisions about how the data is actually stored. The kind of people that are excited about making numpy more extensible are also, unsurprisingly, excited about changing or extending numpy’s semantics. So there’s been a lot of discussion of how to best support partial duck arrays. We’ve been guilty of this ourself.

At this point though, we think the best general strategy is to focus our efforts primarily on supporting full duck arrays, and only worry about partial duck arrays as much as we need to to make sure we don’t accidentally rule them out for no reason.

Why focus on full duck arrays? Several reasons:

First, there are lots of very clear use cases. Potential consumers of the full duck array interface include almost every package that uses numpy (scipy, sklearn, astropy, …), and in particular packages that provide array-wrapping-classes that handle multiple types of arrays, such as xarray and dask.array. Potential implementers of the full duck array interface include: distributed arrays, sparse arrays, masked arrays, arrays with units (unless they switch to using dtypes), labeled arrays, and so forth. Clear use cases lead to good and relevant APIs.

Second, the Anna Karenina principle applies here: full duck arrays are all alike, but every partial duck array is partial in its own way:

  • xarray.DataArray is mostly a duck array, but has incompatible broadcasting semantics.
  • xarray.Dataset wraps multiple arrays in one object; it still implements some array interfaces like __array_ufunc__, but certainly not all of them.
  • pandas.Series has methods with similar behavior to numpy, but unique null-skipping behavior.
  • scipy’s LinearOperators support matrix multiplication and nothing else
  • h5py and similar libraries for accessing array storage have objects that support numpy-like slicing and conversion into a full array, but not computation.
  • Some classes may be similar to ndarray, but without supporting the full indexing semantics.

And so forth.

Despite our best attempts, we haven’t found any clear, unique way of slicing up the ndarray API into a hierarchy of related types that captures these distinctions; in fact, it’s unlikely that any single person even understands all the distinctions. And this is important, because we have a lot of APIs that we need to add duck array support to (both in numpy and in all the projects that depend on numpy!). By definition, these already work for ndarray, so hopefully getting them to work for full duck arrays shouldn’t be so hard, since by definition full duck arrays act like ndarray. It’d be very cumbersome to have to go through each function and identify the exact subset of the ndarray API that it needs, then figure out which partial array types can/should support it. Once we have things working for full duck arrays, we can go back later and refine the APIs needed further as needed. Focusing on full duck arrays allows us to start making progress immediately.

In the future, it might be useful to identify specific use cases for duck arrays and standardize narrower interfaces targeted just at those use cases. For example, it might make sense to have a standard “array loader” interface that file access libraries like h5py, netcdf, pydap, zarr, … all implement, to make it easy to switch between these libraries. But that’s something that we can do as we go, and it doesn’t necessarily have to involve the NumPy devs at all. For an example of what this might look like, see the documentation for dask.array.from_array.

Principle 2: Take advantage of duck typing

ndarray has a very large API surface area:

In [1]: len(set(dir(np.ndarray)) - set(dir(object)))
Out[1]: 138

And this is a huge underestimate, because there are also many free-standing functions in NumPy and other libraries which currently use the NumPy C API and thus only work on ndarray objects. In type theory, a type is defined by the operations you can perform on an object; thus, the actual type of ndarray includes not just its methods and attributes, but all of these functions. For duck arrays to be successful, they’ll need to implement a large proportion of the ndarray API – but not all of it. (For example, dask.array.Array does not provide an equivalent to the ndarray.ptp method, presumably because no-one has ever noticed or cared about its absence. But this doesn’t seem to have stopped people from using dask.)

This means that realistically, we can’t hope to define the whole duck array API up front, or that anyone will be able to implement it all in one go; this will be an incremental process. It also means that even the so-called “full” duck array interface is somewhat fuzzily defined at the borders; there are parts of the np.ndarray API that duck arrays won’t have to implement, but we aren’t entirely sure what those are.

And ultimately, it isn’t really up to the NumPy developers to define what does or doesn’t qualify as a duck array. If we want scikit-learn functions to work on dask arrays (for example), then that’s going to require negotiation between those two projects to discover incompatibilities, and when an incompatibility is discovered it will be up to them to negotiate who should change and how. The NumPy project can provide technical tools and general advice to help resolve these disagreements, but we can’t force one group or another to take responsibility for any given bug.

Therefore, even though we’re focusing on “full” duck arrays, we don’t attempt to define a normative “array ABC” – maybe this will be useful someday, but right now, it’s not. And as a convenient side-effect, the lack of a normative definition leaves partial duck arrays room to experiment.

But, we do provide some more detailed advice for duck array implementers and consumers below.

Principle 3: Focus on protocols

Historically, numpy has had lots of success at interoperating with third-party objects by defining protocols, like __array__ (asks an arbitrary object to convert itself into an array), __array_interface__ (a precursor to Python’s buffer protocol), and __array_ufunc__ (allows third-party objects to support ufuncs like np.exp).

NEP 16 took a different approach: we need a duck-array equivalent of asarray, and it proposed to do this by defining a version of asarray that would let through objects which implemented a new AbstractArray ABC. As noted above, we now think that trying to define an ABC is a bad idea for other reasons. But when this NEP was discussed on the mailing list, we realized that even on its own merits, this idea is not so great. A better approach is to define a method that can be called on an arbitrary object to ask it to convert itself into a duck array, and then define a version of asarray that calls this method.

This is strictly more powerful: if an object is already a duck array, it can simply return self. It allows more correct semantics: NEP 16 assumed that asarray(obj, dtype=X) is the same as asarray(obj).astype(X), but this isn’t true. And it supports more use cases: if h5py supported sparse arrays, it might want to provide an object which is not itself a sparse array, but which can be automatically converted into a sparse array. See NEP <XX, to be written> for full details.

The protocol approach is also more consistent with core Python conventions: for example, see the __iter__ method for coercing objects to iterators, or the __index__ protocol for safe integer coercion. And finally, focusing on protocols leaves the door open for partial duck arrays, which can pick and choose which subset of the protocols they want to participate in, each of which have well-defined semantics.

Conclusion: protocols are one honking great idea – let’s do more of those.

Principle 4: Reuse existing methods when possible

It’s tempting to try to define cleaned up versions of ndarray methods with a more minimal interface to allow for easier implementation. For example, __array_reshape__ could drop some of the strange arguments accepted by reshape and __array_basic_getitem__ could drop all the strange edge cases of NumPy’s advanced indexing.

But as discussed above, we don’t really know what APIs we need for duck-typing ndarray. We would inevitably end up with a very long list of new special methods. In contrast, existing methods like reshape and __getitem__ have the advantage of already being widely used/exercised by libraries that use duck arrays, and in practice, any serious duck array type is going to have to implement them anyway.

Principle 5: Make it easy to do the right thing

Making duck arrays work well is going to be a community effort. Documentation helps, but only goes so far. We want to make it easy to implement duck arrays that do the right thing.

One way NumPy can help is by providing mixin classes for implementing large groups of related functionality at once. NDArrayOperatorsMixin is a good example: it allows for implementing arithmetic operators implicitly via the __array_ufunc__ method. It’s not complete, and we’ll want more helpers like that (e.g. for reductions).

(We initially thought that the importance of these mixins might be an argument for providing an array ABC, since that’s the standard way to do mixins in modern Python. But in discussion around NEP 16 we realized that partial duck arrays also wanted to take advantage of these mixins in some cases, so even if we did have an array ABC then the mixins would still need some sort of separate existence. So never mind that argument.)

Tentative duck array guidelines

As a general rule, libraries using duck arrays should insist upon the minimum possible requirements, and libraries implementing duck arrays should provide as complete of an API as possible. This will ensure maximum compatibility. For example, users should prefer to rely on .transpose() rather than .swapaxes() (which can be implemented in terms of transpose), but duck array authors should ideally implement both.

If you are trying to implement a duck array, then you should strive to implement everything. You certainly need .shape, .ndim and .dtype, but also your dtype attribute should actually be a numpy.dtype object, weird fancy indexing edge cases should ideally work, etc. Only details related to NumPy’s specific np.ndarray implementation (e.g., strides, data, view) are explicitly out of scope.

A (very) rough sketch of future plans

The proposals discussed so far – __array_ufunc__ and some kind of asarray protocol – are clearly necessary but not sufficient for full duck typing support. We expect the need for additional protocols to support (at least) these features:

  • Concatenating duck arrays, which would be used internally by other array combining methods like stack/vstack/hstack. The implementation of concatenate will need to be negotiated among the list of array arguments. We expect to use an __array_concatenate__ protocol like __array_ufunc__ instead of multiple dispatch.
  • Ufunc-like functions that currently aren’t ufuncs. Many NumPy functions like median, percentile, sort, where and clip could be written as generalized ufuncs but currently aren’t. Either these functions should be written as ufuncs, or we should consider adding another generic wrapper mechanism that works similarly to ufuncs but makes fewer guarantees about how the implementation is done.
  • Random number generation with duck arrays, e.g., np.random.randn(). For example, we might want to add new APIs like random_like() for generating new arrays with a matching shape and type – though we’ll need to look at some real examples of how these functions are used to figure out what would be helpful.
  • Miscellaneous other functions such as np.einsum, np.zeros_like, and np.broadcast_to that don’t fall into any of the above categories.
  • Checking mutability on duck arrays, which would imply that they support assignment with __setitem__ and the out argument to ufuncs. Many otherwise fine duck arrays are not easily mutable (for example, because they use some kinds of sparse or compressed storage, or are in read-only shared memory), and it turns out that frequently-used code like the default implementation of np.mean needs to check this (to decide whether it can re-use temporary arrays).

We intentionally do not describe exactly how to add support for these types of duck arrays here. These will be the subject of future NEPs.