-
Notifications
You must be signed in to change notification settings - Fork 0
Description
TLDR: Especially as libraries begin to support alternative backends vis the Array API, it might be useful to have a standard format documenting the type, dtype, and dimensionality/shape of array inputs and corresponding outputs (and a way of adding typing information to the code). I thought the summit might be a good place to discuss the topic potentially prepare a SPEC.
In SciPy, at least, the term "array-like" (without qualification) is commonly used to document the parameter type of functions. I'll give an example from stats.chatterjeexi, which is about on par with other scipy.stats functions in terms of detail.
"array-like", used throughout SciPy, is far too broad. I don't think it is defined anywhere in our documentation. The most "official" definition of "array_like" I can find is from the NumPy glossary:
Clearly "Any argument accepted by numpy.array is array_like." is not a useful working definition, as almost any Python object is accepted by np.array and coerced to an object array For example, the module numpy is array_like, according to this definition.
import numpy as np
np.array(np)
# array(<module 'numpy' from '/Users/matthaberland/miniforge3/envs/scipy-dev/lib/python3.13/site-packages/numpy/__init__.py'>,
# dtype=object)As for the object type, what_ we really mean these days (for many functions in stats, for example) is that the type should be one of the following:
- A Python list, which will be converted to a NumPy array (but this may change in SciPy 2.0)
- An Array-API compatible array subject to limitations that appear in a table in the documentation
For example, a table given in the ttest_ind documentation looks like:
This does not mention two other important pieces of information, which are often omitted: dtype and shape. For dtype, sometimes "real floating" or similar is specified, but many scipy.stats functions assume the reader understands that the input must be real. Many elementwise and reducing functions are pretty flexible about allowed input shape, but even in that case, it would be useful to have a standard way of specifying the relationship between input and output shapes.
We have similar issues with the documentation of return values. Instead of "array-like", we often see float or "scalar or ndarray", which doesn't capture the full story.
I think we need:
- a term that means "array type from a backend that complies with the Python array API standard". (The array API standard docs just use "array", which might be fine, but I can see arguments that it will be confused with an informal, array-like definition.)
- a standard format for documenting allowed dtypes.
- a standard format for documenting the most shape requirements.
- a standard format for the documenting the most common relationships between input and output type, dtype, and shape
The last part is probably the most complex. For instance, in SciPy, it is common to have:
- Elementwise functions, which typically accept array-API compatible arrays of any numerical dtype and shape. Usually:
- Output type = input type, although sometimes the output may technically be a different type from the same backend that is more or less compatible with other arrays from that backend. (For example,
scipy.specialfunctions may accept a 0-d NumPy array and produce a NumPy scalar. Whether this is OK is debatable - see Is Array-In -> Scalar-Out OK? #38 - but there may be other backends which use multiple types to implement the overall array protocol.) - The dtype of the output matches the
result_typeof the input(s) with exceptions (e.g. integers may be promoted to floating point) - The shape of the output matches the shape of the input.
- Output type = input type, although sometimes the output may technically be a different type from the same backend that is more or less compatible with other arrays from that backend. (For example,
- Reducing functions, which follow similar rules as above, but reduce along the axis (or axes) specified by an
axisargument; the output shape matches that of the input but eliminates these dimensions - Generalized ufuncs like those in
scipy.linalg, which follow similar rules as above, which preserve the "batch shape" between and outuput but may may have complicated input -> output "core dimension" relationships, which could be specified in terms of shapes like(m,n),(n,p)->(m,p)(but often aren't!)
I'd suggest that we can provide some common language for cases like these, which libraries can adapt to their needs. We would also suggest a way to link to more information within a libary's documentation, since it is very common for input/output rules to be complicated but at least consistent within a certain set of functions. For instance, essentially all scipy.linalg functions now have a standard note that links to a tutorial about batch operations.
I think this is a lot more compact/readable (and TBH, more useful) than to spelll out all the rules in the documentation of every function. This might be a decent pattern to follow for documenting the relationship between input and output shapes and dtypes (e.g. give information for a representative case in the documentation, and link to a full set of common rules).