Skip to content

Documentation and typing of array types, dtypes, and dimensionality/shape #41

@mdhaber

Description

@mdhaber

TLDR: Especially as libraries begin to support alternative backends vis the Array API, it might be useful to have a standard format documenting the type, dtype, and dimensionality/shape of array inputs and corresponding outputs (and a way of adding typing information to the code). I thought the summit might be a good place to discuss the topic potentially prepare a SPEC.


In SciPy, at least, the term "array-like" (without qualification) is commonly used to document the parameter type of functions. I'll give an example from stats.chatterjeexi, which is about on par with other scipy.stats functions in terms of detail.

Image

"array-like", used throughout SciPy, is far too broad. I don't think it is defined anywhere in our documentation. The most "official" definition of "array_like" I can find is from the NumPy glossary:

Image

Clearly "Any argument accepted by numpy.array is array_like." is not a useful working definition, as almost any Python object is accepted by np.array and coerced to an object array For example, the module numpy is array_like, according to this definition.

import numpy as np
np.array(np)
# array(<module 'numpy' from '/Users/matthaberland/miniforge3/envs/scipy-dev/lib/python3.13/site-packages/numpy/__init__.py'>,
#      dtype=object)

As for the object type, what_ we really mean these days (for many functions in stats, for example) is that the type should be one of the following:

  • A Python list, which will be converted to a NumPy array (but this may change in SciPy 2.0)
  • An Array-API compatible array subject to limitations that appear in a table in the documentation

For example, a table given in the ttest_ind documentation looks like:

Image

This does not mention two other important pieces of information, which are often omitted: dtype and shape. For dtype, sometimes "real floating" or similar is specified, but many scipy.stats functions assume the reader understands that the input must be real. Many elementwise and reducing functions are pretty flexible about allowed input shape, but even in that case, it would be useful to have a standard way of specifying the relationship between input and output shapes.

We have similar issues with the documentation of return values. Instead of "array-like", we often see float or "scalar or ndarray", which doesn't capture the full story.

I think we need:

  • a term that means "array type from a backend that complies with the Python array API standard". (The array API standard docs just use "array", which might be fine, but I can see arguments that it will be confused with an informal, array-like definition.)
  • a standard format for documenting allowed dtypes.
  • a standard format for documenting the most shape requirements.
  • a standard format for the documenting the most common relationships between input and output type, dtype, and shape

The last part is probably the most complex. For instance, in SciPy, it is common to have:

  • Elementwise functions, which typically accept array-API compatible arrays of any numerical dtype and shape. Usually:
    • Output type = input type, although sometimes the output may technically be a different type from the same backend that is more or less compatible with other arrays from that backend. (For example, scipy.special functions may accept a 0-d NumPy array and produce a NumPy scalar. Whether this is OK is debatable - see Is Array-In -> Scalar-Out OK? #38 - but there may be other backends which use multiple types to implement the overall array protocol.)
    • The dtype of the output matches the result_type of the input(s) with exceptions (e.g. integers may be promoted to floating point)
    • The shape of the output matches the shape of the input.
  • Reducing functions, which follow similar rules as above, but reduce along the axis (or axes) specified by an axis argument; the output shape matches that of the input but eliminates these dimensions
  • Generalized ufuncs like those in scipy.linalg, which follow similar rules as above, which preserve the "batch shape" between and outuput but may may have complicated input -> output "core dimension" relationships, which could be specified in terms of shapes like (m,n),(n,p)->(m,p) (but often aren't!)

I'd suggest that we can provide some common language for cases like these, which libraries can adapt to their needs. We would also suggest a way to link to more information within a libary's documentation, since it is very common for input/output rules to be complicated but at least consistent within a certain set of functions. For instance, essentially all scipy.linalg functions now have a standard note that links to a tutorial about batch operations.

Image

I think this is a lot more compact/readable (and TBH, more useful) than to spelll out all the rules in the documentation of every function. This might be a decent pattern to follow for documenting the relationship between input and output shapes and dtypes (e.g. give information for a representative case in the documentation, and link to a full set of common rules).

Metadata

Metadata

Assignees

No one assigned

    Labels

    SPECSCommunity & governance SPECs, new SPECs, infrastructure & tools, etc.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions