|
| 1 | +--- |
| 2 | +jupytext: |
| 3 | + text_representation: |
| 4 | + extension: .md |
| 5 | + format_name: myst |
| 6 | +kernelspec: |
| 7 | + display_name: Python 3 (ipykernel) |
| 8 | + language: python |
| 9 | + name: python3 |
| 10 | +--- |
| 11 | + |
| 12 | +(getting_started)= |
| 13 | +# Getting started |
| 14 | + |
| 15 | +## Welcome to `xarray-einstats`! |
| 16 | +`xarray-einstats` is an open source Python library part of the |
| 17 | +{doc}`ArviZ project <arviz_org:index>`. |
| 18 | +It acts as a bridge between the [xarray](https://xarray.dev/) |
| 19 | +library for labelled arrays and libraries for raw arrays |
| 20 | +such as [NumPy](https://numpy.org/) or [SciPy](https://scipy.org/). |
| 21 | + |
| 22 | +Xarray has as "Compatibility with the broader ecosystem" as |
| 23 | +one of its main {doc}`goals <xarray:getting-started-guide/why-xarray>`. |
| 24 | +Which is what allows `xarray-einstats` to perform this |
| 25 | +_bridge_ role with minimal code and duplication. |
| 26 | + |
| 27 | +## Overview |
| 28 | +`xarray-einstats` provides wrappers for: |
| 29 | + |
| 30 | +* Most of the functions in {mod}`numpy.linalg` |
| 31 | +* A subset of {mod}`scipy.stats` |
| 32 | +* `rearrange` and `reduce` from [einops](http://einops.rocks/) |
| 33 | + |
| 34 | +These wrappers have the same names and functionality as the original functions. |
| 35 | +The difference in behaviour is that the wrappers will not make assumptions |
| 36 | +about the meaning of a dimension based on its position |
| 37 | +nor they have arguments like `axis` or `axes`. |
| 38 | +They will have `dims` argument that take _dimension names_ instead of |
| 39 | +integers indicating the positions of the dimensions on which to act. |
| 40 | + |
| 41 | +It also provides a handful of re-implemented functions: |
| 42 | + |
| 43 | +* {func}`xarray_einstats.numba.histogram` |
| 44 | +* {class}`xarray_einstats.stats.multivariate_normal` |
| 45 | + |
| 46 | +These are partially reimplemented because the original function |
| 47 | +doesn't yet support multidimensional and/or batched computations. |
| 48 | +They also share the name with a function in NumPy or SciPy, |
| 49 | +but they only implement a subset of the features. |
| 50 | +Moreover, the goal is for those to eventually be wrappers too. |
| 51 | + |
| 52 | + |
| 53 | +## Using `xarray-einstats` |
| 54 | +### DataArray inputs |
| 55 | +Functions in `xarray-einstats` are designed to work on {class}`~xarray.DataArray` objects. |
| 56 | + |
| 57 | +Let's load some example data: |
| 58 | + |
| 59 | +```{code-cell} ipython3 |
| 60 | +from xarray_einstats import linalg, stats, tutorial |
| 61 | +
|
| 62 | +da = tutorial.generate_matrices_dataarray(4) |
| 63 | +da |
| 64 | +``` |
| 65 | + |
| 66 | +and show an example: |
| 67 | + |
| 68 | +```{code-cell} ipython3 |
| 69 | +stats.skew(da, dims=["batch", "dim2"]) |
| 70 | +``` |
| 71 | + |
| 72 | +`xarray-einstats` uses `dims` as argument throughout the codebase |
| 73 | +as an alternative to both `axis` or `axes` indistinctively, |
| 74 | +also as alternative to the `(..., M, M)` convention used by NumPy. |
| 75 | + |
| 76 | +The use of `dims` follows {func}`~xarray.dot`, instead of the singular |
| 77 | +`dim` argument used for example in {meth}`~xarray.DataArray.mean`. |
| 78 | +Both a single dimension or multiple are valid inputs, |
| 79 | +and using `dims` emphasizes the fact that operations |
| 80 | +and reductions can be performed over multiple dimensions at the same time. |
| 81 | +Moreover, in linear algebra functions, `dims` is often restricted |
| 82 | +to a 2 element list as it indicates which dimensions define the matrices, |
| 83 | +interpreting all the others as batch dimensions. |
| 84 | + |
| 85 | +That means that the two calls below are equivalent, even if the dimension |
| 86 | +names of the inputs are not, _because their dimension names are the same_. |
| 87 | +Thus, |
| 88 | + |
| 89 | +```{code-cell} ipython3 |
| 90 | +linalg.det(da, dims=["dim", "dim2"]) |
| 91 | +``` |
| 92 | + |
| 93 | +returns the same as: |
| 94 | + |
| 95 | +```{code-cell} ipython3 |
| 96 | +linalg.det(da.transpose("dim2", "experiment", "dim", "batch"), dims=["dim", "dim2"]) |
| 97 | +``` |
| 98 | + |
| 99 | +:::{important} |
| 100 | +In `xarray_einstats` only the dimension names matter, not their order. |
| 101 | +::: |
| 102 | + |
| 103 | +### Dataset and GroupBy inputs |
| 104 | +While the `DataArray` is the base xarray object, there are also |
| 105 | +other xarray objects that are key while using the library. |
| 106 | +These other objects such as {class}`~xarray.Dataset` are implemented as |
| 107 | +a collection of `DataArray` objects, and all include a `.map` |
| 108 | +method in order to apply the same function to all its child `DataArrays`. |
| 109 | + |
| 110 | +```{code-cell} ipython3 |
| 111 | +ds = tutorial.generate_mcmc_like_dataset(9438) |
| 112 | +ds |
| 113 | +``` |
| 114 | + |
| 115 | +We can use {meth}`~xarray.Dataset.map` to apply the same function to |
| 116 | +all the 4 child `DataArray`s in `ds`, but this will not always be possible. |
| 117 | +When using `.map`, the function provided is applied to all child `DataArray`s |
| 118 | +with the same `**kwargs`. |
| 119 | + |
| 120 | +If we try doing: |
| 121 | + |
| 122 | +```{code-cell} ipython3 |
| 123 | +:tags: [raises-exception, hide-output] |
| 124 | +
|
| 125 | +ds.map(stats.circmean, dims=("chain", "draw")) |
| 126 | +``` |
| 127 | + |
| 128 | +we get an exception. The `chain` and `draw` dimensions are not present in all |
| 129 | +child `DataArrays`. Instead, we could apply it only to the variables |
| 130 | +that have both `chain` and `dim` dimensions. |
| 131 | + |
| 132 | + |
| 133 | +```{code-cell} ipython3 |
| 134 | +ds_samples = ds[["mu", "sigma", "score"]] |
| 135 | +ds_samples.map(stats.circmean, dims=("chain", "draw")) |
| 136 | +``` |
| 137 | + |
| 138 | +:::{attention} |
| 139 | +In general, you should prefer using `.map` attribute over using non-`DataArray` objects as |
| 140 | +input to the `xarray_einstats` directly. |
| 141 | +`.map` will ensure no unexpected broadcasting between the multiple child `DataArray`s takes place. |
| 142 | +See the examples below for some examples. |
| 143 | + |
| 144 | +However, if you are using functions that reduce dimensions on non-`DataArray` inputs |
| 145 | +whose child `DataArray`s all have all the dimensions to reduce you will |
| 146 | +not trigger any such broadcasting, |
| 147 | +_and we have included that behaviour on our test suite to ensure it stays this way_. |
| 148 | +::: |
| 149 | + |
| 150 | +It is also possible to do |
| 151 | + |
| 152 | + |
| 153 | +```{code-cell} ipython3 |
| 154 | +stats.circmean(ds_samples, dims=("chain", "draw")) |
| 155 | +``` |
| 156 | + |
| 157 | +Here, all child `DataArray`s have both `chain` and `draw` dimension, |
| 158 | +so as expected, the result is the same. |
| 159 | +There are some cases however, in which _not_ using `.map` triggers |
| 160 | +some broadcasting operations which will generally not be the desired |
| 161 | +output. |
| 162 | + |
| 163 | +If we use the `.map` attribute, the function is applied to each |
| 164 | +child `DataArray` independently from the others: |
| 165 | + |
| 166 | + |
| 167 | +```{code-cell} ipython3 |
| 168 | +ds.map(stats.rankdata) |
| 169 | +``` |
| 170 | + |
| 171 | +whereas without using the `.map` attribute, extra broadcasting can happen: |
| 172 | + |
| 173 | + |
| 174 | +```{code-cell} ipython3 |
| 175 | +stats.rankdata(ds) |
| 176 | +``` |
| 177 | + |
| 178 | +--- |
| 179 | + |
| 180 | +The behaviour on {class}`~xarray.core.groupby.DataArrayGroupBy` for example is very similar |
| 181 | +to the examples we have shown for `Dataset`s: |
| 182 | + |
| 183 | + |
| 184 | +```{code-cell} ipython3 |
| 185 | +da = ds["mu"].assign_coords(team=["a", "b", "b", "a", "c", "b"]) |
| 186 | +da |
| 187 | +``` |
| 188 | + |
| 189 | +when we apply a "group by" operation over the `team` dimension, we generate a |
| 190 | +`DataArrayGroupBy` with 3 groups. |
| 191 | + |
| 192 | +```{code-cell} ipython3 |
| 193 | +gb = da.groupby("team") |
| 194 | +gb |
| 195 | +``` |
| 196 | + |
| 197 | +on which we can use `.map` to apply a function from `xarray-einstats` over |
| 198 | +all groups independently: |
| 199 | + |
| 200 | +```{code-cell} ipython3 |
| 201 | +gb.map(stats.median_abs_deviation, dims=["draw", "team"]) |
| 202 | +``` |
| 203 | + |
| 204 | +which as expected has performed the operation group-wise, yielding a different |
| 205 | +result than either |
| 206 | + |
| 207 | +```{code-cell} ipython3 |
| 208 | +stats.median_abs_deviation(da, dims=["draw", "team"]) |
| 209 | +``` |
| 210 | + |
| 211 | +or |
| 212 | + |
| 213 | +```{code-cell} ipython3 |
| 214 | +stats.median_abs_deviation(da, dims="draw") |
| 215 | +``` |
| 216 | + |
| 217 | +:::{seealso} |
| 218 | +Check out the {ref}`xarray:groupby` page on xarray's documentation. |
| 219 | +::: |
0 commit comments