Add zonal stats user story for #428

dcherian · dcherian · commit e3b1a53e98d9 · 2025-04-04T23:34:15.000-06:00
diff --git a/ci/docs.yml b/ci/docs.yml
@@ -15,6 +15,7 @@ dependencies:
   - matplotlib-base
   - myst-parser
   - myst-nb
+  - sparse
   - sphinx
   - sphinx-remove-toctrees
   - furo>=2024.08
diff --git a/docs/source/user-stories.md b/docs/source/user-stories.md
@@ -10,4 +10,5 @@
    user-stories/climatology-hourly-cubed.ipynb
    user-stories/custom-aggregations.ipynb
    user-stories/nD-bins.ipynb
+   user-stories/large-zonal-stats.ipynb
 ```
diff --git a/docs/source/user-stories/large-zonal-stats.ipynb b/docs/source/user-stories/large-zonal-stats.ipynb
@@ -0,0 +1,194 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0",
+   "metadata": {},
+   "source": [
+    "# Large Raster Zonal Statistics\n",
+    "\n",
+    "\"Zonal statistics\" spans a large range of problems. \n",
+    "\n",
+    "This one is inspired by [this issue](https://github.com/xarray-contrib/flox/issues/428), where a cell areas raster is aggregated over 6 different groupers and summed. Each array involved has shape 560_000 x 1440_000 and chunk size 10_000 x 10_000. Three of the groupers `tcl_year`, `drivers`, and `tcd_thresholds` have a small number of group labels (23, 5, and 7). \n",
+    "\n",
+    "The last 3 groupers are [GADM](https://gadm.org/) level 0, 1, 2 administrative area polygons rasterized to this grid; with 248, 86, and 854 unique labels respectively (arrays `adm0`, `adm1`, and `adm2`). These correspond to country-level, state-level, and county-level administrative boundaries. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1",
+   "metadata": {},
+   "source": [
+    "## Example dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2",
+   "metadata": {},
+   "source": [
+    "Here is a representative version of the dataset (in terms of size and chunk sizes)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import dask.array\n",
+    "import numpy as np\n",
+    "import xarray as xr\n",
+    "\n",
+    "from flox.xarray import xarray_reduce\n",
+    "\n",
+    "sizes = {\"y\": 560_000, \"x\": 1440_000}\n",
+    "chunksizes = {\"y\": 2_000, \"x\": 2_000}\n",
+    "dims = (\"y\", \"x\")\n",
+    "shape = tuple(sizes[d] for d in dims)\n",
+    "chunks = tuple(chunksizes[d] for d in dims)\n",
+    "\n",
+    "ds = xr.Dataset(\n",
+    "    {\n",
+    "        \"areas\": (dims, dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n",
+    "        \"tcl_year\": (\n",
+    "            dims,\n",
+    "            1 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n",
+    "        ),\n",
+    "        \"drivers\": (dims, 2 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
+    "        \"tcd_thresholds\": (\n",
+    "            dims,\n",
+    "            3 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n",
+    "        ),\n",
+    "        \"adm0\": (dims, 4 + dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n",
+    "        \"adm1\": (dims, 5 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
+    "        \"adm2\": (dims, 6 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
+    "    }\n",
+    ")\n",
+    "ds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4",
+   "metadata": {},
+   "source": [
+    "## Zonal Statistics"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5",
+   "metadata": {},
+   "source": [
+    "Next define the grouper arrays and expected group labels"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "by = (ds.tcl_year, ds.drivers, ds.tcd_thresholds, ds.adm0, ds.adm1, ds.adm2)\n",
+    "expected_groups = (\n",
+    "    np.arange(23),\n",
+    "    np.arange(1, 6),\n",
+    "    np.arange(1, 8),\n",
+    "    np.arange(248),\n",
+    "    np.arange(86),\n",
+    "    np.arange(854),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "result = xarray_reduce(\n",
+    "    ds.areas,\n",
+    "    *by,\n",
+    "    expected_groups=expected_groups,\n",
+    "    func=\"sum\",\n",
+    ")\n",
+    "result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8",
+   "metadata": {},
+   "source": [
+    "Formulating the three admin levels as orthogonal dimensions is quite wasteful --- not all countries have 86 states or 854 counties per state. \n",
+    "\n",
+    "We end up with one humoungous 56GB chunk, that is mostly empty.\n",
+    "\n",
+    "## We can do better using a sparse array\n",
+    "\n",
+    "Since the results are very sparse, we can instruct flox to constructing dense arrays of intermediate results on the full 23 x 5 x 7 x 248 x 86 x 854 output grid.\n",
+    "\n",
+    "```python\n",
+    "ReindexStrategy(\n",
+    "    # do not reindex to the full output grid at the blockwise aggregation stage\n",
+    "    blockwise=False,\n",
+    "    # when combining intermediate results after blockwise aggregation, reindex to the\n",
+    "    # common grid using a sparse.COO array type\n",
+    "    array_type=ReindexArrayType.SPARSE_COO\n",
+    ")\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from flox import ReindexArrayType, ReindexStrategy\n",
+    "\n",
+    "result = xarray_reduce(\n",
+    "    ds.areas,\n",
+    "    *by,\n",
+    "    expected_groups=expected_groups,\n",
+    "    func=\"sum\",\n",
+    "    reindex=ReindexStrategy(\n",
+    "        blockwise=False,\n",
+    "        array_type=ReindexArrayType.SPARSE_COO,\n",
+    "    ),\n",
+    ")\n",
+    "result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "10",
+   "metadata": {},
+   "source": [
+    "The output is a sparse array (see the **Data type** section)! Note that the size of this array cannot be estimated without computing it.\n",
+    "\n",
+    "The computation runs smoothly with low memory."
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}