Skip to content

Commit e3b1a53

Browse files
committed
Add zonal stats user story for #428
1 parent 091e73d commit e3b1a53

File tree

3 files changed

+196
-0
lines changed

3 files changed

+196
-0
lines changed

ci/docs.yml

+1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ dependencies:
1515
- matplotlib-base
1616
- myst-parser
1717
- myst-nb
18+
- sparse
1819
- sphinx
1920
- sphinx-remove-toctrees
2021
- furo>=2024.08

docs/source/user-stories.md

+1
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,5 @@
1010
user-stories/climatology-hourly-cubed.ipynb
1111
user-stories/custom-aggregations.ipynb
1212
user-stories/nD-bins.ipynb
13+
user-stories/large-zonal-stats.ipynb
1314
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "0",
6+
"metadata": {},
7+
"source": [
8+
"# Large Raster Zonal Statistics\n",
9+
"\n",
10+
"\"Zonal statistics\" spans a large range of problems. \n",
11+
"\n",
12+
"This one is inspired by [this issue](https://github.com/xarray-contrib/flox/issues/428), where a cell areas raster is aggregated over 6 different groupers and summed. Each array involved has shape 560_000 x 1440_000 and chunk size 10_000 x 10_000. Three of the groupers `tcl_year`, `drivers`, and `tcd_thresholds` have a small number of group labels (23, 5, and 7). \n",
13+
"\n",
14+
"The last 3 groupers are [GADM](https://gadm.org/) level 0, 1, 2 administrative area polygons rasterized to this grid; with 248, 86, and 854 unique labels respectively (arrays `adm0`, `adm1`, and `adm2`). These correspond to country-level, state-level, and county-level administrative boundaries. "
15+
]
16+
},
17+
{
18+
"cell_type": "markdown",
19+
"id": "1",
20+
"metadata": {},
21+
"source": [
22+
"## Example dataset"
23+
]
24+
},
25+
{
26+
"cell_type": "markdown",
27+
"id": "2",
28+
"metadata": {},
29+
"source": [
30+
"Here is a representative version of the dataset (in terms of size and chunk sizes)."
31+
]
32+
},
33+
{
34+
"cell_type": "code",
35+
"execution_count": null,
36+
"id": "3",
37+
"metadata": {},
38+
"outputs": [],
39+
"source": [
40+
"import dask.array\n",
41+
"import numpy as np\n",
42+
"import xarray as xr\n",
43+
"\n",
44+
"from flox.xarray import xarray_reduce\n",
45+
"\n",
46+
"sizes = {\"y\": 560_000, \"x\": 1440_000}\n",
47+
"chunksizes = {\"y\": 2_000, \"x\": 2_000}\n",
48+
"dims = (\"y\", \"x\")\n",
49+
"shape = tuple(sizes[d] for d in dims)\n",
50+
"chunks = tuple(chunksizes[d] for d in dims)\n",
51+
"\n",
52+
"ds = xr.Dataset(\n",
53+
" {\n",
54+
" \"areas\": (dims, dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n",
55+
" \"tcl_year\": (\n",
56+
" dims,\n",
57+
" 1 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n",
58+
" ),\n",
59+
" \"drivers\": (dims, 2 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
60+
" \"tcd_thresholds\": (\n",
61+
" dims,\n",
62+
" 3 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n",
63+
" ),\n",
64+
" \"adm0\": (dims, 4 + dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n",
65+
" \"adm1\": (dims, 5 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
66+
" \"adm2\": (dims, 6 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
67+
" }\n",
68+
")\n",
69+
"ds"
70+
]
71+
},
72+
{
73+
"cell_type": "markdown",
74+
"id": "4",
75+
"metadata": {},
76+
"source": [
77+
"## Zonal Statistics"
78+
]
79+
},
80+
{
81+
"cell_type": "markdown",
82+
"id": "5",
83+
"metadata": {},
84+
"source": [
85+
"Next define the grouper arrays and expected group labels"
86+
]
87+
},
88+
{
89+
"cell_type": "code",
90+
"execution_count": null,
91+
"id": "6",
92+
"metadata": {},
93+
"outputs": [],
94+
"source": [
95+
"by = (ds.tcl_year, ds.drivers, ds.tcd_thresholds, ds.adm0, ds.adm1, ds.adm2)\n",
96+
"expected_groups = (\n",
97+
" np.arange(23),\n",
98+
" np.arange(1, 6),\n",
99+
" np.arange(1, 8),\n",
100+
" np.arange(248),\n",
101+
" np.arange(86),\n",
102+
" np.arange(854),\n",
103+
")"
104+
]
105+
},
106+
{
107+
"cell_type": "code",
108+
"execution_count": null,
109+
"id": "7",
110+
"metadata": {},
111+
"outputs": [],
112+
"source": [
113+
"result = xarray_reduce(\n",
114+
" ds.areas,\n",
115+
" *by,\n",
116+
" expected_groups=expected_groups,\n",
117+
" func=\"sum\",\n",
118+
")\n",
119+
"result"
120+
]
121+
},
122+
{
123+
"cell_type": "markdown",
124+
"id": "8",
125+
"metadata": {},
126+
"source": [
127+
"Formulating the three admin levels as orthogonal dimensions is quite wasteful --- not all countries have 86 states or 854 counties per state. \n",
128+
"\n",
129+
"We end up with one humoungous 56GB chunk, that is mostly empty.\n",
130+
"\n",
131+
"## We can do better using a sparse array\n",
132+
"\n",
133+
"Since the results are very sparse, we can instruct flox to constructing dense arrays of intermediate results on the full 23 x 5 x 7 x 248 x 86 x 854 output grid.\n",
134+
"\n",
135+
"```python\n",
136+
"ReindexStrategy(\n",
137+
" # do not reindex to the full output grid at the blockwise aggregation stage\n",
138+
" blockwise=False,\n",
139+
" # when combining intermediate results after blockwise aggregation, reindex to the\n",
140+
" # common grid using a sparse.COO array type\n",
141+
" array_type=ReindexArrayType.SPARSE_COO\n",
142+
")\n",
143+
"```"
144+
]
145+
},
146+
{
147+
"cell_type": "code",
148+
"execution_count": null,
149+
"id": "9",
150+
"metadata": {},
151+
"outputs": [],
152+
"source": [
153+
"from flox import ReindexArrayType, ReindexStrategy\n",
154+
"\n",
155+
"result = xarray_reduce(\n",
156+
" ds.areas,\n",
157+
" *by,\n",
158+
" expected_groups=expected_groups,\n",
159+
" func=\"sum\",\n",
160+
" reindex=ReindexStrategy(\n",
161+
" blockwise=False,\n",
162+
" array_type=ReindexArrayType.SPARSE_COO,\n",
163+
" ),\n",
164+
")\n",
165+
"result"
166+
]
167+
},
168+
{
169+
"cell_type": "markdown",
170+
"id": "10",
171+
"metadata": {},
172+
"source": [
173+
"The output is a sparse array (see the **Data type** section)! Note that the size of this array cannot be estimated without computing it.\n",
174+
"\n",
175+
"The computation runs smoothly with low memory."
176+
]
177+
}
178+
],
179+
"metadata": {
180+
"language_info": {
181+
"codemirror_mode": {
182+
"name": "ipython",
183+
"version": 3
184+
},
185+
"file_extension": ".py",
186+
"mimetype": "text/x-python",
187+
"name": "python",
188+
"nbconvert_exporter": "python",
189+
"pygments_lexer": "ipython3"
190+
}
191+
},
192+
"nbformat": 4,
193+
"nbformat_minor": 5
194+
}

0 commit comments

Comments
 (0)