Skip to content

Commit 2e3447a

Browse files
committed
Add zonal stats user story for #428
1 parent 091e73d commit 2e3447a

File tree

1 file changed

+194
-0
lines changed

1 file changed

+194
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "0",
6+
"metadata": {},
7+
"source": [
8+
"# Large Raster Zonal Statistics\n",
9+
"\n",
10+
"\"Zonal statistics\" spans a large range of problems. \n",
11+
"\n",
12+
"This one is inspired by [this issue](https://github.com/xarray-contrib/flox/issues/428), where a cell areas raster is aggregated over 6 different groupers and summed. Each array involved has shape 560_000 x 1440_000 and chunk size 10_000 x 10_000. Three of the groupers `tcl_year`, `drivers`, and `tcd_thresholds` have a small number of group labels (23, 5, and 7). \n",
13+
"\n",
14+
"The last 3 groupers are [GADM](https://gadm.org/) level 0, 1, 2 administrative area polygons rasterized to this grid; with 248, 86, and 854 unique labels respectively (arrays `adm0`, `adm1`, and `adm2`). These correspond to country-level, state-level, and county-level administrative boundaries. "
15+
]
16+
},
17+
{
18+
"cell_type": "code",
19+
"execution_count": null,
20+
"id": "1",
21+
"metadata": {},
22+
"outputs": [],
23+
"source": [
24+
"import dask.array\n",
25+
"import numpy as np\n",
26+
"import xarray as xr\n",
27+
"\n",
28+
"from flox.xarray import xarray_reduce"
29+
]
30+
},
31+
{
32+
"cell_type": "markdown",
33+
"id": "2",
34+
"metadata": {},
35+
"source": [
36+
"Here is a representative version of the dataset (in terms of size and chunk sizes)."
37+
]
38+
},
39+
{
40+
"cell_type": "code",
41+
"execution_count": null,
42+
"id": "3",
43+
"metadata": {},
44+
"outputs": [],
45+
"source": [
46+
"sizes = {\"y\": 560_000, \"x\": 1440_000}\n",
47+
"chunksizes = {\"y\": 2_000, \"x\": 2_000}\n",
48+
"dims = (\"y\", \"x\")\n",
49+
"shape = tuple(sizes[d] for d in dims)\n",
50+
"chunks = tuple(chunksizes[d] for d in dims)\n",
51+
"\n",
52+
"ds = xr.Dataset(\n",
53+
" {\n",
54+
" \"areas\": (dims, dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n",
55+
" \"tcl_year\": (\n",
56+
" dims,\n",
57+
" 1 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n",
58+
" ),\n",
59+
" \"drivers\": (dims, 2 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
60+
" \"tcd_thresholds\": (\n",
61+
" dims,\n",
62+
" 3 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n",
63+
" ),\n",
64+
" \"adm0\": (dims, 4 + dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n",
65+
" \"adm1\": (dims, 5 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
66+
" \"adm2\": (dims, 6 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
67+
" }\n",
68+
")\n",
69+
"ds"
70+
]
71+
},
72+
{
73+
"cell_type": "markdown",
74+
"id": "4",
75+
"metadata": {},
76+
"source": [
77+
"Next define the grouper arrays and expected group labels"
78+
]
79+
},
80+
{
81+
"cell_type": "code",
82+
"execution_count": null,
83+
"id": "5",
84+
"metadata": {},
85+
"outputs": [],
86+
"source": [
87+
"by = (ds.tcl_year, ds.drivers, ds.tcd_thresholds, ds.adm0, ds.adm1, ds.adm2)\n",
88+
"expected_groups = (\n",
89+
" np.arange(23),\n",
90+
" np.arange(1, 6),\n",
91+
" np.arange(1, 8),\n",
92+
" np.arange(248),\n",
93+
" np.arange(86),\n",
94+
" np.arange(854),\n",
95+
")"
96+
]
97+
},
98+
{
99+
"cell_type": "code",
100+
"execution_count": null,
101+
"id": "6",
102+
"metadata": {},
103+
"outputs": [],
104+
"source": [
105+
"result = xarray_reduce(\n",
106+
" ds.areas,\n",
107+
" *by,\n",
108+
" expected_groups=expected_groups,\n",
109+
" func=\"sum\",\n",
110+
")\n",
111+
"result"
112+
]
113+
},
114+
{
115+
"cell_type": "markdown",
116+
"id": "7",
117+
"metadata": {},
118+
"source": [
119+
"Formulating the three admin levels as orthogonal dimensions is quite wasteful --- not all countries have 86 states or 854 counties per state. \n",
120+
"\n",
121+
"We end up with one humoungous 56GB chunk, that is mostly empty.\n",
122+
"\n",
123+
"We can do better. Since the results are very sparse, we can instruct flox to constructing dense arrays of intermediate results on the full 23 x 5 x 7 x 248 x 86 x 854 output grid.\n",
124+
"\n",
125+
"```python\n",
126+
"ReindexStrategy(\n",
127+
" # do not reindex to the full output grid at the blockwise aggregation stage\n",
128+
" blockwise=False,\n",
129+
" # when combining intermediate results after blockwise aggregation, reindex to the\n",
130+
" # common grid using a sparse.COO array type\n",
131+
" array_type=ReindexArrayType.SPARSE_COO\n",
132+
")\n",
133+
"```"
134+
]
135+
},
136+
{
137+
"cell_type": "code",
138+
"execution_count": null,
139+
"id": "8",
140+
"metadata": {},
141+
"outputs": [],
142+
"source": [
143+
"from flox import ReindexArrayType, ReindexStrategy\n",
144+
"\n",
145+
"result = xarray_reduce(\n",
146+
" ds.areas,\n",
147+
" *by,\n",
148+
" expected_groups=expected_groups,\n",
149+
" func=\"sum\",\n",
150+
" reindex=ReindexStrategy(\n",
151+
" blockwise=False,\n",
152+
" array_type=ReindexArrayType.SPARSE_COO,\n",
153+
" ),\n",
154+
")\n",
155+
"result"
156+
]
157+
},
158+
{
159+
"cell_type": "markdown",
160+
"id": "9",
161+
"metadata": {},
162+
"source": [
163+
"The output is a sparse array! Note that the size of this array cannot be estimated without computing it.\n",
164+
"\n",
165+
"The computation runs smoothly with low memory."
166+
]
167+
},
168+
{
169+
"cell_type": "code",
170+
"execution_count": null,
171+
"id": "10",
172+
"metadata": {},
173+
"outputs": [],
174+
"source": [
175+
"result.compute()"
176+
]
177+
}
178+
],
179+
"metadata": {
180+
"language_info": {
181+
"codemirror_mode": {
182+
"name": "ipython",
183+
"version": 3
184+
},
185+
"file_extension": ".py",
186+
"mimetype": "text/x-python",
187+
"name": "python",
188+
"nbconvert_exporter": "python",
189+
"pygments_lexer": "ipython3"
190+
}
191+
},
192+
"nbformat": 4,
193+
"nbformat_minor": 5
194+
}

0 commit comments

Comments
 (0)