generated from r4ds/bookclub-template
-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path10_statistical-modelling-of-spatial-data.Rmd
454 lines (312 loc) · 9.85 KB
/
10_statistical-modelling-of-spatial-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
# Statistical modelling of spatial data
**Learning objectives:**
- quantifying relationships between variables
- predicting the outcome of observations
- applying a model-based approach
## Mapping with non-spatial regression and ML models
Let's start with loading the **North Carolina SIDS data** - `nc` spatial data.
```{r message=FALSE,warning=FALSE}
library(tidyverse)
library(sf)
```
```{r}
nc <- system.file("gpkg/nc.gpkg", package="sf") |>
read_sf()
```
It contains information about **100 counties of North Carolina**, and includes **counts of numbers of live births** (also non-white live births) and **numbers of sudden infant deaths**, for the July 1, 1974 to June 30, 1978 and July 1, 1979 to June 30, 1984 periods.
We are interested in:
- Sudden Infant Death Syndrome, 1974-78 (SIDS)
- Live births, 1974-78 (BIR74)
- Non-white births, 1974-78 (NWBIR74)
Useful Resources:
- Data description: <https://geodacenter.github.io/data-and-lab/sids/>
- Vignette: <https://r-spatial.github.io/spdep/articles/sids.html>
- More datasets: <https://geodacenter.github.io/data-and-lab/>
```{r}
nc %>% names
```
```{r}
nc %>%
dplyr::select(SID74, BIR74, NWBIR74) %>%
head()
```
This type of data can be manipulated as needed to make the model analysis.
```{r}
nc1 <- nc |> mutate(SID = SID74 / BIR74, NWB = NWBIR74 / BIR74)
nc1 %>% dplyr::select(SID, NWB) %>% head()
```
Making a model with spatial data is as the same as with dealing with dataframes.
$$\text{observed = explainer + remainder}$$
$$Y= (\beta_0+\beta_1x) +\epsilon$$
The first step is to apply a linear regression model to our spatial data.
```{r}
fit <- lm(SID ~ NWB, nc1)
broom::tidy(fit)
```
The result of the `prediction()` function releases the `fit`, `lwr` and `upr` vectors.
```{r}
pr <- fit %>%
predict(nc1, interval = "prediction")
bind_cols(nc, pr) |> names()
```
Cross-validation techniques can be applied taking consideration of the spatial data autocorrelation, which is mostly due to proximity of spatial points or locations with same structure. Due to this, classical methods of applying cross-validation on spatial data might lead to overestimating results, in order to overcome this more specialized cross-validation techinques have been provided to deal well with spatial data, such as:
- [spatialsample](https://spatialsample.tidymodels.org/) (Silge and Mahoney 2023)
- [CAST](https://cran.r-project.org/web/packages/CAST/vignettes/cast01-CAST-intro.html) (Meyer, Milà, and Ludwig 2023)
- [mlr3spatial](https://mlr3spatial.mlr-org.com/) (Becker and Schratz 2022)
- [mlr3spatiotempcv](https://mlr3spatiotempcv.mlr-org.com/articles/mlr3spatiotempcv.html) (Schratz and Becker 2022)
## Support and statistical modelling
Support of data plays a lead role in the statistical analysis of spatial data:
- a `constant` value for every point of the geometry
- a single value that is an `aggregate` over all points of the geometry
- a value that is unique to only this geometry, describing its `identity`
## Time in predictive models
Statistical analysis of spatiotemporal data proceeds either by:
- reducing time, then working on the problem spatially (time first, space later)
- reducing space, then working on the problem temporally (space first, time later)
## Design-based and model-based inference
> Statistical inference means the action of estimating parameters about a population from sample data.
Two possibilities to proceed:
- model-based (assumes a superpopulation model)
- design-based (assumes randomness in the locations - unweighted sample mean is used to estimate the population mean, and no model parameters need to be fit)
The model-based is best for:
- predictions are required for small areas to be sampled
- available data were not collected randomly
Design-based is best for:
- observations were collected using a spatial random sampling process
- needs for data aggregation
- not sensitive estimates, i.e. for regulatory or legal purposes
## Predictive models with coordinates
- models should also not be sensitive to arbitrary rotations of the land or latitude
- assume sample data to be independent
- allow for spatial and/or temporal autocorrelation of residuals (Adaptive cross-validation measures such as spatial cross-validation may help getting more relevant measures for predictive performance.)
## Exercises
## Exercise 1
Use a `random forest model` to predict SID values (e.g. using package randomForest), and plot the random forest predictions against observations, along with the line.
```{r}
library(randomForest) |> suppressPackageStartupMessages()
```
```{r}
r = randomForest(SID ~ NWB, nc1)
```
```{r}
nc1$rf = predict(r)
```
```{r}
nc2 <- nc1%>%
dplyr::select(SID,rf)
nc2%>%head
```
```{r}
nc2 %>%
ggplot(aes(SID,rf))+
geom_point(shape=21,stroke=0.5)+
geom_abline()
```
### Exercise 2
Create a new dataset by randomly sampling 1000 points from the nc dataset, and rerun the linear regression model of section 10.2 on this dataset. What has changed?
```{r}
pts = st_sample(nc, 1000)
nc3 = st_intersection(nc1, pts)
```
```{r}
fit |> summary()
```
```{r}
fit2 <- lm(SID ~ NWB, nc3)
fit2|> summary()
```
> the standard error has decreased with a factor 3 (sqrt(10))
```{r}
lm(SID ~ NWB, nc1) |>
predict(nc1, interval = "prediction") -> pr1
lm(SID ~ NWB, nc3) |>
predict(nc1, interval = "prediction") -> pr2
mean(pr1[,"upr"] - pr1[,"lwr"])
# [1] 0.005161177
mean(pr2[,"upr"] - pr2[,"lwr"])
# [1] 0.004992217
```
```{r}
lm(SID ~ NWB, nc1) |>
predict(nc1, interval = "confidence") -> pr1
lm(SID ~ NWB, nc3) |>
predict(nc1, interval = "confidence") -> pr2
mean(pr1[,"upr"] - pr1[,"lwr"])
# [1] 0.0007025904
mean(pr2[,"upr"] - pr2[,"lwr"])
# [1] 0.000221525
```
### Exercise 3
Do the `water-land classification` using `class::knn`.
```{r}
tif <- system.file("tif/L7_ETMs.tif", package = "stars")
```
```{r}
library(stars)
# r <- read_stars(tif)
(r <- read_stars(tif))
```
```{r}
st_bbox(r)
```
```{r}
st_bbox(r) |>
st_as_sfc()
```
```{r}
set.seed(115517)
pts2 <- st_bbox(r) |>
st_as_sfc() |>
st_sample(20)
```
```{r}
(e <- st_extract(r, pts2))
```
```{r}
r%>%class
```
```{r}
r%>%dim
```
```{r}
st_as_sf(e)
```
```{r}
e_df <- st_as_sf(e) |>
st_coordinates() %>%
as_tibble() %>%
mutate(label=c(1:20),
group=ifelse(label%in%c(8, 14, 15, 18, 19),"land","water"))
e_df%>%head
```
```{r}
ggplot()+
stars::geom_stars(data=r)+
geom_text(data=e_df,
mapping=aes(x=X,y=Y,
label=label,
group=group,
color=group),
check_overlap = T,
inherit.aes = F)+
scale_color_manual(values=c("red","yellow"))+
coord_equal()
```
```{r}
rs <- split(r)
rs
```
```{r}
trn <- rs %>%
st_extract(pts2)
```
```{r}
trn$cls <- rep("land", 20)
trn$cls[c(8, 14, 15, 18, 19)] <- "water"
```
```{r}
library(class)
tr <- as.data.frame(trn) |> dplyr::select(X1, X2, X3, X4, X5, X6)
test <- as.data.frame(rs) |> dplyr::select(X1, X2, X3, X4, X5, X6)
```
```{r}
rs$cls = knn(tr, test, trn$cl, k = 5)
plot(rs["cls"])
```
### Exercise 4
For the `linear model` using `nc` and for the `knn example` of the previous exercise, add a first and a second order linear model in the spatial coordinates and compare the results (use st_centroid to obtain polygon centroids, and st_coordinates to extract the x and y coordinates in matrix form).
```{r}
nc1
```
```{r}
cc <- st_centroid(nc1) |>
st_coordinates()
cc
```
```{r}
# geometries
nc4 <- bind_cols(nc1, cc) |>
transmute(X=X, Y=Y, SID=SID, NWB=NWB)
```
```{r}
(lm0 <- lm(SID ~ NWB, nc1)) |> summary()
```
```{r}
(lm1 <- lm(SID ~ NWB + X + Y, nc4)) |> summary()
```
```{r}
(lm2 <- lm(SID ~ NWB+X+Y+ I(X^2) + I(Y^2)+ X*Y, nc4)) |> summary()
```
```{r}
nc1$prediction_0 <- lm0 |> predict(nc4)
nc1$prediction_1 <- lm1 |> predict(nc4)
nc1$prediction_2 <- lm2 |> predict(nc4)
```
```{r}
nc5 <- nc1 %>%
dplyr::select(prediction_0,prediction_1,prediction_2) |>
#pivot_longer(cols = c("pr0","pr1","pr2"))%>%
st_as_stars() |>
merge()
nc5
```
```{r}
ggplot()+
geom_stars(data=nc5)+
facet_wrap(~attributes,ncol = 1) +
scale_fill_viridis_c()+
labs(fill="")+
theme_void()+
theme(legend.position = "bottom",
legend.text = element_text(size=3))
```
```{r}
tr1 <- cbind(as.data.frame(trn),
st_coordinates(trn)) |>
dplyr::select(X, Y, X1, X2, X3, X4, X5, X6)
```
```{r}
test1 <- as.data.frame(rs) |>
transmute(X=x, Y=y, X1, X2, X3, X4, X5, X6)
```
```{r}
rs$cls1 = knn(tr1, test1, trn$cl, k = 5)
```
```{r}
rs
```
```{r}
tr2 <- cbind(as.data.frame(trn), st_coordinates(trn)) |>
transmute(X, Y, X2=X^2, Y2=Y^2, XY=X*Y, X1, X2, X3, X4, X5, X6)
```
```{r}
test2 <- as.data.frame(rs) |>
transmute(X=x, Y=y, X2=X^2, Y2=Y^2, XY=X*Y, X1, X2, X3, X4, X5, X6)
```
```{r}
rs$cls2 = knn(tr2, test2, trn$cl, k = 5)
```
```{r}
rs[c("cls", "cls1", "cls2")] |> merge() |> plot()
```
Soultions: <https://edzer.github.io/sdsr_exercises/>
## Meeting Videos {-}
### Cohort 1 {-}
`r knitr::include_url("https://www.youtube.com/embed/iVMgUV_9-Vs")`
<details>
<summary> Meeting chat log </summary>
```
00:06:39 Keuntae’s iPad: start
00:10:16 Federica Gazzelloni: https://geodacenter.github.io/data-and-lab/sids/
00:10:31 Federica Gazzelloni: https://r-spatial.github.io/spdep/articles/sids.html
00:10:40 Federica Gazzelloni: https://geodacenter.github.io/data-and-lab/
00:21:54 Keuntae’s iPad: I cannot hear your voice clearly.
00:25:13 Federica Gazzelloni: https://spatialsample.tidymodels.org/
00:25:18 Federica Gazzelloni: https://cran.r-project.org/web/packages/CAST/vignettes/cast01-CAST-intro.html
00:25:23 Federica Gazzelloni: https://mlr3spatial.mlr-org.com/
00:25:28 Federica Gazzelloni: https://mlr3spatiotempcv.mlr-org.com/articles/mlr3spatiotempcv.html
01:04:58 Federica Gazzelloni: end
01:05:03 Derek Sollberger (he/his): thank you!
01:05:35 Federica Gazzelloni: https://edzer.github.io/sdsr_exercises/
```
</details>