-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSpatial Data Processing with R.qmd
3274 lines (2458 loc) · 184 KB
/
Spatial Data Processing with R.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: Spatial Data Processing with R
author:
- name: Jérôme Guélat
degrees: PhD
orcid: 0000-0003-1438-4378
affiliation:
- name: Swiss Ornithological Institute
url: https://www.vogelwarte.ch/en
date: today
license: "CC BY-NC-ND 4.0"
copyright:
holder: Jérôme Guélat
year: 2023
citation:
publisher: Swiss Ornithological Institute
issued: 2023
type: report
url: https://www.github.com/jguelat/R-GIS
keywords:
- GIS
- R Spatial
- Spatial analyses
- Geoprocessing
- Cartography
- Static maps
- Dynamic maps
bibliography: references.bib
csl: american-medical-association.csl
title-block-banner: true
format: html
page-layout: full
number-sections: true
toc: true
toc-depth: 4
toc-location: left
toc-title: ""
code-fold: false
#theme: sketchy
---
```{r}
#| echo: false
#| output: false
set.seed(555)
Sys.setenv(PROJ_LIB = "")
```
## Preface
The first version of this tutorial was created for a course given at the [Swiss Ornithological Institute](https://www.vogelwarte.ch/en){target="_blank"} in 2023. Researchers and students in ecology were the original audience but the whole material can be used by anybody who wants to learn more about how to perform GIS analyses and design maps with the R statistical programming language. The tutorial was heavily updated in 2025.
My goal was to combine a very general and basic introduction to GIS with some sort of cookbook showing how to perform common GIS analyses and create maps with R. The introductory sections should provide enough information for readers with no or little GIS experience in order to understand the rest of the material. You're welcome to skip them, especially if you understand GIS data models and how GIS data is stored in R.
The cookbook part of the tutorial is a collection of analyses and mapping techniques that I regularly use in my job. Most of them are standard GIS procedures, but you'll also find more advanced topics. There are two ways to run the code snippets in this tutorial. You can either run everything, starting with the first section and continuing with the others. Or, if you prefer the cookbook approach, I also included a collapsible code snippet at the start of each section that will allow you to get all the required data and load packages for this section only.
You will notice that the amount of content (examples, explanations, exercises, etc.) can be variable depending on the section. I'm sorry for this, this is purely due to a lack of time on my side... I promise to do my best to add more content soon (hopefully).
None of this could have been written without the incredible work done by the R-spatial community. I'd especially like to thank all the software developers who created and are maintaining the R software, the R packages and the open-source libraries used in these packages!
## Introduction
### R and GIS
You've probably already used a more traditional GIS software such as QGIS or ArcGIS Pro, you've maybe even heard that many GIS specialists use the Python programming language. Then why would you start using R, a statistic software, to perform your GIS analyses and make maps? I will only outline a few advantages and disadvantages in this introduction.
A lot of ecological data you're going to analyse has a spatial component and it's convenient to perform everything in the same software. You will not need to transfer files and everything will be in the right format for subsequent statistical analyses. Moreover, most of you already know R and it's definitely easier to extend a bit you knowledge to include GIS analyses than to learn how to use a new software or how to code in Python. Fortunately, there's a really active community of R-users doing GIS, so you will not be alone and you'll easily find a lot of documentation online. There's also an incredibly large number of packages on CRAN for spatial data processing or analysis. You'll find an overview on the [CRAN Task View: Analysis of Spatial Data](https://cran.r-project.org/web/views/Spatial.html){target="_blank"}.
Doing your GIS analyses with code is also a nice opportunity to make your research more reproducible. The whole data processing is documented and you and others can easily check and re-run everything. The same applies for maps, you can re-create them in a few seconds if the data changed. This is much harder if you only use a "point and click" GIS software.
However there are some GIS tasks where R doesn't shine. I would for example never digitize GIS data or georeference images (such as old maps) using R. As we will later, the cartographic capabilities of some R packages are really impressive. But it will still be easier to use a traditional GIS software if you need more specialized techniques such as complex labeling, advanced symbology or complex map layouts. The same applies if you need to use 3D vector data (with the exception of point clouds).
### GIS data models
When we work with geographic data we need to decide how to model real world objects or concepts (buildings, forests, elevation, etc.) before we can use them in a computer with a GIS (Geographic Information System) software. GIS people mainly use 2 main data models: vector and raster. Other models exist, such as TINs, point clouds or meshes, but we won't cover them here.
Vector data is normally used for high precision data sets and can be represented as points, lines or polygons. Properties of the represented features are stored as attributes. The vector types you will use depends of course on your own data and on the required analyses. For example: points could be appropriate for bird nests and sightings, lines for moving animals and linear structures (paths, rivers), and polygons for territories and land cover categories. Of course a river can also be modeled as a polygon if you're interested in its width (or you can also store its width as an attribute of the line).
::: callout-note
High precision doesn't necessarily mean high accuracy! For example the coordinates of some points could be stored in meters with 5 decimals even though the measurement error was 2 meters.
Most vector data formats include some possibility to store information about measurement errors but this is actually very rarely used.
:::
The best known format for storing vector data is the shapefile, an old and inefficient format developed by ESRI. Even though the shapefile format is still widely used, it has a lot of limitations and problems (listed on the following website: <http://switchfromshapefile.org>{target="_blank"}). Nowadays GIS specialists advise to replace it with better alternatives such as the **GeoPackage** format. Every modern GIS software can read and write GeoPackages and the format is also completely open-source. It is also published as a standard by the Open Geospatial Consortium (OGC) which makes it a future-proof alternative.
::: callout-important
I strongly advise against using GeoPackages (or any other file database format) on cloud-storage platforms such as Dropbox, OneDrive or Google Drive, especially if you need to edit them! Most of the time everything will work fine, but the risk of corruption and/or data loss due to the synchronization mechanism is not negligible!
:::
Raster data is basically an image divided into cells (pixels) of constant size, and each cell has an associated value. Satellite imagery, topographic maps and digital elevation models (DEM) are typical examples where the raster data model is appropriate. A raster data set can have several layers called bands, for example most aerial images have at least three bands: red, green and blue. In the raster data model, specific geographic features are aggregated to a given resolution to create a consistent data set, associated with a loss of precision. The resolution used to aggregate can have a large influence of some analyses and must be thought of carefully.
There exists thousands of different raster data formats. As of today I recommend using the **GeoTiff** format. It is widely used in the GIS world and every GIS software can read and write raster data in this format. Note that it is also possible to use the GeoPackage format to save raster data sets, however I would advise against using it since some GIS software won't be able to read these rasters.
::: callout-tip
Vector data: use the GeoPackage format
Raster data: use the GeoTiff format
:::
### Getting ready
In order to run the code in this tutorial, you'll need the following packages. You can use the following code to install them. All the required dependencies will be automatically installed. The most important ones are `sf`[@pebesma_simple_2018], `terra`[@hijmans_terra_2023], `tmap`[@tennekes_tmap_2023] and `mapview`[@appelhans_mapview_2023].
::: callout-important
Please also update to the latest version of R, otherwise you may get packages that are not fully up-to-date.
:::
``` r
install.packages("sf")
install.packages("terra")
install.packages("lwgeom")
install.packages("spatstat.random")
install.packages("spdep")
install.packages("httr2")
install.packages("tmap")
install.packages("mapview")
install.packages("tmaptools")
install.packages("classInt")
install.packages("leaflet.extras2")
install.packages("leafsync")
install.packages("qgisprocess")
```
The required data is available on the GitHub repository (<https://github.com/jguelat/R-GIS>{target="_blank"}). You'll also find all the code needed to re-create this document.
::: callout-important
A new major version of tmap (v4) was released last week. Unfortunately it breaks some of the examples shown in this tutorial and I had not enough time to make all the changes needed. Therefore please install an older version (v3) before running the tmap examples. You can download it from my OneDrive: [Download tmap v3](https://vogelwarte-my.sharepoint.com/:u:/g/personal/jerome_guelat_vogelwarte_ch/Edt2uVOziSlKi_MkwczypUUBTbhuWe0LyaCwWFqCxQgvwQ?download=1). I will update the code as soon as possible.
:::
## Vector data
### Vector data model
The main vector types are points, lines and polygons (or a combination thereof) and the point is the base of all these types. For example a simple line consists of 2 connected points, similarly an ordered sequence of connected points will represent a more complex line (often called a polyline). A simple polygon will be modeled as an external ring, which is a special type of polyline where the first and last points are identical. In the case of lines and polygons we often speak of vertices to describe these points. Things can be a bit more complex, for example a polygon could have a hole which is modeled as an internal ring.
The **Simple Feature** standard ([full documentation](https://portal.ogc.org/files/?artifact_id=25355){target="_blank"}) was developed to be sure that we all speak the same language when describing vector elements. The specification describes 18 geometry types, but don't worry only 7 of them will be useful for us. The following figure shows these 7 types (source: Lovelace *et al.*, 2019[@lovelace_geocomputation_2019]):

A feature represents a geographic entity modeled by one of these types. For example a building would be a single feature of type POLYGON, while the whole Hawaii archipelago would be a single feature of type MULTIPOLYGON (but you could of course also model each island separately as type POLYGON). A single feature using the MULTI\* types can have multiple elements but this is not mandatory. Most of the time we will use the default 2D version of these types. However it is possible to include additional numeric values such as the height of each point (Z values) or some kind of measurement error (M values). Note that many GIS software will ignore Z and M values for the vast majority of spatial analyses.
::: callout-important
The feature type is usually defined for the whole vector data set, and not per feature (actually `sf` lets you do that but this will brings you all sorts of troubles). For example, if you know that your data set will contain POLYGON and MULTIPOLYGON features, then you will have to use the MULTIPOLYGON type for all of them.
:::
In most GIS softwares (including R), simple features are internally encoded using the well-known binary (WKB) or well-known text (WKT) standards. As the name mentions, WKB is a binary format and hence not easily readable by normal humans. The WKT format is encoding exactly the same information as WKB, but in a more human friendly way. Here are some examples of WKT-encoded features (check the [Wikipedia page](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry){target="_blank"} if you need more):
- a point: `POINT (10 5)`
- a linestring made of 3 points: `LINESTRING (1 1, 2 4, 5 10)`
- a polygon (without a hole): `POLYGON ((10 5, 10 9, 5 8, 4 2, 10 5))`
- a multilinestring: `MULTILINESTRING ((1 1, 2 4, 5 10), (2 2, 5 2))`
The geometry is of course essential in order to have a spatial information but the vector data model also allows storing non-spatial attributes (often called *attribute table*) for each feature. As we will see, these tables are stored as data frames in R and each column will store some property of the related feature (identification number, name, etc.). Each row relates to a single spatial feature (which can consist of several geometries if its type is MULTI\*). The following figure shows some examples (source: Tennekes & Nowosad, 2021[@tennekes_elegant_2021]):
{width="600"}
### A first look at vector data in R
Let's have a look at how R stores a vector data set. The main classes and methods needed to work with spatial vector data are defined in the `sf` package. We will also load the `tmap` package to have access to some spatial data sets.
```{r}
library(tmap)
library(sf)
```
When you first load the `sf` package, it will provide you with version information about some important open-source GIS libraries it uses. In a few rare cases, some functions will only be available if you use recent version of these libraries. If you use `sf` on Windows or Mac and install it from CRAN, they will be included inside the `sf` package and there's no easy way to update them. These libraries are used in almost all open-source GIS software and even in some commercial ones. GDAL takes care of reading and writing your GIS files and can read 99.9% of all the existing GIS formats (the vector part of the GDAL library is often called OGR); GEOS is a Euclidean planar geometry engine and is used for all the common GIS analyses (intersection, union, buffer, etc.); PROJ is responsible for all the coordinate reference systems operations. The s2 library is a spherical geometry engine which is active by default for computations when using unprojected data.
::: {.callout-important title="A bit of history"}
The availability of the `sf` package was a massive change in the "R/GIS" ecosystem (often called R-Spatial). In the old days we used a combination of several packages to process GIS vector data in R. The spatial classes (and some functions) were defined in the `sp` package, the import/export of data was managed by the `rgdal` package, and the geometric operations were available in the `rgeos` package. You'll find a lot of code using these packages on the internet. Please refrain from using them since they are not maintained anymore. The packages `rgdal` and `rgeos` were removed from CRAN, `sp` is still available. Moreover the `sf` package is definitely more powerful and much faster.
:::
We will now load the `World` vector data set inside the `tmap` package and have a look at its structure.
```{r}
data(World)
class(World)
names(World)
World
```
We see the `World` object is stored as a data frame with an additional geometry column (note that the name of the geometry column doesn't need to be 'geometry'). The content of the geometry column is displayed using the WKT format. A programmer would say these objects are instances of the `sf` class, and I will thus call them `sf` objects. R is also giving us more information, like the coordinate reference system used (more on that later) and the number of dimensions (i.e. XY, XYZ or XYZM).
::: {.callout-note collapse="true" icon="false" title="Question: why is the MULTIPOLYGON type appropriate?"}
Don't forget that each feature (i.e. each row of the data frame) represents a country, and some countries are made up of several distinct pieces of land (e.g., islands, exclaves). That's why we need the MULTIPOLYGON type. And since the type apply to the whole data set, even countries with a single geometry (like Switzerland) will need to be MULTIPOLYGONS.
:::
It is also easy to plot the data using the usual command.
```{r}
#| warning: false
plot(World)
```
By default R will take the first 9 attributes of the `sf` object and plot them using the available geometries. Since these objects inherit from the data base class, you can use all the typical data frame functions such as `summary`, `head`, `merge`, `rbind`, etc. Subsetting is also possible using the standard `[]` operators. Therefore you can use the following code if you only want to plot the well-being index, for the whole world, only for countries with a high index, or just for Australia.
```{r}
plot(World[,"well_being"])
plot(World[World$well_being > 6,"well_being"])
plot(World[World$name == "Australia","well_being"])
```
Note that the color scale was adapted depending on the available values in the filtered data set. If you only need the geometries without any attributes, then you can use the `st_geometry()` function.
```{r}
plot(st_geometry(World))
```
::: callout-note
We haven't done it here, but, as we will see later, it is better to first project everything using an appropriate projection when you want to plot global data (like the previous world maps).
:::
::: {.callout-note icon="false" title="Exercise (5 minutes)"}
Play a little bit with the `World` data set, try different functions that you would normally use with a data frame. Import the `redlist` data from the file `red_list_index.csv` (source: <https://stats.oecd.org>{target="_blank"}) and join it to the `World` data frame to add new attributes. Plot a world map using one of the new attributes.
```{r}
#| eval: false
#| code-fold: true
redlist <- read.csv("data/red_list_index.csv")
world2 <- merge(World, redlist, by.x = "iso_a3", by.y = "code")
plot(world2[,"index_2020"])
```
:::
### Structure of `sf` objects
::: {.callout-tip appearance="minimal" collapse="true" title="If you start from here..."}
Run the following code to load and create everything you'll need to run the examples in this section.
```{r}
#| eval: false
library(sf)
```
:::
Most of the time you won't need to create your own `sf` objects from scratch since you'll import some existing GIS data. But if you need to, there are special functions to help you. This is also a good way to get a better understanding of the structure of `sf` objects. The standard process is shown in the following figure (source: Lovelace *et al.*, 2019[@lovelace_geocomputation_2019]):
{width="800"}
You first need to create each feature geometry using some constructor functions. Each of these features will be of class `sfg` (simple feature geometry). Then you collect all these geometries in a list using the `st_sfc()` function. You get a new object of class `sfc` (simple feature list-column). After that you combine the newly created simple feature list-column with the attributes (stored as a data frame, or a tibble) using the `st_sf()` function in order to get an `sf` object.
Since this is rather abstract, let's look at a simple example. Imagine we want to create a point data set containing three bird observations, and each observation will have the following attributes: species and sex. We start by creating our point geometries using x and y coordinates:
```{r}
pt1 <- st_point(c(2657000, 1219000))
pt2 <- st_point(c(2658000, 1218000))
pt3 <- st_point(c(2659000, 1217000))
```
Let's have a look at what we've just created:
```{r}
pt1
class(pt1)
typeof(pt1)
str(pt1)
```
Our first object is a 2D point (otherwise we would see XYZ or XYZM) of class `sfg`. If we look a bit more into the details of the structure, we see that it is actually stored as vector of type `double` (with length 2).
Now we need to collect our points inside an `sfc` object. This is simply a list of `sfg` objects with an associated coordinate reference system (CRS). Since we collected our data in Switzerland, we will use the standard Swiss coordinate reference system. As we will see later, most coordinate reference systems are identified by a specific number.
```{r}
pts <- st_sfc(pt1, pt2, pt3, crs = "EPSG:2056")
```
Let's have a look at our new creation:
```{r}
pts
class(pts)
typeof(pts)
str(pts)
```
This confirms that our `sfc` object is actually a list, and this object will be the geometry column of the soon to be created `sf` object. Since our object is a list, it is easy to extract individual elements if needed:
```{r}
# Extract the second item of the list
pts[[2]]
class(pts[[2]])
```
The feature geometries (stored in an `sfc` object) are only half of what we need to create an `sf` object. We also need to define the attributes of each feature. We store them in a data frame using the same order as the geometries.
```{r}
pts_data <- data.frame(species = c("wallcreeper", "alpine chough", "kingfisher"),
sex = c("male", "female", "female"))
pts_data
```
And as a last step we combine the feature geometries with the related attributes using the `st_sf()` function. We now have a typical GIS data set stored as an `sf` object.
```{r}
pts_sf <- st_sf(pts_data, geometry = pts)
pts_sf
```
Since everything is stored as lists, it is again easy to access individual elements of the `sf` object:
```{r}
# Extract the 3rd geometry
pts_sf$geometry[[3]]
```
There's some sort of tradition to call geometry columns *geom* or *geometry*, but you're entirely free to chose another name. However, you need to be a bit careful since the `sf` package must always know the correct name. For example, using the standard `names()` function will not work for geometry columns since `sf` won't be informed of the change. To modify the name of the geometry column, always use the `st_geometry()` function.
```{r}
#| error: true
names(pts_sf)[3] <- "my_beautiful_points"
pts_sf
st_geometry(pts_sf) <- "my_beautiful_points"
pts_sf
```
::: callout-tip
You can also create `sf` objects directly from a data frame containing a column of type `sfc` using the `st_as_sf()` function.
```{r}
pts_data$geometry <- pts
pts_sf <- st_as_sf(pts_data)
```
:::
::: callout-note
You've now probably noticed that most functions in the `sf` package have an `st_` prefix. This is a reference (and probably homage) to PostGIS, an extension allowing to store and query GIS data in the PostgreSQL database management system. All PostGIS functions start with the `ST_` prefix, which stands for "Spatial Type".
:::
We process similarly to create other geometry types from scratch, the only difference is that we now need matrices to store the vertices of the lines and polygons instead of a simple vector, and for multilinestrings, (multi-)polygons and geometry collections, we need more lists to encapsulate everything. If you're not sure how to create geometries, the `sf` documentation provides examples for all the geometry types. Look for the following functions: `st_point()`, `st_linestring()`, `st_polygon()`, `st_multipoint()`, `st_multilinestring()`, `st_multipolygon()`, `st_geometrycollection()`. Here's a more complex example showing how to create a multipolygon (including one geometry with a hole) inside an `sfg` object. The next steps (collecting geometries in an `sfc` object, adding attributes and store as an `sf` object) are exactly the same as before.
```{r}
# rbind creates matrices and makes the coding easier
pol1_border <- rbind(c(1, 5), c(2, 2), c(4, 1), c(4, 4), c(1, 5))
pol1_hole <- rbind(c(2, 4), c(3, 4), c(3, 3), c(2, 3), c(2, 4))
pol1 <- list(pol1_border, pol1_hole)
pol2 <- list(rbind(c(0, 2), c(1, 2), c(1, 3), c(0, 3), c(0, 2)))
multipolygon_list <- list(pol1, pol2)
multipol <- st_multipolygon(multipolygon_list)
multipol
plot(multipol, col = "navy")
```
::: callout-tip
You can also create `sfc` and `sf` objects from scratch using the WKT format and the `st_as_sfc()` and `st_as_sf()` functions. The following example creates an `sfc` object using a character vector, without needing to create an `sfg` object first.
```{r}
pts <- st_as_sfc(c("POINT(2657000 1219000)", "POINT(2658000 1218000)", "POINT(2659000 1217000)"), crs = "EPSG:2056")
```
And you can use a similar approach to create an `sf` object. In this case we add a new column (as a character vector) to the data frame containing the attributes. Note the use of the `wkt` argument inside the `st_as_sf()` function.
```{r}
pts_data$geometry <- c("POINT(2657000 1219000)", "POINT(2658000 1218000)", "POINT(2659000 1217000)")
pts_sf <- st_as_sf(pts_data, wkt = "geometry", crs = "EPSG:2056")
```
:::
::: {.callout-note icon="false" title="Exercise (5 minutes)"}
Try to build your own `sfc` and `sf` objects using either `st_sfc()` and `st_sf()` or `st_as_sfc()` and `st_as_sf()`.
:::
## Raster data
As we saw above, a raster data set is basically an image, which is the same as a grid of pixels. These pixels are often called cells. Most raster data sets you will encounter will have square cells with a constant size (also called resolution), and we will only focus on these in this tutorial. However don't forget that other kind of grids, for example sheared or curvilinear, also exist. This is sometimes needed depending on the coordinate reference system used to store the data, or can be caused by some reprojections.
Rasters are perfect for storing continuous values contained in a large area (called the extent of the raster). Digital elevation models are a typical example of such data, each cell is used to store an elevation value. You will also find rasters containing discrete values, these are often used to store landcover or landuse data sets. Note that, unlike vector data, it is impossible to store overlapping features in the same data set.
We saw that vector data sets can store multiple attributes for a single feature. We can use a similar technique for raster data sets with the help of raster bands. You can think of raster bands as different layers of the same grid, each layer containing a different information. This is mainly use for spectral data, for example the red, green and blue intensity values in an aerial picture; satellite imagery will have even more bands depending on the number of sensors. Multiband rasters are also often used to store environment variables that change through time (e.g. a temperature raster, with one band per day). Such rasters are often called datacubes.

Performing computations on raster data sets is usually very efficient and faster than using vector data. This is due to the fact that rasters are stored in some kind of matrix formats with some extra information such as the coordinate reference system and the origin of the raster. It this thus possible to use highly efficient linear algebra libraries. The mathematical operations performed on raster cells are called map algebra.
We will use the `terra` package to work with raster data. This package has everything needed to import, analyse, visualize and export raster data sets. Like the `sf` package, it is also using the GDAL library for the import/export operations, which means it can open almost every raster data format. Unlike the `sf` package, `terra` will not import the full data sets in memory but only create a pointer to the data and then read smaller blocks of data successively. This allows working with very large rasters with a relatively small memory footprint. The amount of functions available in the `terra` package is similar to typical GIS software.
::: {.callout-important title="Rasters and R workspaces"}
Since `terra` only stores a pointer to the raster data set, this means the actual data set won't be included if you save your session in an R workspace (.Rdata files). If you really want to include it in your workspace, you can use the `wrap()` function. Note that this is also needed if you want to pass raster data over a connection that serializes, e.g. to a computer cluster.
:::
There is another famous R package to process raster data, the `stars` package. It is especially useful if you need to work with "irregular" rasters (sheared, curvilinear, etc.) or with complex datacubes. It is also tidyverse-friendly and the syntax is closed to the one use in `sf`. However the number of available functions is (still) much lower than in `terra`. If you need to use both packages, it is fortunately easy to convert raster objects from `terra` to `stars` (using the function `st_as_stars()`), and the other way round (using the function `rast()`).
::: {.callout-note title="A bit of history"}
A revolution happened in 2010 in the small world of R-spatial when the `raster` package was released. People were able to perform analyses using raster data sets in R instead of using a standard GIS software. The package has been maintained during many years, and many functions were added. However its developer decided to create a new package from scratch in order to improve speed and memory efficiency, the `terra` package was born. You will often find a lot of R code using the `raster` package on the web. Fortunately it is quite easy to adapt code written for the `raster` package to `terra`. The functions have similar names (sometimes even identical) and everything that was available in `raster` should also be available in `terra`. Actually the recent versions of `raster` even use `terra` in the background instead of the original `raster` code.
:::
## Coordinate reference systems
The majority of normal people will get scared if there's some problem to solve involving coordinate reference systems or projections. That's why I will keep this part really short and only show you the stuff you will need to perform standard GIS analyses with R. If you want to read more about this (extremely interesting) topic, I invite you to read the following book chapters: <https://r.geocompx.org/spatial-class.html#crs-intro>{target="_blank"} and <https://r-tmap.github.io/tmap-book/geodata.html#crs>{target="_blank"}.
There is a famous expression saying "spatial is special"... One of the main reasons is that such data will have an associated location and you thus need a reference frame to describe this location. This reference frame is called a coordinate reference system (CRS) in the GIS world. CRSs can be either geographic or projected.
::: callout-important
When you're working with GIS data, you should always know the CRS you're using. Otherwise coordinates are just numbers without a reference frame. When you share GIS data, make sure the CRS is always defined in the data set or documented in some other way. The CRS of a vector data set can be queried with the `st_crs()` function, for a terra object you should use the `crs()` function.
:::
A geographic CRS will identify locations on a spheroid or an ellipsoid using 2 values: latitude and longitude. The shape of the Earth is actually a geoid, but it is too complex to perform computations and thus one has to use approximations. The spheroid is making the assumption that the Earth is a sphere, while the ellipsoid is a better approximation accounting for the fact that our planet is a bit compressed (flatter at the North and South Poles). Geographic coordinate systems are not using a projection! All the computations (distances, buffers, etc.) have to happen on the spheroid/ellipsoid, which makes things more complex. It is easy to make mistakes when working with geographic CRSs, and even smart people fell in this trap (e.g. <https://georeferenced.wordpress.com/2014/05/22/worldmapblunders>{target="_blank"}).
Projected CRSs are based on geographic CRSs but include an additional step: a projection on a flat surface. When using a projected CRS, locations are described using Cartesian coordinates called easting and northing (x and y). Projecting a spherical or ellipsoidal surface on a plane will cause deformations. These will affect four properties of the surface: areas, distances, shapes and directions. A projected CRS can preserve only one or two of these properties. There exists a ton of different projections and all of them make different compromises, some are even totally useless (check this beautiful xkcd comic: <https://xkcd.com/977>{target="_blank"}). Choosing a projection can be challenging, especially if your data covers a very large area. The following websites allow you to visualize the main projection types: <https://www.geo-projections.com>{target="_blank"} and <https://map-projections.net/singleview.php>{target="_blank"}. The second website also provides a nice tool to visualize distortions called a Tissot indicatrix. Fortunately, if your data is within a "smallish" area, it is relatively easy to find a good projected CRS that has only minimal distortions. Almost every country has its own recommended projected CRS (or CRSs), and if your data covers several countries, you can use a UTM (Universal Transverse Mercator) coordinate system.
::: callout-tip
It is almost always easier to work with a projected CRS, except if your data is really global (or covering a really large area, like a continent). Moreover, most GIS software will (still) make the assumption that you're data is on a flat plane, even if you're working with a geographic CRS. The `sf` package is kind of an exception since it will actually perform calculations on a spheroid if you use a geographic CRS, thanks to the s2 library.
:::
::: callout-important
The CRS used by almost all mapping websites (OpenStreetMap, Google Maps, etc.) should never be used for any analysis. It is a slightly modified version of the Mercator projection called Web Mercator or Pseudo-Mercator. It has some advantages allowing good visualization speed, but the distortions are massive. Check the following website: <https://www.thetruesize.com>{target="_blank"}.
:::
With so many CRSs available, we need a way to classify them. That's what the EPSG (European Petroleum Survey Group) started doing a few years ago. They collected and documented most available CRSs in a data set which is now called the EPSG Geodetic Parameter Dataset (<https://epsg.org/home.html>{target="_blank"}). In this data set, every CRS has a unique identification number that can be used in a GIS software instead of writing the full definition of the CRS. The best available transformations between CRSs are also defined. Sadly this data set is still missing a few interesting CRSs and was thus completed by other companies such as ESRI. This is the reason why you'll sometimes see ESRI codes instead of EPSG for some CRSs. To avoid confusion, CRSs are usually referenced by an SRID (Spatial Reference System Identifier), which is made of two components, an authority (such as EPSG or ESRI) and an identifier. If no authority is mentioned you can usually assume it's coming from the EPSG data set (especially in the open-source GIS world). For clarity, I recommend always specifying the full SRID when working with CRSs. With `sf` and `terra` (and most other GIS packages), the SRID has to be written in the form "authority:identifier".
The following CRSs are especially interesting for us:
| SRID | Name | Description |
|-----------------|-----------------|--------------------------------------|
| EPSG:2056 | CH1903+/LV95 | Projected CRS currently used in Switzerland |
| EPSG:21781 | CH1903/LV03 | Former projected CRS used in Switzerland, you will still find data sets using this one |
| EPSG:4326 | WGS84 | Geographic CRS used for most global data sets, and by GPS devices |
| EPSG:3857 | Pseudo-Mercator | Projected CRS used by online maps |
| EPSG:8857 | Equal Earth Greenwich | Nice equal-area projection for world maps |
| ESRI:54030 | Robinson | Aesthetically pleasing projection for world maps |
::: {.callout-important title="Proj4strings"}
When looking for examples on the web, you will often find code snippets using what is called a proj4string to define a CRS or to reproject data. For example the proj4string for the current Swiss CRS looks like this: `+proj=somerc +lat_0=46.9524055555556 +lon_0=7.43958333333333 +k_0=1 +x_0=2600000 +y_0=1200000 +ellps=bessel +towgs84=674.374,15.056,405.346,0,0,0,0 +units=m +no_defs +type=crs`. This was the standard way of describing CRSs until a few years ago. You should NOT use these strings, instead always use the EPSG (or another authority) number to be on the safe side. Otherwise you may get small to medium position errors when reprojecting your data.
Similarly, you will sometimes see some CRS definitions using the `+init=` syntax (e.g., `+init=EPSG:2056`). This should also be avoided for similar reasons, moreover this can also cause problems with other GIS software not recognizing the CRS properly.
:::
::: callout-important
If you search for the EPSG database on your favorite search engine, you may find the website <https://epsg.io>{target="_blank"}. Please do not use it! It is not the official EPSG website, it doesn't use the latest version of the EPSG database, and therefore some definitions of CRSs are outdated.
:::
You can easily explore how CRSs are stored in modern GIS software. For example if you want to inspect the current Swiss CRS:
```{r}
st_crs("EPSG:2056")
```
We get the full description of the CRS with all its parameters. The format used is unfortunately also named WKT, but this has nothing to do with the WKT format used to define geometries. If you use EPSG codes, you can also simply enter the code as an integer (please don't do this to avoid confusion).
::: {.callout-note icon="false" title="Exercise (5 minutes)"}
Try to understand the output of the `st_crs()` function. Try it with a geographic CRS.
:::
## Tips and tricks for vectors
### Reading vector data
::: {.callout-tip appearance="minimal" collapse="true" title="If you start from here..."}
Run the following code to load and create everything you'll need to run the examples in this section.
```{r}
#| eval: false
library(sf)
```
:::
We had a look at the gory details of the internal structure `sf` objects. However most of the time you will not create such objects on your own but rather rely on the `sf` package to create the right structure when you import existing GIS data. The `sf` package is using the GDAL (Geospatial Data Abstraction Library) library to read GIS files, and this means you will be able to import almost all existing GIS vector formats. If you want to check all the available formats, you can use the `st_drivers()` function. Sometimes you will not get a standard GIS data set but a simple CSV (or Excel) file containing coordinates and related attributes. We will now see how to import these different data types in order to use them with the `sf` package.
#### Import a GeoPackage
The GeoPackage format is the best available open format to store vector data. It is based on the SQLite database format which is the most used file-based database nowadays. You can think of it as a special folder containing one or several GIS data sets. Since we normally don't know in advance if a GeoPackage contains one or more data sets, we first have to inspect it.
```{r}
st_layers("data/geodata.gpkg")
```
You should not always trust the reported number of features. Some GIS format such as the GeoPackage report this number, some don't. If the GeoPackage was produced by a software that doesn't properly implement the standard, the reported number of features could be wrong (but this shouldn't have any other bad consequence). If you want to be sure to get the correct number, you can use the `do_count = TRUE` argument of the `st_layers()` function, but this will be slower.
To read the data, you use the `st_read()` function, the first argument is the path of the GeoPackage, and the second argument is the layer you want to import. The function will return an `sf` object. By default you'll get some information about the data being imported. If you don't need them, you can use the argument `quiet = TRUE`.
```{r}
muni <- st_read("data/geodata.gpkg", "municipalities")
streets <- st_read("data/geodata.gpkg", "streets", quiet = TRUE)
```
Let's check the object we've just created. To inspect `sf` objects, you can either call them directly or use the `print()` function. The `head()` and `tail()` functions will also work since `sf` objects are based on data frames. By default, only the first 10 rows will be displayed. If you want to see more (or less) rows, use the `n` argument of the `print()` function.
```{r}
muni
print(muni, n = 2)
```
The output contains basic information about the data set, and the first features are shown with all the attributes. This municipalities data set is an extract of the [swissBOUNDARIES3D](https://www.swisstopo.admin.ch/en/landscape-model-swissboundaries3d){target="_blank"} data set provided by Swisstopo, the streets data set is an extract of the [swissTLM3D](https://www.swisstopo.admin.ch/en/landscape-model-swisstlm3d){target="_blank"} data set, also provided by Swisstopo.
#### Import a Shapefile
If you really need to import a Shapefile, you should also use the `st_read()` function. Since Shapefiles cannot contain more than one data set, we only need to provide the first argument of the function. A Shapefile consists of several files with different extensions (.shp, .shx, etc.), we use the .shp extension by default when importing.
```{r}
muni2 <- st_read("data/municipalities.shp", quiet = TRUE)
muni2
```
This is actually the same data set as the one in the GeoPackage, however, note that `sf` is now using polygons instead of multipolygons. This is caused by the fact that the Shapefile format does not distinguish properly between the two types. The GDAL library will use the type polygons in this case, but you can still have a combination of polygons and multipolygons in the same data set.
#### Import a CSV file with coordinates
If you have a table containing coordinates of point data (e.g. sites or bird sightings), you should use the `st_as_sf()` function. The first argument should be the data frame containing the data, and you also need to specify the names (or the numbers) of the columns containing the geographic coordinates, and the CRS used. The following data set was extracted from the bird sightings database of the Swiss Ornithological Institute.
```{r}
obs <- read.csv("data/observations.csv")
head(obs)
obs <- st_as_sf(obs, coords = c("x", "y"), crs = "EPSG:2056")
obs
```
If you have data in WGS84, your geometry columns will probably be named longitude and latitude. In this case remember that longitude corresponds to the *x* coordinate and latitude to the *y* coordinate. This can be sometimes confusing because most geographic CRSs have a "reversed" axis order (which means latitude is stored before longitude). To be honest the situation is even more complicated than that, since in geodesy, the convention is to let the x axis point to the North and the y axis to the East (more information: <https://wiki.osgeo.org/wiki/Axis_Order_Confusion>{target="_blank"}).
Fortunately, thanks to the great job done by the programmers of the PROJ and GDAL libraries, you don't really have to think about all this chaos. Just remember the rule mentioned above and you should be safe 99.99% of the time. You should thus use (note the different value for the `crs` argument):
```{r}
#| eval: false
obs_wgs <- read.csv("data/observations_wgs.csv")
obs_wgs <- st_as_sf(obs_wgs, coords = c("lon", "lat"), crs = "EPSG:4326")
```
#### Import from a PostGIS database
PostGIS is a famous open-source extension for the PostgreSQL database management system. It allows storing all kind of GIS data inside a database and perform hundreds of typical GIS analyses. The Swiss Ornithological Institute is using PostGIS to store almost all its bird data and a lot of other GIS data sets. If you have a laptop provided by the institute and you already accessed our database via QGIS, you should be able to run the following code. The process will be the same for all PostGIS databases.
First we need to load the `RPostgres` package which provides function to access PostgreSQL databases (and hence PostGIS, too). There is another package providing similar functionality called `RPostgreSQL`, but in my opinion the `RPostgres` is better maintained and I experienced less problems.
After storing all the connection details in some variables, we can finally create a connection to the database using the `dbConnect()` function.
```{r}
#| eval: false
library(RPostgres)
# Login data
user <- "replace_with_your_login"
password <- "replace_with_your_password"
host <- "dbspatialdata1"
database <- "research"
# Connection to the database
dbcon <- dbConnect(Postgres(), dbname = database, host = host, user = user, password = password)
```
After that we need to import the data with a query, using again the `st_read()` function. Note that the first argument of the function must be the database connection object. The first possibility consists of importing the whole layer (called table in the database lingo) with all its attributes. This is what we do for the `cantons1` data set. We need to use the `Id()` function to specify the location of the table inside the database. In a PostgreSQL database, a schema is a bit like a folder where we store tables, this allows us to implement some structure inside the database. In our case the table `cantonal_boundaries_ch` is stored inside the schema `perimeter`. Note that we don't need the `Id()` function if the table are stored in the `public` schema.
We can also specify a SQL query to import the data, like for the `cantons2` data set. Using this kind of query, we are fully flexible. We can for example specify the attributes we want to import, specific data filters, we can even join different tables together (by attributes or even spatially). Once again we have to specify the schema, but this is done a bit differently.
Once we have our `sf` objects, we still need to disconnect the database. The cantonal_boundaries_ch data set contains all the cantonal boundaries in Switzerland. The data is provided by Swisstopo ([swissBOUNDARIES3D](https://www.swisstopo.admin.ch/en/landscape-model-swissboundaries3d){target="_blank"}).
```{r}
#| eval: false
# Load cantonal boundaries
cantons1 <- st_read(dbcon, layer = Id(schema = "perimeter", table = "cantonal_boundaries_ch"))
cantons2 <- st_read(dbcon, query = "SELECT id, name, geom
FROM perimeter.cantonal_boundaries_ch
WHERE name = 'Fribourg'")
# Disconnect database
dbDisconnect(dbcon)
# Show sf objects
cantons1
cantons2
```
#### Import from WKB
You probably won't need to import geometries in the WKB format very often. GIS data should not be shared directly in this format. However, since it's the default format used to store geometries in the PostGIS database, you will maybe get one day a table with attributes and a single WKB column. The following table is a direct extract from a PostGIS database.
```{r}
obs_wkb <- read.csv("data/observations_wkb.csv")
head(obs_wkb)
```
Unfortunately it is not possible to use the `st_read()` function to import such data, and the `st_as_sf()` function won't work either. In this case we need to first convert the WKB geometries into `sfc` objects. To do that we need to add an extra attribute using the `structure()` function to inform `sf` that we're using WKB geometries. After that we can use the `st_as_sfc()` function. The `EWKB = TRUE` means that we are using a WKB dialect called EWKB (Extended WKB) which also includes the SRID of the geometries. Once we have a vector of `sfc` objects, we remove the now useless column containing the WKB geometries and use the `st_sf()` function to combine the data frame with the geometries.
```{r}
geom <- st_as_sfc(structure(as.list(obs_wkb$wkb), class = "WKB"), EWKB = TRUE)
obs_wkb <- subset(obs_wkb, select = -wkb)
obs_wkb <- st_sf(obs_wkb, geom)
obs_wkb
```
### Writing vector data
::: {.callout-tip appearance="minimal" collapse="true" title="If you start from here..."}
Run the following code to load and create everything you'll need to run the examples in this section.
```{r}
#| eval: false
library(sf)
muni <- st_read("data/geodata.gpkg", "municipalities", quiet = TRUE)
obs <- read.csv("data/observations.csv")
obs <- st_as_sf(obs, coords = c("x", "y"), crs = "EPSG:2056")
```
:::
When you create or modify a GIS data set with `sf` you'll need to export it to some standard GIS format if you want to share it with colleagues, open it in another GIS software, or simply archive it. I do not recommend using R workspaces (.Rdata files) to share or store GIS data. For exporting vector data, we are going to use the `st_write()` function. Like the `st_read()` function, it uses the GDAL library, so you'll be able to export in many different formats. You can specify the format explicitly, otherwise `sf` will try to guess it based on the file extension. It is also possible to export to a PostGIS database using an approach similar to the one we used for importing.
#### Export to GeoPackage
For a GeoPackage, you need to specify the name of the GeoPackage first (it will be automatically created if it doesn't exist) and the name of the data set that will be stored inside the GeoPackage. If you specify a GeoPackage that already exists, the data set will be added to it as a new table.
```{r}
#| echo: false
#| output: false
if(file.exists("export/birds.gpkg")) {file.remove("export/birds.gpkg")}
```
```{r}
st_write(obs, "export/birds.gpkg", "observations")
obs2 <- obs[1:10,]
st_write(obs2, "export/birds.gpkg", "observations2", quiet = TRUE)
st_layers("export/birds.gpkg")
```
If you want to delete a data set, you can use the `st_delete()` function. Think twice before doing it, there will be no warning!
```{r}
st_delete("export/birds.gpkg", "observations2")
```
#### Export to CSV
It is usually a better option to export a GeoPackage, but sometimes you'll still need to export your data to CSV. When you share such a file, always add metadata about the CRS you used. Since CSV is not a GIS format, we need a way to store the geometries. This is easy for point data since we can always add columns with the coordinates. For line and polygon data we need to find another solution, for example store the WKT in a new column.
```{r}
#| echo: false
#| output: false
if(file.exists("export/birds.csv")) {file.remove("export/birds.csv")}
if(file.exists("export/municipalities.csv")) {file.remove("export/municipalities.csv")}
```
We can use the `layer_options` argument of the `st_write()` function to export point data with the additional columns for the *x* and *y* coordinates. These additional options are sent directly to the GDAL library which does the export.
```{r}
st_write(obs, "export/birds.csv", layer_options = "GEOMETRY=AS_XY")
```
For polygon and line data, we can do something similar to get a new column with the WKT geometry. Since commas are used in the WKT format, it might be a good idea to use another separator for the CSV file. Note that you'll probably get into troubles if your geometries have a lot of vertices.
```{r}
st_write(muni, "export/municipalities.csv", layer_options = c("GEOMETRY=AS_WKT", "SEPARATOR=SEMICOLON"), quiet = TRUE)
```
#### Export to Shapefile
Really??? [Please don't](figures/simpsons.png){target="_blank"}!
### Basic geometric computations
::: {.callout-tip appearance="minimal" collapse="true" title="If you start from here..."}
Run the following code to load and create everything you'll need to run the examples in this section.
```{r}
#| eval: false
library(sf)
muni <- st_read("data/geodata.gpkg", "municipalities", quiet = TRUE)
streets <- st_read("data/geodata.gpkg", "streets", quiet = TRUE)
```
:::
In this section we'll see how to perform some basic geometric computations on spatial data, such as computing area, perimeter, length and centroids. We'll also learn how to display the coordinates of `sf` geometries.
#### Areas and lengths
As a first example, let's how we can compute the area and perimeter of polygons, or the length of lines.
```{r}
st_area(muni)
st_perimeter(muni)
head(st_length(streets))
```
Note that the results always have a unit of measurement. This is a feature that is provided by `sf` and will occur will all functions giving some sort of measurement. This is compatible with the `units` package which allows easy conversions between different unit types. However this can sometimes be a problem if you need a "raw" value. In this case you can use the `as.numeric()` function to remove the units.
::: callout-note
The `st_perimeter()` function was not available in older versions of the `sf` package. The `lwgeom` package was needed to perform this computation.
:::
If you use unprojected data (i.e., with a geographic CRS), `sf` will automatically use the `s2` library to compute areas, perimeters and lengths.
```{r}
st_area(World[1:5,])
st_perimeter(World[1:5,])
```
The `s2` library performs its computations on a spheroid. For areas and lengths, it is possible to obtain a better approximation by using an ellipsoid. You can do this by turning off the `s2` library with the `sf_use_s2(FALSE)` function. In this case, `sf` will automatically use equivalent functions provided by the `lwgeom`[@pebesma_lwgeom_2023] package. These functions use algorithms from the GeographicLib library which are to my knowledge the most precise approximations you can currently get. Note that the `st_perimeter()` function won't be available if you do that. You'll need to transform the polygons to lines first (see @sec-typecasting for details), and then use the `st_length()` function. Don't forget to reactivate `s2` (using the `sf_use_s2(TRUE)` function) when you're done.
```{r}
sf_use_s2(FALSE)
st_area(World[1:5,])
st_length(st_cast(World[1:5,], "MULTILINESTRING"))
sf_use_s2(TRUE)
```
::: callout-important
You should normally not turn `s2` off. Computing areas and lengths (and distances, as we will see later) are probably the only valid cases where turning `s2` off is a good idea. For all other computations based on geographic CRSs you should NOT deactivate `s2`, otherwise you'll get results that will most probably be wrong.
:::
#### Centroids
Computing the centroid of polygons is another useful operation that is easily computed using the `st_centroid()` function.
```{r}
muni_centroid <- st_centroid(muni)
plot(st_geometry(muni))
plot(st_geometry(muni_centroid), add = TRUE)
```
For this example, you can safely ignore the warning about attributes assumed to be constant. The output of the `st_centroid()` function will be a point data set with the same number of features and the same attributes as the data set used to compute the centroids. The function simply warns you that if the value of some attribute is not constant for some polygon, then the value of this attribute for the centroid probably doesn't make a lot of sense.
::: callout-tip
Centroids can be used to place labels inside polygons, but don't forget that polygons with strange shapes may not contain their own centroid. If you need to be sure that the point will be inside the polygon, use the `st_point_on_surface()` function.
:::
#### Extract coordinates
Sometimes we also need to know the coordinates of the `sf` objects we're using. For points we of course get the coordinates of the points, for lines and polygons we get the coordinates of the vertices, with additional columns showing how to reconstruct the features (check the `st_coordinates()` help page to understand the meaning of L1, L2 and L3).
```{r}
st_coordinates(muni_centroid)
head(st_coordinates(muni))
```
#### Common pitfalls
Unfortunately geometric computations are not always that easy... Let's have a look at another example. We load another polygon layer and try to compute the area of each polygon.
```{r}
bug <- st_read("data/geodata.gpkg", "wtf", quiet = TRUE)
plot(bug, col = 1:nrow(bug))
st_area(bug)
```
Oops, these polygons look big enough but one of them seems to have an area of 0... Why is this happening? To understand the problem, we first need to talk a bit about geometric validity...
### Geometric validity
::: {.callout-tip appearance="minimal" collapse="true" title="If you start from here..."}
Run the following code to load and create everything you'll need to run the examples in this section.
```{r}
#| eval: false
library(sf)
bug <- st_read("data/geodata.gpkg", "wtf", quiet = TRUE)
```
:::
When we had a look a the vector data model, we discovered the Simple Feature standard but we forgot an important part: geometric validity. Validity is defined a bit differently depending on the geometry engine used for the computations. First the good news: points are always valid! Lines are always valid for the GEOS engine used by `sf` but they are considered invalid by QGIS if they have self-intersections (such lines are called non-simple). Polygons are definitely invalid if they have self-intersections (like our example). The other invalid cases are shown on this website: <https://postgis.net/docs/using_postgis_dbmanagement.html#Valid_Geometry>{target="_blank"}. Using invalid geometries can be problematic for some analyses, such as computing areas.
Normally we should expect official data sets to be valid but this is often not the case. You can check the validity of each feature using the `st_is_valid()` function. If you want a short description of the problems, you can add the argument `reason = TRUE`.
```{r}
bug
st_is_valid(bug)
st_is_valid(bug, reason = TRUE)
```
If there are only a few invalid features, we can correct them manually in QGIS. But sometimes this is not feasible and we need some automatic way of correcting. This is where the `st_make_valid()` function shines. Even though it's fully automatic, it will perform the appropriate changes 99% of the time. The function can use two different algorithms to correct geometries, you can choose which one will be used with the argument `geos_method`. The default algorithm ("valid_structure") is more recent and should produce better results in most cases. Try the older one ("valid_linework") if you're not happy with the results. Check the following webpage to see the differences between the two algorithms (it is written for PostGIS but `sf` uses the same geometry engine): <https://www.crunchydata.com/blog/waiting-for-postgis-3.2-st_makevalid>{target="_blank"}
```{r}
bug_valid <- st_make_valid(bug)
st_is_valid(bug_valid)
bug_valid
st_geometry_type(bug_valid)
st_geometry_type(bug_valid, by_geometry = FALSE)
```
When we look at the corrected data set, we see that the invalid polygon was converted to a multipolygon. We can also check it using the `st_geometry_type()` function. This is however a problem since we normally don't want a data set with mixed geometry types. When we use the `by_geometry = FALSE` argument, we see that `sf` is now using a generic GEOMETRY type for the data set. The solution would be to convert all the other polygons to multipolygons. To do that we need to understand type casting.
::: callout-important
You've maybe heard of the buffer trick. It consists of computing a 0 meter buffer around each geometry to make them valid. This does work and make everything valid, but you may lose some parts of the geometries. For example, if you have a polygon with a self-intersection. The smaller part will not be retained.
:::
### Vector type casting {#sec-typecasting}
::: {.callout-tip appearance="minimal" collapse="true" title="If you start from here..."}
Run the following code to load and create everything you'll need to run the examples in this section.
```{r}
#| eval: false
library(sf)
muni <- st_read("data/geodata.gpkg", "municipalities", quiet = TRUE)
bug <- st_read("data/geodata.gpkg", "wtf", quiet = TRUE)
bug_valid <- st_make_valid(bug)
```
:::
Changing the type of vector features is done with the `st_cast()` function. Using this function you can not only disaggregate geometries with MULTI\* types into several unique features (e.g., multipolygon to polygon) or extract simpler types (e.g., extract polygon borders or vertices), but also construct geometries using "simpler" geometry types (e.g., build a line from points).
#### Polygons to Multipolygons
We can now solve our previous problem and convert everything to multipolygons (the existing multipolygon will be left untouched). Using the `st_as_text()` function we can see the WKT representation of the features geometry, and confirm that we're now using the same vector type for all features. The `st_geometry_type()` function also tells us that the data set type is now multipolygon.
```{r}
bug_multipoly <- st_cast(bug_valid, to = "MULTIPOLYGON")
st_as_text(st_geometry(bug_valid))
st_as_text(st_geometry(bug_multipoly))
st_geometry_type(bug_multipoly, by_geometry = FALSE)
```
If you want to be absolutely sure that you have only one feature type in your data set, you can combine the `unique()` function with the `st_geometry_type()` function.
```{r}
unique(st_geometry_type(bug_valid))
unique(st_geometry_type(bug_multipoly))
```
#### Polygons to other types
We can go a bit further and convert our multipolygons to other types. Converting polygons to lines will extract the rings, and converting to points will extract the vertices (similarly, casting a linestring to points will extract its vertices). Note: for some reasons, it is not possible to convert a multipolygon directly to a linestring. You'll need to convert it to a multilinestring object first.
```{r}
bug_poly <- st_cast(bug_multipoly, to = "POLYGON")
bug_multiline <- st_cast(bug_multipoly, to = "MULTILINESTRING")
bug_line <- st_cast(bug_multiline, to = "LINESTRING")
bug_multipts <- st_cast(bug_multipoly, to = "MULTIPOINT")
bug_pts <- st_cast(bug_multipoly, to = "POINT")
bug_poly
bug_multiline
bug_line
bug_multipts
bug_pts
```
In the following figure, each feature has a unique color. It is thus easy to visualize the difference between the MULTI\* types and the other ones.
```{r}
par(mfrow = c(2, 3))
plot(st_geometry(bug_multipoly), col = rainbow(nrow(bug_multipoly)), main = "Multipolygons")
plot(st_geometry(bug_multiline), col = rainbow(nrow(bug_multiline)), main = "Multilines")
plot(st_geometry(bug_multipts), col = rainbow(nrow(bug_multipts)), pch = 16, main = "Multipoints")
plot(st_geometry(bug_poly), col = rainbow(nrow(bug_poly)), main = "Polygons")
plot(st_geometry(bug_line), col = rainbow(nrow(bug_line)), main = "Lines")
plot(st_geometry(bug_pts), col = rainbow(nrow(bug_pts)), pch = 16, main = "Points")
```
#### Points to lines
It is of course not possible to convert points directly to polygons, but if you have an `sf` object with points in the right order, you can easily build a linestring. As an example, let's extract the first 10 vertices of the Sempach multipolygon. Once you have an `sf` object with ordered points, you need to group them into a single multipoints geometry using the `st_combine()` function, and then call the `st_cast()` function on this new object.
```{r}
#| warning: false
sempach_pts <- st_cast(muni[muni$name == "Sempach",], to = "POINT")[1:10,]
sempach_multipts <- st_combine(sempach_pts)
sempach_line <- st_cast(sempach_multipts, "LINESTRING")
par(mfrow = c(1, 2))
plot(st_geometry(sempach_pts), pch = 16, main = "Points")
text(sempach_pts, 1:nrow(sempach_pts), pos = 4, cex = 0.8)
plot(sempach_line, lwd = 2, col = "navy", main = "Linestring")
```
#### Lines to polygons
If you have a linestring or multilinestring geometry forming a closed ring, you can easily convert it to a polygon. As an example, let's use the outer ring of the Sempach multipolygon.
```{r}
sempach_multiline <- st_cast(muni[muni$name == "Sempach",], to = "MULTILINESTRING")
sempach_poly <- st_cast(sempach_multiline, to = "POLYGON")
par(mfrow = c(1, 2))
plot(st_geometry(sempach_multiline), col = "navy", main = "Multilinestring")
plot(st_geometry(sempach_poly), col = "navy", main = "Polygon")
```
::: {.callout-note icon="false" title="Exercise (5 minutes)"}
Create a new object containing only the polygon for Sursee, check its validity and compute its perimeter using the `st_length()` function, compare with the result of the `st_perimeter()` function.
```{r}
#| eval: false
#| code-fold: true
sursee_poly <- muni[muni$name == "Sursee",]
sursee_multiline <- st_cast(sursee_poly, to = "MULTILINESTRING")
st_length(sursee_multiline)
st_perimeter(sursee_poly)
```
:::
### Spatial predicates
::: {.callout-tip appearance="minimal" collapse="true" title="If you start from here..."}
Run the following code to load and create everything you'll need to run the examples in this section.
```{r}
#| eval: false
library(sf)
muni <- st_read("data/geodata.gpkg", "municipalities", quiet = TRUE)
streets <- st_read("data/geodata.gpkg", "streets", quiet = TRUE)
obs <- read.csv("data/observations.csv")
obs <- st_as_sf(obs, coords = c("x", "y"), crs = "EPSG:2056")
```
:::
Topology describes the spatial relationships between vector objects. For example, two features can intersect, or one feature can contain another one. The existence of such relationships between features is tested by functions called spatial (binary) predicates. Many are available in the `sf` package, use `?geos_binary_pred` if you want to see the full list.
::: callout-important
When using spatial predicates you must be sure that both objects use the same CRS.
:::
We can for example easily test whether bird sightings are located in Sempach, or somewhere else.
```{r}
sempach <- muni[muni$name == "Sempach",]
obs_in_sempach <- st_intersects(obs, sempach)
obs_in_sempach
summary(lengths(obs_in_sempach) > 0)
```
The output is stored in an memory efficient sparse matrix format which is not always easily readable by humans. We can use the `sparse = FALSE` argument to get a non-sparse matrix and perform standard operations (e.g. computing the number of sightings in Sempach).
```{r}
obs_in_sempach <- st_intersects(obs, sempach, sparse = FALSE)
tail(obs_in_sempach)
sum(obs_in_sempach)
```
Using another predicate (`st_disjoint`), we can get a list of all sightings that are in other municipalities. Of course, this computation is superfluous in this case since the output of the `st_disjoint()` function is the complement of the set provided by the `st_intersects()` function.
```{r}
obs_not_in_sempach <- st_disjoint(obs, sempach, sparse = FALSE)
sum(obs_not_in_sempach)
```
We can easily find all the sightings that are located within 1km of the Swiss Ornithological Institute.
```{r}
soi <- st_as_sfc("POINT(2657271 1219754)", crs = "EPSG:2056")
st_is_within_distance(soi, obs, dist = 1000)
```
Things get a bit more complex when the two elements used inside the predicate contain multiple features. In this example we test for intersections between two municipalities and all the highway segments stored in the streets data set.
```{r}
muni_extract <- muni[6:7,]
highways <- streets[streets$type == 2,]
st_intersects(muni_extract, highways)
```
Don't hesitate to try other predicates (e.g. `st_within()`, `st_contains()`). The difference between some of them is sometimes quite subtle (e.g., the influence of the feature border). If you need even more flexibility you should use the `st_relate()` function. This flexibility comes with a price, though. The `st_relate()` function is much slower since it doesn't use spatial indices. If you want an in-depth explanation of all the possibilities, you should check the following website: <https://en.wikipedia.org/wiki/DE-9IM>{target="_blank"}.
### Spatial subsetting
::: {.callout-tip appearance="minimal" collapse="true" title="If you start from here..."}
Run the following code to load and create everything you'll need to run the examples in this section.
```{r}
#| eval: false
library(sf)
muni <- st_read("data/geodata.gpkg", "municipalities", quiet = TRUE)
obs <- read.csv("data/observations.csv")
obs <- st_as_sf(obs, coords = c("x", "y"), crs = "EPSG:2056")
sempach <- muni[muni$name == "Sempach",]
soi <- st_as_sfc("POINT(2657271 1219754)", crs = "EPSG:2056")
```
:::
Now that we know how to test different topological properties, we can use them to subset data spatially. The `sf` package allows doing that using the usual `[]` notation. The `st_intersects` predicate is used by default if you don't specify anything. This is how we create a new `sf` object containing only the sightings in Sempach.
```{r}
obs_in_sempach <- obs[sempach,]
# Equivalent to
obs_in_sempach <- obs[sempach, , op = st_intersects]
```
The empty argument can be used to specify the desired attribute columns.
::: {.callout-note icon="false" title="Exercise (5 minutes)"}
Within a 2 km radius around Swiss Ornithological Institute, how many bird sightings are in Neuenkirch? Try to make a map of the municipalities with the filtered sightings.
```{r}
#| eval: false
#| code-fold: true
neuenkirch <- muni[which(muni$name == "Neuenkirch"),]
obs_within_2000m <- obs[soi, , op = st_is_within_distance, dist = 1000]
obs_filtered <- obs_within_2000m[neuenkirch,]
nrow(obs_filtered)
plot(st_geometry(muni))
plot(st_geometry(obs_filtered), add = TRUE)
```
:::
### Spatial joins
::: {.callout-tip appearance="minimal" collapse="true" title="If you start from here..."}
Run the following code to load and create everything you'll need to run the examples in this section.
```{r}
#| eval: false
library(sf)
muni <- st_read("data/geodata.gpkg", "municipalities", quiet = TRUE)
obs <- read.csv("data/observations.csv")
obs <- st_as_sf(obs, coords = c("x", "y"), crs = "EPSG:2056")
```
:::
We've already seen how to join a spatial object to another table using attributes. Now we'll do something similar but instead of using attributes, we'll perform a join between spatial objects based on their topological relationships. As a first example we will join the bird sightings data set with the municipalities data set. As output we will get the bird sightings with additional attributes corresponding to their respective municipality. We'll do this using the `st_join()` function.
```{r}
obs_muni <- st_join(obs, muni, join = st_intersects)
obs_muni
```
In this example both data sets have an attributed called "name". When we join them together, R is automatically renaming these columns to "name.x" and "name.y". The "x" and "y" corresponds to the order of the data sets when calling the `st_join()` function. We can now easily compute the number of sightings per municipality.
```{r}
table(obs_muni$name.y)
```
Let's try another spatial join, this time we will join the sightings with a landcover data set, which is an extract of the [swissTLM3D](https://www.swisstopo.admin.ch/en/landscape-model-swisstlm3d){target="_blank"} data set provided by Swisstopo. The goal of the analysis it to add a new attribute to the bird sightings data set corresponding to the landcover value.