generated from trias-project/checklist-recipe
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdwc_mapping.Rmd
250 lines (207 loc) · 8.98 KB
/
dwc_mapping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
title: "Darwin Core mapping of ESAS data"
author:
- Peter Desmet
date: "`r Sys.Date()`"
output:
html_document:
df_print: paged
number_sections: yes
toc: yes
toc_depth: 3
toc_float: yes
---
This document describes how (and contains the code to) transform European Seabirds at Sea (ESAS) data to an [OBIS-ENV Darwin Core Archive](https://obis.org/manual/dataformat/#obis-env-data) that can be uploaded to an IPT.
# Setup
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = TRUE)
```
Install required libraries (if not yet installed):
```{r}
installed <- rownames(installed.packages())
required <- c("magrittr", "here", "glue", "dplyr", "jsonlite", "assertthat", "readr", "tidyr", "DBI", "icesVocab")
if (!all(required %in% installed)) {
install.packages(required[!required %in% installed])
}
```
Load libraries:
```{r message = FALSE}
library(magrittr) # To use pipes
library(here) # To find files
library(glue) # To insert variables in strings
library(dplyr) # To transform data
library(readr) # To write files
library(tidyr) # To tidy data
library(jsonlite) # To handle json
library(assertthat) # To return errors
library(DBI) # To create and query databases
library(icesVocab) # To get code lists
```
# Read source data
Public data can be read automatically from the [ESAS web services](https://esas.ices.dk/webservices), but this is currently incredibly slow and likely to result in an error:
```{r get_remote_data eval=FALSE}
# THIS STEP IS SKIPPED
# Load custom function
source(here::here("R", "get_data.R"))
# Query data
campaigns <- get_data("campaigns", "DataAccess=Public")
samples <- get_data("samples")
positions <- get_data("positions")
observations <- get_data("observations")
# Set all columns to character to avoid date interpretation issues in SQLite
# see https://stackoverflow.com/a/13462536/2463806
campaigns <- mutate(campaigns, across(everything(), as.character))
samples <- mutate(samples, across(everything(), as.character))
positions <- mutate(positions, across(everything(), as.character))
observations <- mutate(observations, across(everything(), as.character))
# Note that fields returned by the API are lowerCamelCase, while the official
# names are UpperCamelCase and returned as such by CSV downloads.
# The following steps (especially "Process data") assume UpperCamelCase
# The SQLite queries are case insensitive
```
We therefore recommend the following **manual steps** (also listed in README):
1. Go to <https://esas.ices.dk/inventory>.
2. Filter on `Access: Public`.
3. Click `Download Data` and accept the disclaimer.
4. Download the data from the download request page.
5. Unzip the download and move the files to the repository in a `data/raw` directory (ignored by git).
6. Read the data:
```{r get_local_data}
# Set all columns to character to avoid date interpretation issues in SQLite
# see https://stackoverflow.com/a/13462536/2463806
as_character <- cols(.default = "c")
campaigns <- read_csv(here::here("data", "raw", "Campaigns.csv"), col_types = as_character)
samples <- read_csv(here::here("data", "raw", "Samples.csv"), col_types = as_character)
positions <- read_csv(here::here("data", "raw", "Positions.csv"), col_types = as_character)
observations <- read_csv(here::here("data", "raw", "Observations.csv"), col_types = as_character)
```
## Process data
Separate columns `DataRightsHolder`, `Association` and `Behaviour`, which can contain `~` delimited values, into maximum 3 columns:
```{r}
campaigns <-
campaigns %>%
tidyr::separate(
DataRightsHolder,
into = c("DataRightsHolder_1", "DataRightsHolder_2", "DataRightsHolder_3"),
sep = "~",
remove = FALSE, convert = FALSE, extra = "merge", fill = "right"
)
# country also contains ~ values, but is not used in the DwC mapping
```
```{r}
observations <-
observations %>%
tidyr::separate(
Association,
into = c("Association_1", "Association_2", "Association_3"),
sep = "~",
remove = FALSE, convert = FALSE, extra = "merge", fill = "right"
) %>%
tidyr::separate(
Behaviour,
into = c("Behaviour_1", "Behaviour_2", "Behaviour_3"),
sep = "~",
remove = FALSE, convert = FALSE, extra = "merge", fill = "right"
)
# country also contains ~ values, but is not used in the DwC mapping
```
## Read code lists
Get code lists from ICES vocabularies API:
```{r get_vocabularies}
edmo <- getCodeList("EDMO")
shipc <- getCodeList("SHIPC")
platformclass <- getCodeList(URLencode("Platform Class"))
platformside <- getCodeList("PlatformSide")
bdcountmethod <- getCodeList("BD_CountMethod")
targettaxa <- getCodeList("TargetTaxa")
useofbinoculars <- getCodeList("UseOfBinoculars")
beaufort <- getCodeList("Beaufort")
visibility <- getCodeList("Visibility")
glare <- getCodeList("Glare")
cloudcover <- getCodeList("CloudCover")
precipitation <- getCodeList("Precipitation")
sightability <- getCodeList("Sightability")
observationdistance <- getCodeList("ObservationDistance")
lifestage <- getCodeList("LifeStage")
moult <- getCodeList("Moult")
plumage <- getCodeList("Plumage")
sex <- getCodeList("SEXCO")
traveldirection <- getCodeList("TravelDirection")
preytype <- getCodeList("PreyType")
association <- getCodeList("Association")
behaviour <- getCodeList("Behaviour")
```
## Create database
Create a SQLite database with the source data and code lists, so it can be queried with SQL in the next steps:
```{r}
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# Import data
DBI::dbWriteTable(con, "campaigns", campaigns)
DBI::dbWriteTable(con, "samples", samples)
DBI::dbWriteTable(con, "positions", positions)
DBI::dbWriteTable(con, "observations", observations)
# Import code lists
DBI::dbWriteTable(con, "edmo", edmo)
DBI::dbWriteTable(con, "shipc", shipc)
DBI::dbWriteTable(con, "platformclass", platformclass)
DBI::dbWriteTable(con, "platformside", platformside)
DBI::dbWriteTable(con, "bdcountmethod", bdcountmethod)
DBI::dbWriteTable(con, "targettaxa", targettaxa)
DBI::dbWriteTable(con, "useofbinoculars", useofbinoculars)
DBI::dbWriteTable(con, "beaufort", beaufort)
DBI::dbWriteTable(con, "visibility", visibility)
DBI::dbWriteTable(con, "glare", glare)
DBI::dbWriteTable(con, "cloudcover", cloudcover)
DBI::dbWriteTable(con, "precipitation", precipitation)
DBI::dbWriteTable(con, "sightability", sightability)
DBI::dbWriteTable(con, "observationdistance", observationdistance)
DBI::dbWriteTable(con, "lifestage", lifestage)
DBI::dbWriteTable(con, "moult", moult)
DBI::dbWriteTable(con, "plumage", plumage)
DBI::dbWriteTable(con, "sex", sex)
DBI::dbWriteTable(con, "traveldirection", traveldirection)
DBI::dbWriteTable(con, "preytype", preytype)
DBI::dbWriteTable(con, "association", association)
DBI::dbWriteTable(con, "behaviour", behaviour)
```
## Darwin Core mapping
The Darwin Core mapping follows the recommendations of the [OBIS Darwin Core manual](https://obis.org/manual/darwincore/) and is structured as [OBIS-ENV-DATA](https://obis.org/manual/dataformat/#obis-env-data).
Create [Event](https://rs.gbif.org/core/dwc_event_2022-02-02.xml) core:
```{r dwc_event}
dwc_event_sql <- glue::glue_sql(readr::read_file(here::here("sql", "dwc_event.sql")), .con = con)
dwc_event <- DBI::dbGetQuery(con, dwc_event_sql)
```
Create [Occurrence](https://rs.gbif.org/core/dwc_occurrence_2022-02-02.xml) extension:
```{r dwc_occurrence}
dwc_occurrence_sql <- glue::glue_sql(readr::read_file(here::here("sql", "dwc_occurrence.sql")), .con = con)
dwc_occurrence <- DBI::dbGetQuery(con, dwc_occurrence_sql)
```
Create [Extended Measurement Or Facts](https://rs.gbif.org/extension/obis/extended_measurement_or_fact.xml) extension:
```{r dwc_emof}
dwc_emof_sql <- glue::glue_sql(readr::read_file(here::here("sql", "dwc_emof.sql")), .con = con)
dwc_emof <- DBI::dbGetQuery(con, dwc_emof_sql)
```
## Save data to file
Files are saved as tab delimited text files, to avoid [parsing issues](https://github.com/gbif/ipt/issues/1930) with commas in fields.
Save all data, to be used for IPT upload:
```{r}
directory <- here::here("data", "processed")
if (!dir.exists(directory)) { dir.create(directory) }
write_tsv(dwc_event, file.path(directory, "event.tsv.gz"), na = "")
write_tsv(dwc_occurrence, file.path(directory, "occurrence.tsv.gz"), na = "")
write_tsv(dwc_emof, file.path(directory, "emof.tsv.gz"), na = "")
```
Save a sample of the data, to be used in git to notice mapping issues:
```{r}
# Filter data on a single campaign
campaign_id <- "110000153"
campaign_id_regex <- paste0("^", campaign_id) # Match start of string
dwc_event_sample <- dwc_event %>% filter(grepl(campaign_id_regex, eventID))
dwc_occurrence_sample <- dwc_occurrence %>% filter(grepl(campaign_id_regex, eventID))
dwc_emof_sample <- dwc_emof %>% filter(grepl(campaign_id_regex, eventID))
# Write data
directory <- here::here("data", "processed_sample")
write_tsv(dwc_event_sample, file.path(directory, "event.tsv"), na = "")
write_tsv(dwc_occurrence_sample, file.path(directory, "occurrence.tsv"), na = "")
write_tsv(dwc_emof_sample, file.path(directory, "emof.tsv"), na = "")
```