src/dwc_mapping.Rmd

---
title: "Darwin Core mapping of ESAS data"
author:
- Peter Desmet
date: "`r Sys.Date()`"
output:
  html_document:
    df_print: paged
    number_sections: yes
    toc: yes
    toc_depth: 3
    toc_float: yes
---

This document describes how (and contains the code to) transform European Seabirds at Sea (ESAS) data to an [OBIS-ENV Darwin Core Archive](https://obis.org/manual/dataformat/#obis-env-data) that can be uploaded to an IPT.

# Setup 

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = TRUE)
```

Install required libraries (if not yet installed):

```{r}
installed <- rownames(installed.packages())
required <- c("magrittr", "here", "glue", "dplyr", "jsonlite", "assertthat", "readr", "tidyr", "DBI", "icesVocab")
if (!all(required %in% installed)) {
  install.packages(required[!required %in% installed])
}
```

Load libraries:

```{r message = FALSE}
library(magrittr)       # To use pipes
library(here)           # To find files
library(glue)           # To insert variables in strings
library(dplyr)          # To transform data
library(readr)          # To write files
library(tidyr)          # To tidy data
library(jsonlite)       # To handle json
library(assertthat)     # To return errors
library(DBI)            # To create and query databases
library(icesVocab)      # To get code lists
```

# Read source data

Public data can be read automatically from the [ESAS web services](https://esas.ices.dk/webservices), but this is currently incredibly slow and likely to result in an error:

```{r get_remote_data eval=FALSE}
# THIS STEP IS SKIPPED
# Load custom function
source(here::here("R", "get_data.R"))

# Query data
campaigns <- get_data("campaigns", "DataAccess=Public")
samples <- get_data("samples")
positions <- get_data("positions")
observations <- get_data("observations")

# Set all columns to character to avoid date interpretation issues in SQLite
# see https://stackoverflow.com/a/13462536/2463806
campaigns <- mutate(campaigns, across(everything(), as.character))
samples <- mutate(samples, across(everything(), as.character))
positions <- mutate(positions, across(everything(), as.character))
observations <- mutate(observations, across(everything(), as.character))

# Note that fields returned by the API are lowerCamelCase, while the official 
# names are UpperCamelCase and returned as such by CSV downloads.
# The following steps (especially "Process data") assume UpperCamelCase
# The SQLite queries are case insensitive
```

We therefore recommend the following **manual steps** (also listed in README):

1. Go to <https://esas.ices.dk/inventory>.
2. Filter on `Access: Public`.
3. Click `Download Data` and accept the disclaimer.
4. Download the data from the download request page.
5. Unzip the download and move the files to the repository in a `data/raw` directory (ignored by git).
6. Read the data:

```{r get_local_data}
# Set all columns to character to avoid date interpretation issues in SQLite
# see https://stackoverflow.com/a/13462536/2463806
as_character <- cols(.default = "c")
campaigns <- read_csv(here::here("data", "raw", "Campaigns.csv"), col_types = as_character)
samples <- read_csv(here::here("data", "raw", "Samples.csv"), col_types = as_character)
positions <- read_csv(here::here("data", "raw", "Positions.csv"), col_types = as_character)
observations <- read_csv(here::here("data", "raw", "Observations.csv"), col_types = as_character)
```

## Process data

Separate columns `DataRightsHolder`, `Association` and `Behaviour`, which can contain `~` delimited values, into maximum 3 columns:

```{r}
campaigns <-
  campaigns %>%
  tidyr::separate(
    DataRightsHolder,
    into = c("DataRightsHolder_1", "DataRightsHolder_2", "DataRightsHolder_3"),
    sep = "~",
    remove = FALSE, convert = FALSE, extra = "merge", fill = "right"
  )
  # country also contains ~ values, but is not used in the DwC mapping
```

```{r}
observations <-
  observations %>%
  tidyr::separate(
    Association,
    into = c("Association_1", "Association_2", "Association_3"),
    sep = "~",
    remove = FALSE, convert = FALSE, extra = "merge", fill = "right"
  ) %>%
  tidyr::separate(
    Behaviour,
    into = c("Behaviour_1", "Behaviour_2", "Behaviour_3"),
    sep = "~",
    remove = FALSE, convert = FALSE, extra = "merge", fill = "right"
  )
  # country also contains ~ values, but is not used in the DwC mapping
```

## Read code lists

Get code lists from ICES vocabularies API:

```{r get_vocabularies}
edmo <- getCodeList("EDMO")
shipc <- getCodeList("SHIPC")
platformclass <- getCodeList(URLencode("Platform Class"))
platformside <- getCodeList("PlatformSide")
bdcountmethod <- getCodeList("BD_CountMethod")
targettaxa <- getCodeList("TargetTaxa")
useofbinoculars <- getCodeList("UseOfBinoculars")
beaufort <- getCodeList("Beaufort")
visibility <- getCodeList("Visibility")
glare <- getCodeList("Glare")
cloudcover <- getCodeList("CloudCover")
precipitation <- getCodeList("Precipitation")
sightability <- getCodeList("Sightability")
observationdistance <- getCodeList("ObservationDistance")
lifestage <- getCodeList("LifeStage")
moult <- getCodeList("Moult")
plumage <- getCodeList("Plumage")
sex <- getCodeList("SEXCO")
traveldirection <- getCodeList("TravelDirection")
preytype <- getCodeList("PreyType")
association <- getCodeList("Association")
behaviour <- getCodeList("Behaviour")
```

## Create database

Create a SQLite database with the source data and code lists, so it can be queried with SQL in the next steps:

```{r}
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# Import data
DBI::dbWriteTable(con, "campaigns", campaigns)
DBI::dbWriteTable(con, "samples", samples)
DBI::dbWriteTable(con, "positions", positions)
DBI::dbWriteTable(con, "observations", observations)

# Import code lists
DBI::dbWriteTable(con, "edmo", edmo)
DBI::dbWriteTable(con, "shipc", shipc)
DBI::dbWriteTable(con, "platformclass", platformclass)
DBI::dbWriteTable(con, "platformside", platformside)
DBI::dbWriteTable(con, "bdcountmethod", bdcountmethod)
DBI::dbWriteTable(con, "targettaxa", targettaxa)
DBI::dbWriteTable(con, "useofbinoculars", useofbinoculars)
DBI::dbWriteTable(con, "beaufort", beaufort)
DBI::dbWriteTable(con, "visibility", visibility)
DBI::dbWriteTable(con, "glare", glare)
DBI::dbWriteTable(con, "cloudcover", cloudcover)
DBI::dbWriteTable(con, "precipitation", precipitation)
DBI::dbWriteTable(con, "sightability", sightability)
DBI::dbWriteTable(con, "observationdistance", observationdistance)
DBI::dbWriteTable(con, "lifestage", lifestage)
DBI::dbWriteTable(con, "moult", moult)
DBI::dbWriteTable(con, "plumage", plumage)
DBI::dbWriteTable(con, "sex", sex)
DBI::dbWriteTable(con, "traveldirection", traveldirection)
DBI::dbWriteTable(con, "preytype", preytype)
DBI::dbWriteTable(con, "association", association)
DBI::dbWriteTable(con, "behaviour", behaviour)
```

## Darwin Core mapping

The Darwin Core mapping follows the recommendations of the [OBIS Darwin Core manual](https://obis.org/manual/darwincore/) and is structured as [OBIS-ENV-DATA](https://obis.org/manual/dataformat/#obis-env-data).

Create [Event](https://rs.gbif.org/core/dwc_event_2022-02-02.xml) core:

```{r dwc_event}
dwc_event_sql <- glue::glue_sql(readr::read_file(here::here("sql", "dwc_event.sql")), .con = con)
dwc_event <- DBI::dbGetQuery(con, dwc_event_sql)
```

Create [Occurrence](https://rs.gbif.org/core/dwc_occurrence_2022-02-02.xml) extension:

```{r dwc_occurrence}
dwc_occurrence_sql <- glue::glue_sql(readr::read_file(here::here("sql", "dwc_occurrence.sql")), .con = con)
dwc_occurrence <- DBI::dbGetQuery(con, dwc_occurrence_sql)
```

Create [Extended Measurement Or Facts](https://rs.gbif.org/extension/obis/extended_measurement_or_fact.xml) extension:

```{r dwc_emof}
dwc_emof_sql <- glue::glue_sql(readr::read_file(here::here("sql", "dwc_emof.sql")), .con = con)
dwc_emof <- DBI::dbGetQuery(con, dwc_emof_sql)
```

## Save data to file

Files are saved as tab delimited text files, to avoid [parsing issues](https://github.com/gbif/ipt/issues/1930) with commas in fields.

Save all data, to be used for IPT upload:

```{r}
directory <- here::here("data", "processed")
if (!dir.exists(directory)) { dir.create(directory) }
write_tsv(dwc_event, file.path(directory, "event.tsv.gz"), na = "")
write_tsv(dwc_occurrence, file.path(directory, "occurrence.tsv.gz"), na = "")
write_tsv(dwc_emof, file.path(directory, "emof.tsv.gz"), na = "")
```

Save a sample of the data, to be used in git to notice mapping issues:

```{r}
# Filter data on a single campaign
campaign_id <- "110000153"
campaign_id_regex <- paste0("^", campaign_id) # Match start of string
dwc_event_sample <- dwc_event %>% filter(grepl(campaign_id_regex, eventID))
dwc_occurrence_sample <- dwc_occurrence %>% filter(grepl(campaign_id_regex, eventID))
dwc_emof_sample <- dwc_emof %>% filter(grepl(campaign_id_regex, eventID))

# Write data
directory <- here::here("data", "processed_sample")
write_tsv(dwc_event_sample, file.path(directory, "event.tsv"), na = "")
write_tsv(dwc_occurrence_sample, file.path(directory, "occurrence.tsv"), na = "")
write_tsv(dwc_emof_sample, file.path(directory, "emof.tsv"), na = "")
```