Tidyverse functions

Tidyverse functions are part of the tidyverse collection of R packages. These functions work well together, keep the code readable and are good for exploring and transforming data, which is why we try to stick to only using these utensils in our mapping script. For mapping data to Darwin Core, we noticed you can mostly get by with just three functions: mutate(), recode() and case_when(), which are discussed below. To learn more about the other Tidyverse functions used in the mapping script, check the tidyverse documentation or type ?function_name in your R Studio console.

A quick word about piping

Piping means using the pipe operator %>% or pipe. It is easy to use and highly increases the readability of your code:

# Take the dataframe "taxon", group the values of the column "input_kingdom" and show a count for each unique value
taxon %>%
  group_by(input_kingdom) %>%
  count()

Is a much more readable way than the classic approach of nesting functions:

# Take the dataframe "taxon", group the values of the column "input_kingdom" and show a count for each unique value
count(group_by(taxon, input_kingdom))

mutate()

mutate() adds or updates a column to your dataframe. We use it to add a new Darwin Core term to our data frame and populate it with one or more values. To allow comparison between the source data and the Darwin Core terms, we do not update columns.

The basic code for mutate() looks like this:

input_data %<>% mutate(new_column_name = ...)

With:

input_data: a data frame with your input data, i.e. the source checklist data
%<>%: a shorter way of writing input_data <- input_data %>% ...
mutate(): a function to add or update a column
new_column_name: a name of the column you want to add to the dataframe, i.e. the Darwin Core term. If this were an existing column name, it would update that column, which we want to avoid. That is also why we prefix all source column names with input_, so we don't accidentaly update one of these if we add a Darwin Core term of the same name.
…: the value(s) to populate this new column with, whether these are static, unaltered or altered

Mapping static values

Some Darwin Core terms have the same static value for every record in the data, i.e. their content is constant for the whole dataset. This is mostly the case for record-level terms (metadata) in the taxon core, but other terms can be static as well.

To map to a static value, write that value in "double quotes":

taxon %<>% mutate(license = "http://creativecommons.org/publicdomain/zero/1.0/")

taxon %<>% mutate(kingdom = "Animalia")

Mapping unaltered values

To copy the unaltered value of a source column to a Darwin Core term, use the name of that column as your value:

taxon %<>% mutate(scientificName = input_scientific_name)

Mapping altered values

If you want to standardize, correct or combine the source data before mapping it to a Darwin Core term, you will have to write an expression in your mutate() function to do that. A simple example is concatenating the values from two columns together:

taxon %<>% mutate(scientificName = paste(input_genus, input_species))

The range of possibilities and bugs (i.e. the example above will create odd values if one of the input columns is empty) is too big to cover here, but for standardizing/correcting values there are two functions we would like to introduce: recode() and case_when(). Both are used in conjunction with mutate().

recode()

recode() replaces specific input values with a new, altered values in a one-to-one mapping. It is useful for correcting specific typos or mapping values to controlled vocabularies. The basic code is:

input_data %<>% mutate(darwin_core_term = recode(input_column,
  "input_value_1" = "dwc_value_1",
  "input_value_2" = "dwc_value_2",
  .default = "" # Option to handle other input values, drop this to leave them as is
  .missing = "" # Option to handle NA values
))

Correcting specific typos

input_data %<>% mutate(input_scientific_name = recode(input_scientific_name,
  "AseroÙ rubra" = "Asero rubra"
))

In the above example we correct the typo AseroÙ rubra to Asero rubra. All the other input_scientific_names are left untouched (we did not use the .default parameter). Here we also overwrite the column input_scientific_name with the recoded values, as we will use that column as the bases for our Taxon IDs.

Add comments to explain why you recoded some values:

taxon %<>% mutate(phylum = recode(input_phylum, 
  "Crustacea" = "Arthropoda" # Crustacea is not a phylum
))

Controlled vocabularies

taxon %<>% mutate(taxonRank = recode(input_rankmarker,
  "infrasp."  = "infraspecificname",
  "sp."       = "species",
  "var."      = "variety",
  .default    = ""
))

In the above example we map our input_rankmarker to the GBIF vocabulary for taxon ranks. Any input value we haven't defined, will be left empty (.default = "").

case_when()

case_when allows to assign values based on conditions, rather than specific values used for recode(). It is useful when the mapping of a term depends on multiple input values. The basic code is:

input_data %<>% mutate(darwin_core_term = case_when(
  conditional_statement_1 ~ "dwc_value_1",
  conditional_statement_2 ~ "dwc_value_2",
  TRUE ~ "dwc_value_3" # Option to handle all other conditions
))

You can read this as: if conditional_statement_1 is true then map to dwc_value_1, if conditional_statement_2 is true then map to dwc_value_2, else map to dwc_value_3.

Use multiple input values

distribution %<>% mutate(locality = case_when(
  !is.na(input_locality) ~ input_locality,
  input_country_code == "BE" ~ "Belgium",
  input_country_code == "GB" ~ "United Kingdom",
  input_country_code == "MK" ~ "Macedonia",
  input_country_code == "NL" ~ "The Netherlands",
  TRUE ~ ""
))

In the above example we populate the Darwin Core term locality with information from the input_locality if that is not empty (!is.na). Otherwise, we use specific input_country_codes to map to a country name. In the other cases (e.g. another input_country_code) we leave location empty (TRUE ~ ""). Note how we used two input columns (input_locality and input_country_code) for this mapping.

Home
Getting started
Basics
- Ingredients: Source data
- Instructions: R Markdown
- Utensils: Tidyverse functions
- Dinner: Darwin Core data
Mapping script
- Data preparation
- Mapping
  - Taxon core
  - Distribution extension
GitHub
Publishing data
Examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly