-
Notifications
You must be signed in to change notification settings - Fork 12
Tidyverse functions
Tidyverse functions are part of the tidyverse
collection of R packages. These functions work well together, keep the code readable and are good for exploring and transforming data, which is why we try to stick to only using these utensils in our mapping script. For mapping data to Darwin Core, we noticed you can mostly get by with just three functions: mutate()
, recode()
and case_when()
, which are discussed below. To learn more about the other Tidyverse functions used in the mapping script, check the tidyverse documentation or type ?function_name
in your R Studio console.
Piping means using the pipe operator %>%
or pipe. It is easy to use and highly increases the readability of your code:
# Take the dataframe "taxon", group the values of the column "input_kingdom" and show a count for each unique value
taxon %>%
group_by(input_kingdom) %>%
count()
Is a much more readable way than the classic approach of nesting functions:
# Take the dataframe "taxon", group the values of the column "input_kingdom" and show a count for each unique value
count(group_by(taxon, input_kingdom))
mutate()
adds or updates a column to your dataframe. We use it to add a new Darwin Core term to our data frame and populate it with one or more values. To allow comparison between the source data and the Darwin Core terms, we do not update columns.
The basic code for mutate()
looks like this:
input_data %<>% mutate(new_column_name = ...)
With:
-
input_data
: a data frame with your input data, i.e. the source checklist data -
%<>%
: a shorter way of writinginput_data <- input_data %>% ...
-
mutate()
: a function to add or update a column -
new_column_name
: a name of the column you want to add to the dataframe, i.e. the Darwin Core term. If this were an existing column name, it would update that column, which we want to avoid. That is also why we prefix all source column names withinput_
, so we don't accidentaly update one of these if we add a Darwin Core term of the same name. -
…
: the value(s) to populate this new column with, whether these are static, unaltered or altered
Some Darwin Core terms have the same static value for every record in the data, i.e. their content is constant for the whole dataset. This is mostly the case for record-level terms (metadata) in the taxon core, but other terms can be static as well.
To map to a static value, write that value in "double quotes":
taxon %<>% mutate(license = "http://creativecommons.org/publicdomain/zero/1.0/")
taxon %<>% mutate(kingdom = "Animalia")
To copy the unaltered value of a source column to a Darwin Core term, use the name of that column as your value:
taxon %<>% mutate(scientificName = input_scientific_name)
If you want to standardize, correct or combine the source data before mapping it to a Darwin Core term, you will have to write an expression in your mutate()
function to do that. A simple example is concatenating the values from two columns together:
taxon %<>% mutate(scientificName = paste(input_genus, input_species))
The range of possibilities and bugs (i.e. the example above will create odd values if one of the input columns is empty) is too big to cover here, but for standardizing/correcting values there are two functions we would like to introduce: recode()
and case_when()
. Both are used in conjunction with mutate()
.
recode()
replaces specific input values with a new, altered values in a one-to-one mapping. It is useful for correcting specific typos or mapping values to controlled vocabularies. The basic code is:
input_data %<>% mutate(darwin_core_term = recode(input_column,
"input_value_1" = "dwc_value_1",
"input_value_2" = "dwc_value_2",
.default = "" # Option to handle other input values, drop this to leave them as is
.missing = "" # Option to handle NA values
))
input_data %<>% mutate(input_scientific_name = recode(input_scientific_name,
"AseroÙ rubra" = "Asero rubra"
))
In the above example we correct the typo AseroÙ rubra
to Asero rubra
. All the other input_scientific_names
are left untouched (we did not use the .default
parameter). Here we also overwrite the column input_scientific_name
with the recoded values, as we will use that column as the bases for our Taxon IDs.
Add comments to explain why you recoded some values:
taxon %<>% mutate(phylum = recode(input_phylum,
"Crustacea" = "Arthropoda" # Crustacea is not a phylum
))
taxon %<>% mutate(taxonRank = recode(input_rankmarker,
"infrasp." = "infraspecificname",
"sp." = "species",
"var." = "variety",
.default = ""
))
In the above example we map our input_rankmarker
to the GBIF vocabulary for taxon ranks. Any input value we haven't defined, will be left empty (.default = ""
).
case_when
allows to assign values based on conditions, rather than specific values used for recode()
. It is useful when the mapping of a term depends on multiple input values. The basic code is:
input_data %<>% mutate(darwin_core_term = case_when(
conditional_statement_1 ~ "dwc_value_1",
conditional_statement_2 ~ "dwc_value_2",
TRUE ~ "dwc_value_3" # Option to handle all other conditions
))
You can read this as: if conditional_statement_1
is true then map to dwc_value_1
, if conditional_statement_2
is true then map to dwc_value_2
, else map to dwc_value_3
.
distribution %<>% mutate(locality = case_when(
!is.na(input_locality) ~ input_locality,
input_country_code == "BE" ~ "Belgium",
input_country_code == "GB" ~ "United Kingdom",
input_country_code == "MK" ~ "Macedonia",
input_country_code == "NL" ~ "The Netherlands",
TRUE ~ ""
))
In the above example we populate the Darwin Core term locality
with information from the input_locality
if that is not empty (!is.na
). Otherwise, we use specific input_country_code
s to map to a country name. In the other cases (e.g. another input_country_code
) we leave location
empty (TRUE ~ ""
). Note how we used two input columns (input_locality
and input_country_code
) for this mapping.
- Home
- Getting started
- Basics
- Ingredients: Source data
- Instructions: R Markdown
- Utensils: Tidyverse functions
- Dinner: Darwin Core data
- Mapping script
- Data preparation
- Mapping
- GitHub
- Publishing data
- Examples