tidyverse_intro.R

#' ---
#' title: "Using the tidyverse!"
#' author: "Beau Larkin"
#' date: "2021-04-05"
#' output:
#'   github_document:
#'     toc: true
#'     toc_depth: 2
#'   pdf_document:
#'     df_print: tibble
#'     fig_caption: yes
#'     highlight: tango
#'     toc: true
#'     toc_depth: 2
#' ---
#'
#' # Description
#' 
#' * For University of Montana Course BIOB-595 - Art Woods, professor
#' * 2021-04-05 - Beau Larkin (guest instructor)
#' 
#' **Goals:**
#' 
#' * Show some features of the tidyverse
#' * Convince you that it's at least worth your time to learn more
#' * Give you some tools for further exploration
#' * != flexing huge skills
#' 
#' This session borrows heavily from [*R for Data Science*](https://r4ds.had.co.nz/index.html) (R4DS) 
#' by Wickham and Grolemund, 2017, O'Reilly, and from other sources as annotated in-text.
#' 
#' BTW What is this #' thing? It's useful. I'll show you at the end.
#' 
#' Let's get started!

# Package and library installation
packages_needed = c("tidyverse", "knitr", "nycflights13", "magrittr")
packages_installed = packages_needed %in% rownames(installed.packages())

if (any(! packages_installed))
    install.packages(packages_needed[! packages_installed])
for (i in 1:length(packages_needed)) {
    library(packages_needed[i], character.only = T)
}

#' # Engagement
#' What do you know about the tidyverse? Chat three words...
#' Do you use tidyverse packages and functions? Which packages?
#' 
#' ## Here's what I think:
#' 
#' * It's more than a collection of functions and fads (con sarn it, these new-fangled pipes!)
#' * A paradigm for data science, and it's here to stay
#' * Why tidyverse? It allows us to use intuitive, 
#' linear processes to get answers and results from data quickly
#'
#' ## A *paradigm*???
#' What do you need before creating graphics, applying models, and testing hypotheses? 
#' 
#' * ...
#' * The 80/20 rule...
#' 
#' # Tidy data 
#' 
#' * Chapter 12 in R4DS
#' * *Tidy Data* in the [Journal of Statistical Software, Wickham 2014](https://www.jstatsoft.org/article/view/v059i10). 
#' * You cannot pass into the tidyverse without tidy data! (whoa, that's biblical, but seriously
#' things don't work well in the tidyverse with messy data)
#' 
#' "Tidy data are tidy in the same way. Each messy dataset is messy in it's own way." - Wickham
#' 
#' ## Examples: messy data

#+ ex_1
ex_1 <- data.frame(
    plot_replicate = paste(rep(LETTERS[1:5], each = 4), rep(1:4, 5), sep = "_"), 
    date = rep("2016-06-12", 20),
    measurement_1 = rbinom(20, 50, 0.2)
)
ex_1[20, 2] <- "2016-06-12, measured after lunch"
kable(ex_1)
# View(ex_1)
# What is messy? Comment in chat but wait to submit...
#'
#' How would we produce a mean of measurement_1 in plots???
#+ aggregate_ex_1
aggregate(measurement_1 ~ plot_replicate, FUN = mean, data = ex_1) 
#' Guess it's "back to excel"

#+ ex_2
ex_2 <- data.frame(
    plot = rep(LETTERS[1:5], each = 8),
    replicate = rep(1:4, each = 2),
    parameter = rep(c("aphids_n", "height_cm"), 10),
    value = rep(c(700, 80), 10) + rbinom(40, 500, 0.4)
)
kable(ex_2)
# View(ex_2)
# What is messy? Comment in chat but wait to submit...

#' ## Tidy data rules:
#' There are three interrelated rules which make a dataset tidy:
#' 
#' 1. Each variable must have its own column.
#' 1. Each observation must have its own row.
#' 1. Each value must have its own cell.
#' 
#' So, remember the rules or just develop good habits that lead to intuition?
#' 
#' * Tidy or not? Not always binary...unfortunately. But like art, you know messy when you see it.
#' * Quickly recognize problems in data...
#' * Think ahead when planning to collect data...
#' 
#' ### At this point, *tibbles* often come up (see Chapter 10 in R4DS)
#' 
#' * Unavoidable in tidyverse, and mostly this is a good thing
#' * Occasional incompatibility issues, so be aware

#+ iris_tibble
# data frame
iris 
head(iris)
# tibble
(iris_t <- as_tibble(iris)) 
# sometimes you must convert
as.data.frame(iris_t)

#' # Pipes!! 
#' 
#' * What do you know already? Start with any vector `x`, and use a pipe to log transform `x` 
#' * Chapter 18 in R4DS
#' * [Pipes in R Tutorial For Beginners](https://www.datacamp.com/community/tutorials/pipe-r-tutorial)
#' 
#' How do we typically order data and functions, thinking of data as nouns and functions as verbs...

#+ pipes_intro
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)
log(x) # this is verb, noun

#' The forward pipe aligns code with the order in which we think and speak
x %>% log() # this is noun, verb

#' An annoying but useful example:
round(exp(mean(log(x))), 1) # "nested"
#' Closing parenthesis woes...
#' 
#' How would you say this in plain English? 
#' * "Round to 1 digit the exponentiated mean of the logarithm of x" <- AWFUL
#' 
#' ## With pipes:
x %>% log() %>% mean() %>% exp() %>% round(1)
#' Take x, log-transform it, average it, exponentiate it, and round it to 1 digit (**cheeky**)
#' 
#' Note: () not necessary but make the code easier to read by highlighting functions
#' 
#' Yeah, it's worth it to just spend 15 minutes dinking around with this kind of thing until you get it...
#' 
#' ## First position rule and the placeholder "."
#' Most functions take multiple arguments...
#+ first_position_rule
log(x, base = 2)
x %>% log(base = 2) %>% mean(na.rm = TRUE) %>% exp() %>% round(1)
x %>% log(., base = 2)
x %>% log(., base = 2) %>% mean(., na.rm = TRUE) %>% exp(.) %>% round(., 1) # Annoying, dot is implied

#' When noun isn't used in first position of a function, the "." is necessary
#+ aggregate_example, error=TRUE
# the trad way
aggregate(Sepal.Length ~ Species, FUN = mean, data = iris) 
# the naive but wrong pipe way
iris %>% aggregate(Sepal.Length ~ Species, FUN = mean) 
# the right pipe way
iris %>% aggregate(Sepal.Length ~ Species, FUN = mean, data = .) 

#' See package `magrittr` for other pipes that do fancy things, but the forward pipe is your workhorse
#' 
#' Whew! Let's pause for questions.
#' 
#' Chat a question if you have one. Chances are, someone else has the same question. 
#' 
#' # Verbs
#' The core of tidyverse usage, from package `dplyr`, verbs are what we do with our nouns (data), 
#' which are typically data frames (tibbles)
#' 
#' ## All verbs in the tidyverse:
#' 
#' 1. take a data frame as their first argument,
#' 1. identify variable names in subsequent arguments, without quotes, 
#' 1. and create a new data frame.
#' 
#' * Chapter 5 in R4DS
#' * Using `nycflights13::flights`. (Sorry not biology). This data frame contains all 
#' 336,776 flights that 
#' departed from New York City in 2013. Each row contains data from a single flight. 
#' The data comes from the US Bureau of Transportation 
#' Statistics, and is documented in `?flights`. Let's have a look:
#+ flights_data
flights
# The tibble view is handy, but still we cannot see all the variables

#' ## Verb: `glimpse()`
#' It's indispensable, as you'll see when we get to a complicated wrangle

#+ verb_glimpse
flights %>% glimpse()

#' ## Verb: `filter()`
#' 
#' * Acts on rows
#' 
#' ### Find United Airlines flights
#+ filter_intro
filter(flights, carrier == "UA")
# with forward pipe
flights %>% filter(., carrier == "UA")
flights %>% filter(carrier == "UA") # dot placeholder is implied

#' ### Find United Airlines and American Airlines flights
flights %>% filter(carrier == "UA" & carrier == "AA")
#' Wrong, trying to combine levels within one variable, use OR
flights %>% filter(carrier == "UA" | carrier == "AA")
#' Better: use value matching function `%in%`
flights %>% filter(carrier %in% c("UA", "AA"))

#' ### Find United Airlines or American Airlines flights longer than 1200 miles in July
#' Use AND to combine across variables
#+ filter_1
flights %>% filter(carrier %in% c("UA", "AA") & distance > 1200 & month == 7)
#+ filter_2, results=FALSE
flights %>% filter(carrier %in% c("UA", "AA"), distance > 1200, month == 7) # same-same & ,

#' ### On your own: how many flights on February 10 departed on time or earlier? 
#' Paste this into your script to get started:
packages_needed = c("tidyverse", "nycflights13")
packages_installed = packages_needed %in% rownames(installed.packages())

if (any(! packages_installed))
    install.packages(packages_needed[! packages_installed])
for (i in 1:length(packages_needed)) {
    library(packages_needed[i], character.only = T)
}

flights %>% glimpse()

flights %>% filter(month == 2, day == 10, dep_delay <= 0)

#' Chat your answers to Art as quickly as you can...
# answer...

#' ## Verb: `select()`
#' 
#' * Acts on columns
#' 
flights %>% select(distance, air_time)
flights %>% select(starts_with("sched"), everything())

#' ## Verb: `mutate()`
#' 
#' * Acts on columns
#' * Creates new variables or replaces transformed, existing variables
#' 
flights %>% 
    select(distance, air_time) %>% 
    glimpse() %>% # in-line view with `glimpse()`
    mutate(speed_mph = distance / (air_time / 60))

#' ## Verbs: `group_by()` and `summarize()`
#' 
#' * Frequently used together
#' * Roughly like `tapply()` or `aggregate()` but much more flexible
#' * Act on rows grouped by variables, usually character or factor levels
#' * Prepares data frame for further transformation, produces results
#' 
#' ### Do different carriers fly at different speeds?
#+ group_by_summarize_intro
flights %>% 
    mutate(speed_mph = distance / (air_time / 60)) %>% 
    group_by(carrier) %>% 
    summarize(carrier_avg = mean(speed_mph))

#' What??? Oh yeah, NAs are contagious. For now, let's just nuke them:
flights %>% 
    glimpse() %>% 
    select(carrier, distance, air_time) %>% 
    glimpse() %>% 
    mutate(speed_mph = distance / (air_time / 60)) %>% 
    filter(!is.na(speed_mph)) %>% 
    glimpse() %>% # annoying, but makes the point
    group_by(carrier) %>% 
    summarize(carrier_avg = mean(speed_mph)) %>% 
    ungroup() %>% # you want this habit
    arrange(-carrier_avg) # yes another verb...

# how would you do this in base R? Obnoxious example:
#+ obnoxious, error=TRUE
flights_2 <- flights[, carrier]
# Ugh! Um...
flights_2 <- flights[, flights$carrier]
# stu##& R and %&^#ing #@%&&!
head(as.data.frame(flights))
flights_2 <- flights[which(complete.cases(flights)), c(10, 16, 15)]
flights_2$speed_mph <- flights_2$distance / (flights_2$air_time / 60)
flights_speed <- aggregate(speed_mph ~ carrier, FUN = mean, data = flights_2) # what???
flights_speed[order(desc(flights_speed$speed_mph)), ]

#' Of course it's doable, but it's harder to see what happened, and the intermediate objects always cause problems

#' ### Ok, what do the carrier codes stand for anyway?
#' From `?flights` we see that `airlines` is a data frame with the carrier names:
airlines

#' # Combine verbs for EDA (exploratory data analysis)
#' To motivate this part, let's make a guess and a follow-up explanation:
#' 
#' * We assume that departure delays are worse before winter holidays than in late January
#' * But do faster flying carriers make up lost time better? 
#'    * Let's limit this to departure delays under an hour. More than that and
#'    there is less to gain from hurrying. 

#+ flights_EDA_boxplot
flights %>% 
    filter(month %in% c(1, 12) & day == 20 & !is.na(dep_delay) & dep_delay <= 60) %>% 
    mutate(date = paste(year, month, day, sep = "-")) %>% 
    ggplot() + # You can wrangle straight into ggplot!!! Where is the placeholder? 
    geom_boxplot(aes(x = carrier, y = dep_delay)) +
    facet_wrap(vars(date))

#' Median departure delays are mostly below zero on 1-20 and above zero on 12-20

#+ flights_EDA_scatterplot_facets
flights %>%
    mutate( # sometimes breaking it out is more readable
        air_delay = arr_delay - dep_delay,
        speed_mph = distance / (air_time / 60),
        date      = paste(year, month, day, sep = "-")
    ) %>%
    filter(month %in% c(1, 12),
           day == 20, !is.na(air_delay),
           dep_delay <= 60) %>%
    filter(!(carrier %in% c("AS", "F9", "FL", "HA", "VX", "YV"))) %>%
    ggplot(aes(x = speed_mph, y = air_delay, group = date)) +
    geom_smooth(aes(color = date),
                size   = 0,
                method = "lm",
                se     = TRUE) +
    geom_point(aes(color = date), size = 0.4) +
    geom_smooth(aes(color = date), method = "lm", se = FALSE) +
    facet_wrap(vars(carrier))

#' Are pilots pushing it before Christmas?
#'
#' # Wrapping up
#' *R for Data Science*. Is this data mining or using data from planned experiments?
#' Thoughts and discussion. 
#' 
#' Questions and comments from you?
#' 
#' * If you are on GitHub, please comment, correct, or shred this script. [Link](https://github.com/bglarkin/tidyverse_intro) to repo.
#' * Or [download](https://mpgcloud.egnyte.com/dl/cguemdnod0) the R script
#' 
#' 
#' # Homework
#' 
#' * You will need to load the `tidyverse` library (after installing the package if necessary)
#' * With this script loaded in RStudio, choose File>Compile Report to render the document (if you want)
#' 
#' ## Transform a messy data frame into a tidy tibble
#' 
#' * Produce hw_1 using the code below. Use pipes and tidyverse functions to transform 
#' it into a tidy tibble. Hint = `?separate`
#+ hw_1
hw_1 <- data.frame(
    plot_replicate = paste(rep(LETTERS[1:5], each = 4), rep(1:4, 5), sep = "_"), 
    date = rep("2016-06-12", 20),
    measurement_1 = rbinom(20, 50, 0.2)
)

#+ hw_1_solved
hw_1 %>% 
    separate(plot_replicate, into = c("plot", "replicate")) %>% 
    as_tibble()

#' ## Are aphids more numerous on taller plants?
#' 
#' * Produce hw_2 using the code below. 
#' * Transform hw_2 into a tidy tibble using pipes and tidyverse functions. Hint = `?pivot_wider`
#' * Wrangle the data using tidyverse tools and pipe into ggplot to create scatterplots with 
#' linear trendlines superimposed.
#' * Put height_cm on the x axis, aphids_n on the y_axis, and facet on plot
#' * Hints: `?geom_point`, `?geom_smooth`, `?facet_wrap`
#' 
#+ hw_2
hw_2 <- data.frame(
    plot = rep(LETTERS[1:5], each = 8),
    replicate = rep(1:4, each = 2),
    parameter = rep(c("aphids_n", "height_cm"), 20),
    measurement = rep(c(700, 80), 20) + c(sort(rbinom(30, 500, 0.4)), rbinom(10, 500, 0.4))
)

#+ hw_2_solved
hw_2 %>% 
    pivot_wider(names_from = parameter, values_from = measurement) %>%  # `pivot_wider()` produces a tibble as output
    ggplot(aes(x = height_cm, y = aphids_n)) +
    geom_point() +
    geom_smooth(method = "lm") +
    facet_wrap(vars(plot), scales = "free_x") # note use of scales

#' ## Tidy vs. messy
#' 
#' * Explain why hw_3 (see below) is messy as it is
#' * Explain why hw_3 is tidy as it is
#' 
#' Hint: can you compare ants and aphids with `?geom_boxplot` and with a scatterplot (`?geom_point`) 
#' using the same configuration of hw_3, or is a transformation necessary to produce both of 
#' these plots? What does this suggest about independent observations and tidy data?
#' 
#' * Using `geom_point()` but still a boxplot format was a common misunderstanding
#+ hw_3
hw_3 <- data.frame(
    plot = rep(LETTERS[1:5], each = 8),
    replicate = rep(1:4, each = 2),
    insects = rep(c("aphids", "ants"), 20),
    count = rep(c(70, 20), 20) + rnorm(40, 0, 5) %>% round(., 0)
)

#+ hw_3_solved_1
hw_3 %>% 
    as_tibble() %>% # unnecessary but follows lesson strictly
    ggplot() +
    geom_boxplot(aes(x = insects, y = count)) +
    facet_wrap(vars(plot)) # not explicit in question, but appropriate to look for plot effect
#' Data are suitable for analysis as-is if the goal is to compare aphid and ant counts in plots.
#' You could argue that each row is already an independent observation if you aren't interested
#' in a relationship between ants and aphids.

#+ hw_3_solved_2
hw_3 %>% 
    pivot_wider(names_from = insects, values_from = count) %>% 
    as_tibble() %>% # unnecessary but follows lesson strictly
    ggplot(aes(x = aphids, y = ants)) +
    geom_point() +
    geom_smooth(method = "lm") +
    facet_wrap(vars(plot), scales = "free_x") # not explicit in question, but appropriate to look for plot effect

#' Correlations require pivoting, where each row is an independent observation.
#'
#' ### Student example
#' Using `geom_point()` but still really a boxplot format, was a common misunderstanding
hw_3 %>% 
    ggplot() +
    geom_boxplot(aes(insects, count)) + 
    facet_wrap(vars(plot)) # Even though the data is not fully tidy, we can easily make boxplots in ggplot
hw_3 %>% 
    ggplot() +
    geom_point(aes(insects, count)) + 
    facet_wrap(vars(plot)) # Same 

#' ## Bonus challenge
#' Use a [join](https://dplyr.tidyverse.org/reference/mutate-joins.html) function from `dplyr` to find out
#' how many different models of airplanes were flown out of New York by each airline
#'  in 2013. **Which carrier flew the largest number of airplane models?** 
#'  
#' * Hint: install/load `nycflights13` and then look at the `?planes` and `?flights` data 
#' frames.

#+ bonus_solved_1
flights %>% glimpse()
planes %>% glimpse()

#' In the planes data frame, the variable `tailnum` is associated with the airplane `model`. The variable `tailnum`
#' can join the flights and planes data frames. `Group_by` and `summarize()` would typically be used.  
#' Calling `distinct` avoids two separate rounds of `group_by()` and `summarize()`. Using `n_distinct()` makes things
#' very compact. To reach the answer quickly, calling `arrange()` can put the top carrier at the top of the tibble. 
#' Alternatively, a second call to `summarize`, which executes `max()` on the summary variable would work.
#' 
#' * Use of other than `left_join()` led some astray (others can work but may get complicated)
#' * Some attempted to filter the airports or years; ?flights...
#' * Must join to airplanes df; tailnumbers go with individual planes, not models

#+ bonus_solved_2
flights %>% 
    left_join(planes, by = "tailnum") %>% # best practice to specify `by = ` to maintain control over join keys
    distinct(carrier, model) %>% 
    group_by(carrier) %>% 
    summarize(n = n()) %>% ungroup() %>% 
    arrange(-n)

#+ bonus_solved_3
flights %>% 
    left_join(planes, by = "tailnum") %>% 
    group_by(carrier, model) %>% 
    summarize(n = n()) %>% ungroup() %>% # this summary is meaningless, which is why I prefer using `distinct()` or maybe `slice()`
    group_by(carrier) %>% 
    summarize(n = n()) %>%
    slice_max(n) %>% 
    left_join(airlines, by = "carrier") # pretty display

#' Many paths exist to the answer, but they all lead to American Airlines flying the most airplane models out of NYC in 2013.
#' 
#' ### Student example
#' Two of you made this very compact
#+ erba_solution
# Grace Erba: very compact, combines tidyverse and traditional tools
combined.plane.data <- full_join(flights, planes, by = "tailnum")
combined.plane.data %>% 
    group_by(carrier) %>%
    summarise(models = length(unique(model))) # `n_distinct()` = `length(unique())` also works here