Skip to content

Latest commit

 

History

History
1692 lines (1460 loc) · 75.8 KB

tidyverse_intro.md

File metadata and controls

1692 lines (1460 loc) · 75.8 KB

Using the tidyverse!

Beau Larkin 2021-04-05

Description

  • For University of Montana Course BIOB-595 - Art Woods, professor
  • 2021-04-05 - Beau Larkin (guest instructor)

Goals:

  • Show some features of the tidyverse
  • Convince you that it’s at least worth your time to learn more
  • Give you some tools for further exploration
  • != flexing huge skills

This session borrows heavily from R for Data Science (R4DS) by Wickham and Grolemund, 2017, O’Reilly, and from other sources as annotated in-text.

BTW What is this #’ thing? It’s useful. I’ll show you at the end.

Let’s get started!

# Package and library installation
packages_needed = c("tidyverse", "knitr", "nycflights13", "magrittr")
packages_installed = packages_needed %in% rownames(installed.packages())

if (any(! packages_installed))
    install.packages(packages_needed[! packages_installed])
for (i in 1:length(packages_needed)) {
    library(packages_needed[i], character.only = T)
}
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

Engagement

What do you know about the tidyverse? Chat three words… Do you use tidyverse packages and functions? Which packages?

Here’s what I think:

  • It’s more than a collection of functions and fads (con sarn it, these new-fangled pipes!)
  • A paradigm for data science, and it’s here to stay
  • Why tidyverse? It allows us to use intuitive, linear processes to get answers and results from data quickly

A paradigm???

What do you need before creating graphics, applying models, and testing hypotheses?

  • The 80/20 rule…

Tidy data

  • Chapter 12 in R4DS
  • Tidy Data in the Journal of Statistical Software, Wickham 2014.
  • You cannot pass into the tidyverse without tidy data! (whoa, that’s biblical, but seriously things don’t work well in the tidyverse with messy data)

“Tidy data are tidy in the same way. Each messy dataset is messy in it’s own way.” - Wickham

Examples: messy data

ex_1 <- data.frame(
    plot_replicate = paste(rep(LETTERS[1:5], each = 4), rep(1:4, 5), sep = "_"), 
    date = rep("2016-06-12", 20),
    measurement_1 = rbinom(20, 50, 0.2)
)
ex_1[20, 2] <- "2016-06-12, measured after lunch"
kable(ex_1)
plot_replicate date measurement_1
A_1 2016-06-12 14
A_2 2016-06-12 16
A_3 2016-06-12 10
A_4 2016-06-12 11
B_1 2016-06-12 8
B_2 2016-06-12 16
B_3 2016-06-12 4
B_4 2016-06-12 11
C_1 2016-06-12 8
C_2 2016-06-12 11
C_3 2016-06-12 11
C_4 2016-06-12 10
D_1 2016-06-12 9
D_2 2016-06-12 9
D_3 2016-06-12 15
D_4 2016-06-12 9
E_1 2016-06-12 11
E_2 2016-06-12 12
E_3 2016-06-12 7
E_4 2016-06-12, measured after lunch 7
# View(ex_1)
# What is messy? Comment in chat but wait to submit...

How would we produce a mean of measurement_1 in plots???

aggregate(measurement_1 ~ plot_replicate, FUN = mean, data = ex_1) 
##    plot_replicate measurement_1
## 1             A_1            14
## 2             A_2            16
## 3             A_3            10
## 4             A_4            11
## 5             B_1             8
## 6             B_2            16
## 7             B_3             4
## 8             B_4            11
## 9             C_1             8
## 10            C_2            11
## 11            C_3            11
## 12            C_4            10
## 13            D_1             9
## 14            D_2             9
## 15            D_3            15
## 16            D_4             9
## 17            E_1            11
## 18            E_2            12
## 19            E_3             7
## 20            E_4             7

Guess it’s “back to excel”

ex_2 <- data.frame(
    plot = rep(LETTERS[1:5], each = 8),
    replicate = rep(1:4, each = 2),
    parameter = rep(c("aphids_n", "height_cm"), 10),
    value = rep(c(700, 80), 10) + rbinom(40, 500, 0.4)
)
kable(ex_2)
plot replicate parameter value
A 1 aphids_n 898
A 1 height_cm 283
A 2 aphids_n 911
A 2 height_cm 275
A 3 aphids_n 902
A 3 height_cm 268
A 4 aphids_n 920
A 4 height_cm 272
B 1 aphids_n 906
B 1 height_cm 281
B 2 aphids_n 904
B 2 height_cm 281
B 3 aphids_n 907
B 3 height_cm 288
B 4 aphids_n 918
B 4 height_cm 289
C 1 aphids_n 892
C 1 height_cm 275
C 2 aphids_n 897
C 2 height_cm 268
C 3 aphids_n 879
C 3 height_cm 277
C 4 aphids_n 908
C 4 height_cm 286
D 1 aphids_n 901
D 1 height_cm 281
D 2 aphids_n 903
D 2 height_cm 266
D 3 aphids_n 913
D 3 height_cm 290
D 4 aphids_n 892
D 4 height_cm 278
E 1 aphids_n 892
E 1 height_cm 282
E 2 aphids_n 892
E 2 height_cm 281
E 3 aphids_n 894
E 3 height_cm 264
E 4 aphids_n 915
E 4 height_cm 278
# View(ex_2)
# What is messy? Comment in chat but wait to submit...

Tidy data rules:

There are three interrelated rules which make a dataset tidy:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

So, remember the rules or just develop good habits that lead to intuition?

  • Tidy or not? Not always binary…unfortunately. But like art, you know messy when you see it.
  • Quickly recognize problems in data…
  • Think ahead when planning to collect data…

At this point, tibbles often come up (see Chapter 10 in R4DS)

  • Unavoidable in tidyverse, and mostly this is a good thing
  • Occasional incompatibility issues, so be aware
# data frame
iris 
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# tibble
(iris_t <- as_tibble(iris)) 
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows
# sometimes you must convert
as.data.frame(iris_t)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

Pipes!!

How do we typically order data and functions, thinking of data as nouns and functions as verbs…

x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)
log(x) # this is verb, noun
## [1] -2.216407397 -1.024432890 -0.462035460 -0.004008021 -0.663588378
## [6] -1.951928221 -4.074541935 -0.187535124 -0.097612829

The forward pipe aligns code with the order in which we think and speak

x %>% log() # this is noun, verb
## [1] -2.216407397 -1.024432890 -0.462035460 -0.004008021 -0.663588378
## [6] -1.951928221 -4.074541935 -0.187535124 -0.097612829

An annoying but useful example:

round(exp(mean(log(x))), 1) # "nested"
## [1] 0.3

Closing parenthesis woes…

How would you say this in plain English? * “Round to 1 digit the exponentiated mean of the logarithm of x” <- AWFUL

With pipes:

x %>% log() %>% mean() %>% exp() %>% round(1)
## [1] 0.3

Take x, log-transform it, average it, exponentiate it, and round it to 1 digit (cheeky)

Note: () not necessary but make the code easier to read by highlighting functions

Yeah, it’s worth it to just spend 15 minutes dinking around with this kind of thing until you get it…

First position rule and the placeholder “.”

Most functions take multiple arguments…

log(x, base = 2)
## [1] -3.197599960 -1.477944251 -0.666576266 -0.005782353 -0.957355663
## [6] -2.816037165 -5.878321443 -0.270555993 -0.140825544
x %>% log(base = 2) %>% mean(na.rm = TRUE) %>% exp() %>% round(1)
## [1] 0.2
x %>% log(., base = 2)
## [1] -3.197599960 -1.477944251 -0.666576266 -0.005782353 -0.957355663
## [6] -2.816037165 -5.878321443 -0.270555993 -0.140825544
x %>% log(., base = 2) %>% mean(., na.rm = TRUE) %>% exp(.) %>% round(., 1) # Annoying, dot is implied
## [1] 0.2

When noun isn’t used in first position of a function, the “.” is necessary

# the trad way
aggregate(Sepal.Length ~ Species, FUN = mean, data = iris) 
##      Species Sepal.Length
## 1     setosa        5.006
## 2 versicolor        5.936
## 3  virginica        6.588
# the naive but wrong pipe way
iris %>% aggregate(Sepal.Length ~ Species, FUN = mean) 
## Error in aggregate.data.frame(., Sepal.Length ~ Species, FUN = mean): 'by' must be a list
# the right pipe way
iris %>% aggregate(Sepal.Length ~ Species, FUN = mean, data = .) 
##      Species Sepal.Length
## 1     setosa        5.006
## 2 versicolor        5.936
## 3  virginica        6.588

See package magrittr for other pipes that do fancy things, but the forward pipe is your workhorse

Whew! Let’s pause for questions.

Chat a question if you have one. Chances are, someone else has the same question.

Verbs

The core of tidyverse usage, from package dplyr, verbs are what we do with our nouns (data), which are typically data frames (tibbles)

All verbs in the tidyverse:

  1. take a data frame as their first argument,
  2. identify variable names in subsequent arguments, without quotes,
  3. and create a new data frame.
  • Chapter 5 in R4DS
  • Using nycflights13::flights. (Sorry not biology). This data frame contains all 336,776 flights that departed from New York City in 2013. Each row contains data from a single flight. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights. Let’s have a look:
flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# The tibble view is handy, but still we cannot see all the variables

Verb: glimpse()

It’s indispensable, as you’ll see when we get to a complicated wrangle

flights %>% glimpse()
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

Verb: filter()

  • Acts on rows

Find United Airlines flights

filter(flights, carrier == "UA")
## # A tibble: 58,665 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      554            558        -4      740            728
##  4  2013     1     1      558            600        -2      924            917
##  5  2013     1     1      558            600        -2      923            937
##  6  2013     1     1      559            600        -1      854            902
##  7  2013     1     1      607            607         0      858            915
##  8  2013     1     1      611            600        11      945            931
##  9  2013     1     1      623            627        -4      933            932
## 10  2013     1     1      628            630        -2     1016            947
## # … with 58,655 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# with forward pipe
flights %>% filter(., carrier == "UA")
## # A tibble: 58,665 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      554            558        -4      740            728
##  4  2013     1     1      558            600        -2      924            917
##  5  2013     1     1      558            600        -2      923            937
##  6  2013     1     1      559            600        -1      854            902
##  7  2013     1     1      607            607         0      858            915
##  8  2013     1     1      611            600        11      945            931
##  9  2013     1     1      623            627        -4      933            932
## 10  2013     1     1      628            630        -2     1016            947
## # … with 58,655 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(carrier == "UA") # dot placeholder is implied
## # A tibble: 58,665 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      554            558        -4      740            728
##  4  2013     1     1      558            600        -2      924            917
##  5  2013     1     1      558            600        -2      923            937
##  6  2013     1     1      559            600        -1      854            902
##  7  2013     1     1      607            607         0      858            915
##  8  2013     1     1      611            600        11      945            931
##  9  2013     1     1      623            627        -4      933            932
## 10  2013     1     1      628            630        -2     1016            947
## # … with 58,655 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Find United Airlines and American Airlines flights

flights %>% filter(carrier == "UA" & carrier == "AA")
## # A tibble: 0 x 19
## # … with 19 variables: year <int>, month <int>, day <int>, dep_time <int>,
## #   sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Wrong, trying to combine levels within one variable, use OR

flights %>% filter(carrier == "UA" | carrier == "AA")
## # A tibble: 91,394 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      554            558        -4      740            728
##  5  2013     1     1      558            600        -2      753            745
##  6  2013     1     1      558            600        -2      924            917
##  7  2013     1     1      558            600        -2      923            937
##  8  2013     1     1      559            600        -1      941            910
##  9  2013     1     1      559            600        -1      854            902
## 10  2013     1     1      606            610        -4      858            910
## # … with 91,384 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Better: use value matching function %in%

flights %>% filter(carrier %in% c("UA", "AA"))
## # A tibble: 91,394 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      554            558        -4      740            728
##  5  2013     1     1      558            600        -2      753            745
##  6  2013     1     1      558            600        -2      924            917
##  7  2013     1     1      558            600        -2      923            937
##  8  2013     1     1      559            600        -1      941            910
##  9  2013     1     1      559            600        -1      854            902
## 10  2013     1     1      606            610        -4      858            910
## # … with 91,384 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Find United Airlines or American Airlines flights longer than 1200 miles in July

Use AND to combine across variables

flights %>% filter(carrier %in% c("UA", "AA") & distance > 1200 & month == 7)
## # A tibble: 4,465 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1       44           2150       174      300            100
##  2  2013     7     1      538            540        -2      800            810
##  3  2013     7     1      547            548        -1      903            851
##  4  2013     7     1      548            545         3      815            819
##  5  2013     7     1      557            600        -3      828            907
##  6  2013     7     1      600            600         0      836            830
##  7  2013     7     1      606            610        -4      847            924
##  8  2013     7     1      625            630        -5      902            905
##  9  2013     7     1      634            630         4      919            930
## 10  2013     7     1      635            640        -5     1034           1040
## # … with 4,455 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(carrier %in% c("UA", "AA"), distance > 1200, month == 7) # same-same & ,

On your own: how many flights on February 10 departed on time or earlier?

Paste this into your script to get started:

packages_needed = c("tidyverse", "nycflights13")
packages_installed = packages_needed %in% rownames(installed.packages())

if (any(! packages_installed))
    install.packages(packages_needed[! packages_installed])
for (i in 1:length(packages_needed)) {
    library(packages_needed[i], character.only = T)
}

flights %>% glimpse()
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
flights %>% filter(month == 2, day == 10, dep_delay <= 0)
## # A tibble: 429 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     2    10      455            500        -5      625            650
##  2  2013     2    10      553            600        -7      826            858
##  3  2013     2    10      556            600        -4      708            716
##  4  2013     2    10      556            600        -4      923            910
##  5  2013     2    10      557            600        -3      854            911
##  6  2013     2    10      557            559        -2      915            930
##  7  2013     2    10      558            600        -2      810            815
##  8  2013     2    10      600            600         0      849            905
##  9  2013     2    10      603            610        -7      850            915
## 10  2013     2    10      607            610        -3      811            817
## # … with 419 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Chat your answers to Art as quickly as you can…

# answer...

Verb: select()

  • Acts on columns
flights %>% select(distance, air_time)
## # A tibble: 336,776 x 2
##    distance air_time
##       <dbl>    <dbl>
##  1     1400      227
##  2     1416      227
##  3     1089      160
##  4     1576      183
##  5      762      116
##  6      719      150
##  7     1065      158
##  8      229       53
##  9      944      140
## 10      733      138
## # … with 336,766 more rows
flights %>% select(starts_with("sched"), everything())
## # A tibble: 336,776 x 19
##    sched_dep_time sched_arr_time  year month   day dep_time dep_delay arr_time
##             <int>          <int> <int> <int> <int>    <int>     <dbl>    <int>
##  1            515            819  2013     1     1      517         2      830
##  2            529            830  2013     1     1      533         4      850
##  3            540            850  2013     1     1      542         2      923
##  4            545           1022  2013     1     1      544        -1     1004
##  5            600            837  2013     1     1      554        -6      812
##  6            558            728  2013     1     1      554        -4      740
##  7            600            854  2013     1     1      555        -5      913
##  8            600            723  2013     1     1      557        -3      709
##  9            600            846  2013     1     1      557        -3      838
## 10            600            745  2013     1     1      558        -2      753
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Verb: mutate()

  • Acts on columns
  • Creates new variables or replaces transformed, existing variables
flights %>% 
    select(distance, air_time) %>% 
    glimpse() %>% # in-line view with `glimpse()`
    mutate(speed_mph = distance / (air_time / 60))
## Rows: 336,776
## Columns: 2
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028, …
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, 34…

## # A tibble: 336,776 x 3
##    distance air_time speed_mph
##       <dbl>    <dbl>     <dbl>
##  1     1400      227      370.
##  2     1416      227      374.
##  3     1089      160      408.
##  4     1576      183      517.
##  5      762      116      394.
##  6      719      150      288.
##  7     1065      158      404.
##  8      229       53      259.
##  9      944      140      405.
## 10      733      138      319.
## # … with 336,766 more rows

Verbs: group_by() and summarize()

  • Frequently used together
  • Roughly like tapply() or aggregate() but much more flexible
  • Act on rows grouped by variables, usually character or factor levels
  • Prepares data frame for further transformation, produces results

Do different carriers fly at different speeds?

flights %>% 
    mutate(speed_mph = distance / (air_time / 60)) %>% 
    group_by(carrier) %>% 
    summarize(carrier_avg = mean(speed_mph))
## # A tibble: 16 x 2
##    carrier carrier_avg
##    <chr>         <dbl>
##  1 9E              NA 
##  2 AA              NA 
##  3 AS              NA 
##  4 B6              NA 
##  5 DL              NA 
##  6 EV              NA 
##  7 F9              NA 
##  8 FL              NA 
##  9 HA             480.
## 10 MQ              NA 
## 11 OO              NA 
## 12 UA              NA 
## 13 US              NA 
## 14 VX              NA 
## 15 WN              NA 
## 16 YV              NA

What??? Oh yeah, NAs are contagious. For now, let’s just nuke them:

flights %>% 
    glimpse() %>% 
    select(carrier, distance, air_time) %>% 
    glimpse() %>% 
    mutate(speed_mph = distance / (air_time / 60)) %>% 
    filter(!is.na(speed_mph)) %>% 
    glimpse() %>% # annoying, but makes the point
    group_by(carrier) %>% 
    summarize(carrier_avg = mean(speed_mph)) %>% 
    ungroup() %>% # you want this habit
    arrange(-carrier_avg) # yes another verb...
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
## Rows: 336,776
## Columns: 3
## $ carrier  <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028, …
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, 34…
## Rows: 327,346
## Columns: 4
## $ carrier   <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", …
## $ distance  <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028,…
## $ air_time  <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, 3…
## $ speed_mph <dbl> 370.0441, 374.2731, 408.3750, 516.7213, 394.1379, 287.6000, …

## # A tibble: 16 x 2
##    carrier carrier_avg
##    <chr>         <dbl>
##  1 HA             480.
##  2 VX             446.
##  3 AS             444.
##  4 F9             425.
##  5 UA             421.
##  6 DL             418.
##  7 AA             417.
##  8 WN             401.
##  9 B6             400.
## 10 FL             394.
## 11 MQ             368.
## 12 OO             366.
## 13 EV             363.
## 14 9E             345.
## 15 US             342.
## 16 YV             332.
# how would you do this in base R? Obnoxious example:
flights_2 <- flights[, carrier]
## Error in `[.tbl_df`(flights, , carrier): object 'carrier' not found
# Ugh! Um...
flights_2 <- flights[, flights$carrier]
## Error: Can't subset columns that don't exist.
## x Columns `UA`, `UA`, `AA`, `B6`, `DL`, etc. don't exist.
# stu##& R and %&^#ing #@%&&!
head(as.data.frame(flights))
##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      517            515         2      830            819
## 2 2013     1   1      533            529         4      850            830
## 3 2013     1   1      542            540         2      923            850
## 4 2013     1   1      544            545        -1     1004           1022
## 5 2013     1   1      554            600        -6      812            837
## 6 2013     1   1      554            558        -4      740            728
##   arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
## 2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
## 3        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
## 4       -18      B6    725  N804JB    JFK  BQN      183     1576    5     45
## 5       -25      DL    461  N668DN    LGA  ATL      116      762    6      0
## 6        12      UA   1696  N39463    EWR  ORD      150      719    5     58
##             time_hour
## 1 2013-01-01 05:00:00
## 2 2013-01-01 05:00:00
## 3 2013-01-01 05:00:00
## 4 2013-01-01 05:00:00
## 5 2013-01-01 06:00:00
## 6 2013-01-01 05:00:00
flights_2 <- flights[which(complete.cases(flights)), c(10, 16, 15)]
flights_2$speed_mph <- flights_2$distance / (flights_2$air_time / 60)
flights_speed <- aggregate(speed_mph ~ carrier, FUN = mean, data = flights_2) # what???
flights_speed[order(desc(flights_speed$speed_mph)), ]
##    carrier speed_mph
## 9       HA  480.3577
## 14      VX  446.1749
## 3       AS  443.6789
## 7       F9  425.1721
## 12      UA  420.8838
## 5       DL  418.4628
## 2       AA  417.4727
## 15      WN  400.5320
## 4       B6  399.9715
## 8       FL  394.3581
## 10      MQ  368.4028
## 11      OO  366.3201
## 6       EV  362.9436
## 1       9E  345.4304
## 13      US  341.9397
## 16      YV  331.9700

Of course it’s doable, but it’s harder to see what happened, and the intermediate objects always cause problems ### Ok, what do the carrier codes stand for anyway? From ?flights we see that airlines is a data frame with the carrier names:

airlines
## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

Combine verbs for EDA (exploratory data analysis)

To motivate this part, let’s make a guess and a follow-up explanation:

  • We assume that departure delays are worse before winter holidays than in late January
  • But do faster flying carriers make up lost time better?
    • Let’s limit this to departure delays under an hour. More than that and there is less to gain from hurrying.
flights %>% 
    filter(month %in% c(1, 12) & day == 20 & !is.na(dep_delay) & dep_delay <= 60) %>% 
    mutate(date = paste(year, month, day, sep = "-")) %>% 
    ggplot() + # You can wrangle straight into ggplot!!! Where is the placeholder? 
    geom_boxplot(aes(x = carrier, y = dep_delay)) +
    facet_wrap(vars(date))

Median departure delays are mostly below zero on 1-20 and above zero on 12-20

flights %>%
    mutate( # sometimes breaking it out is more readable
        air_delay = arr_delay - dep_delay,
        speed_mph = distance / (air_time / 60),
        date      = paste(year, month, day, sep = "-")
    ) %>%
    filter(month %in% c(1, 12),
           day == 20, !is.na(air_delay),
           dep_delay <= 60) %>%
    filter(!(carrier %in% c("AS", "F9", "FL", "HA", "VX", "YV"))) %>%
    ggplot(aes(x = speed_mph, y = air_delay, group = date)) +
    geom_smooth(aes(color = date),
                size   = 0,
                method = "lm",
                se     = TRUE) +
    geom_point(aes(color = date), size = 0.4) +
    geom_smooth(aes(color = date), method = "lm", se = FALSE) +
    facet_wrap(vars(carrier))
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Are pilots pushing it before Christmas?

Wrapping up

R for Data Science. Is this data mining or using data from planned experiments? Thoughts and discussion.

Questions and comments from you?

  • If you are on GitHub, please comment, correct, or shred this script. Link to repo.
  • Or download the R script

Homework

  • You will need to load the tidyverse library (after installing the package if necessary)
  • With this script loaded in RStudio, choose File>Compile Report to render the document (if you want)

Transform a messy data frame into a tidy tibble

  • Produce hw_1 using the code below. Use pipes and tidyverse functions to transform it into a tidy tibble. Hint = ?separate
hw_1 <- data.frame(
    plot_replicate = paste(rep(LETTERS[1:5], each = 4), rep(1:4, 5), sep = "_"), 
    date = rep("2016-06-12", 20),
    measurement_1 = rbinom(20, 50, 0.2)
)
hw_1 %>% 
    separate(plot_replicate, into = c("plot", "replicate")) %>% 
    as_tibble()
## # A tibble: 20 x 4
##    plot  replicate date       measurement_1
##    <chr> <chr>     <chr>              <int>
##  1 A     1         2016-06-12            13
##  2 A     2         2016-06-12             7
##  3 A     3         2016-06-12             8
##  4 A     4         2016-06-12            10
##  5 B     1         2016-06-12            10
##  6 B     2         2016-06-12            10
##  7 B     3         2016-06-12            12
##  8 B     4         2016-06-12            13
##  9 C     1         2016-06-12            15
## 10 C     2         2016-06-12             7
## 11 C     3         2016-06-12            15
## 12 C     4         2016-06-12             8
## 13 D     1         2016-06-12            10
## 14 D     2         2016-06-12            11
## 15 D     3         2016-06-12            11
## 16 D     4         2016-06-12            12
## 17 E     1         2016-06-12             8
## 18 E     2         2016-06-12            14
## 19 E     3         2016-06-12            15
## 20 E     4         2016-06-12             9

Are aphids more numerous on taller plants?

  • Produce hw_2 using the code below.
  • Transform hw_2 into a tidy tibble using pipes and tidyverse functions. Hint = ?pivot_wider
  • Wrangle the data using tidyverse tools and pipe into ggplot to create scatterplots with linear trendlines superimposed.
  • Put height_cm on the x axis, aphids_n on the y_axis, and facet on plot
  • Hints: ?geom_point, ?geom_smooth, ?facet_wrap
hw_2 <- data.frame(
    plot = rep(LETTERS[1:5], each = 8),
    replicate = rep(1:4, each = 2),
    parameter = rep(c("aphids_n", "height_cm"), 20),
    measurement = rep(c(700, 80), 20) + c(sort(rbinom(30, 500, 0.4)), rbinom(10, 500, 0.4))
)
hw_2 %>% 
    pivot_wider(names_from = parameter, values_from = measurement) %>%  # `pivot_wider()` produces a tibble as output
    ggplot(aes(x = height_cm, y = aphids_n)) +
    geom_point() +
    geom_smooth(method = "lm") +
    facet_wrap(vars(plot), scales = "free_x") # note use of scales
## `geom_smooth()` using formula 'y ~ x'

Tidy vs. messy

  • Explain why hw_3 (see below) is messy as it is
  • Explain why hw_3 is tidy as it is

Hint: can you compare ants and aphids with ?geom_boxplot and with a scatterplot (?geom_point) using the same configuration of hw_3, or is a transformation necessary to produce both of these plots? What does this suggest about independent observations and tidy data?

  • Using geom_point() but still a boxplot format was a common misunderstanding
hw_3 <- data.frame(
    plot = rep(LETTERS[1:5], each = 8),
    replicate = rep(1:4, each = 2),
    insects = rep(c("aphids", "ants"), 20),
    count = rep(c(70, 20), 20) + rnorm(40, 0, 5) %>% round(., 0)
)
hw_3 %>% 
    as_tibble() %>% # unnecessary but follows lesson strictly
    ggplot() +
    geom_boxplot(aes(x = insects, y = count)) +
    facet_wrap(vars(plot)) # not explicit in question, but appropriate to look for plot effect

Data are suitable for analysis as-is if the goal is to compare aphid and ant counts in plots. You could argue that each row is already an independent observation if you aren’t interested in a relationship between ants and aphids.

hw_3 %>% 
    pivot_wider(names_from = insects, values_from = count) %>% 
    as_tibble() %>% # unnecessary but follows lesson strictly
    ggplot(aes(x = aphids, y = ants)) +
    geom_point() +
    geom_smooth(method = "lm") +
    facet_wrap(vars(plot), scales = "free_x") # not explicit in question, but appropriate to look for plot effect
## `geom_smooth()` using formula 'y ~ x'

Correlations require pivoting, where each row is an independent observation.

Student example

Using geom_point() but still really a boxplot format, was a common misunderstanding

hw_3 %>% 
    ggplot() +
    geom_boxplot(aes(insects, count)) + 
    facet_wrap(vars(plot)) # Even though the data is not fully tidy, we can easily make boxplots in ggplot

hw_3 %>% 
    ggplot() +
    geom_point(aes(insects, count)) + 
    facet_wrap(vars(plot)) # Same 

Bonus challenge

Use a join function from dplyr to find out how many different models of airplanes were flown out of New York by each airline in 2013. Which carrier flew the largest number of airplane models?

  • Hint: install/load nycflights13 and then look at the ?planes and ?flights data frames.
flights %>% glimpse()
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
planes %>% glimpse()
## Rows: 3,322
## Columns: 9
## $ tailnum      <chr> "N10156", "N102UW", "N103US", "N104UW", "N10575", "N105UW…
## $ year         <int> 2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1999, 199…
## $ type         <chr> "Fixed wing multi engine", "Fixed wing multi engine", "Fi…
## $ manufacturer <chr> "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE", "AIRBU…
## $ model        <chr> "EMB-145XR", "A320-214", "A320-214", "A320-214", "EMB-145…
## $ engines      <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ seats        <int> 55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 55, 55, 5…
## $ speed        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ engine       <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turb…

In the planes data frame, the variable tailnum is associated with the airplane model. The variable tailnum can join the flights and planes data frames. Group_by and summarize() would typically be used.
Calling distinct avoids two separate rounds of group_by() and summarize(). Using n_distinct() makes things very compact. To reach the answer quickly, calling arrange() can put the top carrier at the top of the tibble. Alternatively, a second call to summarize, which executes max() on the summary variable would work.

  • Use of other than left_join() led some astray (others can work but may get complicated)
  • Some attempted to filter the airports or years; ?flights…
  • Must join to airplanes df; tailnumbers go with individual planes, not models
flights %>% 
    left_join(planes, by = "tailnum") %>% # best practice to specify `by = ` to maintain control over join keys
    distinct(carrier, model) %>% 
    group_by(carrier) %>% 
    summarize(n = n()) %>% ungroup() %>% 
    arrange(-n)
## # A tibble: 16 x 2
##    carrier     n
##    <chr>   <int>
##  1 AA         32
##  2 DL         25
##  3 WN         22
##  4 UA         18
##  5 US         13
##  6 B6          7
##  7 FL          7
##  8 EV          6
##  9 MQ          5
## 10 AS          4
## 11 9E          3
## 12 F9          3
## 13 OO          3
## 14 VX          3
## 15 YV          3
## 16 HA          1
flights %>% 
    left_join(planes, by = "tailnum") %>% 
    group_by(carrier, model) %>% 
    summarize(n = n()) %>% ungroup() %>% # this summary is meaningless, which is why I prefer using `distinct()` or maybe `slice()`
    group_by(carrier) %>% 
    summarize(n = n()) %>%
    slice_max(n) %>% 
    left_join(airlines, by = "carrier") # pretty display
## `summarise()` has grouped output by 'carrier'. You can override using the `.groups` argument.

## # A tibble: 1 x 3
##   carrier     n name                  
##   <chr>   <int> <chr>                 
## 1 AA         32 American Airlines Inc.

Many paths exist to the answer, but they all lead to American Airlines flying the most airplane models out of NYC in 2013.

Student example

Two of you made this very compact

# Grace Erba: very compact, combines tidyverse and traditional tools
combined.plane.data <- full_join(flights, planes, by = "tailnum")
combined.plane.data %>% 
    group_by(carrier) %>%
    summarise(models = length(unique(model))) # `n_distinct()` = `length(unique())` also works here
## # A tibble: 16 x 2
##    carrier models
##    <chr>    <int>
##  1 9E           3
##  2 AA          32
##  3 AS           4
##  4 B6           7
##  5 DL          25
##  6 EV           6
##  7 F9           3
##  8 FL           7
##  9 HA           1
## 10 MQ           5
## 11 OO           3
## 12 UA          18
## 13 US          13
## 14 VX           3
## 15 WN          22
## 16 YV           3