-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtidyverse_intro.R
507 lines (459 loc) · 18.4 KB
/
tidyverse_intro.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
#' ---
#' title: "Using the tidyverse!"
#' author: "Beau Larkin"
#' date: "2021-04-05"
#' output:
#' github_document:
#' toc: true
#' toc_depth: 2
#' pdf_document:
#' df_print: tibble
#' fig_caption: yes
#' highlight: tango
#' toc: true
#' toc_depth: 2
#' ---
#'
#' # Description
#'
#' * For University of Montana Course BIOB-595 - Art Woods, professor
#' * 2021-04-05 - Beau Larkin (guest instructor)
#'
#' **Goals:**
#'
#' * Show some features of the tidyverse
#' * Convince you that it's at least worth your time to learn more
#' * Give you some tools for further exploration
#' * != flexing huge skills
#'
#' This session borrows heavily from [*R for Data Science*](https://r4ds.had.co.nz/index.html) (R4DS)
#' by Wickham and Grolemund, 2017, O'Reilly, and from other sources as annotated in-text.
#'
#' BTW What is this #' thing? It's useful. I'll show you at the end.
#'
#' Let's get started!
# Package and library installation
packages_needed = c("tidyverse", "knitr", "nycflights13", "magrittr")
packages_installed = packages_needed %in% rownames(installed.packages())
if (any(! packages_installed))
install.packages(packages_needed[! packages_installed])
for (i in 1:length(packages_needed)) {
library(packages_needed[i], character.only = T)
}
#' # Engagement
#' What do you know about the tidyverse? Chat three words...
#' Do you use tidyverse packages and functions? Which packages?
#'
#' ## Here's what I think:
#'
#' * It's more than a collection of functions and fads (con sarn it, these new-fangled pipes!)
#' * A paradigm for data science, and it's here to stay
#' * Why tidyverse? It allows us to use intuitive,
#' linear processes to get answers and results from data quickly
#'
#' ## A *paradigm*???
#' What do you need before creating graphics, applying models, and testing hypotheses?
#'
#' * ...
#' * The 80/20 rule...
#'
#' # Tidy data
#'
#' * Chapter 12 in R4DS
#' * *Tidy Data* in the [Journal of Statistical Software, Wickham 2014](https://www.jstatsoft.org/article/view/v059i10).
#' * You cannot pass into the tidyverse without tidy data! (whoa, that's biblical, but seriously
#' things don't work well in the tidyverse with messy data)
#'
#' "Tidy data are tidy in the same way. Each messy dataset is messy in it's own way." - Wickham
#'
#' ## Examples: messy data
#+ ex_1
ex_1 <- data.frame(
plot_replicate = paste(rep(LETTERS[1:5], each = 4), rep(1:4, 5), sep = "_"),
date = rep("2016-06-12", 20),
measurement_1 = rbinom(20, 50, 0.2)
)
ex_1[20, 2] <- "2016-06-12, measured after lunch"
kable(ex_1)
# View(ex_1)
# What is messy? Comment in chat but wait to submit...
#'
#' How would we produce a mean of measurement_1 in plots???
#+ aggregate_ex_1
aggregate(measurement_1 ~ plot_replicate, FUN = mean, data = ex_1)
#' Guess it's "back to excel"
#+ ex_2
ex_2 <- data.frame(
plot = rep(LETTERS[1:5], each = 8),
replicate = rep(1:4, each = 2),
parameter = rep(c("aphids_n", "height_cm"), 10),
value = rep(c(700, 80), 10) + rbinom(40, 500, 0.4)
)
kable(ex_2)
# View(ex_2)
# What is messy? Comment in chat but wait to submit...
#' ## Tidy data rules:
#' There are three interrelated rules which make a dataset tidy:
#'
#' 1. Each variable must have its own column.
#' 1. Each observation must have its own row.
#' 1. Each value must have its own cell.
#'
#' So, remember the rules or just develop good habits that lead to intuition?
#'
#' * Tidy or not? Not always binary...unfortunately. But like art, you know messy when you see it.
#' * Quickly recognize problems in data...
#' * Think ahead when planning to collect data...
#'
#' ### At this point, *tibbles* often come up (see Chapter 10 in R4DS)
#'
#' * Unavoidable in tidyverse, and mostly this is a good thing
#' * Occasional incompatibility issues, so be aware
#+ iris_tibble
# data frame
iris
head(iris)
# tibble
(iris_t <- as_tibble(iris))
# sometimes you must convert
as.data.frame(iris_t)
#' # Pipes!!
#'
#' * What do you know already? Start with any vector `x`, and use a pipe to log transform `x`
#' * Chapter 18 in R4DS
#' * [Pipes in R Tutorial For Beginners](https://www.datacamp.com/community/tutorials/pipe-r-tutorial)
#'
#' How do we typically order data and functions, thinking of data as nouns and functions as verbs...
#+ pipes_intro
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)
log(x) # this is verb, noun
#' The forward pipe aligns code with the order in which we think and speak
x %>% log() # this is noun, verb
#' An annoying but useful example:
round(exp(mean(log(x))), 1) # "nested"
#' Closing parenthesis woes...
#'
#' How would you say this in plain English?
#' * "Round to 1 digit the exponentiated mean of the logarithm of x" <- AWFUL
#'
#' ## With pipes:
x %>% log() %>% mean() %>% exp() %>% round(1)
#' Take x, log-transform it, average it, exponentiate it, and round it to 1 digit (**cheeky**)
#'
#' Note: () not necessary but make the code easier to read by highlighting functions
#'
#' Yeah, it's worth it to just spend 15 minutes dinking around with this kind of thing until you get it...
#'
#' ## First position rule and the placeholder "."
#' Most functions take multiple arguments...
#+ first_position_rule
log(x, base = 2)
x %>% log(base = 2) %>% mean(na.rm = TRUE) %>% exp() %>% round(1)
x %>% log(., base = 2)
x %>% log(., base = 2) %>% mean(., na.rm = TRUE) %>% exp(.) %>% round(., 1) # Annoying, dot is implied
#' When noun isn't used in first position of a function, the "." is necessary
#+ aggregate_example, error=TRUE
# the trad way
aggregate(Sepal.Length ~ Species, FUN = mean, data = iris)
# the naive but wrong pipe way
iris %>% aggregate(Sepal.Length ~ Species, FUN = mean)
# the right pipe way
iris %>% aggregate(Sepal.Length ~ Species, FUN = mean, data = .)
#' See package `magrittr` for other pipes that do fancy things, but the forward pipe is your workhorse
#'
#' Whew! Let's pause for questions.
#'
#' Chat a question if you have one. Chances are, someone else has the same question.
#'
#' # Verbs
#' The core of tidyverse usage, from package `dplyr`, verbs are what we do with our nouns (data),
#' which are typically data frames (tibbles)
#'
#' ## All verbs in the tidyverse:
#'
#' 1. take a data frame as their first argument,
#' 1. identify variable names in subsequent arguments, without quotes,
#' 1. and create a new data frame.
#'
#' * Chapter 5 in R4DS
#' * Using `nycflights13::flights`. (Sorry not biology). This data frame contains all
#' 336,776 flights that
#' departed from New York City in 2013. Each row contains data from a single flight.
#' The data comes from the US Bureau of Transportation
#' Statistics, and is documented in `?flights`. Let's have a look:
#+ flights_data
flights
# The tibble view is handy, but still we cannot see all the variables
#' ## Verb: `glimpse()`
#' It's indispensable, as you'll see when we get to a complicated wrangle
#+ verb_glimpse
flights %>% glimpse()
#' ## Verb: `filter()`
#'
#' * Acts on rows
#'
#' ### Find United Airlines flights
#+ filter_intro
filter(flights, carrier == "UA")
# with forward pipe
flights %>% filter(., carrier == "UA")
flights %>% filter(carrier == "UA") # dot placeholder is implied
#' ### Find United Airlines and American Airlines flights
flights %>% filter(carrier == "UA" & carrier == "AA")
#' Wrong, trying to combine levels within one variable, use OR
flights %>% filter(carrier == "UA" | carrier == "AA")
#' Better: use value matching function `%in%`
flights %>% filter(carrier %in% c("UA", "AA"))
#' ### Find United Airlines or American Airlines flights longer than 1200 miles in July
#' Use AND to combine across variables
#+ filter_1
flights %>% filter(carrier %in% c("UA", "AA") & distance > 1200 & month == 7)
#+ filter_2, results=FALSE
flights %>% filter(carrier %in% c("UA", "AA"), distance > 1200, month == 7) # same-same & ,
#' ### On your own: how many flights on February 10 departed on time or earlier?
#' Paste this into your script to get started:
packages_needed = c("tidyverse", "nycflights13")
packages_installed = packages_needed %in% rownames(installed.packages())
if (any(! packages_installed))
install.packages(packages_needed[! packages_installed])
for (i in 1:length(packages_needed)) {
library(packages_needed[i], character.only = T)
}
flights %>% glimpse()
flights %>% filter(month == 2, day == 10, dep_delay <= 0)
#' Chat your answers to Art as quickly as you can...
# answer...
#' ## Verb: `select()`
#'
#' * Acts on columns
#'
flights %>% select(distance, air_time)
flights %>% select(starts_with("sched"), everything())
#' ## Verb: `mutate()`
#'
#' * Acts on columns
#' * Creates new variables or replaces transformed, existing variables
#'
flights %>%
select(distance, air_time) %>%
glimpse() %>% # in-line view with `glimpse()`
mutate(speed_mph = distance / (air_time / 60))
#' ## Verbs: `group_by()` and `summarize()`
#'
#' * Frequently used together
#' * Roughly like `tapply()` or `aggregate()` but much more flexible
#' * Act on rows grouped by variables, usually character or factor levels
#' * Prepares data frame for further transformation, produces results
#'
#' ### Do different carriers fly at different speeds?
#+ group_by_summarize_intro
flights %>%
mutate(speed_mph = distance / (air_time / 60)) %>%
group_by(carrier) %>%
summarize(carrier_avg = mean(speed_mph))
#' What??? Oh yeah, NAs are contagious. For now, let's just nuke them:
flights %>%
glimpse() %>%
select(carrier, distance, air_time) %>%
glimpse() %>%
mutate(speed_mph = distance / (air_time / 60)) %>%
filter(!is.na(speed_mph)) %>%
glimpse() %>% # annoying, but makes the point
group_by(carrier) %>%
summarize(carrier_avg = mean(speed_mph)) %>%
ungroup() %>% # you want this habit
arrange(-carrier_avg) # yes another verb...
# how would you do this in base R? Obnoxious example:
#+ obnoxious, error=TRUE
flights_2 <- flights[, carrier]
# Ugh! Um...
flights_2 <- flights[, flights$carrier]
# stu##& R and %&^#ing #@%&&!
head(as.data.frame(flights))
flights_2 <- flights[which(complete.cases(flights)), c(10, 16, 15)]
flights_2$speed_mph <- flights_2$distance / (flights_2$air_time / 60)
flights_speed <- aggregate(speed_mph ~ carrier, FUN = mean, data = flights_2) # what???
flights_speed[order(desc(flights_speed$speed_mph)), ]
#' Of course it's doable, but it's harder to see what happened, and the intermediate objects always cause problems
#' ### Ok, what do the carrier codes stand for anyway?
#' From `?flights` we see that `airlines` is a data frame with the carrier names:
airlines
#' # Combine verbs for EDA (exploratory data analysis)
#' To motivate this part, let's make a guess and a follow-up explanation:
#'
#' * We assume that departure delays are worse before winter holidays than in late January
#' * But do faster flying carriers make up lost time better?
#' * Let's limit this to departure delays under an hour. More than that and
#' there is less to gain from hurrying.
#+ flights_EDA_boxplot
flights %>%
filter(month %in% c(1, 12) & day == 20 & !is.na(dep_delay) & dep_delay <= 60) %>%
mutate(date = paste(year, month, day, sep = "-")) %>%
ggplot() + # You can wrangle straight into ggplot!!! Where is the placeholder?
geom_boxplot(aes(x = carrier, y = dep_delay)) +
facet_wrap(vars(date))
#' Median departure delays are mostly below zero on 1-20 and above zero on 12-20
#+ flights_EDA_scatterplot_facets
flights %>%
mutate( # sometimes breaking it out is more readable
air_delay = arr_delay - dep_delay,
speed_mph = distance / (air_time / 60),
date = paste(year, month, day, sep = "-")
) %>%
filter(month %in% c(1, 12),
day == 20, !is.na(air_delay),
dep_delay <= 60) %>%
filter(!(carrier %in% c("AS", "F9", "FL", "HA", "VX", "YV"))) %>%
ggplot(aes(x = speed_mph, y = air_delay, group = date)) +
geom_smooth(aes(color = date),
size = 0,
method = "lm",
se = TRUE) +
geom_point(aes(color = date), size = 0.4) +
geom_smooth(aes(color = date), method = "lm", se = FALSE) +
facet_wrap(vars(carrier))
#' Are pilots pushing it before Christmas?
#'
#' # Wrapping up
#' *R for Data Science*. Is this data mining or using data from planned experiments?
#' Thoughts and discussion.
#'
#' Questions and comments from you?
#'
#' * If you are on GitHub, please comment, correct, or shred this script. [Link](https://github.com/bglarkin/tidyverse_intro) to repo.
#' * Or [download](https://mpgcloud.egnyte.com/dl/cguemdnod0) the R script
#'
#'
#' # Homework
#'
#' * You will need to load the `tidyverse` library (after installing the package if necessary)
#' * With this script loaded in RStudio, choose File>Compile Report to render the document (if you want)
#'
#' ## Transform a messy data frame into a tidy tibble
#'
#' * Produce hw_1 using the code below. Use pipes and tidyverse functions to transform
#' it into a tidy tibble. Hint = `?separate`
#+ hw_1
hw_1 <- data.frame(
plot_replicate = paste(rep(LETTERS[1:5], each = 4), rep(1:4, 5), sep = "_"),
date = rep("2016-06-12", 20),
measurement_1 = rbinom(20, 50, 0.2)
)
#+ hw_1_solved
hw_1 %>%
separate(plot_replicate, into = c("plot", "replicate")) %>%
as_tibble()
#' ## Are aphids more numerous on taller plants?
#'
#' * Produce hw_2 using the code below.
#' * Transform hw_2 into a tidy tibble using pipes and tidyverse functions. Hint = `?pivot_wider`
#' * Wrangle the data using tidyverse tools and pipe into ggplot to create scatterplots with
#' linear trendlines superimposed.
#' * Put height_cm on the x axis, aphids_n on the y_axis, and facet on plot
#' * Hints: `?geom_point`, `?geom_smooth`, `?facet_wrap`
#'
#+ hw_2
hw_2 <- data.frame(
plot = rep(LETTERS[1:5], each = 8),
replicate = rep(1:4, each = 2),
parameter = rep(c("aphids_n", "height_cm"), 20),
measurement = rep(c(700, 80), 20) + c(sort(rbinom(30, 500, 0.4)), rbinom(10, 500, 0.4))
)
#+ hw_2_solved
hw_2 %>%
pivot_wider(names_from = parameter, values_from = measurement) %>% # `pivot_wider()` produces a tibble as output
ggplot(aes(x = height_cm, y = aphids_n)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(vars(plot), scales = "free_x") # note use of scales
#' ## Tidy vs. messy
#'
#' * Explain why hw_3 (see below) is messy as it is
#' * Explain why hw_3 is tidy as it is
#'
#' Hint: can you compare ants and aphids with `?geom_boxplot` and with a scatterplot (`?geom_point`)
#' using the same configuration of hw_3, or is a transformation necessary to produce both of
#' these plots? What does this suggest about independent observations and tidy data?
#'
#' * Using `geom_point()` but still a boxplot format was a common misunderstanding
#+ hw_3
hw_3 <- data.frame(
plot = rep(LETTERS[1:5], each = 8),
replicate = rep(1:4, each = 2),
insects = rep(c("aphids", "ants"), 20),
count = rep(c(70, 20), 20) + rnorm(40, 0, 5) %>% round(., 0)
)
#+ hw_3_solved_1
hw_3 %>%
as_tibble() %>% # unnecessary but follows lesson strictly
ggplot() +
geom_boxplot(aes(x = insects, y = count)) +
facet_wrap(vars(plot)) # not explicit in question, but appropriate to look for plot effect
#' Data are suitable for analysis as-is if the goal is to compare aphid and ant counts in plots.
#' You could argue that each row is already an independent observation if you aren't interested
#' in a relationship between ants and aphids.
#+ hw_3_solved_2
hw_3 %>%
pivot_wider(names_from = insects, values_from = count) %>%
as_tibble() %>% # unnecessary but follows lesson strictly
ggplot(aes(x = aphids, y = ants)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(vars(plot), scales = "free_x") # not explicit in question, but appropriate to look for plot effect
#' Correlations require pivoting, where each row is an independent observation.
#'
#' ### Student example
#' Using `geom_point()` but still really a boxplot format, was a common misunderstanding
hw_3 %>%
ggplot() +
geom_boxplot(aes(insects, count)) +
facet_wrap(vars(plot)) # Even though the data is not fully tidy, we can easily make boxplots in ggplot
hw_3 %>%
ggplot() +
geom_point(aes(insects, count)) +
facet_wrap(vars(plot)) # Same
#' ## Bonus challenge
#' Use a [join](https://dplyr.tidyverse.org/reference/mutate-joins.html) function from `dplyr` to find out
#' how many different models of airplanes were flown out of New York by each airline
#' in 2013. **Which carrier flew the largest number of airplane models?**
#'
#' * Hint: install/load `nycflights13` and then look at the `?planes` and `?flights` data
#' frames.
#+ bonus_solved_1
flights %>% glimpse()
planes %>% glimpse()
#' In the planes data frame, the variable `tailnum` is associated with the airplane `model`. The variable `tailnum`
#' can join the flights and planes data frames. `Group_by` and `summarize()` would typically be used.
#' Calling `distinct` avoids two separate rounds of `group_by()` and `summarize()`. Using `n_distinct()` makes things
#' very compact. To reach the answer quickly, calling `arrange()` can put the top carrier at the top of the tibble.
#' Alternatively, a second call to `summarize`, which executes `max()` on the summary variable would work.
#'
#' * Use of other than `left_join()` led some astray (others can work but may get complicated)
#' * Some attempted to filter the airports or years; ?flights...
#' * Must join to airplanes df; tailnumbers go with individual planes, not models
#+ bonus_solved_2
flights %>%
left_join(planes, by = "tailnum") %>% # best practice to specify `by = ` to maintain control over join keys
distinct(carrier, model) %>%
group_by(carrier) %>%
summarize(n = n()) %>% ungroup() %>%
arrange(-n)
#+ bonus_solved_3
flights %>%
left_join(planes, by = "tailnum") %>%
group_by(carrier, model) %>%
summarize(n = n()) %>% ungroup() %>% # this summary is meaningless, which is why I prefer using `distinct()` or maybe `slice()`
group_by(carrier) %>%
summarize(n = n()) %>%
slice_max(n) %>%
left_join(airlines, by = "carrier") # pretty display
#' Many paths exist to the answer, but they all lead to American Airlines flying the most airplane models out of NYC in 2013.
#'
#' ### Student example
#' Two of you made this very compact
#+ erba_solution
# Grace Erba: very compact, combines tidyverse and traditional tools
combined.plane.data <- full_join(flights, planes, by = "tailnum")
combined.plane.data %>%
group_by(carrier) %>%
summarise(models = length(unique(model))) # `n_distinct()` = `length(unique())` also works here