generated from r4ds/bookclub-template
-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy path05-the-package-within.Rmd
531 lines (388 loc) · 16.4 KB
/
05-the-package-within.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```
# The package within
**Purpose**
To walk through the development of a small toy package, with an emphasis on the __package's__ R code and how it differs from R code in a __script__.
Roadmap:
1. Data analysis script
2. Isolate and extract the reusable data and logic from the script
3. Put code into an R package
4. Use package in a newly simplified script.
This chapter purposefully makes mistakes along the way, in order to highlight special considerations for the R code inside a package.
*Section headers incorporate the NATO phonetic alphabet.*
## Alfa: a script that works
A fictional script for a dataset of people who went for a swim:
> Where did you swim and how hot was it outside?
```{r}
infile <- "swim.csv"
(dat <- read.csv(infile))
#> name where temp
#> 1 Adam beach 95
#> 2 Bess coast 91
#> 3 Cora seashore 28
#> 4 Dale beach 85
#> 5 Evan seaside 31
```
1. Observations were classified as American or British based on how they described the beach:
```{r}
dat$english[dat$where == "beach"] <- "US"
dat$english[dat$where == "coast"] <- "US"
dat$english[dat$where == "seashore"] <- "UK"
dat$english[dat$where == "seaside"] <- "UK"
```
2. Temperatures were converted to Celsius:
```{r}
dat$temp[dat$english == "US"] <- (dat$temp[dat$english == "US"] - 32) * 5/9
dat
#> name where temp english
#> 1 Adam beach 35.0 US
#> 2 Bess coast 32.8 US
#> 3 Cora seashore 28.0 UK
#> 4 Dale beach 29.4 US
#> 5 Evan seaside 31.0 UK
```
3. Data is written back out to a CSV file. A timestamp is also captured in the filename.
```{r}
now <- Sys.time()
timestamp <- format(now, "%Y-%B-%d_%H-%M-%S")
(outfile <- paste0(timestamp, "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile)))
#> [1] "2022-April-17_07-14-16_swim_clean.csv"
write.csv(dat, file = outfile, quote = FALSE, row.names = FALSE)
```
## Bravo: a better script that works
- There's a package that lurks within the original script.
- Suboptimal coding practices like repetitive code and mixing of code and data makes it hard to catch.
- A good first step is to refactor this code.
Next version of the script:
```{r}
library(tidyverse)
infile <- "swim.csv"
dat <- read_csv(infile, col_types = cols(name = "c", where = "c", temp = "d"))
lookup_table <- tribble(
~where, ~english,
"beach", "US",
"coast", "US",
"seashore", "UK",
"seaside", "UK"
)
dat <- dat %>%
left_join(lookup_table)
#> Joining, by = "where"
f_to_c <- function(x) (x - 32) * 5/9
dat <- dat %>%
mutate(temp = if_else(english == "US", f_to_c(temp), temp))
dat
#> # A tibble: 5 × 4
#> name where temp english
#> <chr> <chr> <dbl> <chr>
#> 1 Adam beach 35 US
#> 2 Bess coast 32.8 US
#> 3 Cora seashore 28 UK
#> 4 Dale beach 29.4 US
#> 5 Evan seaside 31 UK
now <- Sys.time()
timestamp <- function(time) format(time, "%Y-%B-%d_%H-%M-%S")
outfile_path <- function(infile) {
paste0(timestamp(now), "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
write_csv(dat, outfile_path(infile))
```
**Key features of this code:**
- Functions from the `tidyverse` packages, `readr` and `dplyr`.
- Different “beach” words are stored in a _lookup table_. This makes it easier to add words in the future.
- `f_to_c()`, `timestamp()`, and `outfile_path()` functions now hold the logic for converting temperatures and forming the timestamped output file name.
Easier to recognize the reusable bits of this script, i.e. the bits that have nothing to do with a specific input file, like `swim.csv`
## Charlie: external helpers
Move reusable data and logic out of the analysis script and into separate files.
Here is the content of `beach-lookup-table.csv`:
```
where,english
beach,US
coast,US
seashore,UK
seaside,UK
```
Here is the content of `cleaning-helpers.R`:
```{r}
library(tidyverse)
localize_beach <- function(dat) {
lookup_table <- read_csv(
"beach-lookup-table.csv",
col_types = cols(where = "c", english = "c")
)
left_join(dat, lookup_table)
}
f_to_c <- function(x) (x - 32) * 5/9
celsify_temp <- function(dat) {
mutate(dat, temp = if_else(english == "US", f_to_c(temp), temp))
}
now <- Sys.time()
timestamp <- function(time) format(time, "%Y-%B-%d_%H-%M-%S")
outfile_path <- function(infile) {
paste0(timestamp(now), "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
```
- High-level helper functions like `localize_beach()` and `celsify_temp()`, were added to the pre-existing helpers (`f_to_c()`, `timestamp()`, and `outfile_path()`)
```{r}
library(tidyverse)
source("cleaning-helpers.R")
infile <- "swim.csv"
dat <- read_csv(infile, col_types = cols(name = "c", where = "c", temp = "d"))
(dat <- dat %>%
localize_beach() %>%
celsify_temp())
#> Joining, by = "where"
#> # A tibble: 5 × 4
#> name where temp english
#> <chr> <chr> <dbl> <chr>
#> 1 Adam beach 35 US
#> 2 Bess coast 32.8 US
#> 3 Cora seashore 28 UK
#> 4 Dale beach 29.4 US
#> 5 Evan seaside 31 UK
write_csv(dat, outfile_path(infile))
```
Script is now much shorter (and cleaner). However, whether it's easier depends on personal preference and what "feels" easier to work with.
For more information, refer to the [Tidyverse design guide](https://design.tidyverse.org/)
## Delta: an attempt at a package
- Use `usethis::create_package()` to scaffold a new R package
- Copy `cleaning-helpers.R` into the new package, specifically, to `R/cleaning-helpers.R`
- Copy `beach-lookup-table.csv` into the top-level of the new source package
- Install package
Script we're trying to run:
```{r}
library(tidyverse)
library(delta) # instead of source("cleaning-helpers.R")
infile <- "swim.csv"
dat <- read_csv(infile, col_types = cols(name = "c", where = "c", temp = "d"))
dat <- dat %>%
localize_beach() %>%
celsify_temp()
write_csv(dat, outfile_path(infile))
```
Results when we try to run this code:
```{r}
library(tidyverse)
library(delta)
infile <- "swim.csv"
dat <- read_csv(infile, col_types = cols(name = "c", where = "c", temp = "d"))
dat <- dat %>%
localize_beach() %>%
celsify_temp()
> Error in localize_beach(.) : could not find function "localize_beach"
write_csv(dat, outfile_path(infile))
> Error in outfile_path(infile) : could not find function "outfile_path"
```
Despite calling `library(delta)`, none of the functions were actually available to use.
- Attaching a package does not put the functions in the global workspace.
- In contrast to `source()`-ing a file of helper functions, attaching a package does not dump its functions into the global workspace.
- By default, functions in a package are only for internal use
- We can export these functions properly by putting `@export` in the roxygen comment above each function (See Section 10.4 for more information on DESCRIPTION)
```{r}
#' @export
celsify_temp <- function(dat) {
mutate(dat, temp = if_else(english == "US", f_to_c(temp), temp))
}
```
Now our script works (sort of)!
```{r}
library(tidyverse)
library(delta)
infile <- "swim.csv"
dat <- read_csv(infile, col_types = cols(name = "c", where = "c", temp = "d"))
dat <- dat %>%
localize_beach() %>%
celsify_temp()
#> Error: 'beach-lookup-table.csv' does not exist in current working directory ('/Users/jenny/tmp').
write_csv(dat, outfile_path(infile))
```
Problem: You can't dump CSV files into the source of an R package and expect it to work. Despite this, you can still install and attach this package.
- This means that broken packages can still be used. To prevent this, you should run `R CMD` check or `check()` often during development.
- Doing so will alert you to the problem:
```
* installing *source* package ‘delta’ ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
Error in library(tidyverse) : there is no package called ‘tidyverse’
Error: unable to load R code in package ‘delta’
Execution halted
ERROR: lazy loading failed for package ‘delta’
* removing ‘/Users/brendan/RScripts/delta.Rcheck/delta’
```
What is the reason behind this error?
- Package was declared incorrectly
- While you can load a package using `library(tidyverse)` in a script, dependencies on other packages must be declared in the `DESCRIPTION`
## Echo: a working package
Now we'll make a package that actually works:
```{r}
lookup_table <- dplyr::tribble(
~where, ~english,
"beach", "US",
"coast", "US",
"seashore", "UK",
"seaside", "UK"
)
#' @export
localize_beach <- function(dat) {
dplyr::left_join(dat, lookup_table)
}
f_to_c <- function(x) (x - 32) * 5/9
#' @export
celsify_temp <- function(dat) {
dplyr::mutate(dat, temp = dplyr::if_else(english == "US", f_to_c(temp), temp))
}
now <- Sys.time()
timestamp <- function(time) format(time, "%Y-%B-%d_%H-%M-%S")
#' @export
outfile_path <- function(infile) {
paste0(timestamp(now), "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
```
**Note:** To fix our initial problem with loading a CSV file, we've created a dataframe `lookup_table`. However, Chapter 8 Data provides more guidance and recommendations on how to load datasets properly.
**Other Note:** When calling functions from other packages, we should specify the package that we're using (e.g., dplyr::mutate()). Moreover, we should identify the specific package being used, rather than the meta-package (e.g., do not use tidyverse::mutate())
- All of the user-facing functions have an `@export` tag in their roxygen comment, which means that devtools::document() adds them correctly to the NAMESPACE file.
This package can be installed, but we receive 1 note and 1 warning:
```
* checking R code for possible problems ... NOTE
celsify_temp: no visible binding for global variable ‘english’
celsify_temp: no visible binding for global variable ‘temp’
Undefined global functions or variables:
english temp
* checking for missing documentation entries ... WARNING
Undocumented code objects:
‘celsify_temp’ ‘localize_beach’ ‘outfile_path’
All user-level objects in a package should have documentation entries.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.
```
Translation of first warnings: `no visible binding for global variable ‘english’` and `no visible binding for global variable ‘temp’`
- Using bare variable names like `english` and `temp` looks suspicious because you're using unquoted variable names from `dplyr` inside a package.
- Defining these variables globally eliminates the note:
```{r}
option 1 (then you should also put utils in Imports)
utils::globalVariables(c("english", "temp"))
option 2
english <- temp <- NULL
```
The other note we received from R:
`"Undocumented code objects: ‘celsify_temp’ ‘localize_beach’ ‘outfile_path’ All user-level objects in a package should have documentation entries."`
This is caused by not documenting exported functions. Using roxygen comments to document it should solve the problem. (Function documentation is discussed in Chapter 16).
## Foxtrot: build time vs. run time
**Another problem::** timestamps don't seem to work properly.
```{r}
Sys.time()
#> [1] "2022-02-24 20:49:59 PST"
outfile_path("INFILE.csv")
#> [1] "2020-September-03_11-06-33_INFILE_clean.csv"
```
The timestamp reflects the time that the function was initially run, rather than the current time.
Recall how we form the filepath for output files:
```{r}
now <- Sys.time()
timestamp <- function(time) format(time, "%Y-%B-%d_%H-%M-%S")
outfile_path <- function(infile) {
paste0(timestamp(now), "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
```
**Source of the problem:** The `now <- Sys.time()` function lies outside the `outfile_path` definition, so it is executed when the package is built, but never again. *Code outside your functions is only built once at build time.*
Moving `now <- Sys.time` so that it's no longer top level code:
```{r}
# always timestamp as "now"
outfile_path <- function(infile) {
now <- timestamp(Sys.time())
paste0(now, "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
# allow user to provide a time, but default to "now"
outfile_path <- function(infile, time = Sys.time()) {
now <- timestamp(time)
paste0(now, "_", sub("(.*)([.]csv$)", "\\1_clean\\2", infile))
}
```
> Need to have a different mindset when defining objects inside a package. The objects should be functions and these functions should (generally) only use data they create or that is passed via an argument.
## Golf: side effects
A new concern with the timestamp: The timestamps depend on which part of the world you're in.
```{r, results='hide', warning=FALSE, echo=FALSE}
suppressPackageStartupMessages(library(tidyverse))
```
```{r echo=FALSE}
location <- c("Rome, Italy", "Warsaw, Poland", "Sao Paulo, Brazil", "Greenwich, England")
timestamp <- c("2020-settembre-05_00-30-00", "2020-września-05_00-30-00", "2020-setembro-04_19-30-00", "2020-September-04_23-30-00")
LC_TIME <- c("it_IT.UTF-8", "pl_PL.UTF-8", "pt_BR.UTF-8", "en_GB.UTF-8")
tz <- c("Europe/Rome", "Europe/Warsaw", "America/Sao_Paulo", "Europe/London")
tibble::tibble(location, timestamp, LC_TIME, tz) %>%
knitr::kable()
```
**Problem:** The month names vary, as does the time, and even the date.
**Proposed Solution:** Create timestamps that are all in a fixed time zone.
- We can force a certain locale with `Sys.setlocale()` and force a particular time zone by adjusting the TZ environment variable.
Our attempt at implementing this:
```
timestamp <- function(time = Sys.time()) {
Sys.setlocale("LC_TIME", "C")
Sys.setenv(TZ = "UTC")
format(time, "%Y-%B-%d_%H-%M-%S")
}
```
However, a user in Brazil would see this after using `outfile_path()` from our package:
```
outfile_path("INFILE.csv")
#> [1] "2022-April-17_07-14-18_INFILE_clean.csv"
format(Sys.time(), "%Y-%B-%d_%H-%M-%S")
#> [1] "2022-April-17_07-14-18"
```
Our calls to `Sys.setlocale()` and `Sys.setenv()` inside `timestamp()` have made persistent changes to their R session. This sort of side effect is very undesirable and is extremely difficult to track down and debug, especially in more complicated settings.
**Solution:**
```
# use withr::local_*() functions to keep the changes local to timestamp()
timestamp <- function(time = Sys.time()) {
withr::local_locale(c("LC_TIME" = "C"))
withr::local_timezone("UTC")
format(time, "%Y-%B-%d_%H-%M-%S")
}
# use the tz argument to format.POSIXct()
timestamp <- function(time = Sys.time()) {
withr::local_locale(c("LC_TIME" = "C"))
format(time, "%Y-%B-%d_%H-%M-%S", tz = "UTC")
}
# put the format() call inside withr::with_*()
timestamp <- function(time = Sys.time()) {
withr::with_locale(
c("LC_TIME" = "C"),
format(time, "%Y-%B-%d_%H-%M-%S", tz = "UTC")
)
}
```
The locale in our timestamp is only temporarily modified with the `withr::with_locale` function.
- In this example, the mistake we made was changing the user's overall state.
- If you have to do this, make sure this is documented explicitly or try to make them reversible
> You need to adopt a different mindset when defining functions inside a package. Try to avoid making any changes to the user’s overall state. If such changes are unavoidable, make sure to reverse them (if possible) or to document them explicitly (if related to the function’s primary purpose).
## Meeting Videos
### Cohort 1
`r knitr::include_url("https://www.youtube.com/embed/eMWgu9OQ0m8")`
### Cohort 2
`r knitr::include_url("https://www.youtube.com/embed/yRzRiXiqPag")`
### Cohort 3
`r knitr::include_url("https://www.youtube.com/embed/bsGCgJr60as")`
<details>
<summary> Meeting chat log </summary>
```
00:45:32 collinberke: Check this out regarding side effects: https://withr.r-lib.org/articles/changing-and-restoring-state.html
00:55:58 Arun Chavan: https://rstats.wtf/
00:58:05 Isabella Velásquez: https://yihui.org/knitr/options/
00:59:31 Arun Chavan: +1 for knitr options because the tab autocompletion doesn’t work for them (at least for me)
01:00:58 Brendan Lam: I'm gonna try that
```
</details>
### Cohort 4
`r knitr::include_url("https://www.youtube.com/embed/xHYtYBrNAbg")`
<details>
<summary> Meeting chat log </summary>
```
00:25:03 Olivier Leroy: dplyr:::select()
00:26:06 Olivier Leroy: dplyr::select()
00:28:56 Olivier Leroy: https://docs.google.com/spreadsheets/d/1AlgFwMR42NjjoG2lcm0hf_pk5ZACw-n8UCwnrny91bU/edit#gid=0
00:35:45 Howard Baek: https://github.com/r-lib/conflicted
```
</details>