You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You might get asked to choose a CRAN mirror -- this is asking you to
59
61
choose a site to download the package from. The choice doesn't matter too much; I'd recommend choosing the RStudio mirror.
60
62
61
63
```{r, message = FALSE, purl = FALSE}
62
64
library("dplyr") ## loads in dplyr package to use
65
+
library("tidyr") ## loads in tidyr package to use
66
+
library("ggplot2") ## loads in ggplot2 package to use
63
67
library("readr") ## load in readr package to use
64
68
```
65
69
66
70
You only need to install a package once per computer, but you need to load it
67
71
every time you open a new R session and want to use that package.
68
72
73
+
> ## Tip: Installing packages
74
+
> It may be temping to install the `tidyverse` package, as it contains many
75
+
> useful collection of packages for this lesson and beyond. However, when
76
+
> teaching or following this lesson, we advise that participants install
77
+
> `dplyr`, `readr`, `ggplot2`, and `tidyr` individually as shown above.
78
+
> Otherwise, a substaial amount of the lesson will be spend waiting for the
79
+
> installation to complete.
80
+
{: .callout}
81
+
69
82
## What is dplyr?
70
83
71
84
The package `dplyr` is a fairly new (2014) package that tries to provide easy
@@ -158,38 +171,50 @@ To choose rows, use `filter()`:
158
171
filter(variants, sample_id == "SRR2584863")
159
172
```
160
173
161
-
`filter()` will keep all the rows that match the conditions that are provided. Here are a few examples:
174
+
`filter()` will keep all the rows that match the conditions that are provided.
175
+
Here are a few examples:
162
176
163
177
```{r}
164
178
# rows for which the reference genome has T or G
165
179
filter(variants, REF %in% c("T", "G"))
166
-
# rows with QUAL values greater than or equal to 100
167
-
filter(variants, QUAL >= 100)
168
180
# rows that have TRUE in the column INDEL
169
181
filter(variants, INDEL)
170
182
# rows that don't have missing data in the IDV column
171
183
filter(variants, !is.na(IDV))
172
184
```
173
185
186
+
We have a column titled "QUAL". This is a Phred-scaled confidence
187
+
score that a polymorphism exists at this position given the sequencing
188
+
data. Lower QUAL scores indicate low probability of a polymorphism
189
+
existing at that site. `filter()` can be useful for selecting mutations that
190
+
have a QUAL score above a certain threshold:
191
+
192
+
```{r}
193
+
# rows with QUAL values greater than or equal to 100
194
+
filter(variants, QUAL >= 100)
195
+
```
196
+
174
197
`filter()` allows you to combine multiple conditions. You can separate them using a `,` as arguments to the function, they will be combined using the `&` (AND) logical operator. If you need to use the `|` (OR) logical operator, you can specify it explicitly:
175
198
176
199
```{r}
177
200
# this is equivalent to:
178
201
# filter(variants, sample_id == "SRR2584863" & QUAL >= 100)
179
202
filter(variants, sample_id == "SRR2584863", QUAL >= 100)
180
203
# using `|` logical operator
181
-
filter(variants, sample_id == "SRR2584863", (INDEL | QUAL >= 100))
^[The figure was adapted from the Software Carpentry lesson, [R for Reproducible Scientific Analysis](https://swcarpentry.github.io/r-novice-gapminder/13-dplyr/)]
336
-
337
-
Here the summary function used was `n()` to find the count for each
338
-
group. Since this is a quite a common operation, there is a simpler method
339
-
called `tally()`:
340
+
We can use `group_by()` to tally the number of mutations detected in each sample
341
+
using the function `tally()`:
340
342
341
343
```{r, purl = FALSE, message = FALSE}
342
344
variants %>%
343
-
group_by(ALT) %>%
345
+
group_by(sample_id) %>%
344
346
tally()
345
347
```
346
348
347
-
To show that there are many ways to achieve the same results, there is another way to approach this, which bypasses `group_by()` using the function `count()`:
349
+
Since counting or tallying values is a common use case for `group_by()`, an alternative function was created to bypasses `group_by()` using the function `count()`:
348
350
349
351
```{r, purl = FALSE, message = FALSE}
350
352
variants %>%
351
-
count(ALT)
353
+
count(sample_id)
352
354
```
353
355
354
356
> ## Challenge
355
357
>
356
-
> * How many mutations are found in each sample?
358
+
> * How many mutations are INDELs?
357
359
>
358
360
>> ## Solution
359
361
>>
360
362
>> ```{r}
361
363
>> variants %>%
362
-
>> count(sample_id)
364
+
>> count(INDEL)
363
365
>> ```
364
366
> {: .solution}
365
367
{: .challenge}
366
368
369
+
370
+
When the data is grouped, `summarize()` can be used to collapse each group into
371
+
a single-row summary. `summarize()` does this by applying an aggregating
372
+
or summary function to each group.
373
+
374
+
It can be a bit tricky at first, but we can imagine physically splitting the data
375
+
frame by groups and applying a certain function to summarize the data.
^[The figure was adapted from the Software Carpentry lesson, [R for Reproducible Scientific Analysis](https://swcarpentry.github.io/r-novice-gapminder/13-dplyr/)]
381
+
367
382
We can also apply many other functions to individual columns to get other
368
383
summary statistics. For example,we can use built-in functions like `mean()`,
369
384
`median()`, `min()`, and `max()`. These are called "built-in functions" because
0 commit comments