Skip to content

Commit a9ce0d5

Browse files
authored
Merge pull request #161 from ytakemon/Post_AUC_dplyr_lesson_updates
Post AUC beta run dplyr lesson updates
2 parents 9e21ec2 + c0a20d5 commit a9ce0d5

File tree

1 file changed

+60
-45
lines changed

1 file changed

+60
-45
lines changed

episodes/05-dplyr.Rmd

Lines changed: 60 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -52,20 +52,33 @@ then load it to be able to use it.
5252

5353
```{r, eval = FALSE, purl = FALSE}
5454
install.packages("dplyr") ## installs dplyr package
55-
install.packages("readr") ## install readr pacakge
55+
install.packages("tidyr") ## installs tidyr package
56+
install.packages("ggplot2") ## installs ggplot2 package
57+
install.packages("readr") ## install readr package
5658
```
5759

5860
You might get asked to choose a CRAN mirror -- this is asking you to
5961
choose a site to download the package from. The choice doesn't matter too much; I'd recommend choosing the RStudio mirror.
6062

6163
```{r, message = FALSE, purl = FALSE}
6264
library("dplyr") ## loads in dplyr package to use
65+
library("tidyr") ## loads in tidyr package to use
66+
library("ggplot2") ## loads in ggplot2 package to use
6367
library("readr") ## load in readr package to use
6468
```
6569

6670
You only need to install a package once per computer, but you need to load it
6771
every time you open a new R session and want to use that package.
6872

73+
> ## Tip: Installing packages
74+
> It may be temping to install the `tidyverse` package, as it contains many
75+
> useful collection of packages for this lesson and beyond. However, when
76+
> teaching or following this lesson, we advise that participants install
77+
> `dplyr`, `readr`, `ggplot2`, and `tidyr` individually as shown above.
78+
> Otherwise, a substaial amount of the lesson will be spend waiting for the
79+
> installation to complete.
80+
{: .callout}
81+
6982
## What is dplyr?
7083

7184
The package `dplyr` is a fairly new (2014) package that tries to provide easy
@@ -158,38 +171,50 @@ To choose rows, use `filter()`:
158171
filter(variants, sample_id == "SRR2584863")
159172
```
160173
161-
`filter()` will keep all the rows that match the conditions that are provided. Here are a few examples:
174+
`filter()` will keep all the rows that match the conditions that are provided.
175+
Here are a few examples:
162176

163177
```{r}
164178
# rows for which the reference genome has T or G
165179
filter(variants, REF %in% c("T", "G"))
166-
# rows with QUAL values greater than or equal to 100
167-
filter(variants, QUAL >= 100)
168180
# rows that have TRUE in the column INDEL
169181
filter(variants, INDEL)
170182
# rows that don't have missing data in the IDV column
171183
filter(variants, !is.na(IDV))
172184
```
173185

186+
We have a column titled "QUAL". This is a Phred-scaled confidence
187+
score that a polymorphism exists at this position given the sequencing
188+
data. Lower QUAL scores indicate low probability of a polymorphism
189+
existing at that site. `filter()` can be useful for selecting mutations that
190+
have a QUAL score above a certain threshold:
191+
192+
```{r}
193+
# rows with QUAL values greater than or equal to 100
194+
filter(variants, QUAL >= 100)
195+
```
196+
174197
`filter()` allows you to combine multiple conditions. You can separate them using a `,` as arguments to the function, they will be combined using the `&` (AND) logical operator. If you need to use the `|` (OR) logical operator, you can specify it explicitly:
175198

176199
```{r}
177200
# this is equivalent to:
178201
# filter(variants, sample_id == "SRR2584863" & QUAL >= 100)
179202
filter(variants, sample_id == "SRR2584863", QUAL >= 100)
180203
# using `|` logical operator
181-
filter(variants, sample_id == "SRR2584863", (INDEL | QUAL >= 100))
204+
filter(variants, sample_id == "SRR2584863", (MQ >= 50 | QUAL >= 100))
182205
```
183206

184207
> ## Challenge
185208
>
186209
> Select all the mutations that occurred between the positions 1e6 (one million)
187-
> and 2e6 (included) that are not indels and have QUAL greater than 200.
210+
> and 2e6 (inclusive) that have a QUAL greater than 200, and exclude INDEL mutations.
211+
> Hint: to flip logical values such as TRUE to a FALSE, we can use to negation symbol
212+
> "!". (eg. !TRUE == FALSE).
188213
>
189214
>> ## Solution
190215
>>
191216
>> ```{r}
192-
>> filter(variants, POS >= 1e6 & POS <= 2e6, !INDEL, QUAL > 200)
217+
>> filter(variants, POS >= 1e6 & POS <= 2e6, QUAL > 200, !INDEL)
193218
>> ```
194219
> {: .solution}
195220
{: .challenge}
@@ -213,7 +238,7 @@ variants %>%
213238
select(REF, ALT, DP)
214239
```
215240
216-
In the above code, we use the pipe to send the `variants` dataset first through
241+
In the above code, we use the pipe to send the `variants` data set first through
217242
`filter()`, to keep rows where `sample_id` matches a particular sample, and then through `select()` to
218243
keep only the `REF`, `ALT`, and `DP` columns. Since `%>%` takes
219244
the object on its left and passes it as the first argument to the function on
@@ -258,14 +283,14 @@ SRR2584863_variants %>% slice(10:25)
258283
> Starting with the `variants` data frame, use pipes to subset the data
259284
> to include only observations from SRR2584863 sample,
260285
> where the filtered depth (DP) is at least 10.
261-
> Shwoing only 5th through 11th rows of columns `REF`, `ALT`, and `POS`.
286+
> Showing only 5th through 11th rows of columns `REF`, `ALT`, and `POS`.
262287
>
263288
>> ## Solution
264289
>> ```{r}
265290
>> variants %>%
266291
>> filter(sample_id == "SRR2584863" & DP >= 10) %>%
267292
>> slice(5:11) %>%
268-
>> select(REF, ALT, POS)
293+
>> select(sample_id, DP, REF, ALT, POS)
269294
>> ```
270295
> {: .solution}
271296
{: .challenge}
@@ -276,15 +301,12 @@ Frequently you'll want to create new columns based on the values in existing
276301
columns, for example to do unit conversions or find the ratio of values in two
277302
columns. For this we'll use the `dplyr` function `mutate()`.
278303
279-
We have a column titled "QUAL". This is a Phred-scaled confidence
280-
score that a polymorphism exists at this position given the sequencing
281-
data. Lower QUAL scores indicate low probability of a polymorphism
282-
existing at that site. We can convert the confidence value QUAL
283-
to a probability value according to the formula:
304+
For example, we can convert the polymorphism confidence value QUAL to a
305+
probability value according to the formula:
284306
285307
Probability = 1- 10 ^ -(QUAL/10)
286308
287-
Let's add a column (`POLPROB`) to our `variants` data frame that shows
309+
We can use `mutate` to add a column (`POLPROB`) to our `variants` data frame that shows
288310
the probability of a polymorphism at that site given the data.
289311
290312
```{r, purl = FALSE}
@@ -293,7 +315,7 @@ variants %>%
293315
```
294316
295317
> ## Exercise
296-
> There are a lot of columns in our dataset, so let's just look at the
318+
> There are a lot of columns in our data set, so let's just look at the
297319
> `sample_id`, `POS`, `QUAL`, and `POLPROB` columns for now. Add a
298320
> line to the above code to only show those columns.
299321
>
@@ -313,57 +335,50 @@ variants %>%
313335
Many data analysis tasks can be approached using the "split-apply-combine"
314336
paradigm: split the data into groups, apply some analysis to each group, and
315337
then combine the results. `dplyr` makes this very easy through the use of the
316-
`group_by()` function, which splits the data into groups. When the data is
317-
grouped in this way `summarize()` can be used to collapse each group into
318-
a single-row summary. `summarize()` does this by applying an aggregating
319-
or summary function to each group. For example, if we wanted to group
320-
by sample_id and find the number of rows of data for each
321-
sample, we would do:
338+
`group_by()` function, which splits the data into groups.
322339
323-
```{r, purl = FALSE, message = FALSE}
324-
variants %>%
325-
group_by(sample_id) %>%
326-
summarize(n())
327-
```
328-
329-
It can be a bit tricky at first, but we can imagine physically splitting the data
330-
frame by groups and applying a certain function to summarize the data.
331-
332-
<center>
333-
<img src="../fig/split_apply_combine.png" alt="rstudio default session" style="width: 500px;"/>
334-
</center>
335-
^[The figure was adapted from the Software Carpentry lesson, [R for Reproducible Scientific Analysis](https://swcarpentry.github.io/r-novice-gapminder/13-dplyr/)]
336-
337-
Here the summary function used was `n()` to find the count for each
338-
group. Since this is a quite a common operation, there is a simpler method
339-
called `tally()`:
340+
We can use `group_by()` to tally the number of mutations detected in each sample
341+
using the function `tally()`:
340342
341343
```{r, purl = FALSE, message = FALSE}
342344
variants %>%
343-
group_by(ALT) %>%
345+
group_by(sample_id) %>%
344346
tally()
345347
```
346348
347-
To show that there are many ways to achieve the same results, there is another way to approach this, which bypasses `group_by()` using the function `count()`:
349+
Since counting or tallying values is a common use case for `group_by()`, an alternative function was created to bypasses `group_by()` using the function `count()`:
348350

349351
```{r, purl = FALSE, message = FALSE}
350352
variants %>%
351-
count(ALT)
353+
count(sample_id)
352354
```
353355

354356
> ## Challenge
355357
>
356-
> * How many mutations are found in each sample?
358+
> * How many mutations are INDELs?
357359
>
358360
>> ## Solution
359361
>>
360362
>> ```{r}
361363
>> variants %>%
362-
>> count(sample_id)
364+
>> count(INDEL)
363365
>> ```
364366
> {: .solution}
365367
{: .challenge}
366368
369+
370+
When the data is grouped, `summarize()` can be used to collapse each group into
371+
a single-row summary. `summarize()` does this by applying an aggregating
372+
or summary function to each group.
373+
374+
It can be a bit tricky at first, but we can imagine physically splitting the data
375+
frame by groups and applying a certain function to summarize the data.
376+
377+
<center>
378+
<img src="../fig/split_apply_combine.png" alt="rstudio default session" style="width: 500px;"/>
379+
</center>
380+
^[The figure was adapted from the Software Carpentry lesson, [R for Reproducible Scientific Analysis](https://swcarpentry.github.io/r-novice-gapminder/13-dplyr/)]
381+
367382
We can also apply many other functions to individual columns to get other
368383
summary statistics. For example,we can use built-in functions like `mean()`,
369384
`median()`, `min()`, and `max()`. These are called "built-in functions" because

0 commit comments

Comments
 (0)