-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpractical_ggplot2.Rmd
351 lines (209 loc) · 9.96 KB
/
practical_ggplot2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
---
title: "my answers"
author: "My name"
date: "2021-11-12"
output: html_document
---
> This practical aims at performing exploratory plots and how-to build layer by layer to be familiar with
the grammar of graphics. In the last part, a supplementary exercise will focus on plotting genome-wide CNV.
##### Those kind of questions are optional {.bonus}
# Scatter plots of penguins
The `penguins` dataset is provided by the `palmerpenguins` R package. As for every function, most data-sets shipped with a package contain also a useful help page (`?`).
##### If not done already, install the package `palmerpenguins` and load it.
```{r}
# Write your answer here
```
##### Plot the body mass on the _y_ axis and the bill length on the _x_ axis.
```{r}
# Write your answer here
```
##### Plot again the body mass on the _y_ axis and the bill length on the _x_ axis, but with colour by `species`
```{r}
# Write your answer here
```
##### The `geom_smooth()` layer can be used to add a trend line. Try to overlay it to your scatter plot.
```{block, opts.label = "tip"}
by default `geom_smooth` is using a loess regression (< 1,000 points) and adds standard error intervals.
- The `method` argument can be used to change the regression to a linear one: `method = "lm"`
- to disable the ribbon of standard errors, set `se = FALSE`
Be careful where the aesthetics are located, so the trend linear lines are **also** colored per species.
```
```{r}
# Write your answer here
```
##### Adjust the aesthetics of point in order to
- the `shape` map to the originated `island`
- a fixed size of `3`
- a transparency of 40%
```{block, opts.label = "tip"}
You should still have only 3 coloured linear trend lines. Otherwise check to which layer your are adding the aesthetic `shape`.
Remember that fixed parameters are to be defined outside `aes()`
```
```{r}
# Write your answer here
```
##### Ajust the colour aesthetic to the `ggplot()` call to propagate it to both point and regression line.
Try the scale colour viridis for discrete scale (`scale_colour_viridis_d()`). Try to change the default theme to `theme_bw()`
```{r}
# Write your answer here
```
##### Find a way to produce the following plot:
```{r}
# Write your answer here
```
```{block, opts.label = "tip"}
Remember that:
- all aesthetics defined in the `ggplot(aes())` command will be inherited by all following layers
- `aes()` of individual geoms are specific (and overwrite the global definition if present).
- `labs()` controls of plot annotations
- `theme()` allows to tweak the plot like `theme(plot.caption = element_text(face = "italic"))` to render in italic the caption
```
# Categorical data
We are going to use a dataset from the [TidyTuesday](https://github.com/rfordatascience/tidytuesday/) initiative. Several dataset about the theme **deforestation** on April 2021 were released, we will focus on the csv called `brazil_loss.csv`. The dataset columns are described in the linked [README](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-04-06/readme.md) and the csv is directly available at this [url](https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-06/brazil_loss.csv)
##### Load the `brazil_loss.csv` file, remove the 2 first columns (`entity` and `code` since it is all Brazil) and assign the name `brazil_loss`
```{block, opts.label = "tip"}
Set the data type of `year` to `character()`
```
```{r}
# Write your answer here
```
##### Is this dataset tidy?
```{r}
# Write your answer here
```
##### Pivot the deforestation reasons (columns `commercial_crops` to `small_scale_clearing`) to the long format. Values are areas in hectares (`area_ha` is a good column name). Save as `brazil_loss_long`
```{r}
# Write your answer here
```
##### Plot the deforestation areas per year as bars
```{block, opts.label = "tip"}
- `year` needs to be a categorical data. If you didn't read the data as character for this column, you can convert it with `factor()`
- `geom_col()` requires 2 aesthetics
+ `x` must be **categorical / discrete** (see first item)
+ `y` **must be continuous**
```
```{r}
# Write your answer here
```
##### Same as the plot above but bar filled by the reasons for deforestation
```{r}
# Write your answer here
```
Even if we have too many categories, we can appreciate the amount of `natural_disturbances` versus the reasons induced by humans.
##### Lump the reasons for deforestations, keeping only the top 5 reasons, lumping as "Other" the rest
```{block, opts.label = "tip"}
- The function `fct_lump()` is very useful for this operation. Be careful to weight the categories with the appropriate continuous variable.
- The legend of filled colours could be renamed and suppress if the title is explicit
```
```{r}
# Write your answer here
```
##### Optimize the previous plot by sorting the 5 main deforestation reasons {.bonus}
```{block, opts.label = "tip"}
Unfortunately, `fct_infreq()` does not have a weight argument yet.
You need then to take care of this manually.
One solution would be extract the top 5 main reasons using `dplyr` statements.
Then use this vector to recode the `reasons` with the reason name when part of the top5 or `other` if not. Then `fct_reorder(reasons2, area_ha)` does the correct reordering. You might want to use
`fct_rev()` to have the sorting from top to bottom in the legend.
```
```{r}
# Write your answer here
```
# Supplementary exercises
##### Genome-wide copy number variants (CNV) detection {.bonus}
Let's have a look at a real output file for CNV detection. The used tool is called **Reference Coverage Profiles**: [RCP](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4330915/).
It was developed by analyzing the depth of coverage in over 6000 high quality (>40×) genomes.
In the end, for every _kb_ a state is assigned
and similar states are merged eventually.
**state** means:
- 0, no coverage
- 1, deletion
- 2, expected diploidy
- 3, duplication
- 4, > 3 copies
## Reading data
The file is accessible [here](http://biostat2.uni.lu/practicals/data/CNV.seg).
`CNV.seg` has 5 columns and the first 10 lines look like:

##### Load the file [`CNV.seg`](http://biostat2.uni.lu/practicals/data/CNV.seg) in R, assign to the name `cnv` {.bonus}
```{block, opts.label = "warning"}
several issues must be fixed:
- lines starting by a hash (comment) should be discarded. You can specify in `vroom` `comment = '#'` or `skip = 2L`
- first and last column names are unclean. `#chrom` contains a hash and `length (kb)`. Would be neater to fix this upfront.
By skipping the 2 first lines, we need to redo the header manually, avoiding `vroom` to use a data line as header.
- specify a character vector of length 5 to `col_names = ` to create the header
```
```{r}
# Write your answer here
```
## Exploratory plots
##### Plot the counts of the different states. We expect a majority of diploid states. {.bonus}
```{r}
# Write your answer here
```
##### Plot the counts of the different states per chromosome. Might be worth freeing the **count** scale. {.bonus}
```{r}
# Write your answer here
```
##### Using the previous plot, reorder the levels of chromosomes to let them appear in the karyotype order (1:22, X, Y) {.bonus}
```{block, opts.label = "tip"}
we could explicitly provide the full levels lists in the desired order.
However, in the tibble, the chromosomes appear in the wanted order.
See the `fct_inorder()` function in the **forcats** package to take advantage of this.
```
```{r}
# Write your answer here
```
##### Sexual chromosomes are not informative, collapse them into a **gonosomes** level {.bonus}
```{block, opts.label = "tip"}
See the `fct_collapse()` function in the **forcats**
```
```{r}
# Write your answer here
```
##### plot the genomic segments length per state
```{block, opts.label = "tip"}
- The distributions are completely skewed: transform to log-scale to get a decent plot.
- Add the summary `mean` and `median` with colored hyphens using the **ToothGrowth** example
Of note, `shape = 95` for dots is an hyphen shape.
```
```{r}
# Write your answer here
```
## Count gain / loss summarising events per chromosome
##### Filter the tibble only for *autosomes* and remove segments with no coverage and diploid (_i.e_ states 0 and 2 respectively). Save as `cnv_auto`. {.bonus}
```{r}
# Write your answer here
```
##### We are left with state 1 and 3 and 4. Rename **1** as _loss_ and the others as _gain_ {.bonus}
##### Count the events per chromosome and per state {.bonus}
##### For _loss_ counts, set them to negative so the barplot will be display up / down. Save as `cnv_auto_chr` {.bonus}
```{r}
# Write your answer here
```
##### Plot `cnv_auto_chr` using the count as the `y` variable {.bonus}
```{r}
# Write your answer here
```
this is the final plot, where the following changes were made:
- labels of the `y` axis in absolute numbers
- set `expand = c(0, 0)` on the `x` axis. see [stackoverflow's answer](http://stackoverflow.com/a/22945857/1395352)
- use `theme_classic()`
- set the legend on the top right corner. Use a mix of `legend.position` and `legend.justification` in a `theme()` call.
- remove the label of the `x` axis, you could use _chromosomes_ if you prefer
- change the color of the fill argument with `c("springgreen3", "steelblue2")`
It is now obvious that we have mainly huge **deletions** on chromosome 10 and **amplifications** on chromosome 11.
In order to plot the genomic localisations of these events, we want to focus on the main chromosomes that were
affected by amplifications/deletions.
##### Lump the chromsomes by the number of CNV events (states 1, 3 or 4) keeping the 5 top ones and plot the counts {.bonus}
```{block, opts.label = "tip"}
the function `fct_lump` from _forcats_ ease lumping. Just pick `n = 5` to get the top 5 chromosomes
```
```{r}
# Write your answer here
```
## Genomic plot of top 5 chromosomes
##### Plot the genomic localisations of CNV on the 5 main chromosomes {.bonus}
```{r}
# Write your answer here
```