practical_ggplot2.Rmd

---
title: "my answers"
author: "My name"
date: "2021-11-12"
output: html_document
---


> This practical aims at performing exploratory plots and how-to build layer by layer to be familiar with
the grammar of graphics. In the last part, a supplementary exercise will focus on plotting genome-wide CNV. 

##### Those kind of questions are optional {.bonus}


# Scatter plots of penguins

The `penguins` dataset is provided by the `palmerpenguins` R package. As for every function, most data-sets shipped with a package contain also a useful help page (`?`).

##### If not done already, install the package `palmerpenguins` and load it.

```{r}
# Write your answer here
```


##### Plot the body mass on the _y_ axis and the bill length on the _x_ axis.

```{r}
# Write your answer here
```


##### Plot again the body mass on the _y_ axis and the bill length on the _x_ axis, but with colour by `species`

```{r}
# Write your answer here
```


##### The `geom_smooth()` layer can be used to add a trend line. Try to overlay it to your scatter plot.

```{block, opts.label = "tip"}
by default `geom_smooth` is using a loess regression (< 1,000 points) and adds standard error intervals. 

- The `method` argument can be used to change the regression to a linear one: `method = "lm"`
- to disable the ribbon of standard errors, set `se = FALSE`

Be careful where the aesthetics are located, so the trend linear lines are **also** colored per species.
```

```{r}
# Write your answer here
```


##### Adjust the aesthetics of point in order to

- the `shape` map to the originated `island`
- a fixed size of `3`
- a transparency of 40%

```{block,  opts.label = "tip"}
You should still have only 3 coloured linear trend lines. Otherwise check to which layer your are adding the aesthetic `shape`.
Remember that fixed parameters are to be defined outside `aes()`
```

```{r}
# Write your answer here
```


##### Ajust the colour aesthetic to the `ggplot()` call to propagate it to both point and regression line.
Try the scale colour viridis for discrete scale (`scale_colour_viridis_d()`). Try to change the default theme to `theme_bw()`

```{r}
# Write your answer here
```


##### Find a way to produce the following plot:


```{r}
# Write your answer here
```


```{block,  opts.label = "tip"}
Remember that:

- all aesthetics defined in the `ggplot(aes())` command will be inherited by all following layers
- `aes()` of individual geoms are specific (and overwrite the global definition if present).
- `labs()` controls of plot annotations
- `theme()` allows to tweak the plot like `theme(plot.caption = element_text(face = "italic"))` to render in italic the caption
```

# Categorical data

We are going to use a dataset from the [TidyTuesday](https://github.com/rfordatascience/tidytuesday/) initiative. Several dataset about the theme **deforestation** on April 2021 were released, we will focus on the csv called `brazil_loss.csv`. The dataset columns are described in the linked [README](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-04-06/readme.md) and the csv is directly available at this [url](https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-06/brazil_loss.csv)


##### Load the `brazil_loss.csv` file, remove the 2 first columns (`entity` and `code` since it is all Brazil) and assign the name `brazil_loss`

```{block, opts.label = "tip"}
Set the data type of `year` to `character()`
```

```{r}
# Write your answer here
```


##### Is this dataset tidy?

```{r}
# Write your answer here
```


##### Pivot the deforestation reasons (columns `commercial_crops` to `small_scale_clearing`) to the long format. Values are areas in hectares (`area_ha` is a good column name). Save as `brazil_loss_long`

```{r}
# Write your answer here
```


##### Plot the deforestation areas per year as bars

```{block, opts.label = "tip"}
- `year` needs to be a categorical data. If you didn't read the data as character for this column, you can convert it with `factor()`
- `geom_col()` requires 2 aesthetics
    + `x` must be **categorical / discrete** (see first item)
    + `y` **must be continuous**
```

```{r}
# Write your answer here
```


##### Same as the plot above but bar filled by the reasons for deforestation

```{r}
# Write your answer here
```


Even if we have too many categories, we can appreciate the amount of `natural_disturbances` versus the reasons induced by humans.

##### Lump the reasons for deforestations, keeping only the top 5 reasons, lumping as "Other" the rest

```{block, opts.label = "tip"}
- The function `fct_lump()` is very useful for this operation. Be careful to weight the categories with the appropriate continuous variable.
- The legend of filled colours could be renamed and suppress if the title is explicit
```

```{r}
# Write your answer here
```


##### Optimize the previous plot by sorting the 5 main deforestation reasons {.bonus}

```{block, opts.label = "tip"}
Unfortunately, `fct_infreq()` does not have a weight argument yet. 
You need then to take care of this manually.

One solution would be extract the top 5 main reasons using `dplyr` statements.

Then use this vector to recode the `reasons` with the reason name when part of the top5 or `other` if not. Then `fct_reorder(reasons2, area_ha)` does the correct reordering. You might want to use
`fct_rev()` to have the sorting from top to bottom in the legend.
```

```{r}
# Write your answer here
```


# Supplementary exercises

##### Genome-wide copy number variants (CNV) detection {.bonus}

Let's have a look at a real output file for CNV detection. The used tool is called **Reference Coverage Profiles**: [RCP](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4330915/). 

It was developed by analyzing the depth of coverage in over 6000 high quality (>40×) genomes.

In the end, for every _kb_ a state is assigned 
and similar states are merged eventually.

**state** means:

- 0, no coverage
- 1, deletion
- 2, expected diploidy
- 3, duplication
- 4, > 3 copies

## Reading data

The file is accessible [here](http://biostat2.uni.lu/practicals/data/CNV.seg).

`CNV.seg` has 5 columns and the first 10 lines look like:

![CNV](https://i.imgur.com/vzU4omM.png)

##### Load the file [`CNV.seg`](http://biostat2.uni.lu/practicals/data/CNV.seg) in R, assign to the name `cnv` {.bonus}

```{block, opts.label = "warning"}
several issues must be fixed:

- lines starting by a hash (comment) should be discarded. You can specify in `vroom` `comment = '#'` or `skip = 2L`
- first and last column names are unclean. `#chrom` contains a hash and `length (kb)`. Would be neater to fix this upfront.

By skipping the 2 first lines, we need to redo the header manually, avoiding `vroom` to use a data line as header.

- specify a character vector of length 5 to `col_names = ` to create the header 
```

```{r}
# Write your answer here
```


## Exploratory plots

##### Plot the counts of the different states. We expect a majority of diploid states.  {.bonus}

```{r}
# Write your answer here
```


##### Plot the counts of the different states per chromosome. Might be worth freeing the **count** scale.  {.bonus}
 
```{r}
# Write your answer here
```


##### Using the previous plot, reorder the levels of chromosomes to let them appear in the karyotype order (1:22, X, Y)  {.bonus}

```{block,  opts.label = "tip"}
we could explicitly provide the full levels lists in the desired order. 

However, in the tibble, the chromosomes appear in the wanted order.

See the `fct_inorder()` function in the **forcats** package to take advantage of this.
```

```{r}
# Write your answer here
```


##### Sexual chromosomes are not informative, collapse them into a **gonosomes** level  {.bonus}

```{block,  opts.label = "tip"}
See the `fct_collapse()` function in the **forcats**
```

```{r}
# Write your answer here
```


##### plot the genomic segments length per state

```{block,  opts.label = "tip"}
- The distributions are completely skewed: transform to log-scale to get a decent plot.
- Add the summary `mean` and `median` with colored hyphens using the **ToothGrowth** example

Of note, `shape = 95` for dots is an hyphen shape.
```

```{r}
# Write your answer here
```


## Count gain / loss summarising events per chromosome

##### Filter the tibble only for *autosomes* and remove segments with no coverage and diploid (_i.e_ states 0 and 2 respectively). Save as `cnv_auto`.  {.bonus}

```{r}
# Write your answer here
```


##### We are left with state 1 and 3 and 4. Rename **1** as _loss_ and the others as _gain_ {.bonus}

##### Count the events per chromosome and per state  {.bonus}

##### For _loss_ counts, set them to negative so the barplot will be display up / down. Save as `cnv_auto_chr` {.bonus}

```{r}
# Write your answer here
```


##### Plot `cnv_auto_chr` using the count as the `y` variable {.bonus}

```{r}
# Write your answer here
```


this is the final plot, where the following changes were made:

- labels of the `y` axis in absolute numbers
- set `expand = c(0, 0)` on the `x` axis. see [stackoverflow's answer](http://stackoverflow.com/a/22945857/1395352)
- use `theme_classic()`
- set the legend on the top right corner. Use a mix of `legend.position` and `legend.justification` in a `theme()` call.
- remove the label of the `x` axis, you could use _chromosomes_ if you prefer
- change the color of the fill argument with `c("springgreen3", "steelblue2")`


It is now obvious that we have mainly huge **deletions** on chromosome 10 and **amplifications** on chromosome 11.


In order to plot the genomic localisations of these events, we want to focus on the main chromosomes that were
affected by amplifications/deletions.

##### Lump the chromsomes by the number of CNV events (states 1, 3 or 4) keeping the 5 top ones and plot the counts {.bonus}


```{block,  opts.label = "tip"}
the function `fct_lump` from _forcats_ ease lumping. Just pick `n = 5` to get the top 5 chromosomes 
```

```{r}
# Write your answer here
```


## Genomic plot of top 5 chromosomes

##### Plot the genomic localisations of CNV on the 5 main chromosomes {.bonus}


```{r}
# Write your answer here
```