diff --git a/docs/00-introduction.md b/docs/00-introduction.md index 24707fe..8fc3ffa 100644 --- a/docs/00-introduction.md +++ b/docs/00-introduction.md @@ -1,27 +1,29 @@ # Introducing R and RStudio IDE -!!! info +!!! info "Learning outcomes" - === "Keypoints" + === "Key points" - R is a powerful, popular open-source scripting language - You can customize the layout of RStudio, and use the project feature to manage the files and packages used in your analysis - - RStudio allows you to run R in an easy-to-use interface and makes it + - RStudio allows you to run R in an easy-to-use interface and makes it easy to find help === "Objectives" - Know advantages of analyzing data in R - Know advantages of using RStudio - - Create an RStudio project, and know the benefits of working within a project + - Create an RStudio project, and know the benefits of working within a + project - Be able to customize the RStudio layout - - Be able to locate and change the current working directory with `getwd()` and `setwd()` + - Be able to locate and change the current working directory with + `getwd()` and `setwd()` - Compose an R script file containing comments and commands - Understand what an R function is - Locate help for an R function using `?`, `??`, and `args()` - + ## Getting ready to use R for the first time In this lesson we will take you through the very first things you need @@ -32,11 +34,12 @@ to get R working. [R](https://en.wikipedia.org/wiki/R_(programming_language)) has been around since 1995, and was created by Ross Ihaka and Robert Gentleman at -the University of Auckland, New Zealand. R is based off the S -programming language developed at Bell Labs and was developed to teach -intro statistics. See this [slide -deck](https://www.stat.auckland.ac.nz/~ihaka/downloads/Massey.pdf) by -Ross Ihaka for more info on the subject. +the University of Auckland, New Zealand. R is based off the [S programming +language](https://en.wikipedia.org/wiki/S_(programming_language)) +developed at Bell Labs and was developed to teach introductory statistics. +See this [slide +deck](https://www.stat.auckland.ac.nz/~ihaka/downloads/Massey.pdf) +by Ross Ihaka for more info on the subject. ## Advantages of using R @@ -50,11 +53,14 @@ R: - **R is [open source](https://en.wikipedia.org/wiki/Open-source_software)**. This means R is free - an advantage if you are at an institution where - you have to pay for your own MATLAB or SAS license. Open source, is - important to your colleagues in parts of the world where expensive - software in inaccessible. It also means that R is actively developed - by a community (see [r-project.org](https://www.r-project.org/)), - and there are regular updates. + you have to pay for your own + [MATLAB](https://en.wikipedia.org/wiki/MATLAB) or + [SAS](https://en.wikipedia.org/wiki/SAS_(software)) license. Open + source, is important to your colleagues in parts of the world where + expensive software in inaccessible. It also means that R is actively + developed by a community + (see [r-project.org](https://www.r-project.org/)), and there are + regular updates. - **R is widely used**. Ok, maybe programming is a popularity contest. Because, R is used in many areas (not just bioinformatics), you are more likely to find help online when you need it. Chances are, @@ -71,12 +77,15 @@ R: What has motivated you to learn R? Have you had a research question for which spreadsheet programs such as Excel have proven difficult to - use, or where the size of the data set created issues? {: .discussion} + use, or where the size of the data set created issues? + + ## RStudio on Nesi Logging into the [NeSI Jupyter Interface](https://dinindusenanayake.github.io/ganesi_authesetup-login/1_jupyterlogin/). + ## Introducing RStudio Server In these lessons, we will be making use of a software called @@ -92,11 +101,9 @@ RStudio that can be accessed in your web browser. RStudio Server has the same features of the Desktop version of RStudio you could download as standalone software. - - - ![image](./figures/rstudio_session_default.png){width="700"} + ## Overview and customization of the RStudio layout Here are the major windows (or panes) of the RStudio environment: @@ -123,7 +130,7 @@ Here are the major windows (or panes) of the RStudio environment: memory. You can also see some properties of objects/datasets such as their type and dimensions. The "History" tab contains a history of the R commands you've executed R. - - **Files/Plots/Packages/Help/Viewer**: This multipurpose pane will + - **Files/Plots/Packages/Help/Viewer**: This multi-purpose pane will show you the contents of directories on your computer. You can also use the "Files" tab to navigate and set the working directory. The "Plots" tab will show the output of any plots generated. In @@ -136,7 +143,7 @@ Here are the major windows (or panes) of the RStudio environment: In the "Files" tab you can select a file and download it from your cloud instance (click the "more" button) to your local computer. - Uploads are also possible. {: .callout} + Uploads are also possible. All of the panes in RStudio have configuration options. For example, you can minimize/maximize a pane, or by moving your mouse in the space @@ -167,7 +174,7 @@ to more easily: analysis project - Restart work where you left off - Collaborate, especially if you are using version control such as - [git](http://swcarpentry.github.io/git-novice/). + [git](http://swcarpentry.github.io/git-novice/) !!! info "" @@ -178,21 +185,20 @@ to more easily: 2. In the window that opens select **Existing Directory** - Then select **Browse....** - Choose and then click "~/R4Genomnics". - - 3. Finally click `Create Project`. In the - "Files" tab of your output pane (more about the RStudio layout in a - moment), you should see an RStudio project file, - **R4Genomics.Rproj**. All RStudio projects end with the - "**.Rproj**" file extension. + Then select **Browse...** + Choose and then click `~/R4Genomics`. + 3. Finally click **Create Project**. In the + "Files" tab of your output pane (more about the RStudio layout above), + you should see an RStudio project file, `R4Genomics.Rproj`. All + RStudio projects end with the `.Rproj` file extension. + !!! tip "Make your project more reproducible with renv" One of the most wonderful and also frustrating aspects of working with R is managing packages. We will talk more about them, but packages - (e.g. ggplot2) are add-ons that extend what you can do with R. + (e.g. `ggplot2`) are add-ons that extend what you can do with R. Unfortunately it is very common that you may run into versions of R and/or R packages that are not compatible. This may make it difficult for someone to run your R script using their version of R or a given R @@ -200,10 +206,10 @@ to more easily: machine. [renv](https://rstudio.github.io/renv/) is an RStudio add-on that will associate your packages and project so that your work is more portable and reproducible. To turn on renv click on the - `Tools` menu and select - `Project Options`. Under **Enviornments** + **Tools** menu and select + **Project Options**. Under **Enviornments** check off "**Use renv with this project**" and follow any installation - instructions. {: .callout} + instructions. ## Creating your first R script @@ -211,39 +217,42 @@ Now that we are ready to start exploring R, we will want to keep a record of the commands we are using. To do this we can create an R script: -Click the `File` menu and select -`New File` and then `R -Script`. Before we go any further, save your script by -clicking the save/disk icon that is in the bar above the first line in -the script editor, or click the `File` menu -and select `save`. In the "Save File" window -that opens, name your file **"genomics_r\_basics"**. The new script -**genomics_r\_basics.R** should appear under "files" in the output pane. -By convention, R scripts end with the file extension **.R**. +1. Click the **File** menu and select **New File** and then **R Script**. +1. Before we go any further, save your script by clicking the save/disk icon + that is in the bar above the first line in the script editor, or click the + **File** menu and select **Save**. +1. In the **Save File** window that opens, name your file `genomics_r_basics`. + The new script `genomics_r_basics.R` should appear under **Files** in the + output pane. By convention, R scripts end with the file extension `.R`. -## Getting to work with R: navigating directories +## Getting to work with R: Navigating directories Now that we have covered the more aesthetic aspects of RStudio, we can get to work using some commands. We will write, execute, and save the -commands we learn in our **genomics_r\_basics.R** script that is loaded +commands we learn in our `genomics_r_basics.R` script that is loaded in the Source pane. First, lets see what directory we are in. To do so, type the following command into the script: !!! r-project - `getwd()` + ```r + getwd() + ``` To execute this command, make sure your cursor is on the same line the command is written. Then click the `Run` button that is just above the first line of your script in the header of the Source pane. -In the console, we expect to see the following output : +In the console, we expect to see the following output: -!!! solution "" - `[1] "/home/shared/"` +!!! success "Output" -* Notice, at the Console, you will also see the instruction you executed above the output in blue. + ``` + [1] "/home/shared/" + ``` + +* Notice, at the Console, the instruction you executed above the output in blue. Since we will be learning several commands, we may already want to keep some short notes in our script to explain the purpose of the command. @@ -253,42 +262,48 @@ include a comment on the purpose of commands you are learning, e.g.: !!! r-project - `# this command shows the current working directory getwd()` + ```r + # this command shows the current working directory getwd() + ``` -!!! question "Exercise : Work interactively in R" +!!! question "Exercise : Work interactively in R" - What happens when you try to enter the `getwd()` command in the Console pane? + What happens when you try to enter the `getwd()` command in the Console + pane? ??? success "Solution" - You will get the same output you did as when you ran `getwd()` from - the source. You can run any command in the Console, however, - executing it from the source script will make it easier for us to - record what we have done, and ultimately run an entire script, - instead of entering commands one-by-one. {: .solution} {: - .challenge} + You will get the same output you did as when you ran `getwd()` from + the source. You can run any command in the Console, however, + executing it from the source script will make it easier for us to + record what we have done, and ultimately run an entire script, + instead of entering commands one-by-one. For the purposes of this exercise we want you to be in the directory `"/home/shared//R4Genomics"`. What if you weren't? You can set your home directory using the `setwd()` command. Enter this command in your -script, but *don't run* this yet. +script, but ***don't run*** this yet. !!! r-project - `# This sets the working directory setwd()` + ```r + # This sets the working directory + setwd() + ``` -You may have guessed, you need to tell the `setwd()` command what -directory you want to set as your working directory. To do so, inside of -the parentheses, open a set of quotes. Inside the quotes enter a `/` -which is the root directory for Linux. Next, use the -`Tab` key, to take advantage of RStudio's -Tab-autocompletion method, to select `home`, `dcuser`, and -`dc_genomics_r` directory. The path in your script should look like -this: +You may have guessed, you need to tell the `setwd()` command what directory you +want to set as your working directory. To do so, inside of the parentheses, open +a set of quotes `""`. Inside the quotes enter a `/` which is the root directory for +Linux. Next, use the `Tab` key, to take advantage of RStudio's +tab-autocompletion method, to select `home`, `shared`, your `` and +`R4Genomics` directory. The path in your script should look like this: !!! r-project - `# This sets the working directory setwd("/home/shared//R4Genomics")` + ```r + # This sets the working directory + setwd("/home/shared//R4Genomics") + ``` When you run this command, the console repeats the command, but gives you no output. Instead, you see the blank R prompt: `>`. @@ -298,7 +313,6 @@ step to analyzing your data. !!! tip "Never use `setwd()`" - Wait, what was the last 2 minutes about? Well, setting your working directory is something you need to do, you need to be very careful about using this as a step in your script. For example, what if your @@ -307,12 +321,14 @@ step to analyzing your data. on Windows it is likely `C:\`. This is one of several ways you might cause a script to break because a file path is configured differently than your script anticipates. R packages like - [here](https://cran.r-project.org/package=here) and - [file.path](https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/file.path) - allow you to specify file paths is a way that is more operating system - independent. See Jenny Bryan's [blog + [`here`](https://cran.r-project.org/package=here) and the function + [`file.path()`](https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/file.path) + allow you to specify file paths that are operating system independent. See + Jenny Bryan's [blog post](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/) - for this and other R tips. {: .callout} + for this and other R tips. + + ## Using functions in R, without needing to master them A function in R (or any computing language) is a short program that @@ -325,25 +341,32 @@ understand what is happening in any R script. Try the following functions by writing them in your script. See if you can guess what they do, and make sure to add comments to your script - about your assumed purpose. - `dir()` - `sessionInfo()` - `date()` - - `Sys.time()` + about your assumed purpose. + + * `dir()` + * `sessionInfo()` + * `date()` + * `Sys.time()` + * `.libPaths()` ??? success "Solution" - - `dir()` \# Lists files in the working directory - - `sessionInfo()` \# Gives the version of R and additional info + - `dir()` lists files in the working directory + - `sessionInfo()` gives the version of R and additional info including on attached packages - - `date()` \# Gives the current date - - `Sys.time()` \# Gives the current time - - `.libPaths()` \# Shows what libraries are available + - `date()` gives the current date + - `Sys.time()` gives the current time + - `.libPaths()` shows what libraries are available **Notice**: Commands are case sensitive! -You have hopefully noticed a pattern - an R function has three key -properties: - Functions have a name (e.g. `dir`, `getwd`); note that -functions are case sensitive! - Following the name, functions have a -pair of `()` - Inside the parentheses, a function may take 0 or more -arguments +You have hopefully noticed a pattern: an R function has three key +properties: + +* Functions have a name (e.g. `dir`, `getwd`); note that functions are case + sensitive! +* Following the name, functions have a pair of `()` +* Inside the parentheses, a function may take 0 or more arguments An argument may be a specific input for your function and/or may modify the function's behavior. For example the function `round()` will round a @@ -351,20 +374,30 @@ number with a decimal: !!! r-project "r" - ``` + ```r # This will round a number to the nearest integer round(3.14) ``` + ??? success "Output" + + ``` + [1] 3 + ``` + + ## Getting help with function arguments What if you wanted to round to one significant digit? `round()` can do this, but you may first need to read the help to find out how. To see -the help (In R sometimes also called a "vignette") enter a `?` in front +the help (in R sometimes also called a "vignette") enter a `?` in front of the function name: !!! r-project "r" - `round()` + + ```r + ?round() + ``` The "Help" tab will show you information (often, too much information). You will slowly learn how to read and make sense of help files. Checking @@ -374,18 +407,36 @@ this function to modify its behavior. You can also see a function's argument using the `args()` function: !!! r-project "r" - `args(round)` + ```r + args(round) + ``` + + !!! success "Output" + + ``` + function (x, digits = 0) + NULL + ``` `round()` takes two arguments, `x`, which is the number to be rounded, and a `digits` argument. The `=` sign indicates that a default (in this case 0) is already set. Since `x` is not set, `round()` requires we provide it, in contrast to `digits` where R will use the default value 0 unless you explicitly provide a different value. We can explicitly set -the digits parameter when we call the function: +the digits argument when we call the function: !!! r-project "r" - `round(3.14159, digits = 2)` + + ```r + round(3.14159, digits = 2) + ``` + + !!! success "Output" + + ``` + [1] 3.14 + ``` Or, R accepts what we call "positional arguments", if you pass a @@ -394,25 +445,30 @@ order you saw when we used `args()`. In the case below that means that `x` is 3.14159 and digits is 2. !!! r-project "r" - `round(3.14159, 2)` + + ```r + round(3.14159, 2) + ``` Finally, what if you are using `?` to get help for a function in a package not installed on your system, such as when you are running a script which has dependencies. !!! r-project "r" - `geom_point()` -will return an error: + ```r + ?geom_point() + ``` + +The above will return an error: !!! failure "Error" + ``` Error in .helpForCall(topicExpr, parent.frame()) : no methods for ‘geom_point’ and no documentation for it as a function ``` - - Use two question marks (i.e. `??geom_point()`) and R will return results from a search of the documentation for packages you have installed on your computer in the "Help" tab. Finally, if you think there should be a @@ -426,9 +482,9 @@ function. functions. Remember to put your search query in quotes inside the function's parentheses. - - Chi-Squared test - - Student t-test - - mixed linear model + * Chi-Squared test + * Student t-test + * Mixed linear model ??? success Solution @@ -437,27 +493,30 @@ function. - Chi-Squared test: `stats::Chisquare` - Student t-test: `stats::t.test` - - mixed linear model: `stats::lm.glm` + - Mixed linear model: `stats::lm.glm` We will discuss more on where to look for the libraries and packages that contain functions you want to use. For now, be aware that two -important ones are [CRAN](https://cran.r-project.org/) - the main -repository for R, and [Bioconductor](http://bioconductor.org/) - a -popular repository for bioinformatics-related R packages. +important ones are: + +* [CRAN](https://cran.r-project.org/), the main repository for R +* [Bioconductor](http://bioconductor.org/), a popular repository for + bioinformatics-related R packages. + ## RStudio contextual help Here is one last bonus we will mention about RStudio. It's difficult to remember all of the arguments and definitions associated with a given function. When you start typing the name of a function and hit the -`Tab` key, RStudio will display functions and +Tab key, RStudio will display functions and associated help: ![images](./figures/studio_contexthelp1.png) -Once you type a function, hitting the `Tab` -inside the parentheses will show you the function's arguments and -provide additional help for each of these arguments. +Once you type a function, hitting the Tab inside the parentheses will +show you the function's arguments and provide additional help for each of these +arguments. ![image](./figures/studio_contexthelp2.png) diff --git a/docs/01-r-basics.md b/docs/01-r-basics.md index 0bde865..9983699 100644 --- a/docs/01-r-basics.md +++ b/docs/01-r-basics.md @@ -1,28 +1,21 @@ # R Basics -!!! info +!!! info "Learning objectives" - === "Keypoints" + === "Key points" - - Effectively using R is a journey of months or years. Still, you don't - have to be an expert to use R and you can start using and analyzing - your data with with about a day's worth of training. - - It is important to understand how data are organized by R in a given - object type and how the mode of that type (e.g. numeric, character, - logical, etc.) will determine how R will operate on that data. - - Working with vectors effectively prepares you for understanding how - data are organized in R. + - Effectively using R is a journey of months or years. Still, you don't have to be an expert to use R, and you can start using and analyzing your data with about a day's worth of training. + - It is important to understand how R organizes data in a given object type and how the mode of that type (e.g. numeric, character, logical, etc.) will determine how R will operate on that data. + - Working with vectors effectively prepares you to understand how R organizes data. === "Objectives" - - Be able to create the most common R objects including vectors - - Understand that vectors have modes, which correspond to the type of - data they contain + - Be able to create the most common R objects, such as vectors + - Understand that vectors have modes, which correspond to the type of data they contain - Be able to use arithmetic operators on R objects - - Be able to retrieve (subset), name, or replace, values from a vector + - Be able to retrieve (subset), name, or replace values from a vector - Be able to use logical operators in a subsetting operation - - Understand that lists can hold data of more than one mode and can be - indexed + - Understand that lists can hold data of more than one mode and can be indexed === "Questions" @@ -33,17 +26,8 @@ ## "The fantastic world of R awaits you" OR "Nobody wants to learn how to use R" -Before we begin this lesson, we want you to be clear on the goal of the -workshop and these lessons. We believe that every learner can **achieve -competency with R**. You have reached competency when you find that you -are able to **use R to handle common analysis challenges in a reasonable -amount of time** (which includes time needed to look at learning -materials, search for answers online, and ask colleagues for help). As -you spend more time using R (there is no substitute for regular use and -practice) you will find yourself gaining competency and even expertise. -The more familiar you get, the more complex the analyses you will be -able to carry out, with less frustration, and in less time - the -fantastic world of R awaits you! +Before we begin this lesson, we want you to be clear on the goal of the workshop and these lessons. We believe that every learner can **achieve competency with R**. You have reached competency when you find that you are able to **use R to handle common analysis challenges in a reasonable amount of time** (which includes the time needed to look at learning materials, search for answers online, and ask colleagues for help). As you spend more time using R (there is no substitute for regular use and practice), you will gain competency and expertise. The more familiar you get, the more complex the analyses you will be able to carry out, with less frustration and in less time - the fantastic world of R awaits you! + ## What these lessons will not teach you @@ -58,16 +42,16 @@ the additional learning materials in the tip box below. **Here are some R skills we will *not* cover in these lessons** -- How to create and work with R matrices -- How to create and work with loops and conditional statements, and - the "apply" family of functions (which are super useful, read more - [here](https://www.r-bloggers.com/r-tutorial-on-the-apply-family-of-functions/)) -- How to do basic string manipulations (e.g. finding patterns in text - using grep, replacing text) -- How to plot using the default R graphic tools (we *will* cover plot - creation, but will do so using the popular plotting package - `ggplot2`) -- How to use advanced R statistical functions +* How to create and work with R matrices +* How to create and work with loops and conditional statements, and + the "apply" family of functions (which are super useful, read more + [here](https://www.r-bloggers.com/r-tutorial-on-the-apply-family-of-functions/)) +* How to do basic string manipulations (e.g. finding patterns in text + using `grep()`, and replacing text using `gsub()`) +* How to plot using the default R graphic tools (we *will* cover plot + creation, but will do so using the popular plotting package + `ggplot2`) +* How to use advanced R statistical functions !!! tip "Where to learn more" @@ -75,20 +59,20 @@ the additional learning materials in the tip box below. them can be quite technical, but if you are a regular R user you may ultimately need this technical knowledge. - - [R for Beginners](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf): + * [R for Beginners](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf): By Emmanuel Paradis and a great starting point - - [The R Manuals](https://cran.r-project.org/manuals.html): Maintained by the R + * [The R Manuals](https://cran.r-project.org/manuals.html): Maintained by the R project - - [R contributed documentation](https://cran.r-project.org/other-docs.html): Also + * [R contributed documentation](https://cran.r-project.org/other-docs.html): Also linked to the R project; importantly there are materials available in several languages - - [R for Data Science](http://r4ds.had.co.nz/): A + * [R for Data Science](http://r4ds.had.co.nz/): A wonderful collection by noted R educators and developers Garrett Grolemund and Hadley Wickham - - [Practical Data Science for Stats](https://peerj.com/collections/50-practicaldatascistats/): Not + * [Practical Data Science for Stats](https://peerj.com/collections/50-practicaldatascistats/): Not exclusively about R usage, but a nice collection of pre-prints on data science and applications for R - - [Programming in R Software Carpentry + * [Programming in R Software Carpentry lesson](https://software-carpentry.org/lessons/): There are several Software Carpentry lessons in R to choose from @@ -101,48 +85,49 @@ the additional learning materials in the tip box below. !!! bell "Reminder" At this point you should be coding along in the - "**genomics_r\_basics.R**" script we created in the last episode. + `genomics_r_basics.R` script we created in the last episode. Writing your commands in the script (and commenting it) will make it easier to record what you did and why. What might be called a variable in many languages is called an **object** in R. -**To create an object you need:** +**To create an object, you need:** -- a name (e.g. 'a') -- a value (e.g. '1') -- the assignment operator ('\<-') +- A name (e.g. `a`) +- A value (e.g. `1`) +- The assignment operator (`<-`) -In your script, "**genomics_r\_basics.R**", using the R assignment -operator '\<-', assign '1' to the object 'a' as shown. Remember to leave -a comment in the line above (using the '\#') to explain what you are +In your script, `genomics_r_basics.R`, use the R assignment +operator `<-` to assign `1` to the object `a` as shown. Remember to leave +a comment in the line above (using the `#`) to explain what you are doing: !!! r-project "r" + ```r # This line creates the object 'a' and assigns it the value '1' a <- 1 ``` - Next, run this line of code in your script. You can run a line of code -by hitting the `Run` button that is just above the first line of your -script in the header of the Source pane or you can use the appropriate shortcut: +by hitting the `Run` button that is above the first line of your +script in the header of the 'Source' pane or you can use the appropriate shortcut: -- Windows execution shortcut: `Ctrl+Enter` -- Mac execution shortcut: `Cmd(⌘)+Enter` +- Windows execution shortcut: Ctrl + Enter +- Mac execution shortcut: Cmd(⌘) + Enter -To run multiple lines of code, you can highlight all the line you wish to run +To run multiple lines of code, highlight all the line you wish to run and then hit Run or use the shortcut key combo listed above. -In the RStudio 'Console' you should see: +In the RStudio 'Console', you should see: -``` -a <- 1 -> -``` +!!! success "" + ``` + a <- 1 + > + ``` The 'Console' will display lines of code run from a script and any outputs or status/warning/error messages (usually in red). @@ -157,7 +142,7 @@ created in R. !!! question "Exercise: Create some objects in R" - Create the following objects; give each object an appropriate name + Create the following objects and give each object an appropriate name (your best guess at what name to use is fine): 1. Create an object that has the value of number of pairs of human chromosomes @@ -169,12 +154,11 @@ created in R. ??? success "Solution" Here as some possible answers to the challenge: - !!! r-project "r" - ``` + ```r human_chr_number <- 23 gene_name <- 'pten' - ensemble_url <- 'ftp://ftp.ensemblgenomes.org/pub/bacteria/release-39/fasta/bacteria_5_collection/escherichia_coli_b_str_rel606/' + ensemble_url <- "ftp://ftp.ensemblgenomes.org/pub/bacteria/release-39/fasta/bacteria_5_collection/escherichia_coli_b_str_rel606/" human_diploid_chr_num <- 2 * human_chr_number ``` @@ -183,52 +167,47 @@ created in R. Here are some important details about naming objects in R -!!! quote "" - - - **Avoid spaces and special characters**: Object names cannot contain spaces or the minus sign (`-`). You can use '_' to make names more readable. You should avoid - using special characters in your object name (e.g. ! @ # . , etc.). Also, - object names cannot begin with a number. - - **Use short, easy-to-understand names**: You should avoid naming your objects - using single letters (e.g. 'n', 'p', etc.). This is mostly to encourage you - to use names that would make sense to anyone reading your code (a colleague, - or even yourself a year from now). Also, avoiding excessively long names will - make your code more readable. - - **Avoid commonly used names**: There are several names that may already have a - definition in the R language (e.g. 'mean', 'min', 'max'). One clue that a name - already has meaning is that if you start typing a name in RStudio and it gets - a colored highlight or RStudio gives you a suggested autocompletion you have - chosen a name that has a reserved meaning. - - **Use the recommended assignment operator**: In R, we use '<- ' as the - preferred assignment operator. '=' works too, but is most commonly used in - passing arguments to functions (more on functions later). There is a shortcut - for the R assignment operator: - - Windows execution shortcut: Alt+- - - Mac execution shortcut: Option+- +- **Avoid spaces and special characters**
+ Object names cannot contain spaces or the minus sign (`-`). You can use + `_` to make names more readable. You should avoid using special characters + in your object name (e.g. `!` `@` `#` `.` `,` etc.). Also, object names + cannot begin with a number. +- **Use short, easy-to-understand names**
+ Avoid naming your objects using single letters (e.g. `n`, `p`, + etc.). This is mostly to encourage you to use names that would make sense to + anyone reading your code (a colleague, or even yourself a year from now). + Also, avoiding excessively long names will make your code more readable. +- **Avoid commonly used names**
+ Several names may already have a definition in the R language + (e.g. `mean`, `min`, `max`). One clue that a name already has meaning is that + if you start typing a name in RStudio and it gets a colored highlight or + RStudio gives you a suggested autocompletion, you have chosen a name that has + a reserved meaning. +- **Use the recommended assignment operator**
+ In R, we use `<-` as the preferred assignment operator. `=` works too, but is + most commonly used in passing arguments to functions (more on functions + later). Shortcuts for the R assignment operator are: + - Windows execution shortcut: Alt + - + - Mac execution shortcut: Option + - -There are a few more suggestions about naming and style you may want to learn -more about as you write more R code. There are several "style guides" that -have advice, and one to start with is the [tidyverse R style guide](http://style.tidyverse.org/index.html). +There are a few more suggestions about naming and style you can learn more about as you write more R code. Several "style guides" +have advice, and one to start with is the [tidyverse R +style guide](http://style.tidyverse.org/index.html). -!!! tip "Pay attention to warnings in the script console" +!!! tip "Pay attention to warnings in the 'Source' panel" - If you enter a line of code in your script that contains an error, RStudio - may give you an error message and underline this mistake. Sometimes these - messages are easy to understand, but often the messages may need some figuring - out. Paying attention to these warnings will help you avoid mistakes. In the example below, our object name has a space, which - is not allowed in R. The error message does not say this directly, - but R is "not sure" - about how to assign the name to "human_ chr_number" when the object name we - want is "human_chr_number". + If you enter a line of code in your script that contains an error, RStudio may give you an error message and underline this mistake. Sometimes, these messages are easy to understand, but often, the messages may need some figuring out. Paying attention to these warnings will help you avoid mistakes. In the example below, our object name has a space, which is not allowed in R. The error message does not say this directly, but R is "not sure" about how to assign the number to `human_ chr_number` when the object name we want is `human_chr_number` ![images](./figures/rstudio_script_warning.png){width="700"} ### Reassigning object names or deleting objects -Once an object has a value, you can change that value by overwriting it. R will not give you a warning or error if you overwriting an object, which may or may not be a good thing depending on how you look at it. +Once an object has a value, you can change that value by overwriting it. R will not give you a warning or error if you overwrite an object, which may or may not be a good thing depending on how you look at it. !!! r-project "r" - ``` + + ```r # gene_name has the value 'pten' or whatever value you used in the challenge # We will now assign the new value 'tp53' @@ -238,8 +217,9 @@ Once an object has a value, you can change that value by overwriting it. R will You can also remove an object from R's memory entirely. The `rm()` function will delete the object. !!! r-project "r" - ``` - # delete the object 'gene_name' + + ```r + # Delete the object 'gene_name' rm(gene_name) ``` @@ -247,7 +227,14 @@ If you run a line of code that has only an object name, R will normally display the contents of that object. In this case, we are told the object no longer exists. +!!! r-project "r" + + ```r + gene_name + ``` + !!! failure "Error" + ``` Error: object 'gene_name' not found ``` @@ -257,41 +244,39 @@ object no longer exists. In R, **every object has two properties**: -- **Length**: How many distinct values are held in that object -- **Mode**: What is the classification (type) of that object. +- **Length**: The number of distinct values are held in that object +- **Mode**: The classification (type) of that object -We will get to the "length" property later in the lesson. The **"mode" -property** **corresponds to the type of data an object represents**. The +We will get to the "length" property later in the lesson. The **"mode" +property corresponds to the type of data an object represents**. The most common modes you will encounter in R are: | Mode (abbreviation) | Type of data | | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Numeric (num) | Numbers such as floating point/decimals (1.0, 0.5, 3.14), there are also more specific numeric types (dbl - Double, int - Integer). These differences are not relevant for most beginners and pertain to how these value are stored in memory | -| Character (chr) | A sequence of letters/numbers in single '' or double " " quotes | -| Logical (logi) | Boolean values - TRUE of FALSE | +| Numeric (`num`) | Numbers such as floating point (i.e. decimals, e.g. 1.0, 0.5, 3.14).
There are also more specific numeric types (`dbl` - Double, `int` - Integer). These differences are not relevant for most beginners and pertain to how these value are stored in memory | +| Character (`chr`) | A sequence of letters/numbers in single `''` or double `""` quotes | +| Logical (`logi`) | Boolean values `TRUE` or `FALSE` | There are a few other modes (i.e. "complex", "raw" etc.) but these are the three we will work with in this lesson. -Data types are familiar in many programming languages, but also in -natural language where we refer to them as the parts of speech, +Data types are common across many programming languages, but also in +natural language, where we refer to them as the parts of speech, e.g. nouns, verbs, adverbs, etc. Once you know if a word - perhaps an unfamiliar one - is a noun, you can probably guess you can count it and make it plural if there is more than one (e.g. 1 [Tuatara](https://en.wikipedia.org/wiki/Tuatara), or 2 Tuataras). If -something is a adjective, you can usually change it into an adverb by +something is an adjective, you can usually change it into an adverb by adding "-ly" -(e.g. [jejune](https://www.merriam-webster.com/dictionary/jejune) vs. +(e.g., [jejune](https://www.merriam-webster.com/dictionary/jejune) vs. jejunely). Depending on the context, you may need to decide if a word is -in one category or another (e.g "cut" may be a noun when it's on your -finger, or a verb when you are preparing vegetables). These concepts +in one category or another (e.g "cut" may be (1) a noun when it's on your +finger or (2) a verb when you are preparing vegetables). These concepts have important analogies when working with R objects. !!! question "Exercise: Create objects and check their modes" - Create the following objects in R, then use the `mode()` function to - verify their modes. Try to guess what the mode will be before you look - at the solution + Create the following objects in R, then verify their modes using the `mode()` function. Try to guess what the mode will be before you look at the solution. 1. `chromosome_name <- 'chr02'` 2. `od_600_value <- 0.47` @@ -301,56 +286,60 @@ have important analogies when working with R objects. ??? success "Solution" - ```r - chromosome_name <- 'chr02' - od_600_value <- 0.47 - chr_position <- '1001701' - spock <- TRUE - pilot <- Earhart` - ```r mode(chromosome_name) mode(od_600_value) mode(chr_position) - mode(spock) mode(pilot) + mode(spock) + mode(pilot) + ``` + + ``` + [1] "character" + [1] "numeric" + [1] "character" + [1] "logical" + Error: object 'pilot' not found ``` Notice from the solution that even if a series of numbers is given as a value R will consider them to be in the "character" mode if they are -enclosed as single or double quotes. Also, notice that you cannot take a -string of alphanumeric characters (e.g. Earhart) and assign as a value -for an object. In this case, R looks for an object named `Earhart` but +enclosed with single or double quotes. Also, notice that you cannot take a +string of alphanumeric characters (i.e., Earhart) and assign it as a value +for an object. In this case, R looks for an object named `Earhart`, but since there is no object, no assignment can be made. If `Earhart` did exist, then the mode of `pilot` would be whatever the mode of `Earhart` was originally. If we want to create an object called `pilot` that was the **name** "Earhart", we need to enclose `Earhart` in quotation marks. !!! r-project "r" + ```r pilot <- "Earhart" mode(pilot) ``` + ### Mathematical and functional operations on objects -Once an object exists (which by definition also means it has a mode), R -can appropriately manipulate that object. For example, objects of the -numeric modes can be added, multiplied, divided, etc. R provides several +Once an object exists (which, by definition, also means it has a mode), R +can manipulate that object using appropriate methods. For example, objects of the numeric modes can be added, multiplied, divided, etc. R provides several mathematical (arithmetic) operators including: -| Operator | Description | -| ---------- | ------------------------------------------------------------ | -| \+ | addition | -| \- | subtraction | -| \* | multiplication | -| / | division | -| \^ or \*\* | exponentiation | -| a %/% b | integer division (division where the remainder is discarded) | -| a %% b | modulus (returns the remainder after division) | +| Operator | Description | +| ----------- | ------------------------------------------------------------ | +| `+` | Addition | +| `-` | Subtraction | +| `*` | Multiplication | +| `/` | Division | +| `^` or `**` | Exponentiation | +| `%/%` | Integer division (division where the remainder is discarded) | +| `%%` | Modulus (returns the remainder after division) | These can be used with literal numbers: !!! r-project "r" + ```r (1 + (5 ** 0.5))/2 ``` @@ -367,14 +356,19 @@ and importantly, can be used on any object that evaluates to human_chr_number * 2 ``` -??? question "Exercise: Compute the golden ratio" +!!! question "Exercise: Compute the golden ratio" - One approximation of the golden ratio (φ) can be found by taking the sum of 1 - and the square root of 5, and dividing by 2 as in the example above. Compute - the golden ratio to 3 digits of precision using the `sqrt()` and `round()` + One approximation of the golden ratio ($\phi$) can be found by taking the sum of + 1 and the square root of 5, and dividing by 2 as in the example above. + + $$ + \phi = \frac{1 + \sqrt{5}}{2} + $$ + + Compute the golden ratio to 3 digits of precision using the `sqrt()` and `round()` functions. - Hint: Remember the `round()` function can take 2 arguments. + Hint: Remember the `round()` function can take **two** arguments. ??? success "Solution" @@ -387,7 +381,7 @@ and importantly, can be used on any object that evaluates to ## Vectors -Vectors are probably the most used commonly used object type in R. **A vector is a collection of values that are all of the same type (numbers, characters, etc.)**. One of the most common ways to create a vector is to use the `c()` function - the "concatenate" or "combine" function. Inside the function you may enter one or more values; for multiple values, separate each value with a comma: +Vectors are probably the most commonly used object type in R. **A vector is a collection of values that are all of the same type (numbers, characters, etc.)**. One of the most common ways to create a vector is to use the `c()` function — the "concatenate" or "combine" function. Inside the function you may enter one or more values; for multiple values, separate each value with a comma: !!! r-project "r" @@ -410,7 +404,7 @@ function that gives both of these pieces of information is the `str()` str(snp_genes) ``` -Vectors are quite important in R. Another data type that we will work +Vectors are very important in R. Another data type that we will work with later in this lesson, data frames, are collections of vectors. What we learn here about vectors will pay off even more when we start working with data frames. @@ -431,10 +425,10 @@ Let's create a few more vectors to play around with: snp_positions <- c(8762685, 66560624, 67545785, 154039662) ``` -Once we have vectors, one thing we may want to do is specifically retrieve one +Once we have vectors, one thing we may want to do is retrieve one or more values from our vector. To do so, we use **bracket notation**. We type the name of the vector followed by square brackets. In those square brackets -we place the index (e.g. a number) in that bracket as follows: +we place the index (i.e., a number that corresponds to the position of the value in the vector) in that bracket as follows: !!! r-project "r" @@ -443,21 +437,21 @@ we place the index (e.g. a number) in that bracket as follows: snps[3] ``` -In R, every item your vector is indexed, starting from the first item +In R, every item in your vector is indexed, starting from the first item (1) through to the final number of items in your vector. You can also retrieve a range of numbers: !!! r-project "r" ```r - # Get the 1st through 3rd value in the 'snps' vector + # Get the 1st to 3rd value in the 'snps' vector snps[1:3] ``` If you want to retrieve several (but not necessarily sequential) items from -a vector, you pass a **vector of indices**; a vector that has the numbered -positions you wish to retrieve. +a vector, you pass a **vector of indices** (i.e., a vector that has the numbered +positions of the elements you wish to retrieve). !!! r-project "r" @@ -470,7 +464,7 @@ positions you wish to retrieve. There are additional (and perhaps less commonly used) ways of subsetting a vector (see [these examples](https://thomasleeper.com/Rcourse/Tutorials/vectorindexing.html)). -Also, several of these subsetting expressions can be combined: +Several of these subsetting expressions can be combined: !!! r-project "r" @@ -504,11 +498,11 @@ We can verify that "snp_genes" contains the new gene entry snp_genes ``` -??? success "Output" + ??? success "Output" - ``` - [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" "APOA5" - ``` + ``` + [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" "APOA5" + ``` Using a negative index will return a version of a vector with that index's value removed: @@ -519,11 +513,11 @@ index's value removed: snp_genes[-6] ``` -??? success "Output" + ??? success "Output" - ``` - [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" - ``` + ``` + [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" + ``` We can remove that value from our vector by overwriting it with this expression: @@ -536,11 +530,11 @@ expression: snp_genes ``` -??? success "Output" + ??? success "Output" - ``` - [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" - ``` + ``` + [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" + ``` We can also explicitly rename or add a value to our index using bracket notation: @@ -553,11 +547,15 @@ bracket notation: snp_genes ``` -??? success "Output" + ??? success "Output" - ``` - [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5" - ``` + ``` + [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5" + ``` + +!!! tip "Filling in the blanks" + + Notice in the example above that there is an `NA` in the 6th position of the vector? Recall that R keeps track of every element in vectors by numeric indexing. This means that when we added a 7th element where there is no 6th element, R fills it with `NA` which functionally means "missing value" or "Not Available". !!! question "Exercise: Examining and subsetting vectors" @@ -577,19 +575,19 @@ bracket notation: A) False - Vectors have both of these properties
B) True
C) True
- D) False - Vectors have only one mode (e.g. numeric, character); all items in - a vector must be of this mode.
+ D) False - Vectors have only one mode (e.g. numeric, character); + all items in a vector must be of this mode.
E) True
F) True -### Logical Subsetting +### Logical subsetting There is one last set of cool subsetting capabilities we want to introduce. It is possible within R to retrieve items in a vector based on a logical evaluation or numerical comparison. For example, let's say we wanted get all of the SNPs in our vector of SNP positions that were -greater than 100,000,000. We could index using the '\>' (greater than) +greater than 100,000,000. We could index using the `>` (greater than) logical operator: !!! r-project "r" @@ -598,51 +596,52 @@ logical operator: snp_positions[snp_positions > 100000000] ``` -??? success "Output" + ??? success "Output" - ``` - [1] 154039662 - ``` + ``` + [1] 154039662 + ``` In the square brackets you place the name of the vector followed by the -comparison operator and (in this case) a numeric value. Some of the most -common logical operators you will use in R are: +comparison operator and (in this case) a numeric value. Some common logical operators in R are: | Operator | Description | | -------- | ------------------------ | -| < | Less than | -| <= | Less than or equal to | -| \> | Greater than | -| \>= | Greater than or equal to | -| == | Exactly equal to | -| != | Not equal to | -| !x | Not x | -| a \| b | a or b | -| a & b | a and b | +| `<` | Less than | +| `<=` | Less than or equal to | +| `>` | Greater than | +| `>=` | Greater than or equal to | +| `==` | Exactly equal to | +| `!=` | Not equal to | +| `!x` | Not x | +| `a | b` | a or b | +| `a & b` | a and b | + +**:magic_wand: The magic of programming** + +The reason why the expression `snp_positions[snp_positions > 100000000]` works +can be better understood if you examine what the expression +`snp_positions > 100000000` evaluates to: -!!! info "The magic of programming" - - The reason why the expression `snp_positions[snp_positions > 100000000]` works can be better understood if you examine what the expression `snp_positions > 100000000` evaluates to: +!!! r-project "r" - !!! r-project "r" + ```r + snp_positions > 100000000 + ``` - ```r - snp_positions > 100000000 - ``` - ??? success "Output" ``` [1] FALSE FALSE FALSE TRUE ``` - - The output above is a logical vector, the 4th element of which is TRUE. When you pass a logical vector as an index, R will return the true values: - !!! r-project "r" +The output above is a logical vector where the 4th element is `TRUE`. When you pass a logical vector as an index, R will return values that are or evaluate to `TRUE`: - ```r - snp_positions[c(FALSE, FALSE, FALSE, TRUE)] - ``` +!!! r-project "r" + + ```r + snp_positions[c(FALSE, FALSE, FALSE, TRUE)] + ``` ??? success "Output" @@ -650,32 +649,42 @@ common logical operators you will use in R are: [1] 154039662 ``` - If you have never coded before, this type of situation starts to expose the “magic” of programming. We mentioned before that in the bracket notation you take your named vector followed by brackets which contain an index: **named_vector[index]**. The “magic” is that the index needs to evaluate to a number. So, even if it does not appear to be an integer (e.g. 1, 2, 3), as long as R can *evaluate* it, we will get a result. That our expression `snp_positions[snp_positions > 100000000]` evaluates to a number can be seen in the following situation. If you wanted to know which **index** (1, 2, 3, or 4) in our vector of SNP positions was the one that was greater than 100,000,000? - - We can use the `which()` function to return the indices of any item that evaluates as TRUE in our comparison: - - !!! r-project "r" +If you have never coded before, this type of situation starts to expose the +“magic” of programming. We mentioned before that in the bracket notation you +take your named vector followed by brackets which contain an index: +**`named_vector[index]`**. The “magic” is that the index needs to evaluate to a +number. So, even if it does not appear to be an integer (e.g. 1, 2, 3), +as long as R can *evaluate* it, we will get a result. That our expression +`snp_positions[snp_positions > 100000000]` evaluates to a number can be seen in +the following situation. If you wanted to know which **index** (1, 2, 3, or 4) +in our vector of SNP positions was the one that was greater than 100,000,000? +We can use the `which()` function to return the indices of any item that +evaluates as `TRUE` in our comparison: - ```r - which(snp_positions > 100000000) - ``` +!!! r-project "r" + ```r + which(snp_positions > 100000000) + ``` + ??? success "Output" ``` [1] 4 ``` - - **Why this is important** - Often in programming we will not know what inputs and values will be used when our code is executed. Rather than put in a pre-determined value (e.g 100000000) we can use an object that can take on whatever value we need. So for example: +**Why this is important** - !!! r-project "r" +Often in programming we will not know what inputs and values will be used when +our code is executed. Rather than put in a pre-determined value (e.g 100000000) +we can use an object that can take on whatever value we need. So for example: - ```r - snp_marker_cutoff <- 100000000 - snp_positions[snp_positions > snp_marker_cutoff] - ``` +!!! r-project "r" + + ```r + snp_marker_cutoff <- 100000000 + snp_positions[snp_positions > snp_marker_cutoff] + ``` ??? success "Output" @@ -683,7 +692,7 @@ common logical operators you will use in R are: [1] 154039662 ``` - Ultimately, it’s putting together flexible, reusable code like this that gets at the “magic” of programming! +Ultimately, it’s putting together flexible, reusable code like this that gets at the “magic” of programming! ### A few final vector tricks @@ -691,9 +700,9 @@ common logical operators you will use in R are: Finally, there are a few other common retrieve or replace operations you may want to know about. First, you can check to see if any of the values of your vector are missing (i.e. are `NA`, that stands for -`not avaliable`). Missing data will get a more detailed treatment later, -but the `is.NA()` function will return a logical vector, with TRUE for -any NA value: +"Not Available"). Missing data will get a more detailed treatment later, +but the `is.na()` function will return a logical vector, with `TRUE` for +any `NA` value: !!! r-project "r" @@ -704,13 +713,15 @@ any NA value: is.na(snp_genes) ``` -??? success "Output" + ??? success "Output" - ``` - [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE - ``` + ``` + [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE + ``` -Sometimes, you may wish to find out if a specific value (or several values) is present a vector. You can do this using the comparison operator `%in%`, which will return TRUE for any value in your collection that is in the vector you are searching: +Sometimes, you may wish to find out if a specific value (or several values) is +present in a vector. You can do this using the comparison operator `%in%` +which will return `TRUE` for any value in the vector you are searching: !!! r-project "r" @@ -718,36 +729,44 @@ Sometimes, you may wish to find out if a specific value (or several values) is p # Current value of 'snp_genes' inspected using 'str()': # chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5" - # Test to see if "ACTN3" or "APO5A" is in the snp_genes vector. - # If you are looking for more than one value, you must pass this as a vector. + # Test to see if "ACTN3" or "APO5A" is in the snp_genes vector + # If you are looking for more than one value, you must pass this as a vector c("ACTN3","APOA5") %in% snp_genes ``` -??? success "Output" + ??? success "Output" - ``` - [1] TRUE TRUE - ``` + ``` + [1] TRUE TRUE + ``` ## Lists -Lists are another form of data structure in R. Although it resembles vectors, two key properties differentiates them and make lists extremely useful for more complex data storage: +Lists are another form of data structure in R. Although it resembles vectors, +two key properties differentiates them and make lists extremely useful for more +complex data storage: - Lists can contain **data of multiple modes**
Remember that vectors can only contain data of one mode. - Lists can be **nested**
- This means that a list can contain other lists. Vectors, on the other hand, are flat (i.e. all values of a vector is laid out on a single level) + This means that a list can contain other lists. Vectors, on the other hand, + are flat (i.e., all values of a vector is laid out on a single level) Due to their flexibility, many complex analyses store data/return results using lists. !!! info "Attribution" - Some of the content here was adapted from an excellent [tutorial](https://r4ds.had.co.nz/vectors.html#lists), which we highly recommend you read. + + Some of the content here was adapted from an excellent + [tutorial](https://r4ds.had.co.nz/vectors.html#lists) which we highly + recommend you read. + ### Creating lists -Let's begin by creating a list using existing vectors created [prior](#creating-and-subsetting-vectors). +Let's begin by creating a list using existing vectors created +[prior](#creating-and-subsetting-vectors). !!! r-project "r" @@ -765,7 +784,9 @@ Let's begin by creating a list using existing vectors created [prior](#creating- ``` !!! info "Copy-pasting code" - Make sure to highlight, copy, paste, then run all 4 lines of the "Create a list" code block. + + Make sure to highlight, copy, paste, then run all lines of the + "Create a list" code block. We can see how R prints a list by calling it. @@ -775,43 +796,48 @@ We can see how R prints a list by calling it. snps_list ``` -??? success "Output" + ??? success "Output" - ``` - $genes - [1] "OXTR" "ACTN3" "AR" "OPRM1" + ``` + $genes + [1] "OXTR" "ACTN3" "AR" "OPRM1" - $reference_snps - [1] "rs53576" "rs1815739" "rs6152" "rs1799971" + $reference_snps + [1] "rs53576" "rs1815739" "rs6152" "rs1799971" - $chromosomes - [1] "3" "11" "X" "6" + $chromosomes + [1] "3" "11" "X" "6" - $positions - [1] 8762685 66560624 67545785 154039662 + $positions + [1] 8762685 66560624 67545785 154039662 - ``` + ``` -Here, we have a *named list* that contains the vectors `genes`, `reference_snps`, `chromosomes`, and `positions`. We can also inspect the modes of data stored in the list using `str()`. +Here, we have a *named list* that contains the vectors `genes`, +`reference_snps`, `chromosomes`, and `positions`. We can also inspect the modes +of data stored in the list using `str()`. !!! r-project "r" + ```r str(snps_list) ``` -??? success "Output" + ??? success "Output" - ``` - List of 4 - $ genes : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" - $ reference_snps: chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" - $ chromosomes : chr [1:4] "3" "11" "X" "6" - $ positions : num [1:4] 8.76e+06 6.66e+07 6.75e+07 1.54e+08 - ``` + ``` + List of 4 + $ genes : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" + $ reference_snps: chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" + $ chromosomes : chr [1:4] "3" "11" "X" "6" + $ positions : num [1:4] 8.76e+06 6.66e+07 6.75e+07 1.54e+08 + ``` !!! info "Unnamed items in a list" - When we created our list, we assigned a name to each item within the list. However, this is not strictly necessary. Let's try it out and see what happens: + When we created our list, we assigned a name to each item within the list. + However, this is not strictly necessary. Let's try it out and see what + happens: !!! r-project "r" @@ -828,28 +854,32 @@ Here, we have a *named list* that contains the vectors `genes`, `reference_snps` unnamed_snps_list ``` - ??? success "Output" + ??? success "Output" - ``` - [[1]] - [1] "OXTR" "ACTN3" "AR" "OPRM1" - - [[2]] - [1] "rs53576" "rs1815739" "rs6152" "rs1799971" - - [[3]] - [1] "3" "11" "X" "6" - - [[4]] - [1] 8762685 66560624 67545785 154039662 - ``` + ``` + [[1]] + [1] "OXTR" "ACTN3" "AR" "OPRM1" + + [[2]] + [1] "rs53576" "rs1815739" "rs6152" "rs1799971" - Notice that the items are not named and simply given an numeric index within double square brackets `[[1]]`. In fact, you can create a list with a mixture of named and unnamed items. While this is still a valid way to create lists, it is not very user-friendly as we have no way of telling what information is stored in each of the items. Ideally, you should assign names to each item in the list when creating it to easily identify the kinds of information stored within it. + [[3]] + [1] "3" "11" "X" "6" + [[4]] + [1] 8762685 66560624 67545785 154039662 + ``` + + Notice that the items are not named and are simply given a numeric index within double square brackets `[[1]]`. In fact, you can create a list with a mixture of named and unnamed items. While this is still a valid way to create lists, it is not very user-friendly as we have no way of telling what information is stored in each of the items. Ideally, you should assign names to each item in the list when creating it to easily identify the kinds of information stored within it. !!! bell "Importance of correct symbols for adding items to lists" - Notice that we have used the assignment operator `<-` to create `snps_list`, and the equal symbol `=` to add vectors to each item in the list. When creating a vector, you can use either symbol to achieve the same goal. However, when creating lists, the distinction between symbols is ***NOT*** a stylistic choice and will affect the output. We can see what happens if we were to change the `=` operator to `<-` within the brackets. + Notice that we have used the assignment operator `<-` to create + `snps_list` and the equal symbol `=` to add vectors to each item in the + list. When creating a vector, you can use either symbol to achieve the same + goal. However, when creating lists, the distinction between symbols is + ***NOT*** a stylistic choice and will affect the output. We can see what + happens if we were to change the `=` operator to `<-` within the brackets. !!! r-project "r" @@ -862,36 +892,44 @@ Here, we have a *named list* that contains the vectors `genes`, `reference_snps` ) # Call the list - snps_list_2 + wrong_operator ``` - ??? success "Output" - - ``` - [[1]] - [1] "OXTR" "ACTN3" "AR" "OPRM1" - - $reference_snps - [1] "rs53576" "rs1815739" "rs6152" "rs1799971" - - $chromosomes - [1] "3" "11" "X" "6" - - $positions - [1] 8762685 66560624 67545785 154039662 - ``` + ??? success "Output" + + ``` + [[1]] + [1] "OXTR" "ACTN3" "AR" "OPRM1" + + $reference_snps + [1] "rs53576" "rs1815739" "rs6152" "rs1799971" + + $chromosomes + [1] "3" "11" "X" "6" + + $positions + [1] 8762685 66560624 67545785 154039662 + ``` We can observe two things: - - The name for the first list item `genes` is now simply a numeric index `[[1]]`. - - There is an additional object in the global environment called `genes`. + - The name for the first list item `genes` is now simply + a numeric index `[[1]]` + - There is an additional object in the global environment called `genes` - Switching the assignment operator adds an unnamed item in the list, and creates a vector object called `genes` in the global environment. Make certain to use the correct operator when creating lists to avoid unintentionally populating your working environment. + Switching the assignment operator adds an unnamed item in the list, + and creates a vector object called `genes` in the global environment. + Make certain to use the correct operator when creating lists to avoid + unintentionally populating your working environment. ### Refer to items in a list -In the examples above, we refer to values in a vector using their indices (e.g. `vector[c(2, 3)]`). For the named list we created, we can refer to the name of the items themselves. Commonly, this is done using the `$` like so: `list_name$object_name`. For example, we can retrieve gene positions from `snps_list`: +In the examples above, we refer to values in a vector using their indices +(e.g. `vector[c(2, 3)]`). For the named list we created, we can refer to the +name of the items themselves. Commonly, this is done using the `$` like so: +`list_name$object_name`. For example, we can retrieve gene positions from +`snps_list`: !!! r-project "r" @@ -899,13 +937,17 @@ In the examples above, we refer to values in a vector using their indices (e.g. snps_list$positions ``` -??? success "Output" + ??? success "Output" - ``` - [1] 8762685 66560624 67545785 154039662 - ``` + ``` + [1] 8762685 66560624 67545785 154039662 + ``` -There are also ways to refer to objects in a list using square brackets `[]` (like for vectors). However, this method is more complex. [This section](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) of the linked tutorial provides a good analogy on how the brackets are interpreted in R in the context of lists. +There are also ways to refer to objects in a list using square brackets `[]` +(like for vectors). However, this method is more complex. +[This section](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) of the +linked tutorial provides a good analogy on how the brackets are interpreted in +R in the context of lists. ### Add or replace items in a list @@ -925,20 +967,23 @@ We can also add items to lists. Let's add a logical vector to the list. str(snps_list) ``` -??? success "Output" + ??? success "Output" - ``` - List of 5 - $ genes : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" - $ reference_snps: chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" - $ chromosomes : chr [1:4] "3" "11" "X" "6" - $ positions : num [1:4] 8.76e+06 6.66e+07 6.75e+07 1.54e+08 - $ DE_genes : logi [1:4] TRUE FALSE FALSE TRUE - ``` + ``` + List of 5 + $ genes : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" + $ reference_snps: chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" + $ chromosomes : chr [1:4] "3" "11" "X" "6" + $ positions : num [1:4] 8.76e+06 6.66e+07 6.75e+07 1.54e+08 + $ DE_genes : logi [1:4] TRUE FALSE FALSE TRUE + ``` -Now, `snps_list` has 5 items, the last of which is a logical vector that we just added. +Now, `snps_list` has 5 items, the last of which is a logical vector that we +just added. -To replace items in a list, we refer to the item, then assign it something else. Let's say we want the `DE_genes` to be in numeric, instead of logical mode. +To replace items in a list, we refer to the item, then assign it something +else. Let's say we want the `DE_genes` to be in numeric, instead of logical +mode. !!! r-project "r" @@ -950,20 +995,24 @@ To replace items in a list, we refer to the item, then assign it something else. str(snps_list) ``` -??? success "Output" + ??? success "Output" - ``` - List of 5 - $ genes : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" - $ reference_snps: chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" - $ chromosomes : chr [1:4] "3" "11" "X" "6" - $ positions : num [1:4] 8.76e+06 6.66e+07 6.75e+07 1.54e+08 - $ DE_genes : num [1:4] 1 0 0 1 - ``` + ``` + List of 5 + $ genes : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" + $ reference_snps: chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" + $ chromosomes : chr [1:4] "3" "11" "X" "6" + $ positions : num [1:4] 8.76e+06 6.66e+07 6.75e+07 1.54e+08 + $ DE_genes : num [1:4] 1 0 0 1 + ``` -Notice that `DE_genes` has turned into numeric mode from it's original logical mode, where `TRUE` is `1` and `FALSE` is `0`. +Notice that `DE_genes` has turned into numeric mode from its original logical +mode, where `TRUE` is `1` and `FALSE` is `0`. -Up till now, we have made a list that contain items with a vector length of 4. There is no requirement that all items in lists to must have the same length. Lists can have any number of items of different lengths. For example, let's add a single number into the list: +Up till now, we have made a list that contain items with a vector length of 4. +There is no requirement that all items in lists to must have the same length. +Lists can have any number of items of different lengths. For example, let's add +a single number into the list: !!! r-project "r" @@ -975,22 +1024,24 @@ Up till now, we have made a list that contain items with a vector length of 4. T str(snps_list) ``` -??? success "Output" + ??? success "Output" - ``` - List of 6 - $ genes : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" - $ reference_snps: chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" - $ chromosomes : chr [1:4] "3" "11" "X" "6" - $ positions : num [1:4] 8.76e+06 6.66e+07 6.75e+07 1.54e+08 - $ DE_genes : num [1:4] 1 0 0 1 - $ sig_threshold : num 0.05 - ``` + ``` + List of 6 + $ genes : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" + $ reference_snps: chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" + $ chromosomes : chr [1:4] "3" "11" "X" "6" + $ positions : num [1:4] 8.76e+06 6.66e+07 6.75e+07 1.54e+08 + $ DE_genes : num [1:4] 1 0 0 1 + $ sig_threshold : num 0.05 + ``` ### Subset or remove objects from a list -Single square brackets `[]` can be used to subset a list based on the item's index or name. The following code will subset the items `genes`, `reference_snps`, and `sig_threshold` from `snps_list`. +Single square brackets `[]` can be used to subset a list based on the item's +index or name. The following code will subset the items `genes`, +`reference_snps`, and `sig_threshold` from `snps_list`. !!! r-project "r" @@ -1003,13 +1054,14 @@ Single square brackets `[]` can be used to subset a list based on the item's ind str(sub_snps_list) ``` -??? success "Output" + ??? success "Output" - ``` - List of 2 - $ genes : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" - $ reference_snps: chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" - ``` + ``` + List of 3 + $ genes : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" + $ reference_snps: chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" + $ sig_threshold : num 0.05 + ``` !!! r-project "r" @@ -1021,17 +1073,39 @@ Single square brackets `[]` can be used to subset a list based on the item's ind str(sub_snps_list) ``` -??? success "Output" + ??? success "Output" + + ``` + List of 3 + $ genes : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" + $ reference_snps: chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" + $ sig_threshold : num 0.05 + ``` + +!!! tip "List bracket notation" - ``` - List of 2 - $ genes : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" - $ reference_snps: chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" - ``` + When subsetting a list using single square brackets `[]`, the result is + always a list. To access the vector directly, we need to use double square + brackets `[[]]`. For example: -!!! bell "Reminder" + !!! r-project "r" + + ```r + # Return the vector in a list + snps_list["chromosomes"] - Be mindful that when subsetting a list using single square brackets `[]`, the result is always a list! + # Return the vector directly + snps_list[["chromosomes"]] + ``` + + ??? success "Output" + + ``` + $chromosomes + [1] "3" "11" "X" "6" + + [1] "3" "11" "X" "6" + ``` To remove objects in a list, we first refer to it, then assign it as `NULL`. @@ -1045,20 +1119,21 @@ To remove objects in a list, we first refer to it, then assign it as `NULL`. str(sub_snps_list) ``` -??? success "Output" + ??? success "Output" - ``` - List of 2 - $ genes : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" - $ reference_snps: chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" - ``` + ``` + List of 2 + $ genes : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" + $ reference_snps: chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" + ``` -Et voilà! The numeric vector `sig_threshold` is no longer a part of `sub_snps_list`. +Et voilà! The numeric vector `sig_threshold` is no longer a part of +`sub_snps_list`. ### Nested list -A defining feature of lists is that it can be nested (i.e. lists can be stored inside of a list). This property allows the storage of data that are structured hierarchically or in a tree-like fashion (i.e. dendrograms). It is also useful for storing different types of data in an organised manner (e.g. a list of model values with a nested list that contains sample information or metadata). You will undoubtedly encounter complex nested list structures in your R journey. The majority of the time, we are only concerned with referring to or subsetting nested lists. With that in mind, let's start with creating a nested list followed by an example on subsetting. +A defining feature of lists is that it can be nested (i.e., lists can be stored inside of a list). This property allows the storage of data that are structured hierarchically or in a tree-like fashion (i.e., dendrograms). It is also useful for storing different types of data in an organised manner (e.g., a list of model values with a nested list that contains sample information or metadata). You will undoubtedly encounter complex nested list structures in your R journey. The majority of the time, we are only concerned with referring to or subsetting nested lists. With that in mind, let's start with creating a nested list followed by an example on subsetting. !!! r-project "r" @@ -1072,7 +1147,7 @@ A defining feature of lists is that it can be nested (i.e. lists can be stored i position = snp_positions ), result = list( - DE = DE_genes, + DE = snps_DE, threshold = 0.05 ), metadata = list( @@ -1086,6 +1161,7 @@ A defining feature of lists is that it can be nested (i.e. lists can be stored i ``` !!! bell "Reminder" + Again, make sure to highlight, copy, paste, and run the entire code block! Inspect the list structure: @@ -1096,35 +1172,40 @@ Inspect the list structure: str(nested_snps_list) ``` -??? success "Output" + ??? success "Output" - ``` - List of 4 - $ gene : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" - $ snp_info:List of 3 - ..$ reference : chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" - ..$ chromosome: chr [1:4] "3" "11" "X" "6" - ..$ position : num [1:4] 8.76e+06 6.66e+07 6.75e+07 1.54e+08 - $ result :List of 2 - ..$ DE : logi [1:4] TRUE FALSE FALSE TRUE - ..$ threshold: num 0.05 - $ metadata:List of 1 - ..$ sample_group:List of 3 - .. ..$ A: chr [1:2] "S1" "S2" - .. ..$ B: chr "S3" - .. ..$ C: chr "S4" - ``` + ``` + List of 4 + $ gene : chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" + $ snp_info:List of 3 + ..$ reference : chr [1:4] "rs53576" "rs1815739" "rs6152" "rs1799971" + ..$ chromosome: chr [1:4] "3" "11" "X" "6" + ..$ position : num [1:4] 8.76e+06 6.66e+07 6.75e+07 1.54e+08 + $ result :List of 2 + ..$ DE : logi [1:4] TRUE FALSE FALSE TRUE + ..$ threshold: num 0.05 + $ metadata:List of 1 + ..$ sample_group:List of 3 + .. ..$ A: chr [1:2] "S1" "S2" + .. ..$ B: chr "S3" + .. ..$ C: chr "S4" + ``` Phew! There's quite a bit of information here! Let's break it down. -- We have created a list with 4 objects, each of which contain different objects and modes (list, character, numeric, and logical). -- `snp_info` and `result` are nested lists with structures we have encountered before. +- We have created a list with 4 objects, each of which contain different + objects and modes: list, character, numeric, and logical. +- `snp_info` and `result` are nested lists with structures we have encountered + before. - They each have vectors of varying modes. - `result` has vectors of varying lengths. - The nested list `metadata` is the *parent* of `sample_group`. - - `sample_group` has *children* `A`, `B`, and `C`, all of which are character vectors. + - `sample_group` has *children* `A`, `B`, and `C`, all of which are + character vectors. -Now we can subset/retrieve some of the information in the nested list. As we are working with named lists (and sub-lists), it is easier to use the `$` method. Let's subset the `chromosome` vector +Now we can subset/retrieve some of the information in the nested list. As we +are working with named lists (and sub-lists), it is easier to use the `$` +method. Let's subset the `chromosome` vector !!! r-project "r" @@ -1132,11 +1213,11 @@ Now we can subset/retrieve some of the information in the nested list. As we are nested_snps_list$snp_info$chromosome ``` -??? success "Output" + ??? success "Output" - ``` - [1] "3" "11" "X" "6" - ``` + ``` + [1] "3" "11" "X" "6" + ``` If you want to use the index and square brackets method: @@ -1146,15 +1227,17 @@ If you want to use the index and square brackets method: nested_snps_list[[2]][[2]] ``` -??? success "Output" + ??? success "Output" - ``` - [1] "3" "11" "X" "6" - ``` + ``` + [1] "3" "11" "X" "6" + ``` -You can read the above code as "the second item (here, a character vector) of the second item (a list)" +You can read the above code as "the second item (here, a character vector) of +the second item (a list)" -You can retrieve objects from deeply nested lists as long as you know the object's name or index, and how it is structured (parent-child hierarchy). +You can retrieve objects from deeply nested lists as long as you know the +object's name or index, and how it is structured (parent-child hierarchy). ## Review exercises @@ -1186,8 +1269,13 @@ You can retrieve objects from deeply nested lists as long as you know the object ??? success "Solution" ```r + # Answer (a) snps <- c(snps, "rs662799") + + # Answer (b) snp_chromosomes <- c(snp_chromosomes, "11") # Did you use quotes? + + # Answer (c) snp_positions <- c(snp_positions, 116792991) ``` @@ -1197,8 +1285,13 @@ You can retrieve objects from deeply nested lists as long as you know the object Hint: Your vector should look like this in ‘Environment’:
`chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5"`.
- If not recreate the vector by running this expression:
- `snp_genes <- c("OXTR", "ACTN3", "AR", "OPRM1", "CYP1A1", NA, "APOA5")` + If not recreate the vector by running this:
+ + !!! r-project "r" + + ```r + snp_genes <- c("OXTR", "ACTN3", "AR", "OPRM1", "CYP1A1", NA, "APOA5") + ``` a. Create a new version of `snp_genes` that does not contain "CYP1A1" and then
b. Add 2 NA values to the end of `snp_genes` @@ -1206,13 +1299,18 @@ You can retrieve objects from deeply nested lists as long as you know the object ??? success "Solution" ```r + # Answer (a) snp_genes <- snp_genes[-5] + # or use logical subsetting + snp_genes <- snp_genes[snp_genes != "CYP1A1"] + + # Answer (c) snp_genes <- c(snp_genes, NA, NA) ``` !!! question "Exercise 4" - Using indexing, create a new vector named combined that contains: + Using indexing, create a new vector named `combined` that contains: - The the 1st value in `snp_genes` - The 1st value in `snps` @@ -1222,12 +1320,17 @@ You can retrieve objects from deeply nested lists as long as you know the object ??? success "Solution" ```r - combined <- c(snp_genes[1], snps[1], snp_chromosomes[1], snp_positions[1]) + combined <- c( + snp_genes[1], + snps[1], + snp_chromosomes[1], + snp_positions[1] + ) ``` !!! question "Exercise 5" - What type of data is combined? + What type of data is `combined`? ??? success "Solution" diff --git a/docs/02-data-prelude.md b/docs/02-data-prelude.md index fe10fcc..3fbe569 100644 --- a/docs/02-data-prelude.md +++ b/docs/02-data-prelude.md @@ -1,8 +1,8 @@ # Introduction to the example dataset and file type -!!! info +!!! info "Learning outcomes" - === "Keypoints" + === "Key points" - The dataset comes from a real world experiment in *E. coli*. - Publicly available FASTQ files can be downloaded from NCBI SRA. diff --git a/docs/03-basics-factors-dataframes.md b/docs/03-basics-factors-dataframes.md index 933389d..9f6e103 100644 --- a/docs/03-basics-factors-dataframes.md +++ b/docs/03-basics-factors-dataframes.md @@ -1,9 +1,9 @@ # R Basics continued - factors and data frames -!!! info +!!! info "Learning outcomes" - === "Keypoints" + === "Key points" - It is easy to import data into R from tabular formats including Excel. However, you still need to check that R has imported and interpreted @@ -16,7 +16,7 @@ === "Objectives" - - Explain the basic principle of tidy datasets + - Understand the basic principle of tidy datasets - Be able to load a tabular dataset using base R functions - Be able to determine the structure of a data frame including its dimensions and the datatypes of variables @@ -34,7 +34,7 @@ === "Questions" - - How do I get started with tabular data (e.g. spreadsheets) in R? + - How do I get started with tabular data (e.g., spreadsheets) in R? - What are some best practices for reading data into R? - How do I save tabular data generated in R? @@ -53,7 +53,7 @@ our first set of example data: This is principle number one because if you can't tell which files are the original raw data, you risk making some serious mistakes -(e.g. drawing conclusion from data which have been manipulated in some +(e.g., drawing conclusion from data which have been manipulated in some unknown way). **2) Keep spreadsheet data Tidy** @@ -76,21 +76,21 @@ assumptions about your data - the range of values you expect, how many values should have been recorded, etc. Of course, as the data get larger our human ability to keep track will start to fail (and yes, it can fail for small data sets too). R will help you to examine your data so that -you can have greater confidence in your analysis, and its +you can have greater confidence in your analysis and its reproducibility. !!! tip "Keeping your raw data separate" When you work with data in R, you are not changing the original file - you loaded that data from. This is different than (for example) + you loaded that data from. This is different from (for example) working with a spreadsheet program where changing the value of the cell leaves you one "save"-click away from overwriting the original file. You have to purposely use a writing function - (e.g. `write.csv()`) to save data loaded into R. In that case, be sure + (e.g., `write.csv()`) to save data loaded into R. In that case, be sure to save the manipulated data into a new file. More on this later in the lesson. -## Importing tabular data into R +### Importing tabular data into R There are several ways to import data into R. For our purpose here, we will focus on using the tools every R installation comes with (so called @@ -116,21 +116,21 @@ function called `read.csv()`. delimited by semicolons (;) rather than commas? C) What argument would you have to change to read file in which - numbers used commas for decimal separation (i.e. 1,00)? + numbers used commas for decimal separation (i.e., 1,00)? D) What argument would you have to change to read in only the first 10,000 rows of a very large file? ??? success "Solution" - A) The `read.csv()` function has the argument 'header' set to TRUE - by default, this means the function always assumes the first row - is header information, (i.e. column names) + A) The `read.csv()` function has the argument 'header' set to `TRUE` + by default. This means the function always assumes the first row + is header information, (i.e., column names) B) The `read.csv()` function has the argument 'sep' set to ",". This means the function assumes commas are used as delimiters, - as you would expect. Changing this parameter (e.g. `sep=";"`) - would now interpret semicolons as delimiters. + as you would expect. Changing this parameter (e.g., `sep=";"`) + would tell R to interpret semicolons as delimiters. C) Although it is not listed in the `read.csv()` usage, `read.csv()` is a "version" of the function `read.table()` and @@ -138,20 +138,18 @@ function called `read.csv()`. the decimal operator. We'd probably assume the delimiter is some other character. - D) You can set `nrow` to a numeric value (e.g. `nrow=10000`) to + D) You can set `nrow` to a numeric value (e.g., `nrow=10000`) to choose how many rows of a file you read in. This may be useful for very large files where not all the data is needed to test some data cleaning steps you are applying. Hopefully, this exercise gets you thinking about using the provided help documentation in R. There are many arguments that exist, but - which we wont have time to cover. Look here to get familiar with - functions you use frequently, you may be surprised at what you find - they can do. + which we wont have time to cover. Now, let's read in the file `combined_tidy_vcf.csv` which will be -located in `/home/shared/$USER/R4Genomics/`. Call this data `variants`. The first -argument to pass to our `read.csv()` function is the file path for our +located in `/home/shared/$USER/R4Genomics/`. Call/assign this data `variants`. +The first argument to pass to our `read.csv()` function is the file path for our data. The file path must be in quotes and now is a good time to remember to use tab autocompletion. **If you use tab autocompletion you avoid typos and errors in file paths.** Use it! @@ -171,12 +169,12 @@ name of the object will open a view of the data in a new tab. ![images](figures/rstudio_dataframeview.png){ width="1000" } -## Summarizing, subsetting, and determining the structure of a data frame. +### Summarizing and determining the structure of a data frame. -A **data frame is the standard way in R to store tabular data**. A data -fame could also be thought of as a collection of vectors, all of which +A **data frame is the standard way to store tabular data in R**. A data +frame could also be thought of as a collection of vectors, all of which have the same length. Using only two functions, we can learn a lot about -out data frame including some summary statistics as well as well as the +out data frame including some summary statistics as well as the "structure" of the data frame. Let's examine what each of these functions can tell us: @@ -188,70 +186,69 @@ functions can tell us: summary(variants) ``` -??? success "Output" + ??? success "Output" - ``` - sample_id CHROM POS ID REF - Length:801 Length:801 Min. : 1521 Mode:logical Length:801 - Class :character Class :character 1st Qu.:1115970 NA's:801 Class :character - Mode :character Mode :character Median :2290361 Mode :character - Mean :2243682 - 3rd Qu.:3317082 - Max. :4629225 - - ALT QUAL FILTER INDEL IDV IMF - Length:801 Min. : 4.385 Mode:logical Mode :logical Min. : 2.000 Min. :0.5714 - Class :character 1st Qu.:139.000 NA's:801 FALSE:700 1st Qu.: 7.000 1st Qu.:0.8824 - Mode :character Median :195.000 TRUE :101 Median : 9.000 Median :1.0000 - Mean :172.276 Mean : 9.396 Mean :0.9219 - 3rd Qu.:225.000 3rd Qu.:11.000 3rd Qu.:1.0000 - Max. :228.000 Max. :20.000 Max. :1.0000 - NA's :700 NA's :700 - DP VDB RPB MQB BQB - Min. : 2.00 Min. :0.0005387 Min. :0.0000 Min. :0.0000 Min. :0.1153 - 1st Qu.: 7.00 1st Qu.:0.2180410 1st Qu.:0.3776 1st Qu.:0.1070 1st Qu.:0.6963 - Median :10.00 Median :0.4827410 Median :0.8663 Median :0.2872 Median :0.8615 - Mean :10.57 Mean :0.4926291 Mean :0.6970 Mean :0.5330 Mean :0.7784 - 3rd Qu.:13.00 3rd Qu.:0.7598940 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 - Max. :79.00 Max. :0.9997130 Max. :1.0000 Max. :1.0000 Max. :1.0000 - NA's :773 NA's :773 NA's :773 - MQSB SGB MQ0F ICB HOB AC - Min. :0.01348 Min. :-0.6931 Min. :0.00000 Mode:logical Mode:logical Min. :1 - 1st Qu.:0.95494 1st Qu.:-0.6762 1st Qu.:0.00000 NA's:801 NA's:801 1st Qu.:1 - Median :1.00000 Median :-0.6620 Median :0.00000 Median :1 - Mean :0.96428 Mean :-0.6444 Mean :0.01127 Mean :1 - 3rd Qu.:1.00000 3rd Qu.:-0.6364 3rd Qu.:0.00000 3rd Qu.:1 - Max. :1.01283 Max. :-0.4536 Max. :0.66667 Max. :1 - NA's :48 - AN DP4 MQ Indiv gt_PL gt_GT - Min. :1 Length:801 Min. :10.00 Length:801 Length:801 Min. :1 - 1st Qu.:1 Class :character 1st Qu.:60.00 Class :character Class :character 1st Qu.:1 - Median :1 Mode :character Median :60.00 Mode :character Mode :character Median :1 - Mean :1 Mean :58.19 Mean :1 - 3rd Qu.:1 3rd Qu.:60.00 3rd Qu.:1 - Max. :1 Max. :60.00 Max. :1 - - gt_GT_alleles - Length:801 - Class :character - Mode :character - ``` + ``` + sample_id CHROM POS ID REF + Length:801 Length:801 Min. : 1521 Mode:logical Length:801 + Class :character Class :character 1st Qu.:1115970 NA's:801 Class :character + Mode :character Mode :character Median :2290361 Mode :character + Mean :2243682 + 3rd Qu.:3317082 + Max. :4629225 + + ALT QUAL FILTER INDEL IDV IMF + Length:801 Min. : 4.385 Mode:logical Mode :logical Min. : 2.000 Min. :0.5714 + Class :character 1st Qu.:139.000 NA's:801 FALSE:700 1st Qu.: 7.000 1st Qu.:0.8824 + Mode :character Median :195.000 TRUE :101 Median : 9.000 Median :1.0000 + Mean :172.276 Mean : 9.396 Mean :0.9219 + 3rd Qu.:225.000 3rd Qu.:11.000 3rd Qu.:1.0000 + Max. :228.000 Max. :20.000 Max. :1.0000 + NA's :700 NA's :700 + DP VDB RPB MQB BQB + Min. : 2.00 Min. :0.0005387 Min. :0.0000 Min. :0.0000 Min. :0.1153 + 1st Qu.: 7.00 1st Qu.:0.2180410 1st Qu.:0.3776 1st Qu.:0.1070 1st Qu.:0.6963 + Median :10.00 Median :0.4827410 Median :0.8663 Median :0.2872 Median :0.8615 + Mean :10.57 Mean :0.4926291 Mean :0.6970 Mean :0.5330 Mean :0.7784 + 3rd Qu.:13.00 3rd Qu.:0.7598940 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 + Max. :79.00 Max. :0.9997130 Max. :1.0000 Max. :1.0000 Max. :1.0000 + NA's :773 NA's :773 NA's :773 + MQSB SGB MQ0F ICB HOB AC + Min. :0.01348 Min. :-0.6931 Min. :0.00000 Mode:logical Mode:logical Min. :1 + 1st Qu.:0.95494 1st Qu.:-0.6762 1st Qu.:0.00000 NA's:801 NA's:801 1st Qu.:1 + Median :1.00000 Median :-0.6620 Median :0.00000 Median :1 + Mean :0.96428 Mean :-0.6444 Mean :0.01127 Mean :1 + 3rd Qu.:1.00000 3rd Qu.:-0.6364 3rd Qu.:0.00000 3rd Qu.:1 + Max. :1.01283 Max. :-0.4536 Max. :0.66667 Max. :1 + NA's :48 + AN DP4 MQ Indiv gt_PL gt_GT + Min. :1 Length:801 Min. :10.00 Length:801 Length:801 Min. :1 + 1st Qu.:1 Class :character 1st Qu.:60.00 Class :character Class :character 1st Qu.:1 + Median :1 Mode :character Median :60.00 Mode :character Mode :character Median :1 + Mean :1 Mean :58.19 Mean :1 + 3rd Qu.:1 3rd Qu.:60.00 3rd Qu.:1 + Max. :1 Max. :60.00 Max. :1 + + gt_GT_alleles + Length:801 + Class :character + Mode :character + ``` -Our data frame had 29 variables, so we get 29 fields that summarize the data. +Our data frame has 29 variables, so we get 29 fields that summarize the data. The `QUAL`, `IMF`, and `VDB` variables (and several others) are -numerical data and so you get summary statistics on the min and max values for -these columns, as well as mean, median, and interquartile ranges. Many of the -other variables (e.g. `sample_id`) are treated as characters data (more on this -in a bit). -There is a lot to work with, so we will subset the first three columns into a -new data frame using the `data.frame()` function. +numerical data and so you get summary statistics on the minimum and maximum +values for these columns, as well as mean, median, and 1st and 3rd quantile. +Many of the other variables (e.g., `sample_id`) are treated as character data +(more on this in a bit). There is a lot to work with, so we will subset the +first three columns into a new data frame using the `data.frame()` function. !!! r-project "r" ```r - # Put the first three columns of variants into a new data frame called subset + # Put the first three columns of variants into a new data frame called subset_variants - subset <- data.frame(variants[, c(1:3, 6)]) + subset_variants <- variants[, c(1:3, 6)] ``` Now, let's use the `str()` (structure) function to look a little more @@ -262,249 +259,31 @@ closely at how data frames work: ```r # Get the structure of a data frame - str(subset) + str(subset_variants) ``` -??? success "Output" + ??? success "Output" - ``` - 'data.frame': 801 obs. of 4 variables: - $ sample_id: chr "SRR2584863" "SRR2584863" "SRR2584863" "SRR2584863" ... - $ CHROM : chr "CP000819.1" "CP000819.1" "CP000819.1" "CP000819.1" ... - $ POS : int 9972 263235 281923 433359 473901 648692 1331794 1733343 2103887 2333538 ... - $ ALT : chr "G" "T" "T" "CTTTTTTTT" ... - ``` + ``` + 'data.frame': 801 obs. of 4 variables: + $ sample_id: chr "SRR2584863" "SRR2584863" "SRR2584863" "SRR2584863" ... + $ CHROM : chr "CP000819.1" "CP000819.1" "CP000819.1" "CP000819.1" ... + $ POS : int 9972 263235 281923 433359 473901 648692 1331794 1733343 2103887 2333538 ... + $ ALT : chr "G" "T" "T" "CTTTTTTTT" ... + ``` -Ok, thats a lot up unpack! Some things to notice. +Ok, that is a lot up unpack! Some things to notice. - The object type `data.frame` is displayed in the first row along with its dimensions, in this case 801 observations (rows) and 4 variables (columns). -- Each variable (column) has a name (e.g. `sample_id`). This is followed - by the object mode (e.g. chr, int, etc.). Notice that before each - variable name there is a `$` - this will be important later. - - -## Introducing Factors - -Factors are the final major data structure we will introduce in our R genomics -lessons. Factors can be thought of as vectors which are specialized for -categorical data. Given R's specialization for statistics, this make sense since -categorial and continuous variables are usually treated differently. Sometimes -you may want to have data treated as a factor, but in other cases, this may be -undesirable. -Let's see the value of treating some of which are categorical in nature as -factors. Let's take a look at just the alternate alleles - -!!! r-project "r" - - ```r - # Extract the "ALT" column to a new object - - alt_alleles <- subset$ALT - ``` - -Let's look at the first few items in our factor using `head()`: - -!!! r-project "r" - - ```r - head(alt_alleles) - ``` - -??? success "Output" - - ``` - [1] "G" "T" "T" "CTTTTTTTT" "CCGCGC" "T" - ``` - -There are 801 alleles (one for each row). To simplify, lets look at just -the single-nuleotide alleles (SNPs). We can use some of the vector -indexing skills from the last episode. - -!!! r-project "r" - - ```r - snps <- c(alt_alleles[alt_alleles=="A"], - alt_alleles[alt_alleles=="T"], - alt_alleles[alt_alleles=="G"], - alt_alleles[alt_alleles=="C"]) - ``` - -This leaves us with a vector of the 701 alternative alleles which were -single nucleotides. Right now, they are being treated a characters, but -we could treat them as categories of SNP. Doing this will enable some -nice features. For example, we can try to generate a plot of this -character vector as it is right now: - -!!! r-project "r" - - ```r - plot(snps) - ``` - -??? success "Output" - - ``` - Error in plot.window(...) : need finite 'ylim' values - In addition: Warning messages: - 1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion - 2: In min(x) : no non-missing arguments to min; returning Inf - 3: In max(x) : no non-missing arguments to max; returning -Inf - ``` - -Whoops! Though the `plot()` function will do its best to give us a quick -plot, it is unable to do so here. One way to fix this it to tell R to -treat the SNPs as categories (i.e. a factor vector); we will create a -new object to avoid confusion using the `factor()` function: - -!!! r-project "r" - - ```r - factor_snps <- factor(snps) - ``` - -Let's learn a little more about this new type of vector: - -!!! r-project "r" - - ```r - str(factor_snps) - ``` - -??? success "Output" - - ``` - Factor w/ 4 levels "A","C","G","T": 1 1 1 1 1 1 1 1 1 1 ... - ``` - -What we get back are the categories ("A","C","G","T") in our factor; -these are called "Levels". **Levels are the different categories -contained in a factor**. By default, R will organize the levels in a -factor in alphabetical order. So the first level in this factor is "A". - -For the sake of efficiency, R stores the content of a factor as a vector -of integers, which an integer is assigned to each of the possible -levels. Recall levels are assigned in alphabetical order. In this case, -the first item in our `factor_snps` object is "A", which happens to be -the 1st level of our factor, ordered alphabetically. This explains the -sequence of "1"s ("Factor w/ 4 levels"A","C","G","T": 1 1 1 1 1 1 1 1 1 -1 ..."), since"A" is the first level, and the first few items in our -factor are all "A"s. - -We can see how many items in our vector fall into each category: - -!!! r-project "r" - - ```r - summary(factor_snps) - ``` - -??? success "Output" - - ``` - A C G T - 211 139 154 203 - ``` - -As you can imagine, this is already useful when you want to generate a -tally. - -!!! tip "Treating objects as categories without changing their mode" - - You don't have to make an object a factor to get the benefits of - treating an object as a factor. See what happens when you use the - `as.factor()` function on `factor_snps`. To generate a tally, you can - sometimes also use the `table()` function; though sometimes you may - need to combine both (i.e. `table(as.factor(object))`) - -## Plotting and ordering factors - -One of the most common uses for factors will be when you plot -categorical values. For example, suppose we want to know how many of our -variants had each possible SNP we could generate a plot: - -!!! r-project "r" - - ```r - plot(factor_snps) - ``` - -??? success "Output" - - ![images](figures/factor_snps_barplot.png){ width="600"} - -This isn't a particularly pretty example of a plot but it works. We'll -be learning much more about creating nice, publication-quality graphics -later in this lesson. - -If you recall, factors are ordered alphabetically. That might make -sense, but categories (e.g., "red", "blue", "green") often do not have -an intrinsic order. What if we wanted to order our plot according to the -numerical value (i.e., in descending order of SNP frequency)? We can -enforce an order on our factors: - -!!! r-project "r" - - ```r - ordered_factor_snps <- factor(factor_snps, levels = names(sort(table(factor_snps)))) - ``` - -Let's deconstruct this from the inside out (you can try each of these -commands to see why this works): - -1. We create a table of `factor_snps` to get the frequency of each SNP: - `table(factor_snps)` -2. We sort this table: `sort(table(factor_snps))`; use the - `decreasing =` parameter for this function if you wanted to change - from the default of FALSE -3. Using the `names` function gives us just the character names of the - table sorted by frequencies:`names(sort(table(factor_snps)))` -4. The `factor` function is what allows us to create a factor. We give - it the `factor_snps` object as input, and use the `levels=` - parameter to enforce the ordering of the levels. - -Now we see our plot has be reordered: - -!!! r-project "r" - - ```r - plot(ordered_factor_snps) - ``` - -??? success "Output" +- Each variable (column) has a name (e.g., `sample_id`). This is followed + by the object mode (e.g., chr, int, etc.). Notice that before each + variable name there is a `$` (this will be important later). - ![images](figures/ordered_factor_snps_barplot.png){ width="600" } - -Factors come in handy in many places when using R. Even using more -sophisticated plotting packages such as ggplot2 will sometimes require -you to understand how to manipulate factors. - -!!! tip "Packages in R -- what are they and why do we use them?" - - Packages are simply collections of functions and/or data that can be - used to extend the capabilities of R beyond the core functionality - that comes with it by default. There are useful R packages available - that span all types of statistical analysis, data visualization, and - more. The main place that R packages are installed from is a website - called [CRAN](https://cran.r-project.org/) (the Comprehensive R - Archive Network). Many thousands of R packages are available there, - and when you use the built-in R function `install.packages()`, it will - look for a CRAN repository to install from. So, for example, to - install [tidyverse](https://www.tidyverse.org) packages such as - `dplyr` and `ggplot2` (which you'll do in the next few lessons), you - would use the following command: - - !!! r-project "r" - - ```r - # Install a package from CRAN - install.packages("ggplot2") - ``` - -## Subsetting data frames +### Subsetting data frames Next, we are going to talk about how you can get specific values from -data frames, and where necessary, change the mode of a column of values. +data frames. The first thing to remember is that a data frame is two-dimensional (rows and columns). Therefore, to select a specific value we will will @@ -515,21 +294,21 @@ one value (except in some cases where we are taking a range). **Try the following indices and functions and try to figure out what they return** - a. `variants[1,1]` + a. `variants[1, 1]` - b. `variants[2,4]` + b. `variants[2, 4]` - c. `variants[801,29]` + c. `variants[801, 29]` - d. `variants[2, ]` + d. `variants[2,]` - e. `variants[-1, ]` + e. `variants[-1,]` - f. `variants[1:4,1]` + f. `variants[1:4, 1]` - g. `variants[1:10,c("REF","ALT")]` + g. `variants[1:10, c("REF","ALT")]` - h. `variants[,c("sample_id")]` + h. `variants[, c("sample_id")]` i. `head(variants)` @@ -737,111 +516,322 @@ one value (except in some cases where we are taking a range). 19 1 C ``` -The subsetting notation is very similar to what we learned for vectors. -The key differences include: +The subsetting notation is very similar to what we learned for vectors. +The key differences include: + +- Typically provide two values separated by commas: `data.frame[row, + column]` +- In cases where you are taking a continuous range of numbers use a + colon between the numbers (start:stop, inclusive) +- For a non continuous set of numbers, pass a vector using `c()` +- Index using the name of a column(s) by passing them as vectors using + `c()` + +Finally, in all of the subsetting exercises above, we printed values to +the screen. You can create a new data frame object by assigning them to +a new object name: + +!!! r-project "r" + + ```r + # Create a new data frame containing only observations from "SRR2584863" + SRR2584863_variants <- variants[variants$sample_id == "SRR2584863", ] + + # Check the dimension of the data frame + dim(SRR2584863_variants) + + # Get a summary of the data frame + summary(SRR2584863_variants) + ``` + + ??? success "Output" + + ``` + [1] 25 29 + + sample_id CHROM POS ID + Length:25 Length:25 Min. : 9972 Mode:logical + Class :character Class :character 1st Qu.:1331794 NA's:25 + Mode :character Mode :character Median :2618472 + Mean :2464989 + 3rd Qu.:3488669 + Max. :4616538 + + REF ALT QUAL FILTER + Length:25 Length:25 Min. : 31.89 Mode:logical + Class :character Class :character 1st Qu.:104.00 NA's:25 + Mode :character Mode :character Median :211.00 + Mean :172.97 + 3rd Qu.:225.00 + Max. :228.00 + + INDEL IDV IMF DP + Mode :logical Min. : 2.00 Min. :0.6667 Min. : 2.0 + FALSE:19 1st Qu.: 3.25 1st Qu.:0.9250 1st Qu.: 9.0 + TRUE :6 Median : 8.00 Median :1.0000 Median :10.0 + Mean : 7.00 Mean :0.9278 Mean :10.4 + 3rd Qu.: 9.75 3rd Qu.:1.0000 3rd Qu.:12.0 + Max. :12.00 Max. :1.0000 Max. :20.0 + NA's :19 NA's :19 + VDB RPB MQB BQB + Min. :0.01627 Min. :0.9008 Min. :0.04979 Min. :0.7507 + 1st Qu.:0.07140 1st Qu.:0.9275 1st Qu.:0.09996 1st Qu.:0.7627 + Median :0.37674 Median :0.9542 Median :0.15013 Median :0.7748 + Mean :0.40429 Mean :0.9517 Mean :0.39997 Mean :0.8418 + 3rd Qu.:0.65951 3rd Qu.:0.9771 3rd Qu.:0.57507 3rd Qu.:0.8874 + Max. :0.99604 Max. :1.0000 Max. :1.00000 Max. :1.0000 + NA's :22 NA's :22 NA's :22 + MQSB SGB MQ0F ICB + Min. :0.5000 Min. :-0.6904 Min. :0.00000 Mode:logical + 1st Qu.:0.9599 1st Qu.:-0.6762 1st Qu.:0.00000 NA's:25 + Median :0.9962 Median :-0.6620 Median :0.00000 + Mean :0.9442 Mean :-0.6341 Mean :0.04667 + 3rd Qu.:1.0000 3rd Qu.:-0.6168 3rd Qu.:0.00000 + Max. :1.0128 Max. :-0.4536 Max. :0.66667 + NA's :3 + HOB AC AN DP4 MQ + Mode:logical Min. :1 Min. :1 Length:25 Min. :10.00 + NA's:25 1st Qu.:1 1st Qu.:1 Class :character 1st Qu.:60.00 + Median :1 Median :1 Mode :character Median :60.00 + Mean :1 Mean :1 Mean :55.52 + 3rd Qu.:1 3rd Qu.:1 3rd Qu.:60.00 + Max. :1 Max. :1 Max. :60.00 + + Indiv gt_PL gt_GT gt_GT_alleles + Length:25 Length:25 Min. :1 Length:25 + Class :character Class :character 1st Qu.:1 Class :character + Mode :character Mode :character Median :1 Mode :character + Mean :1 + 3rd Qu.:1 + Max. :1 + ``` + +## Introducing factors + +Factors are the final major data structure we will introduce in our Introduction +to R lessons. Factors can be thought of as vectors which are specialized for +categorical data. Given R's specialization for statistics, this make sense since +categorial and continuous variables are usually treated differently. Sometimes +you may want to have data treated as a factor, but in other cases, this may be +undesirable. Let's see the value of treating some of which are categorical in +nature as factors. Let's take a look at just the alternate alleles. + +!!! r-project "r" + + ```r + # Extract the "ALT" column to a new object + + alt_alleles <- subset_variants$ALT + ``` + +Let's look at the first few items in our factor using `head()`: + +!!! r-project "r" + + ```r + head(alt_alleles) + ``` + + ??? success "Output" + + ``` + [1] "G" "T" "T" "CTTTTTTTT" "CCGCGC" "T" + ``` + +There are 801 alleles (one for each row). To simplify, lets look at just +the single-nucleotide alleles (SNPs). We can use some of the vector +indexing skills from the last episode. + +!!! r-project "r" + + ```r + alt_snps <- c(alt_alleles[alt_alleles=="A"], + alt_alleles[alt_alleles=="T"], + alt_alleles[alt_alleles=="G"], + alt_alleles[alt_alleles=="C"]) + ``` + +This leaves us with a vector of the 701 alternative alleles which were +single nucleotides. Right now, they are being treated a characters, but +we could treat them as categories of SNP. Doing this will enable some +nice features. For example, we can try to generate a plot of this +character vector as it is right now: + +!!! r-project "r" + + ```r + plot(alt_snps) + ``` + + ??? failure "Output" + + ``` + Error in plot.window(...) : need finite 'ylim' values + In addition: Warning messages: + 1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion + 2: In min(x) : no non-missing arguments to min; returning Inf + 3: In max(x) : no non-missing arguments to max; returning -Inf + ``` + +Whoops! Though the `plot()` function will do its best to give us a quick +plot, it is unable to do so here. One way to fix this it to tell R to +treat the SNPs as categories (i.e., a factor vector); we will create a +new object to avoid confusion using the `factor()` function: + +!!! r-project "r" + + ```r + factor_snps <- factor(alt_snps) + ``` + +Let's learn a little more about this new type of vector: + +!!! r-project "r" + + ```r + str(factor_snps) + ``` + +??? success "Output" + + ``` + Factor w/ 4 levels "A","C","G","T": 1 1 1 1 1 1 1 1 1 1 ... + ``` + +What we get back are the categories ("A","C","G","T") in our factor; +these are called "levels". **Levels are the different categories +contained in a factor**. By default, R will organize the levels in a +factor in alphabetical order. So the first level in this factor is "A". -- Typically provide two values separated by commas: data.frame\[row, - column\] -- In cases where you are taking a continuous range of numbers use a - colon between the numbers (start:stop, inclusive) -- For a non continuous set of numbers, pass a vector using `c()` -- Index using the name of a column(s) by passing them as vectors using - `c()` +For the sake of efficiency, R stores the content of a factor as a vector +of integers, which an integer is assigned to each of the possible +levels. Recall levels are assigned in alphabetical order. In this case, +the first item in our `factor_snps` object is "A", which happens to be +the 1st level of our factor, ordered alphabetically. This explains the +sequence of "1"s (`Factor w/ 4 levels"A","C","G","T": 1 1 1 1 1 1 1 1 1 +1 ...`), since "A" is the first level and the first few items in our +factor are all "A"s. -Finally, in all of the subsetting exercises above, we printed values to -the screen. You can create a new data frame object by assigning them to -a new object name: +We can see how many items in our vector fall into each category: !!! r-project "r" ```r - # Create a new data frame containing only observations from "SRR2584863" + summary(factor_snps) + ``` - SRR2584863_variants <- variants[variants$sample_id == "SRR2584863", ] + ??? success "Output" - # Check the dimension of the data frame + ``` + A C G T + 211 139 154 203 + ``` - dim(SRR2584863_variants) +As you can imagine, this is already useful when you want to generate a +tally. - # Get a summary of the data frame +!!! tip "Treating objects as categories without changing their mode" - summary(SRR2584863_variants) + You don't have to make an object a factor to get the benefits of + treating an object as a factor. See what happens when you use the + `as.factor()` function on `factor_snps`. To generate a tally, you can + sometimes also use the `table()` function; though sometimes you may + need to combine both (i.e., `table(as.factor(object))`) + +### Plotting and ordering factors + +One of the most common uses for factors will be when you plot +categorical values. For example, suppose we want to know how many of our +variants had each possible SNP we could generate a plot: + +!!! r-project "r" + + ```r + plot(factor_snps) ``` -??? success "Output" + ??? success "Output" + + ![images](figures/factor_snps_barplot.png){ width="600"} + +This isn't a particularly pretty example of a plot but it works. We'll +be learning much more about creating nice, publication-quality graphics +later in this workshop. + +If you recall, factors are ordered alphabetically. That might make +sense, but categories (e.g., "red", "blue", "green") often do not have +an intrinsic order. What if we wanted to order our plot according to the +numerical value (i.e., in descending order of SNP frequency)? We can +enforce an order on our factors: +!!! r-project "r" + + ```r + ordered_factor_snps <- factor( + factor_snps, + levels = names(sort(table(factor_snps))) + ) ``` - [1] 25 29 - - sample_id CHROM POS ID - Length:25 Length:25 Min. : 9972 Mode:logical - Class :character Class :character 1st Qu.:1331794 NA's:25 - Mode :character Mode :character Median :2618472 - Mean :2464989 - 3rd Qu.:3488669 - Max. :4616538 - - REF ALT QUAL FILTER - Length:25 Length:25 Min. : 31.89 Mode:logical - Class :character Class :character 1st Qu.:104.00 NA's:25 - Mode :character Mode :character Median :211.00 - Mean :172.97 - 3rd Qu.:225.00 - Max. :228.00 - - INDEL IDV IMF DP - Mode :logical Min. : 2.00 Min. :0.6667 Min. : 2.0 - FALSE:19 1st Qu.: 3.25 1st Qu.:0.9250 1st Qu.: 9.0 - TRUE :6 Median : 8.00 Median :1.0000 Median :10.0 - Mean : 7.00 Mean :0.9278 Mean :10.4 - 3rd Qu.: 9.75 3rd Qu.:1.0000 3rd Qu.:12.0 - Max. :12.00 Max. :1.0000 Max. :20.0 - NA's :19 NA's :19 - VDB RPB MQB BQB - Min. :0.01627 Min. :0.9008 Min. :0.04979 Min. :0.7507 - 1st Qu.:0.07140 1st Qu.:0.9275 1st Qu.:0.09996 1st Qu.:0.7627 - Median :0.37674 Median :0.9542 Median :0.15013 Median :0.7748 - Mean :0.40429 Mean :0.9517 Mean :0.39997 Mean :0.8418 - 3rd Qu.:0.65951 3rd Qu.:0.9771 3rd Qu.:0.57507 3rd Qu.:0.8874 - Max. :0.99604 Max. :1.0000 Max. :1.00000 Max. :1.0000 - NA's :22 NA's :22 NA's :22 - MQSB SGB MQ0F ICB - Min. :0.5000 Min. :-0.6904 Min. :0.00000 Mode:logical - 1st Qu.:0.9599 1st Qu.:-0.6762 1st Qu.:0.00000 NA's:25 - Median :0.9962 Median :-0.6620 Median :0.00000 - Mean :0.9442 Mean :-0.6341 Mean :0.04667 - 3rd Qu.:1.0000 3rd Qu.:-0.6168 3rd Qu.:0.00000 - Max. :1.0128 Max. :-0.4536 Max. :0.66667 - NA's :3 - HOB AC AN DP4 MQ - Mode:logical Min. :1 Min. :1 Length:25 Min. :10.00 - NA's:25 1st Qu.:1 1st Qu.:1 Class :character 1st Qu.:60.00 - Median :1 Median :1 Mode :character Median :60.00 - Mean :1 Mean :1 Mean :55.52 - 3rd Qu.:1 3rd Qu.:1 3rd Qu.:60.00 - Max. :1 Max. :1 Max. :60.00 - - Indiv gt_PL gt_GT gt_GT_alleles - Length:25 Length:25 Min. :1 Length:25 - Class :character Class :character 1st Qu.:1 Class :character - Mode :character Mode :character Median :1 Mode :character - Mean :1 - 3rd Qu.:1 - Max. :1 + +Let's deconstruct this from the *inside out* (you can try each of these +commands to see why this works): + +1. We create a table of `factor_snps` to get the frequency of each SNP: + `table(factor_snps)` +2. We sort this table: `sort(table(factor_snps))`; use the + `decreasing =` parameter for this function if you wanted to change + from the default of FALSE +3. Using the `names()` function gives us just the character names of the + table sorted by frequencies:`names(sort(table(factor_snps)))` +4. The `factor` function is what allows us to create a factor. We give + it the `factor_snps` object as input, and use the `levels=` + parameter to enforce the ordering of the levels. + +Now we see our plot has be reordered: + +!!! r-project "r" + + ```r + plot(ordered_factor_snps) ``` -## Coercing values in data frames + ??? success "Output" + + ![images](figures/ordered_factor_snps_barplot.png){ width="600" } + +Factors come in handy in many places when using R. Even using more +sophisticated plotting packages such as `ggplot2` will sometimes require +you to understand how to manipulate factors. + +!!! tip "Packages in R – what are they and why do we use them?" + + Packages are simply collections of functions and/or data that can be + used to extend the capabilities of R beyond the core functionality + that comes with it by default. There are useful R packages available + that span all types of statistical analysis, data visualization, and + more. The main place that R packages are installed from is a website + called [CRAN](https://cran.r-project.org/) (the Comprehensive R + Archive Network). Many thousands of R packages are available there, + and when you use the built-in R function `install.packages()`, it will + look for a CRAN repository to install from. So, for example, to + install [tidyverse](https://www.tidyverse.org) packages such as + `dplyr` and `ggplot2` (which you'll do in the next few lessons), you + would use the following command: + + !!! r-project "r" -!!! tip "Coercion isn't limited to data frames" + ```r + # Install a package from CRAN + install.packages("ggplot2") + ``` - While we are going to address coercion in the context of data frames - most of these methods apply to other data structures, such as vectors +## Coercing values Sometimes, it is possible that R will misinterpret the type of data represented in a data frame, or store that data in a mode which prevents you from operating on the data the way you wish. For example, a long list of gene names isn't usually thought of as a categorical variable, the way that your -experimental condition (e.g. control, treatment) might be. More importantly, +experimental condition (e.g., control, treatment) might be. More importantly, some R packages you use to analyze your data may expect characters as input, not factors. At other times (such as plotting or some statistical analyses) a factor may be more appropriate. Ultimately, you should know how to change the @@ -858,11 +848,11 @@ is not what you expect. Consider: typeof(snp_chromosomes) ``` -??? success "Output" + ??? success "Output" - ``` - [1] "character" - ``` + ``` + [1] "character" + ``` Although there are several numbers in our vector, they are all in quotes, so we have explicitly told R to consider them as characters. @@ -874,25 +864,15 @@ everything into a character: ```r snp_chromosomes_2 <- c(3, 11, 'X', 6) typeof(snp_chromosomes_2) - ``` - -??? success "Output" - - ``` - [1] "character" - ``` - -!!! r-project "r" - - ```r snp_chromosomes_2[1] ``` -??? success "Output" + ??? success "Output" - ``` - [1] "3" - ``` + ``` + [1] "character" + [1] "3" + ``` We can use the `as.` functions to explicitly coerce values from one form into another. Consider the following vector of characters, which all @@ -903,27 +883,17 @@ happen to be valid numbers: ```r snp_positions_2 <- c("8762685", "66560624", "67545785", "154039662") typeof(snp_positions_2) - ``` - -??? success "Output" - - ``` - [1] "character" - ``` - -!!! r-project "r" - - ```r snp_positions_2[1] ``` -??? success "Output" + ??? success "Output" - ``` - [1] "8762685" - ``` + ``` + [1] "character" + [1] "8762685" + ``` -Now we can coerce `snp_positions_2` into a numeric type using +Now we can coerce `snp_positions_2` into a numeric mode using `as.numeric()`: !!! r-project "r" @@ -931,25 +901,15 @@ Now we can coerce `snp_positions_2` into a numeric type using ```r snp_positions_2 <- as.numeric(snp_positions_2) typeof(snp_positions_2) - ``` - -??? success "Output" - - ``` - [1] "double" - ``` - -!!! r-project "r" - - ```r snp_positions_2[1] ``` -??? success "Output" + ??? success "Output" - ``` - [1] 8762685 - ``` + ``` + [1] "double" + [1] 8762685 + ``` Sometimes coercion is straight forward, but what would happen if we tried using `as.numeric()` on `snp_chromosomes_2` @@ -960,11 +920,12 @@ tried using `as.numeric()` on `snp_chromosomes_2` snp_chromosomes_2 <- as.numeric(snp_chromosomes_2) ``` -??? success "Output" + ??? success "Output" - ``` - Warning: NAs introduced by coercion - ``` + ``` + Warning message: + NAs introduced by coercion + ``` If we check, we will see that an `NA` value (R's default value for missing data) has been introduced. @@ -975,55 +936,52 @@ missing data) has been introduced. snp_chromosomes_2 ``` -??? success "Output" + ??? success "Output" - ``` - [1] 3 11 NA 6 - ``` + ``` + [1] 3 11 NA 6 + ``` Trouble can really start when we try to coerce a factor. For example, -when we try to coerce the `sample_id` column in our data frame into a -numeric mode look at the result: -<-- sample_id is not a factor. I've no idea why they did that. --> +when we try to coerce the `factor_snps` into a numeric mode look at the +result: !!! r-project "r" ```r - as.numeric(variants$sample_id) + as.numeric(factor_snps) ``` ??? success "Output" ``` - [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [31] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [61] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [91] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [121] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [151] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [181] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [211] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [241] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [271] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [301] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [331] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [361] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [391] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [421] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [451] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [481] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [511] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [541] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [571] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [601] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [631] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [661] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [691] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [721] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [751] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - [781] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA - Warning message: - NAs introduced by coercion + [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 + [28] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 + [55] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 + [82] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 + [109] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 + [136] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 + [163] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 + [190] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 + [217] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 + [244] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 + [271] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 + [298] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 + [325] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 + [352] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 + [379] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 + [406] 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 + [433] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 + [460] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 + [487] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 + [514] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 + [541] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 + [568] 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 + [595] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 + [622] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 + [649] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 + [676] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 + [703] 2 2 2 2 2 ``` Strangely, it works! Almost. Instead of giving an error message, R @@ -1039,37 +997,35 @@ expression like this one: !!! r-project "r" ```r - # Make the 'REF' column a character type column - - variants$REF <- as.character(variants$REF) + # Make the 'sample_id' column a factor type column + variants$sample_id <- as.factor(variants$sample_id) - # check the type of the column - - typeof(variants$REF) + # check the structure of the column + str(variants$sample_id) ``` -??? success "Output" - - ``` - [1] "character" - ``` + ??? success "Output" + ``` + Factor w/ 3 levels "SRR2584863","SRR2584866",..: 1 1 1 1 1 1 1 1 1 1 ... + ``` -## StringsAsFactors = ? +### Lesson summary: Data coercion Lets summarize this section on coercion with a few take home messages. - When you explicitly coerce one data type into another (this is known as -**explicit coercion**), be careful to check the result. Ideally, you should try to see if its possible to avoid steps in your analysis that force you to -coerce. + **explicit coercion**), be careful to check the result. Ideally, you should + try to see if its possible to avoid steps in your analysis that force you to + coerce. - R will sometimes coerce without you asking for it. This is called -(appropriately) **implicit coercion**. For example when we tried to create -a vector with multiple data types, R chose one type through implicit -coercion. -- Check the structure (`str()`) of your data frames before working with them! + (appropriately) **implicit coercion**. For example when we tried to create + a vector with multiple data types, R chose one type through implicit + coercion. +- Check the structure (`str()`) of your data before working with them! -!!! tip "Coercion isn't limited to data frames" +!!! tip "`StringsAsFactors`" Prior to R 4.0 when importing a data frame using any one of the `read.table()` functions such as `read.csv()` , the argument `StringsAsFactors` was by @@ -1077,7 +1033,7 @@ coercion. set to true TRUE. Setting it to FALSE will treat any non-numeric column to a character type. `read.csv()` documentation, you will also see you can explicitly type your columns using the `colClasses` argument. Other R packages - (such as the Tidyverse "readr") don't have this particular conversion issue, + (such as the Tidyverse `readr`) don't have this particular conversion issue, but many packages will still try to guess a data type. @@ -1088,13 +1044,11 @@ to know. There are lots of arithmetic functions you may want to apply to your data frame, covering those would be a course in itself (there is some starting -material [here](https://swcarpentry.github.io/r-novice-inflammation/15-supp-loops-in-depth/)). Our lessons will cover some additional summary statistical functions in -a subsequent lesson, but overall we will focus on data cleaning and -visualization. +material [here](https://swcarpentry.github.io/r-novice-inflammation/15-supp-loops-in-depth/)). You can use functions like `mean()`, `min()`, `max()` on an -individual column. Let's look at the "DP" or filtered depth. This value shows the number of filtered -reads that support each of the reported variants. +individual column. Let's look at the "DP" or filtered depth. This value shows +the number of filtered reads that support each of the reported variants. !!! r-project "r" @@ -1102,11 +1056,11 @@ reads that support each of the reported variants. max(variants$DP) ``` -??? success "Output" + ??? success "Output" - ``` - [1] 79 - ``` + ``` + [1] 79 + ``` You can sort a data frame using the `order()` function: @@ -1117,11 +1071,11 @@ You can sort a data frame using the `order()` function: head(sorted_by_DP$DP) ``` -??? success "Output" + ??? success "Output" - ``` - [1] 2 2 2 2 2 2 - ``` + ``` + [1] 2 2 2 2 2 2 + ``` !!! question "Exercise" @@ -1142,36 +1096,63 @@ You can sort a data frame using the `order()` function: [1] 79 46 41 29 29 27 ``` -You can rename columns: +!!! tip "These functions work on vectors too!" + + When using these functions (i.e., `mean()`, `min()`, `max()`, and `order()`), + replace the input with a vector and it will do the same job! For example: + + !!! r-project "r" + + ```r + # Minimum position of a SNP + min(snp_positions) + + # Decreasing order of SNPs by positions + snp_positions[order(snp_positions, decreasing = TRUE)] + ``` + + !!! success "Output" + + ``` + [1] 8762685 + [1] 154039662 67545785 66560624 8762685 + ``` + +You can rename columns by logical subsetting or index: !!! r-project "r" ```r + # By logical subsetting colnames(variants)[colnames(variants) == "sample_id"] <- "strain" - # Check the column name (hint names are returned as a vector) + # By index + colnames(variants)[2] <- "chromosome" + # Check the column name (hint names are returned as a vector) colnames(variants) ``` -??? success "Output" + ??? success "Output" - ``` - [1] "strain" "CHROM" "POS" "ID" - [5] "REF" "ALT" "QUAL" "FILTER" - [9] "INDEL" "IDV" "IMF" "DP" - [13] "VDB" "RPB" "MQB" "BQB" - [17] "MQSB" "SGB" "MQ0F" "ICB" - [21] "HOB" "AC" "AN" "DP4" - [25] "MQ" "Indiv" "gt_PL" "gt_GT" - [29] "gt_GT_alleles" - ``` + ``` + [1] "sample_id" "CHROM" "POS" + [4] "ID" "REF" "ALT" + [7] "QUAL" "FILTER" "INDEL" + [10] "IDV" "IMF" "DP" + [13] "VDB" "RPB" "MQB" + [16] "BQB" "MQSB" "SGB" + [19] "MQ0F" "ICB" "HOB" + [22] "AC" "AN" "DP4" + [25] "MQ" "Indiv" "gt_PL" + [28] "gt_GT" "gt_GT_alleles" + ``` ## Saving your data frame to a file We can save data to a file. We will save our `SRR2584863_variants` object -to a .csv file using the `write.csv()` function: +to a `.csv` (comma-separated values) file using the `write.csv()` function: !!! r-project "r" @@ -1188,18 +1169,21 @@ file name, the file will be written in the current working directory). Excel is one of the most common formats, so we need to discuss how to make these files play nicely with R. The simplest way to import data -from Excel is to **save your Excel file in .csv format**\*. You can then +from Excel is to **save your Excel file in `.csv` format**. You can then import into R right away. Sometimes you may not be able to do this (imagine you have data in 300 Excel files, are you going to open and export all of them?). One common R package (a set of code with features you can download and -add to your R installation) is the [readxl package](https://CRAN.R-project.org/package=readxl) which can open and +add to your R installation) is the [readxl +package](https://CRAN.R-project.org/package=readxl) which can open and import Excel files. Rather than addressing package installation this second (we'll discuss this soon!), we can take advantage of RStudio's -import feature which integrates this package. (Note: this feature is -available only in the latest versions of RStudio such as is installed on -our cloud instance). +import feature which integrates this package. + +!!! bell "`readxl`-RStudio integration" + + This feature is only available on RStudio version 1.0.44 or later. First, in the RStudio menu go to **File**, select **Import Dataset**, and choose **From Excel...** (notice there are several other options you @@ -1223,7 +1207,7 @@ import this file. We could have written this code and imported the Excel file without the RStudio import function, but now you can choose your preference. -In this exercise, we will leave the title of the data frame as +In this exercise, we will leave the name of the data frame as **Ecoli_metadata**, and there are no other options we need to adjust. Click the Import button to import the data. @@ -1233,32 +1217,32 @@ frame: !!! r-project "r" ```r - head(Rcoli_metadata) + head(Ecoli_metadata) ``` -??? success "Output" + ??? success "Output" - ``` - # A tibble: 6 × 7 - sample generation clade strain cit run genome_size - - 1 REL606 0 NA REL606 unknown 4.62 - 2 REL1166A 2000 unknown REL606 unknown SRR098028 4.63 - 3 ZDB409 5000 unknown REL606 unknown SRR098281 4.6 - 4 ZDB429 10000 UC REL606 unknown SRR098282 4.59 - 5 ZDB446 15000 UC REL606 unknown SRR098283 4.66 - 6 ZDB458 20000 (C1,C2) REL606 unknown SRR098284 4.63 - ``` + ``` + # A tibble: 6 × 7 + sample generation clade strain cit run genome_size + + 1 REL606 0 NA REL606 unknown NA 4.62 + 2 REL1166A 2000 unknown REL606 unknown SRR098028 4.63 + 3 ZDB409 5000 unknown REL606 unknown SRR098281 4.6 + 4 ZDB429 10000 UC REL606 unknown SRR098282 4.59 + 5 ZDB446 15000 UC REL606 unknown SRR098283 4.66 + 6 ZDB458 20000 (C1,C2) REL606 unknown SRR098284 4.63 + ``` -The type of this object is 'tibble', a type of data frame we will talk -more about in the 'dplyr' section. If you needed a true R data frame you -could coerce with `as.data.frame()`. +The type of this object is **tibble**, a type of data frame we will talk +more about in the [`dplyr` section](appendix/05-dplyr.md). If you needed a true +R data frame you could coerce with `as.data.frame()`. !!! question "Exercise: Putting it all together - data frames" **Using the `Ecoli_metadata` data frame created above, answer the following questions** - A) What are the dimensions (# rows, \# columns) of the data frame? + A) What are the dimensions (# rows, # columns) of the data frame? B) What are categories are there in the `cit` column? *hint*: treat column as factor @@ -1266,11 +1250,11 @@ could coerce with `as.data.frame()`. D) What is the genome size for the 7th observation in this data set? - E) What is the median value of the variable `genome_size` + E) What is the median value of the variable `genome_size`? - F) Rename the column `sample` to `sample_id` + F) Rename the column `sample` to `sample_id`. - G) Create a new column (name genome_size_bp) and set it equal to the genome_size multiplied by 1,000,000 + G) Create a new column named `genome_size_bp` and set it equal to the genome_size multiplied by 1,000,000. H) Save the edited `Ecoli_metadata` data frame as "exercise_solution.csv" in your current working directory. @@ -1284,11 +1268,11 @@ could coerce with `as.data.frame()`. dim(Ecoli_metadata) ``` - !!! success "Output" + ??? success "Output" - ``` - [1] 30 7 - ``` + ``` + [1] 30 7 + ``` B) @@ -1298,11 +1282,11 @@ could coerce with `as.data.frame()`. levels(as.factor(Ecoli_metadata$cit)) ``` - !!! success "Output" + ??? success "Output" - ``` - [1] "minus" "plus" "unknown" - ``` + ``` + [1] "minus" "plus" "unknown" + ``` C) @@ -1312,29 +1296,29 @@ could coerce with `as.data.frame()`. table(as.factor(Ecoli_metadata$cit)) ``` - !!! success "Output" + ??? success "Output" - ``` - minus plus unknown - 9 9 12 - ``` + ``` + minus plus unknown + 9 9 12 + ``` D) !!! r-project "r" ```r - Ecoli_metadata[7,7] + Ecoli_metadata[7, 7] ``` - !!! success "Output" + ??? success "Output" - ``` - # A tibble: 1 × 1 - genome_size - - 1 4.62 - ``` + ``` + # A tibble: 1 × 1 + genome_size + + 1 4.62 + ``` E) @@ -1344,11 +1328,11 @@ could coerce with `as.data.frame()`. median(Ecoli_metadata$genome_size) ``` - !!! success "Output" + ??? success "Output" - ``` - [1] 4.625 - ``` + ``` + [1] 4.625 + ``` F) @@ -1356,6 +1340,50 @@ could coerce with `as.data.frame()`. ```r colnames(Ecoli_metadata)[colnames(Ecoli_metadata) == "sample"] <- "sample_id" + + # Check the column names + colnames(Ecoli_metadata) + ``` + + ??? success "Output" + + ``` + [1] "sample_id" "generation" "clade" "strain" + [5] "cit" "run" "genome_size" + ``` + + G) + + !!! r-project "r" + + ```r Ecoli_metadata$genome_size_bp <- Ecoli_metadata$genome_size * 1000000 + + # Check the first few rows + head(Ecoli_metadata) + ``` + + ??? success "Output" + + ``` + # A tibble: 6 × 8 + sample_id generation clade strain cit run genome_size genome_size_bp + + 1 REL606 0 NA REL606 unknown NA 4.62 4620000 + 2 REL1166A 2000 unknown REL606 unknown SRR098028 4.63 4630000 + 3 ZDB409 5000 unknown REL606 unknown SRR098281 4.6 4600000 + 4 ZDB429 10000 UC REL606 unknown SRR098282 4.59 4590000 + 5 ZDB446 15000 UC REL606 unknown SRR098283 4.66 4660000 + 6 ZDB458 20000 (C1,C2) REL606 unknown SRR098284 4.63 4630000 + ``` + + H) + + !!! r-project "r" + + ```r write.csv(Ecoli_metadata, file = "exercise_solution.csv") ``` + + Check the Files tab on the bottom-right panel to see if the file is + there. diff --git a/docs/06-data-visualization.md b/docs/06-data-visualization.md index 6e3ca30..051a29c 100644 --- a/docs/06-data-visualization.md +++ b/docs/06-data-visualization.md @@ -2,20 +2,20 @@ !!! info - === "Keypoints" + === "Key points" - - ggplot2 is a powerful tool for high-quality plots - - ggplot2 provides a flexiable and readable grammar to build plots + - `ggplot2` is a powerful tool for high-quality plots + - `ggplot2` provides a flexible and readable grammar to build plots === "Objectives" - Describe the role of data, aesthetics, and geoms in ggplot functions. - Choose the correct aesthetics and alter the geom parameters for a - scatter plot, histogram, or box plot. + scatter plot, bar chart, density, or box plot. - Layer multiple geometries in a single plot. - - Customize plot scales, titles, themes, and fonts. + - Customize plot titles, themes, and fonts. - Apply a facet to a plot. - - Apply additional ggplot2-compatible plotting libraries. + - Apply additional `ggplot2`-compatible plotting libraries. - Save a ggplot to a file. - List several resources for getting help with ggplot. - List several resources for creating informative scientific plots. @@ -27,33 +27,53 @@ - What is the process of creating a publication-quality plots with ggplot in R? - ## Introduction to **`ggplot2`** ![images](https://ggplot2.tidyverse.org/logo.png){ align="right" } -**`ggplot2`** is a plotting package that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatter plot. This helps in creating publication-quality plots with minimal amounts of adjustments and tweaking. - -The **gg** in "**ggplot**" stands for "Grammar of Graphics," which is an elegant yet powerful way to describe the making of scientific plots. In short, the grammar of graphics breaks down every plot into a few components, namely, a dataset, a set of geoms (visual marks that represent the data points), and a coordinate system. You can imagine this is a grammar that gives unique names to each component appearing in a plot and conveys specific information about data. With **ggplot**, graphics are built step by step by adding new elements. - -The idea of **mapping** is crucial in **ggplot**. One familiar example is to *map* the value of one variable in a dataset to $x$ and the other to $y$. However, we often encounter datasets that include multiple (more than two) variables. In this case, **ggplot** allows you to *map* those other variables to visual marks such as **color** and **shape** (**aesthetics** or `aes`). One thing you may want to remember is the difference between **discrete** and **continuous** variables. Some aesthetics, such as the shape of dots, do not accept continuous variables. If forced to do so, R will give an error. This is easy to understand; we cannot create a continuum of shapes for a variable, unlike, say, color. +**`ggplot2`** is a plotting package that makes it simple to create complex plots +from data in a data frame. It provides a more programmatic interface for +specifying what variables to plot, how they are displayed, and general visual +properties. Therefore, we only need minimal changes if the underlying data +change or if we decide to change from a bar plot to a scatter plot. This helps +in creating publication-quality plots with minimal adjustments and tweaking. + +The **gg** in "**ggplot**" stands for "Grammar of Graphics," +which is an elegant yet powerful way to describe the making of scientific plots. +In short, the grammar of graphics breaks down every plot into a few components, +namely, a dataset, a set of geoms (or *geometric objects*; visual marks that +represent the data points), and a coordinate system. You can imagine this is a +grammar that gives unique names to each component appearing in a plot and +conveys specific information about data. With **ggplot**, graphics are built +step-by-step by adding new elements. + +The idea of **mapping** is crucial in **ggplot**. One familiar example is to +*map* the value of one variable in a dataset to $x$ and the other to $y$. +However, we often encounter datasets that include more than two variables. +In this case, **ggplot** allows you to *map* those other variables to visual +marks such as **color** and **shape**. Along with the mapped coordinates +$x$ and $y$, these visual marks constitute the **aesthetics** of the figure +(specified using `aes()`). One thing you may want to remember is the difference +between **discrete** and **continuous** variables. Some aesthetics, +such as the shape of dots, do not accept continuous variables. +If forced to do so, R will give an error. This is easy to understand; +we cannot create a continuum of shapes for a variable, unlike, say, color. !!! tip "Checking continuous/discrete variables" - When having doubts about whether a variable is [continuous or discrete](https://en.wikipedia.org/wiki/Continuous_or_discrete_variable), a quick way to check is to use the [`summary()`](https://www.geeksforgeeks.org/get-summary-of-results-produced-by-functions-in-r-programming-summary-function/) function. Continuous variables have descriptive statistics but not the discrete variables. + When having doubts about whether a variable is [continuous or + discrete](https://en.wikipedia.org/wiki/Continuous_or_discrete_variable), + a quick way to check is to use the [`summary()`](https://www.geeksforgeeks.org/get-summary-of-results-produced-by-functions-in-r-programming-summary-function/) + function. Continuous variables have descriptive statistics + (e.g., max, min, mean) but not the discrete variables. !!! tip "Installing `ggplot` on your local machine" - Here, we do not need to install `ggplot` while working within NeSI's RStudio. However, if you would like to work on your own local R/RStudio, you can install this package (or any other packages) like so: + Here, we do not need to install `ggplot` while working within + NeSI's RStudio. However, if you would like to work on + your own local R/RStudio, you can install this package (or any other + packages) like so: !!! r-project "r" @@ -61,7 +81,9 @@ The idea of **mapping** is crucial in **ggplot**. One familiar example is to *ma install.packages("ggplot2") ``` - **`ggplot2`** belongs to the [**`tidyverse`** framework](https://www.tidyverse.org/), a suite of packages that can help you with data import and manipulation. + **`ggplot2`** belongs to the [**`tidyverse`** + framework](https://www.tidyverse.org/), a suite of packages that + can help you with data import and manipulation. -Explore the *structure* (types of columns and number of rows) of the dataset using `str()`. +Explore the *structure* (types of columns and number of rows) of the dataset +using `str()`. !!! r-project "r" @@ -148,40 +172,40 @@ Explore the *structure* (types of columns and number of rows) of the dataset usi str(variants) ``` -??? success "Output" + ??? success "Output" - ``` - 'data.frame': 801 obs. of 29 variables: - $ sample_id : chr "SRR2584863" "SRR2584863" "SRR2584863" "SRR2584863" ... - $ CHROM : chr "CP000819.1" "CP000819.1" "CP000819.1" "CP000819.1" ... - $ POS : int 9972 263235 281923 433359 473901 648692 1331794 1733343 2103887 2333538 ... - $ ID : logi NA NA NA NA NA NA ... - $ REF : chr "T" "G" "G" "CTTTTTTT" ... - $ ALT : chr "G" "T" "T" "CTTTTTTTT" ... - $ QUAL : num 91 85 217 64 228 210 178 225 56 167 ... - $ FILTER : logi NA NA NA NA NA NA ... - $ INDEL : logi FALSE FALSE FALSE TRUE TRUE FALSE ... - $ IDV : int NA NA NA 12 9 NA NA NA 2 7 ... - $ IMF : num NA NA NA 1 0.9 ... - $ DP : int 4 6 10 12 10 10 8 11 3 7 ... - $ VDB : num 0.0257 0.0961 0.7741 0.4777 0.6595 ... - $ RPB : num NA 1 NA NA NA NA NA NA NA NA ... - $ MQB : num NA 1 NA NA NA NA NA NA NA NA ... - $ BQB : num NA 1 NA NA NA NA NA NA NA NA ... - $ MQSB : num NA NA 0.975 1 0.916 ... - $ SGB : num -0.556 -0.591 -0.662 -0.676 -0.662 ... - $ MQ0F : num 0 0.167 0 0 0 ... - $ ICB : logi NA NA NA NA NA NA ... - $ HOB : logi NA NA NA NA NA NA ... - $ AC : int 1 1 1 1 1 1 1 1 1 1 ... - $ AN : int 1 1 1 1 1 1 1 1 1 1 ... - $ DP4 : chr "0,0,0,4" "0,1,0,5" "0,0,4,5" "0,1,3,8" ... - $ MQ : int 60 33 60 60 60 60 60 60 60 60 ... - $ Indiv : chr "/home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam" "/home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam" "/home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam" "/home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam" ... - $ gt_PL : chr "121,0" "112,0" "247,0" "91,0" ... - $ gt_GT : int 1 1 1 1 1 1 1 1 1 1 ... - $ gt_GT_alleles: chr "G" "T" "T" "CTTTTTTTT" ... - ``` + ``` + 'data.frame': 801 obs. of 29 variables: + $ sample_id : chr "SRR2584863" "SRR2584863" "SRR2584863" "SRR2584863" ... + $ CHROM : chr "CP000819.1" "CP000819.1" "CP000819.1" "CP000819.1" ... + $ POS : int 9972 263235 281923 433359 473901 648692 1331794 1733343 2103887 2333538 ... + $ ID : logi NA NA NA NA NA NA ... + $ REF : chr "T" "G" "G" "CTTTTTTT" ... + $ ALT : chr "G" "T" "T" "CTTTTTTTT" ... + $ QUAL : num 91 85 217 64 228 210 178 225 56 167 ... + $ FILTER : logi NA NA NA NA NA NA ... + $ INDEL : logi FALSE FALSE FALSE TRUE TRUE FALSE ... + $ IDV : int NA NA NA 12 9 NA NA NA 2 7 ... + $ IMF : num NA NA NA 1 0.9 ... + $ DP : int 4 6 10 12 10 10 8 11 3 7 ... + $ VDB : num 0.0257 0.0961 0.7741 0.4777 0.6595 ... + $ RPB : num NA 1 NA NA NA NA NA NA NA NA ... + $ MQB : num NA 1 NA NA NA NA NA NA NA NA ... + $ BQB : num NA 1 NA NA NA NA NA NA NA NA ... + $ MQSB : num NA NA 0.975 1 0.916 ... + $ SGB : num -0.556 -0.591 -0.662 -0.676 -0.662 ... + $ MQ0F : num 0 0.167 0 0 0 ... + $ ICB : logi NA NA NA NA NA NA ... + $ HOB : logi NA NA NA NA NA NA ... + $ AC : int 1 1 1 1 1 1 1 1 1 1 ... + $ AN : int 1 1 1 1 1 1 1 1 1 1 ... + $ DP4 : chr "0,0,0,4" "0,1,0,5" "0,0,4,5" "0,1,3,8" ... + $ MQ : int 60 33 60 60 60 60 60 60 60 60 ... + $ Indiv : chr "/home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam" "/home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam" "/home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam" "/home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam" ... + $ gt_PL : chr "121,0" "112,0" "247,0" "91,0" ... + $ gt_GT : int 1 1 1 1 1 1 1 1 1 1 ... + $ gt_GT_alleles: chr "G" "T" "T" "CTTTTTTTT" ... + ``` Alternatively, we can display the first a few rows (vertically) of the table using [`head()`](https://www.geeksforgeeks.org/get-the-first-parts-of-a-data-set-in-r-programming-head-function/): @@ -192,45 +216,48 @@ table using [`head()`](https://www.geeksforgeeks.org/get-the-first-parts-of-a-da head(variants) ``` -??? success "Output" - - ``` - sample_id CHROM POS ID REF ALT QUAL FILTER INDEL IDV IMF DP VDB RPB - 1 SRR2584863 CP000819.1 9972 NA T G 91 NA FALSE NA NA 4 0.0257451 NA - 2 SRR2584863 CP000819.1 263235 NA G T 85 NA FALSE NA NA 6 0.0961330 1 - 3 SRR2584863 CP000819.1 281923 NA G T 217 NA FALSE NA NA 10 0.7740830 NA - 4 SRR2584863 CP000819.1 433359 NA CTTTTTTT CTTTTTTTT 64 NA TRUE 12 1.0 12 0.4777040 NA - 5 SRR2584863 CP000819.1 473901 NA CCGC CCGCGC 228 NA TRUE 9 0.9 10 0.6595050 NA - 6 SRR2584863 CP000819.1 648692 NA C T 210 NA FALSE NA NA 10 0.2680140 NA - MQB BQB MQSB SGB MQ0F ICB HOB AC AN DP4 MQ - 1 NA NA NA -0.556411 0.000000 NA NA 1 1 0,0,0,4 60 - 2 1 1 NA -0.590765 0.166667 NA NA 1 1 0,1,0,5 33 - 3 NA NA 0.974597 -0.662043 0.000000 NA NA 1 1 0,0,4,5 60 - 4 NA NA 1.000000 -0.676189 0.000000 NA NA 1 1 0,1,3,8 60 - 5 NA NA 0.916482 -0.662043 0.000000 NA NA 1 1 1,0,2,7 60 - 6 NA NA 0.916482 -0.670168 0.000000 NA NA 1 1 0,0,7,3 60 - Indiv gt_PL gt_GT gt_GT_alleles - 1 /home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam 121,0 1 G - 2 /home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam 112,0 1 T - 3 /home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam 247,0 1 T - 4 /home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam 91,0 1 CTTTTTTTT - 5 /home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam 255,0 1 CCGCGC - 6 /home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam 240,0 1 T - ``` + ??? success "Output" -**`ggplot2`** functions like data in the **long** format, i.e., a column for every dimension (variable), and a row for every observation. Well-structured data will save you time when making figures with **`ggplot2`** + ``` + sample_id CHROM POS ID REF ALT QUAL FILTER INDEL IDV IMF DP VDB RPB + 1 SRR2584863 CP000819.1 9972 NA T G 91 NA FALSE NA NA 4 0.0257451 NA + 2 SRR2584863 CP000819.1 263235 NA G T 85 NA FALSE NA NA 6 0.0961330 1 + 3 SRR2584863 CP000819.1 281923 NA G T 217 NA FALSE NA NA 10 0.7740830 NA + 4 SRR2584863 CP000819.1 433359 NA CTTTTTTT CTTTTTTTT 64 NA TRUE 12 1.0 12 0.4777040 NA + 5 SRR2584863 CP000819.1 473901 NA CCGC CCGCGC 228 NA TRUE 9 0.9 10 0.6595050 NA + 6 SRR2584863 CP000819.1 648692 NA C T 210 NA FALSE NA NA 10 0.2680140 NA + MQB BQB MQSB SGB MQ0F ICB HOB AC AN DP4 MQ + 1 NA NA NA -0.556411 0.000000 NA NA 1 1 0,0,0,4 60 + 2 1 1 NA -0.590765 0.166667 NA NA 1 1 0,1,0,5 33 + 3 NA NA 0.974597 -0.662043 0.000000 NA NA 1 1 0,0,4,5 60 + 4 NA NA 1.000000 -0.676189 0.000000 NA NA 1 1 0,1,3,8 60 + 5 NA NA 0.916482 -0.662043 0.000000 NA NA 1 1 1,0,2,7 60 + 6 NA NA 0.916482 -0.670168 0.000000 NA NA 1 1 0,0,7,3 60 + Indiv gt_PL gt_GT gt_GT_alleles + 1 /home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam 121,0 1 G + 2 /home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam 112,0 1 T + 3 /home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam 247,0 1 T + 4 /home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam 91,0 1 CTTTTTTTT + 5 /home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam 255,0 1 CCGCGC + 6 /home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam 240,0 1 T + ``` -**`ggplot2`** graphics are built step-by-step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots, and more equally important the readability of the code. +**`ggplot2`** functions like data in the **long** format, i.e., a column for +every dimension (variable), and a row for every observation. Well-structured +data will save you time when making figures with **`ggplot2`**. -To build a ggplot, we will use the following basic template that can be used for different types of plots: +**`ggplot2`** graphics are built step-by-step by adding new elements. Adding +layers in this fashion allows for extensive flexibility and customization of +plots, and more equally important the readability of the code. -!!! r-project "r" +To build a ggplot, we will use the following basic template that can be used for +different types of plots: - ```r - ggplot(data = , mapping = aes()) + () - ``` +```r +ggplot(data = , mapping = aes()) + () +``` -- use the `ggplot()` function and bind the plot to a specific data +- Use the `ggplot()` function and bind the plot to a specific data frame using the `data` argument !!! r-project "r" @@ -239,9 +266,9 @@ To build a ggplot, we will use the following basic template that can be used for ggplot(data = variants) ``` -- define a mapping (using the aesthetic (`aes`) function), by +- Define a mapping (using the aesthetic (`aes`) function), by selecting the variables to be plotted and specifying how to present - them in the graph, e.g. as x and y positions or characteristics such + them in the graph, e.g., as x and y positions or characteristics such as size, shape, color, etc. !!! r-project "r" @@ -250,7 +277,7 @@ To build a ggplot, we will use the following basic template that can be used for ggplot(data = variants, aes(x = POS, y = DP)) ``` -- add 'geoms' -- graphical representations of the data in the plot +- Add 'geoms' – graphical representations of the data in the plot (points, lines, bars). **`ggplot2`** offers many different geoms; we will use some common ones today, including: - [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html) @@ -290,9 +317,19 @@ plots, so the above plot can also be generated with code like this: !!! info "Notes" - - Anything you put in the [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html) function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up in [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html). - - You can also specify mappings for a given geom independently of the mappings defined globally in the [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html) function. - - The `+` sign used to add new layers must be placed at the end of the line containing the *previous* layer. If, instead, the `+` sign is added at the beginning of the line containing the new layer, **`ggplot2`** will not add the new layer and will return an error message. + - Anything you put in the + [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html) + function can be seen by any geom layers that you add (i.e., these are + universal plot settings). This includes the x- and y-axis mapping you + set up in [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html). + - You can also specify mappings for a given geom independently of the + mappings defined globally in the + [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html) + function. + - The `+` sign used to add new layers must be placed at the end of the line + containing the *previous* layer. If, instead, the `+` sign is added at the + beginning of the line containing the new layer, **`ggplot2`** will not add + the new layer and will return an error message. !!! r-project "r" @@ -333,15 +370,16 @@ For instance, we can add transparency (`alpha`) to avoid over-plotting: We can also add colors for all the points: -\`\`\`{r adding-colors, purl=FALSE} - !!! r-project "r" ```r ggplot(data = variants, aes(x = POS, y = DP)) + geom_point(alpha = 0.5, color = "blue") ``` -Or to color each species in the plot differently, you could use a vector as an input to the argument **color**. **`ggplot2`** will provide a different color corresponding to different values in the vector. Here is an example where we color with **`sample_id`**: +Or to color each species in the plot differently, you could use a vector as an +input to the argument `color`. **`ggplot2`** will provide a different color +corresponding to different values in the vector. Here is an example where we +color with **`sample_id`**: !!! r-project "r" @@ -353,12 +391,11 @@ Or to color each species in the plot differently, you could use a vector as an i Notice that we can change the geom layer and colors will be still determined by **`sample_id`** - !!! r-project "r" ```r ggplot(data = variants, aes(x = POS, y = DP, color = sample_id)) + - geom_point(alpha = 0.5) + geom_line(alpha = 0.5) ``` To make our plot more readable, we can add axis labels: @@ -385,13 +422,30 @@ To add a *main* title to the plot, we use ggtitle("Read Depth vs. Position") ``` -Now the figure is complete and ready to be exported and saved to a file. +!!! tip "Using `labs()` for plot labels" + + We can also use + [`labs()`](https://ggplot2.tidyverse.org/reference/labs.html?q=labs#null) to + create/modify labels beyond `x` and `y`, such as title, subtitle, caption, + etc. For example: + + !!! r-project "r" + + ```r + ggplot(data = variants, aes(x = POS, y = DP, color = sample_id)) + + geom_point(alpha = 0.5) + + labs(x = "Base Pair Position", + y = "Read Depth (DP)", + title = "Read Depth vs. Position") + ``` + +Now the figure is complete, we can export and save it to a file. This can be achieved easily using [`ggsave()`](https://ggplot2.tidyverse.org/reference/ggsave.html), which can write, by default, the most recent generated figure into different formats (e.g., `jpeg`, `png`, `pdf`) according to the file extension. So, for example, to create a pdf version of the above figure with a -dimension of $6 \times 4$ inches: +dimension of 6 $\times$ 4 inches: !!! r-project "r" @@ -402,6 +456,20 @@ dimension of $6 \times 4$ inches: If we check the *current working directory*, there should be a newly created file called `depth.pdf` with the above plot. +!!! tip "Saving a plot using different units and formats" + + === "Size units" + + By default, `ggsave()` measures lengths in inches. To change that, we + can use the `units =` arguments. This argument will take `in`, `cm`, + `mm`, and `px`. + + === "File formats" + + The example above writes the plot to a PDF. We can also save it to other + formats such as `jpeg`, `tiff`, `bmp`, `png`, etc. by modifying the + suffix (i.e. file extension). + !!! question "Challenge" Use what you just learned to create a scatter plot of mapping quality (`MQ`) over position (`POS`) with the samples showing in different colors. Make sure to give your plot relevant axis labels. @@ -427,7 +495,7 @@ To further customize the plot, we can change the default font format: labs(x = "Base Pair Position", y = "Read Depth (DP)") + ggtitle("Read Depth vs. Position") + - theme(text = element_text(family = "Bookman")) + theme(text = element_text(family = "mono")) ``` @@ -485,8 +553,8 @@ Additionally, you can remove the grid: !!! question "Challenge" Use what you just learned to create a scatter plot of PHRED scaled - quality (`QUAL`) over position (`POS`) with the samples showing in - different colors. Make sure to give your plot relevant axis labels. + quality (`QUAL`) over position (`POS`) with the points colored and faceted + based on samples. Make sure to give your plot relevant axis labels. ??? success "Solution" @@ -495,15 +563,16 @@ Additionally, you can remove the grid: ```r ggplot(data = variants, aes(x = POS, y = QUAL, color = sample_id)) + geom_point() + - labs(x = "Base Pair Position", y = "PHRED-sacled Quality (QUAL)") + + labs(x = "Base Pair Position", + y = "PHRED-sacled Quality (QUAL)") + facet_grid(sample_id ~ .) ``` -## Barplots +## Bar charts -We can create barplots using the +We can create bar charts using the [`geom_bar`](https://ggplot2.tidyverse.org/reference/geom_bar.html) -geom. Let's make a barplot showing the number of variants for each +geom. Let's make a bar chart showing the number of variants for each sample that are indels. !!! r-project "r" @@ -518,7 +587,7 @@ sample that are indels. Since we already have the `sample_id` labels on the individual plot facets, we don’t need the legend. Use the help file for geom_bar and any other online resources you want to use to remove the legend from the plot. - ??? success "Output" + ??? success "Solution" !!! r-project "r" @@ -528,6 +597,10 @@ sample that are indels. facet_grid(sample_id ~ .) ``` +Notice that we did not need to map a variable to `y` in the aesthetics. This is +because `geom_bar()` recognises, by default, that it should be counting the +number of observations that are indels (`TRUE`) or not (`FALSE`). Take a look at +the help page for `geom_bar()` to find out how this behaviour is controlled. ## Density @@ -548,9 +621,10 @@ variants is about 10 reads. !!! question "Challenge" - Use geom_density to plot the distribution of DP with a different fill for each sample. Use a white background for the plot. + Use `geom_density()` to plot the distribution of DP with a different fill + for each sample. Use a white background for the plot. - ??? success "Output" + ??? success "Solution" !!! r-project "r" @@ -560,6 +634,19 @@ variants is about 10 reads. theme_bw() ``` +## Box plot + +A box plot helps us to visualise the spread of grouped data. Let's take a look +at the spread of read depth in different samples and whether or not the spread +looks different if it is an indel. + +!!! r-project "r" + + ```r + ggplot(data = variants, aes(x = sample_id, y = DP, fill = INDEL)) + + geom_boxplot(alpha = 0.5) + + theme_bw() + ``` ## **`ggplot2`** themes @@ -597,16 +684,16 @@ of interest for biologists (although not covered in this lesson) that are worth exploring, such as - [`geom_tile()`](https://ggplot2.tidyverse.org/reference/geom_tile.html), - for heatmaps, + for heatmaps - [`geom_jitter()`](https://ggplot2.tidyverse.org/reference/geom_jitter.html), - for strip charts, and + for strip charts - [`geom_violin()`](https://ggplot2.tidyverse.org/reference/geom_violin.html), for violin plots !!! earth-americas "Resources" - - [ggplot2: Elegant Graphics for Data Analysis](https://www.amazon.com/gp/product/331924275X/) ([online version](https://ggplot2-book.org/)) - - [The Grammar of Graphics (Statistics and Computing)](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/) - - [Data Visualization: A Practical Introduction](https://www.amazon.com/gp/product/0691181624/) ([online version](https://socviz.co/)) - - [The R Graph Gallery](https://r-graph-gallery.com/) ([the book](https://bookdown.org/content/b298e479-b1ab-49fa-b83d-a57c2b034d49/)) + - [ggplot2: Elegant Graphics for Data Analysis](https://www.amazon.com/gp/product/331924275X/) ([online version](https://ggplot2-book.org/)) + - [The Grammar of Graphics (Statistics and Computing)](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/) + - [Data Visualization: A Practical Introduction](https://www.amazon.com/gp/product/0691181624/) ([online version](https://socviz.co/)) + - [The R Graph Gallery](https://r-graph-gallery.com/) ([the book](https://bookdown.org/content/b298e479-b1ab-49fa-b83d-a57c2b034d49/)) diff --git a/docs/07-knitr-markdown.md b/docs/07-knitr-markdown.md index 56e51ac..abd14f3 100644 --- a/docs/07-knitr-markdown.md +++ b/docs/07-knitr-markdown.md @@ -2,7 +2,7 @@ !!! info - === "Keypoints" + === "Key points" - Keep reporting and R software together in one document using R Markdown. @@ -51,14 +51,14 @@ can just re-compile the report and get the new or corrected results (versus having to reconstruct figures, paste them into a Word document, and further hand-edit various detailed results). -The key tool for R is [knitr](http://yihui.name/knitr/), which allows +The key tool for R is [`knitr`](http://yihui.name/knitr/), which allows you to create a document that is a mixture of text and some chunks of -code. When the document is processed by knitr, chunks of R code will be +code. When the document is processed by `knitr`, chunks of R code will be executed, and graphs or other results inserted. This sort of idea has been called "literate programming". -knitr allows you to mix basically any sort of text with any sort of +`knitr` allows you to mix basically any sort of text with any sort of code, but we recommend that you use R Markdown, which mixes Markdown with R. Markdown is a light-weight mark-up language for creating web pages. @@ -68,9 +68,7 @@ pages. Within R Studio, click File → New File → R Markdown and you'll get a dialog box like this: - +![Alt text](figures/open_new_Rmd.png) You can stick with the default (HTML output), but give it a title. @@ -185,7 +183,7 @@ You can make a hyperlink like this: You can include an image file like this: `![caption](http://url/for/file)` -You can do subscripts (e.g., F~2~) with `F~2` and superscripts (e.g., +You can do subscripts (e.g., F~2~) with `F~2~` and superscripts (e.g., F^2^) with `F^2^`. If you know how to write equations in diff --git a/docs/08-r-help.md b/docs/08-r-help.md index b50c566..9e51796 100644 --- a/docs/08-r-help.md +++ b/docs/08-r-help.md @@ -2,7 +2,7 @@ !!! info - === "Keypoints" + === "Key points" - R provides thousands of functions for analyzing data, and provides several way to get help - Using R will mean searching for online help, and there are tips and resources on how to search effectively @@ -55,10 +55,10 @@ Often, in order to duplicate the issue you are having, someone may need to see the data you are working with or verify the versions of R or R packages you are using. The following R functions will help with this: -You can **check the version of R** you are working with using the -`sessionInfo()` function. Actually, it is good to save this information -as part of your notes on any analysis you are doing. When you run the -same script that has worked fine a dozen times before, looking back at +You can **check the version of R** (and any loaded packages) you are working +with using the `sessionInfo()` function. Actually, it is good to save this +information as part of your notes on any analysis you are doing. When you run +the same script that has worked fine a dozen times before, looking back at these notes will remind you that you upgraded R and forget to check your script. @@ -91,12 +91,11 @@ script. Many times, there may be some issues with your data and the way it is formatted. In that case, you may want to share that data with someone else. However, you may not need to share the whole dataset; looking at a -subset of your 50,000 row, 10,000 column dataframe may be TMI (too much -information)! You can take an object you have in memory such as -dataframe (if you don't know what this means yet, we will get to it!) -and save it to a file. In our example we will use the `dput()` function -on the `iris` dataframe which is an example dataset that is installed in -R: +subset of your 50,000 row, 10,000 column data frame may be TMI (too much +information)! You can take an object you have in memory such as a +data frame and save it to a file. In our example we will use the `dput()` +function on the `iris` data frame which is an example dataset that is installed +in R: !!! r-project "r" @@ -163,15 +162,15 @@ them here because they come up commonly: 1:101 # generates the sequence of numbers from 1 to 101 ``` -!!! success "Output" - - ``` - [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 - [23] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 - [45] 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 - [67] 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 - [89] 89 90 91 92 93 94 95 96 97 98 99 100 101 - ``` + !!! success "Output" + + ``` + [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 + [23] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 + [45] 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 + [67] 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 + [89] 89 90 91 92 93 94 95 96 97 98 99 100 101 + ``` In the output above, `[89]` indicates that the first value on that line is the 89th item in your result diff --git a/docs/appendix/04-bioconductor-vcfr.md b/docs/appendix/04-bioconductor-vcfr.md index 9982c84..089f350 100644 --- a/docs/appendix/04-bioconductor-vcfr.md +++ b/docs/appendix/04-bioconductor-vcfr.md @@ -2,7 +2,7 @@ !!! info - === "Keypoints" + === "Keyp oints" - Bioconductor is an alternative package repository for bioinformatics packages. - Installing packages from Bioconductor requires a new method, since it is not compatible with the `install.packages()` function used for CRAN. diff --git a/docs/appendix/05-dplyr.md b/docs/appendix/05-dplyr.md index 07ad593..de69084 100644 --- a/docs/appendix/05-dplyr.md +++ b/docs/appendix/05-dplyr.md @@ -2,7 +2,7 @@ !!! info - === "Keypoints" + === "Key points" - Use the `dplyr` package to manipulate data frames. - Use `glimpse()` to quickly look at your data frame. @@ -12,6 +12,7 @@ - Use `group_by()` and `summarize()` to work with subsets of data. === "Objectives" + - Describe what the `dplyr` package in R is used for. - Apply common `dplyr` functions to manipulate data in R. - Employ the 'pipe' operator to link together a sequence of functions. @@ -88,25 +89,23 @@ package. many useful collection of packages for this lesson and beyond. However, when teaching or following this lesson, we advise that participants install `dplyr`, `readr`, `ggplot2`, and `tidyr` - individually as shown above. Otherwise, a substaial amount of the - lesson will be spend waiting for the installation to complete. + individually as shown above. Otherwise, a substantial amount of the + lesson will be spent waiting for the installation to complete. ## What is dplyr? -The package `dplyr` is a fairly new (2014) package that tries to provide -easy tools for the most common data manipulation tasks. This package is -also included in the [`tidyverse` package](https://www.tidyverse.org/), -which is a collection of eight different packages (`dplyr`, `ggplot2`, -`tibble`, `tidyr`, `readr`, `purrr`, `stringr`, and `forcats`). It is -built to work directly with data frames. The thinking behind it was -largely inspired by the package `plyr` which has been in use for some -time but suffered from being slow in some cases.`dplyr` addresses this -by porting much of the computation to C++. An additional feature is the -ability to work with data stored directly in an external database. The -benefits of doing this are that the data can be managed natively in a -relational database, queries can be conducted on that database, and only -the results of the query returned. +The package `dplyr` tries to provide easy tools for the most common data +manipulation tasks. This package is also included in the [`tidyverse` +package](https://www.tidyverse.org/), which is a collection of eight different +packages (`dplyr`, `ggplot2`, `tibble`, `tidyr`, `readr`, `purrr`, `stringr`, +and `forcats`). It is built to work directly with data frames. The thinking +behind it was largely inspired by the package `plyr` which has been in use for +some time but suffered from being slow in some cases.`dplyr` addresses this by +porting much of the computation to C++. An additional feature is the ability to +work with data stored directly in an external database. The benefits of doing +this are that the data can be managed natively in a relational database, queries +can be conducted on that database, and only the results of the query returned. This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is @@ -117,8 +116,8 @@ analysis in R. ## Loading .csv files in tidy style -The Tidyverse's `readr` package provides its own unique way of loading -.csv files in to R using `read_csv()`, which is similar to `read.csv()`. +Tidyverse's `readr` package provides its own unique way of loading +`.csv` files in to R using `read_csv()`, which is similar to `read.csv()`. `read_csv()` allows users to load in their data faster, doesn't create row names, and allows you to access non-standard variable names (ie. variables that start with numbers of contain spaces), and outputs your @@ -133,21 +132,21 @@ Now let's load our vcf .csv file using `read_csv()`: variants <- read_csv("combined_tidy_vcf.csv") ``` -??? success "Output" + ??? success "Output" - ``` - Rows: 801 Columns: 29 - - ── Column specification ──────────────────────────────────────────────────────────────────────── - Delimiter: "," - chr (7): sample_id, CHROM, REF, ALT, DP4, Indiv, gt_GT_alleles - dbl (16): POS, QUAL, IDV, IMF, DP, VDB, RPB, MQB, BQB, MQSB, SGB, MQ0F, AC, AN, MQ, gt_GT - num (1): gt_PL - lgl (5): ID, FILTER, INDEL, ICB, HOB - - ℹ Use `spec()` to retrieve the full column specification for this data. - ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. - ``` + ``` + Rows: 801 Columns: 29 + + ── Column specification ──────────────────────────────────────────────────────────────────────── + Delimiter: "," + chr (7): sample_id, CHROM, REF, ALT, DP4, Indiv, gt_GT_alleles + dbl (16): POS, QUAL, IDV, IMF, DP, VDB, RPB, MQB, BQB, MQSB, SGB, MQ0F, AC, AN, MQ, gt_GT + num (1): gt_PL + lgl (5): ID, FILTER, INDEL, ICB, HOB + + ℹ Use `spec()` to retrieve the full column specification for this data. + ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. + ``` ## Taking a quick look at data frames @@ -160,41 +159,41 @@ function that (as the name suggests) gives a glimpse of the data frame. glimpse(variants) ``` -??? success "Output" - - ``` - Rows: 801 - Columns: 29 - $ sample_id "SRR2584863", "SRR2584863", "SRR2584863", "SRR2584863", "SRR2584863", "S… - $ CHROM "CP000819.1", "CP000819.1", "CP000819.1", "CP000819.1", "CP000819.1", "C… - $ POS 9972, 263235, 281923, 433359, 473901, 648692, 1331794, 1733343, 2103887,… - $ ID NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … - $ REF "T", "G", "G", "CTTTTTTT", "CCGC", "C", "C", "G", "ACAGCCAGCCAGCCAGCCAGC… - $ ALT "G", "T", "T", "CTTTTTTTT", "CCGCGC", "T", "A", "A", "ACAGCCAGCCAGCCAGCC… - $ QUAL 91.0000, 85.0000, 217.0000, 64.0000, 228.0000, 210.0000, 178.0000, 225.0… - $ FILTER NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … - $ INDEL FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE,… - $ IDV NA, NA, NA, 12, 9, NA, NA, NA, 2, 7, NA, NA, NA, NA, NA, NA, NA, NA, NA,… - $ IMF NA, NA, NA, 1.000000, 0.900000, NA, NA, NA, 0.666667, 1.000000, NA, NA, … - $ DP 4, 6, 10, 12, 10, 10, 8, 11, 3, 7, 9, 20, 12, 19, 15, 10, 14, 9, 13, 2, … - $ VDB 0.0257451, 0.0961330, 0.7740830, 0.4777040, 0.6595050, 0.2680140, 0.6240… - $ RPB NA, 1.000000, NA, NA, NA, NA, NA, NA, NA, NA, 0.900802, NA, 0.954207, NA… - $ MQB NA, 1.0000000, NA, NA, NA, NA, NA, NA, NA, NA, 0.1501340, NA, 0.0497871,… - $ BQB NA, 1.000000, NA, NA, NA, NA, NA, NA, NA, NA, 0.750668, NA, 0.774755, NA… - $ MQSB NA, NA, 0.974597, 1.000000, 0.916482, 0.916482, 0.900802, 1.007750, 1.00… - $ SGB -0.556411, -0.590765, -0.662043, -0.676189, -0.662043, -0.670168, -0.651… - $ MQ0F 0.000000, 0.166667, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.… - $ ICB NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … - $ HOB NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … - $ AC 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … - $ AN 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … - $ DP4 "0,0,0,4", "0,1,0,5", "0,0,4,5", "0,1,3,8", "1,0,2,7", "0,0,7,3", "0,0,3… - $ MQ 60, 33, 60, 60, 60, 60, 60, 60, 60, 60, 25, 60, 10, 60, 60, 60, 60, 60, … - $ Indiv "/home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam", "/… - $ gt_PL 1210, 1120, 2470, 910, 2550, 2400, 2080, 2550, 11128, 1940, 1310, 2550, … - $ gt_GT 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … - $ gt_GT_alleles "G", "T", "T", "CTTTTTTTT", "CCGCGC", "T", "A", "A", "ACAGCCAGCCAGCCAGCC… - ``` + ??? success "Output" + + ``` + Rows: 801 + Columns: 29 + $ sample_id "SRR2584863", "SRR2584863", "SRR2584863", "SRR2584863", "SRR2584863", "S… + $ CHROM "CP000819.1", "CP000819.1", "CP000819.1", "CP000819.1", "CP000819.1", "C… + $ POS 9972, 263235, 281923, 433359, 473901, 648692, 1331794, 1733343, 2103887,… + $ ID NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … + $ REF "T", "G", "G", "CTTTTTTT", "CCGC", "C", "C", "G", "ACAGCCAGCCAGCCAGCCAGC… + $ ALT "G", "T", "T", "CTTTTTTTT", "CCGCGC", "T", "A", "A", "ACAGCCAGCCAGCCAGCC… + $ QUAL 91.0000, 85.0000, 217.0000, 64.0000, 228.0000, 210.0000, 178.0000, 225.0… + $ FILTER NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … + $ INDEL FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE,… + $ IDV NA, NA, NA, 12, 9, NA, NA, NA, 2, 7, NA, NA, NA, NA, NA, NA, NA, NA, NA,… + $ IMF NA, NA, NA, 1.000000, 0.900000, NA, NA, NA, 0.666667, 1.000000, NA, NA, … + $ DP 4, 6, 10, 12, 10, 10, 8, 11, 3, 7, 9, 20, 12, 19, 15, 10, 14, 9, 13, 2, … + $ VDB 0.0257451, 0.0961330, 0.7740830, 0.4777040, 0.6595050, 0.2680140, 0.6240… + $ RPB NA, 1.000000, NA, NA, NA, NA, NA, NA, NA, NA, 0.900802, NA, 0.954207, NA… + $ MQB NA, 1.0000000, NA, NA, NA, NA, NA, NA, NA, NA, 0.1501340, NA, 0.0497871,… + $ BQB NA, 1.000000, NA, NA, NA, NA, NA, NA, NA, NA, 0.750668, NA, 0.774755, NA… + $ MQSB NA, NA, 0.974597, 1.000000, 0.916482, 0.916482, 0.900802, 1.007750, 1.00… + $ SGB -0.556411, -0.590765, -0.662043, -0.676189, -0.662043, -0.670168, -0.651… + $ MQ0F 0.000000, 0.166667, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.… + $ ICB NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … + $ HOB NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … + $ AC 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … + $ AN 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … + $ DP4 "0,0,0,4", "0,1,0,5", "0,0,4,5", "0,1,3,8", "1,0,2,7", "0,0,7,3", "0,0,3… + $ MQ 60, 33, 60, 60, 60, 60, 60, 60, 60, 60, 25, 60, 10, 60, 60, 60, 60, 60, … + $ Indiv "/home/dcuser/dc_workshop/results/bam/SRR2584863.aligned.sorted.bam", "/… + $ gt_PL 1210, 1120, 2470, 910, 2550, 2400, 2080, 2550, 11128, 1940, 1310, 2550, … + $ gt_GT 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … + $ gt_GT_alleles "G", "T", "T", "CTTTTTTTT", "CCGCGC", "T", "A", "A", "ACAGCCAGCCAGCCAGCC… + ``` In the above output, we can already gather some information about `variants`, such as the number of rows and columns, column names, type @@ -214,25 +213,25 @@ arguments are the columns to keep. select(variants, sample_id, REF, ALT, DP) ``` -??? success "Output" - - ``` - # A tibble: 801 × 4 - sample_id REF ALT DP - - 1 SRR2584863 T G 4 - 2 SRR2584863 G T 6 - 3 SRR2584863 G T 10 - 4 SRR2584863 CTTTTTTT CTTTTTTTT 12 - 5 SRR2584863 CCGC CCGCGC 10 - 6 SRR2584863 C T 10 - 7 SRR2584863 C A 8 - 8 SRR2584863 G A 11 - 9 SRR2584863 ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAG ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCC… 3 - 10 SRR2584863 AT ATT 7 - # … with 791 more rows - # ℹ Use `print(n = ...)` to see more rows - ``` + ??? success "Output" + + ``` + # A tibble: 801 × 4 + sample_id REF ALT DP + + 1 SRR2584863 T G 4 + 2 SRR2584863 G T 6 + 3 SRR2584863 G T 10 + 4 SRR2584863 CTTTTTTT CTTTTTTTT 12 + 5 SRR2584863 CCGC CCGCGC 10 + 6 SRR2584863 C T 10 + 7 SRR2584863 C A 8 + 8 SRR2584863 G A 11 + 9 SRR2584863 ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAG ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCC… 3 + 10 SRR2584863 AT ATT 7 + # … with 791 more rows + # ℹ Use `print(n = ...)` to see more rows + ``` To select all columns *except* certain ones, put a "-" in front of the variable to exclude it. @@ -243,27 +242,27 @@ variable to exclude it. select(variants, -CHROM) ``` -??? success "Output" - - ``` - # A tibble: 801 × 28 - sample_id POS ID REF ALT QUAL FILTER INDEL IDV IMF DP VDB RPB MQB - - 1 SRR2584863 9972 NA T G 91 NA FALSE NA NA 4 0.0257 NA NA - 2 SRR2584863 263235 NA G T 85 NA FALSE NA NA 6 0.0961 1 1 - 3 SRR2584863 281923 NA G T 217 NA FALSE NA NA 10 0.774 NA NA - 4 SRR2584863 433359 NA CTTT… CTTT… 64 NA TRUE 12 1 12 0.478 NA NA - 5 SRR2584863 473901 NA CCGC CCGC… 228 NA TRUE 9 0.9 10 0.660 NA NA - 6 SRR2584863 648692 NA C T 210 NA FALSE NA NA 10 0.268 NA NA - 7 SRR2584863 1331794 NA C A 178 NA FALSE NA NA 8 0.624 NA NA - 8 SRR2584863 1733343 NA G A 225 NA FALSE NA NA 11 0.992 NA NA - 9 SRR2584863 2103887 NA ACAG… ACAG… 56 NA TRUE 2 0.667 3 0.902 NA NA - 10 SRR2584863 2333538 NA AT ATT 167 NA TRUE 7 1 7 0.568 NA NA - # … with 791 more rows, and 14 more variables: BQB , MQSB , SGB , MQ0F , - # ICB , HOB , AC , AN , DP4 , MQ , Indiv , gt_PL , - # gt_GT , gt_GT_alleles - # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names - ``` + ??? success "Output" + + ``` + # A tibble: 801 × 28 + sample_id POS ID REF ALT QUAL FILTER INDEL IDV IMF DP VDB RPB MQB + + 1 SRR2584863 9972 NA T G 91 NA FALSE NA NA 4 0.0257 NA NA + 2 SRR2584863 263235 NA G T 85 NA FALSE NA NA 6 0.0961 1 1 + 3 SRR2584863 281923 NA G T 217 NA FALSE NA NA 10 0.774 NA NA + 4 SRR2584863 433359 NA CTTT… CTTT… 64 NA TRUE 12 1 12 0.478 NA NA + 5 SRR2584863 473901 NA CCGC CCGC… 228 NA TRUE 9 0.9 10 0.660 NA NA + 6 SRR2584863 648692 NA C T 210 NA FALSE NA NA 10 0.268 NA NA + 7 SRR2584863 1331794 NA C A 178 NA FALSE NA NA 8 0.624 NA NA + 8 SRR2584863 1733343 NA G A 225 NA FALSE NA NA 11 0.992 NA NA + 9 SRR2584863 2103887 NA ACAG… ACAG… 56 NA TRUE 2 0.667 3 0.902 NA NA + 10 SRR2584863 2333538 NA AT ATT 167 NA TRUE 7 1 7 0.568 NA NA + # … with 791 more rows, and 14 more variables: BQB , MQSB , SGB , MQ0F , + # ICB , HOB , AC , AN , DP4 , MQ , Indiv , gt_PL , + # gt_GT , gt_GT_alleles + # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names + ``` `dplyr` also provides useful functions to select columns based on their names. For instance, `ends_with()` allows you to select columns that @@ -276,25 +275,25 @@ columns that end with the letter "B": select(variants, ends_with("B")) ``` -??? success "Output" - - ``` - # A tibble: 801 × 8 - VDB RPB MQB BQB MQSB SGB ICB HOB - - 1 0.0257 NA NA NA NA -0.556 NA NA - 2 0.0961 1 1 1 NA -0.591 NA NA - 3 0.774 NA NA NA 0.975 -0.662 NA NA - 4 0.478 NA NA NA 1 -0.676 NA NA - 5 0.660 NA NA NA 0.916 -0.662 NA NA - 6 0.268 NA NA NA 0.916 -0.670 NA NA - 7 0.624 NA NA NA 0.901 -0.651 NA NA - 8 0.992 NA NA NA 1.01 -0.670 NA NA - 9 0.902 NA NA NA 1 -0.454 NA NA - 10 0.568 NA NA NA 1.01 -0.617 NA NA - # … with 791 more rows - # ℹ Use `print(n = ...)` to see more rows - ``` + ??? success "Output" + + ``` + # A tibble: 801 × 8 + VDB RPB MQB BQB MQSB SGB ICB HOB + + 1 0.0257 NA NA NA NA -0.556 NA NA + 2 0.0961 1 1 1 NA -0.591 NA NA + 3 0.774 NA NA NA 0.975 -0.662 NA NA + 4 0.478 NA NA NA 1 -0.676 NA NA + 5 0.660 NA NA NA 0.916 -0.662 NA NA + 6 0.268 NA NA NA 0.916 -0.670 NA NA + 7 0.624 NA NA NA 0.901 -0.651 NA NA + 8 0.992 NA NA NA 1.01 -0.670 NA NA + 9 0.902 NA NA NA 1 -0.454 NA NA + 10 0.568 NA NA NA 1.01 -0.617 NA NA + # … with 791 more rows + # ℹ Use `print(n = ...)` to see more rows + ``` !!! question "Challenge" @@ -317,25 +316,25 @@ columns that end with the letter "B": variants_result ``` - !!! success "Output" - - ``` - # A tibble: 801 × 7 - POS sample_id ID INDEL IDV IMF ICB - - 1 9972 SRR2584863 NA FALSE NA NA NA - 2 263235 SRR2584863 NA FALSE NA NA NA - 3 281923 SRR2584863 NA FALSE NA NA NA - 4 433359 SRR2584863 NA TRUE 12 1 NA - 5 473901 SRR2584863 NA TRUE 9 0.9 NA - 6 648692 SRR2584863 NA FALSE NA NA NA - 7 1331794 SRR2584863 NA FALSE NA NA NA - 8 1733343 SRR2584863 NA FALSE NA NA NA - 9 2103887 SRR2584863 NA TRUE 2 0.667 NA - 10 2333538 SRR2584863 NA TRUE 7 1 NA - # … with 791 more rows - # ℹ Use `print(n = ...)` to see more rows - ``` + ??? success "Output" + + ``` + # A tibble: 801 × 7 + POS sample_id ID INDEL IDV IMF ICB + + 1 9972 SRR2584863 NA FALSE NA NA NA + 2 263235 SRR2584863 NA FALSE NA NA NA + 3 281923 SRR2584863 NA FALSE NA NA NA + 4 433359 SRR2584863 NA TRUE 12 1 NA + 5 473901 SRR2584863 NA TRUE 9 0.9 NA + 6 648692 SRR2584863 NA FALSE NA NA NA + 7 1331794 SRR2584863 NA FALSE NA NA NA + 8 1733343 SRR2584863 NA FALSE NA NA NA + 9 2103887 SRR2584863 NA TRUE 2 0.667 NA + 10 2333538 SRR2584863 NA TRUE 7 1 NA + # … with 791 more rows + # ℹ Use `print(n = ...)` to see more rows + ``` ??? success "Alternative solution" @@ -354,27 +353,27 @@ To choose rows, use `filter()`: filter(variants, sample_id == "SRR2584863") ``` -??? success "Output" - - ``` - # A tibble: 25 × 29 - sample_id CHROM POS ID REF ALT QUAL FILTER INDEL IDV IMF DP VDB RPB - - 1 SRR2584863 CP000… 9.97e3 NA T G 91 NA FALSE NA NA 4 0.0257 NA - 2 SRR2584863 CP000… 2.63e5 NA G T 85 NA FALSE NA NA 6 0.0961 1 - 3 SRR2584863 CP000… 2.82e5 NA G T 217 NA FALSE NA NA 10 0.774 NA - 4 SRR2584863 CP000… 4.33e5 NA CTTT… CTTT… 64 NA TRUE 12 1 12 0.478 NA - 5 SRR2584863 CP000… 4.74e5 NA CCGC CCGC… 228 NA TRUE 9 0.9 10 0.660 NA - 6 SRR2584863 CP000… 6.49e5 NA C T 210 NA FALSE NA NA 10 0.268 NA - 7 SRR2584863 CP000… 1.33e6 NA C A 178 NA FALSE NA NA 8 0.624 NA - 8 SRR2584863 CP000… 1.73e6 NA G A 225 NA FALSE NA NA 11 0.992 NA - 9 SRR2584863 CP000… 2.10e6 NA ACAG… ACAG… 56 NA TRUE 2 0.667 3 0.902 NA - 10 SRR2584863 CP000… 2.33e6 NA AT ATT 167 NA TRUE 7 1 7 0.568 NA - # … with 15 more rows, and 15 more variables: MQB , BQB , MQSB , SGB , - # MQ0F , ICB , HOB , AC , AN , DP4 , MQ , Indiv , - # gt_PL , gt_GT , gt_GT_alleles - # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names - ``` + ??? success "Output" + + ``` + # A tibble: 25 × 29 + sample_id CHROM POS ID REF ALT QUAL FILTER INDEL IDV IMF DP VDB RPB + + 1 SRR2584863 CP000… 9.97e3 NA T G 91 NA FALSE NA NA 4 0.0257 NA + 2 SRR2584863 CP000… 2.63e5 NA G T 85 NA FALSE NA NA 6 0.0961 1 + 3 SRR2584863 CP000… 2.82e5 NA G T 217 NA FALSE NA NA 10 0.774 NA + 4 SRR2584863 CP000… 4.33e5 NA CTTT… CTTT… 64 NA TRUE 12 1 12 0.478 NA + 5 SRR2584863 CP000… 4.74e5 NA CCGC CCGC… 228 NA TRUE 9 0.9 10 0.660 NA + 6 SRR2584863 CP000… 6.49e5 NA C T 210 NA FALSE NA NA 10 0.268 NA + 7 SRR2584863 CP000… 1.33e6 NA C A 178 NA FALSE NA NA 8 0.624 NA + 8 SRR2584863 CP000… 1.73e6 NA G A 225 NA FALSE NA NA 11 0.992 NA + 9 SRR2584863 CP000… 2.10e6 NA ACAG… ACAG… 56 NA TRUE 2 0.667 3 0.902 NA + 10 SRR2584863 CP000… 2.33e6 NA AT ATT 167 NA TRUE 7 1 7 0.568 NA + # … with 15 more rows, and 15 more variables: MQB , BQB , MQSB , SGB , + # MQ0F , ICB , HOB , AC , AN , DP4 , MQ , Indiv , + # gt_PL , gt_GT , gt_GT_alleles + # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names + ``` `filter()` will keep all the rows that match the conditions that are provided. Here are a few examples: @@ -428,7 +427,7 @@ logical operator, you can specify it explicitly: Select all the mutations that occurred between the positions 1e6 (one million) and 2e6 (inclusive) that have a QUAL greater than 200, and exclude INDEL mutations. Hint: to flip logical values such as TRUE to - a FALSE, we can use to negation symbol "!". (eg. !TRUE == FALSE). + a FALSE, we can use to negation symbol `!`. (eg. `!TRUE == FALSE`). ??? success "Solution" @@ -438,43 +437,39 @@ logical operator, you can specify it explicitly: filter(variants, POS >= 1e6 & POS <= 2e6, QUAL > 200, !INDEL) ``` - !!! success "Output" - - ``` - # A tibble: 77 × 29 - sampl…¹ CHROM POS ID REF ALT QUAL FILTER INDEL IDV IMF DP VDB RPB MQB - - 1 SRR258… CP00… 1.73e6 NA G A 225 NA FALSE NA NA 11 0.992 NA NA - 2 SRR258… CP00… 1.00e6 NA A G 225 NA FALSE NA NA 15 0.481 NA NA - 3 SRR258… CP00… 1.02e6 NA A G 225 NA FALSE NA NA 12 0.242 NA NA - 4 SRR258… CP00… 1.06e6 NA C T 225 NA FALSE NA NA 17 0.346 NA NA - 5 SRR258… CP00… 1.06e6 NA A G 206 NA FALSE NA NA 9 0.630 NA NA - 6 SRR258… CP00… 1.07e6 NA G T 225 NA FALSE NA NA 11 0.349 NA NA - 7 SRR258… CP00… 1.07e6 NA T C 225 NA FALSE NA NA 12 0.196 NA NA - 8 SRR258… CP00… 1.10e6 NA C T 225 NA FALSE NA NA 15 0.454 NA NAincluding - 9 SRR258… CP00… 1.11e6 NA C T 212 NA FALSE NA NA 9 0.179 NA NA - 10 SRR258… CP00… 1.11e6 NA A G 225 NA FALSE NA NA 14 0.909 NA NA - # … with 67 more rows, 14 more variables: BQB , MQSB , SGB , MQ0F , - # ICB , HOB , AC , AN , DP4 , MQ , Indiv , gt_PL , - # gt_GT , gt_GT_alleles , and abbreviated variable name ¹​sample_id - # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names - ``` + ??? success "Output" + + ``` + # A tibble: 77 × 29 + sampl…¹ CHROM POS ID REF ALT QUAL FILTER INDEL IDV IMF DP VDB RPB MQB + + 1 SRR258… CP00… 1.73e6 NA G A 225 NA FALSE NA NA 11 0.992 NA NA + 2 SRR258… CP00… 1.00e6 NA A G 225 NA FALSE NA NA 15 0.481 NA NA + 3 SRR258… CP00… 1.02e6 NA A G 225 NA FALSE NA NA 12 0.242 NA NA + 4 SRR258… CP00… 1.06e6 NA C T 225 NA FALSE NA NA 17 0.346 NA NA + 5 SRR258… CP00… 1.06e6 NA A G 206 NA FALSE NA NA 9 0.630 NA NA + 6 SRR258… CP00… 1.07e6 NA G T 225 NA FALSE NA NA 11 0.349 NA NA + 7 SRR258… CP00… 1.07e6 NA T C 225 NA FALSE NA NA 12 0.196 NA NA + 8 SRR258… CP00… 1.10e6 NA C T 225 NA FALSE NA NA 15 0.454 NA NAincluding + 9 SRR258… CP00… 1.11e6 NA C T 212 NA FALSE NA NA 9 0.179 NA NA + 10 SRR258… CP00… 1.11e6 NA A G 225 NA FALSE NA NA 14 0.909 NA NA + # … with 67 more rows, 14 more variables: BQB , MQSB , SGB , MQ0F , + # ICB , HOB , AC , AN , DP4 , MQ , Indiv , gt_PL , + # gt_GT , gt_GT_alleles , and abbreviated variable name ¹​sample_id + # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names + ``` ## Pipes -But what if you wanted to select and filter? We can do this with pipes. -Pipes, are a fairly recent addition to R. Pipes let you take the output -of one function and send it directly to the next, which is useful when -you need to many things to the same data set. It was possible to do this -before pipes were added to R, but it was much messier and more -difficult. Pipes in R look like `%>%` and are made available via the -`magrittr` package, which is installed as part of `dplyr`. If you use -RStudio, you can type the pipe with Ctrl + -Shift + M if -you're using a PC, or Cmd + -Shift + M if -you're using a Mac. +But what if you wanted to select and filter? We can do this with pipes. Pipes +let you take the output of one function and send it directly to the next, which +is useful when you need to many things to the same data set. It was possible to +do this before pipes were added to R, but it was much messier and more +difficult. Pipes in R look like `%>%` and are made available via the `magrittr` +package, which is installed as part of `dplyr`. If you use RStudio, you can type +the pipe with Ctrl + Shift + M if you're using +a PC, or Cmd + Shift + M if you're using a Mac. !!! r-project "r" @@ -484,25 +479,25 @@ you're using a Mac. select(REF, ALT, DP) ``` -??? success "Output" - - ``` - # A tibble: 25 × 3 - REF ALT DP - - 1 T G 4 - 2 G T 6 - 3 G T 10 - 4 CTTTTTTT CTTTTTTTT 12 - 5 CCGC CCGCGC 10 - 6 C T 10 - 7 C A 8 - 8 G A 11 - 9 ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAG ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGC… 3 - 10 AT ATT 7 - # … with 15 more rows - # ℹ Use `print(n = ...)` to see more rows - ``` + ??? success "Output" + + ``` + # A tibble: 25 × 3 + REF ALT DP + + 1 T G 4 + 2 G T 6 + 3 G T 10 + 4 CTTTTTTT CTTTTTTTT 12 + 5 CCGC CCGCGC 10 + 6 C T 10 + 7 C A 8 + 8 G A 11 + 9 ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAG ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGC… 3 + 10 AT ATT 7 + # … with 15 more rows + # ℹ Use `print(n = ...)` to see more rows + ``` In the above code, we use the pipe to send the `variants` data set first through `filter()`, to keep rows where `sample_id` matches a particular @@ -512,10 +507,10 @@ the first argument to the function on its right, we don't need to explicitly include the data frame as an argument to the `filter()` and `select()` functions any more. -Some may find it helpful to read the pipe like the word "then". For +Some may find it helpful to read the pipe like the phrase "*then*". For instance, in the above example, we took the data frame `variants`, -*then* we `filter`ed for rows where `sample_id` was SRR2584863, *then* -we `select`ed the `REF`, `ALT`, and `DP` columns, *then* we showed only +*then* we `filter()`ed for rows where `sample_id` was SRR2584863, *then* +we `select()`ed the `REF`, `ALT`, and `DP` columns, *then* we showed only the first six rows. The **`dplyr`** functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames. @@ -540,25 +535,25 @@ just the first six rows to confirm it's what we want: SRR2584863_variants ``` -??? success "Output" - - ``` - # A tibble: 25 × 3 - REF ALT DP - - 1 T G 4 - 2 G T 6 - 3 G T 10 - 4 CTTTTTTT CTTTTTTTT 12 - 5 CCGC CCGCGC 10 - 6 C T 10 - 7 C A 8 - 8 G A 11 - 9 ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAG ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGC… 3 - 10 AT ATT 7 - # … with 15 more rows - # ℹ Use `print(n = ...)` to see more rows - ``` + ??? success "Output" + + ``` + # A tibble: 25 × 3 + REF ALT DP + + 1 T G 4 + 2 G T 6 + 3 G T 10 + 4 CTTTTTTT CTTTTTTTT 12 + 5 CCGC CCGCGC 10 + 6 C T 10 + 7 C A 8 + 8 G A 11 + 9 ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAG ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGC… 3 + 10 AT ATT 7 + # … with 15 more rows + # ℹ Use `print(n = ...)` to see more rows + ``` Similar to `head()` and `tail()` functions, we can also look at the first or last six rows using tidyverse function `slice()`. Slice is a @@ -570,19 +565,19 @@ more versatile function that allows users to specify a range to view: SRR2584863_variants %>% slice(1:6) ``` -??? success "Output" + ??? success "Output" - ``` - # A tibble: 6 × 3 - REF ALT DP - - 1 T G 4 - 2 G T 6 - 3 G T 10 - 4 CTTTTTTT CTTTTTTTT 12 - 5 CCGC CCGCGC 10 - 6 C T 10 - ``` + ``` + # A tibble: 6 × 3 + REF ALT DP + + 1 T G 4 + 2 G T 6 + 3 G T 10 + 4 CTTTTTTT CTTTTTTTT 12 + 5 CCGC CCGCGC 10 + 6 C T 10 + ``` !!! r-project "r" @@ -590,29 +585,29 @@ more versatile function that allows users to specify a range to view: SRR2584863_variants %>% slice(10:25) ``` -??? success "Output" - - ``` - # A tibble: 16 × 3 - REF ALT DP - - 1 AT ATT 7 - 2 A C 9 - 3 A C 20 - 4 G T 12 - 5 A T 19 - 6 G A 15 - 7 A C 10 - 8 C A 14 - 9 A G 9 - 10 A C 13 - 11 A AC 2 - 12 G T 10 - 13 A G 16 - 14 A C 11 - 15 TGG T 10 - 16 A C 9 - ``` + ??? success "Output" + + ``` + # A tibble: 16 × 3 + REF ALT DP + + 1 AT ATT 7 + 2 A C 9 + 3 A C 20 + 4 G T 12 + 5 A T 19 + 6 G A 15 + 7 A C 10 + 8 C A 14 + 9 A G 9 + 10 A C 13 + 11 A AC 2 + 12 G T 10 + 13 A G 16 + 14 A C 11 + 15 TGG T 10 + 16 A C 9 + ``` !!! question "Exercise: Pipe and filter" @@ -632,20 +627,20 @@ more versatile function that allows users to specify a range to view: select(sample_id, DP, REF, ALT, POS) ``` - !!! success "Output" - - ``` - # A tibble: 7 × 5 - sample_id DP REF ALT POS - - 1 SRR2584863 11 G A 1733343 - 2 SRR2584863 20 A C 2446984 - 3 SRR2584863 12 G T 2618472 - 4 SRR2584863 19 A T 2665639 - 5 SRR2584863 15 G A 2999330 - 6 SRR2584863 10 A C 3339313 - 7 SRR2584863 14 C A 3401754 - ``` + ??? success "Output" + + ``` + # A tibble: 7 × 5 + sample_id DP REF ALT POS + + 1 SRR2584863 11 G A 1733343 + 2 SRR2584863 20 A C 2446984 + 3 SRR2584863 12 G T 2618472 + 4 SRR2584863 19 A T 2665639 + 5 SRR2584863 15 G A 2999330 + 6 SRR2584863 10 A C 3339313 + 7 SRR2584863 14 C A 3401754 + ``` ## Mutate @@ -660,7 +655,7 @@ probability value according to the formula: $$\textrm {Probability} = 1- 10 ^ {-\frac{\textrm{QUAL}}{10}}$$ -We can use `mutate` to add a column (`POLPROB`) to our `variants` data +We can use `mutate()` to add a column (`POLPROB`) to our `variants` data frame that shows the probability of a polymorphism at that site given the data. @@ -687,25 +682,25 @@ the data. select(sample_id, POS, QUAL, POLPROB) ``` - !!! success "Output" - - ``` - # A tibble: 801 × 4 - sample_id POS QUAL POLPROB - - 1 SRR2584863 9972 91 1.00 - 2 SRR2584863 263235 85 1.00 - 3 SRR2584863 281923 217 1 - 4 SRR2584863 433359 64 1.00 - 5 SRR2584863 473901 228 1 - 6 SRR2584863 648692 210 1 - 7 SRR2584863 1331794 178 1 - 8 SRR2584863 1733343 225 1 - 9 SRR2584863 2103887 56 1.00 - 10 SRR2584863 2333538 167 1 - # … with 791 more rows - # ℹ Use `print(n = ...)` to see more rows - ``` + ??? success "Output" + + ``` + # A tibble: 801 × 4 + sample_id POS QUAL POLPROB + + 1 SRR2584863 9972 91 1.00 + 2 SRR2584863 263235 85 1.00 + 3 SRR2584863 281923 217 1 + 4 SRR2584863 433359 64 1.00 + 5 SRR2584863 473901 228 1 + 6 SRR2584863 648692 210 1 + 7 SRR2584863 1331794 178 1 + 8 SRR2584863 1733343 225 1 + 9 SRR2584863 2103887 56 1.00 + 10 SRR2584863 2333538 167 1 + # … with 791 more rows + # ℹ Use `print(n = ...)` to see more rows + ``` ## `group_by()` and `summarize()` functions @@ -727,16 +722,16 @@ each sample using the function `tally()`: tally() ``` -??? success "Output" + ??? success "Output" - ``` - # A tibble: 3 × 2 - sample_id n - - 1 SRR2584863 25 - 2 SRR2584866 766 - 3 SRR2589044 10 - ``` + ``` + # A tibble: 3 × 2 + sample_id n + + 1 SRR2584863 25 + 2 SRR2584866 766 + 3 SRR2589044 10 + ``` Since counting or tallying values is a common use case for `group_by()`, an alternative function was created to bypasses `group_by()` using the @@ -784,7 +779,7 @@ the data. ![split_apply_combine](../figures/split_apply_combine.png){width="500"}[^1] We can also apply many other functions to individual columns to get -other summary statistics. For example,we can use built-in functions like +other summary statistics. For example, we can use built-in functions like `mean()`, `median()`, `min()`, and `max()`. These are called "built-in functions" because they come with R and don't require that you install any additional packages. By default, all **R functions operating on @@ -810,16 +805,16 @@ for each sample: ) ``` -??? success "Output" + ??? success "Output" - ``` - # A tibble: 3 × 5 - sample_id mean_DP median_DP min_DP max_DP - - 1 SRR2584863 10.4 10 2 20 - 2 SRR2584866 10.6 10 2 79 - 3 SRR2589044 9.3 9.5 3 16 - ``` + ``` + # A tibble: 3 × 5 + sample_id mean_DP median_DP min_DP max_DP + + 1 SRR2584863 10.4 10 2 20 + 2 SRR2584866 10.6 10 2 79 + 3 SRR2589044 9.3 9.5 3 16 + ``` ## Reshaping data frames @@ -842,14 +837,14 @@ name that will become the cells in the wide data. variants_wide ``` -??? success "Output" + ??? success "Output" - ``` - # A tibble: 1 × 4 - CHROM SRR2584863 SRR2584866 SRR2589044 - - 1 CP000819.1 10.4 10.6 9.3 - ``` + ``` + # A tibble: 1 × 4 + CHROM SRR2584863 SRR2584866 SRR2589044 + + 1 CP000819.1 10.4 10.6 9.3 + ``` The opposite operation of `pivot_wider()` is taken care by `pivot_longer()`. We specify the names of the new columns, and here add @@ -863,16 +858,17 @@ The opposite operation of `pivot_wider()` is taken care by pivot_longer(-CHROM, names_to = "sample_id", values_to = "mean_DP") ``` -??? success "Output" + ??? success "Output" + + ``` + # A tibble: 3 × 3 + CHROM sample_id mean_DP + + 1 CP000819.1 SRR2584863 10.4 + 2 CP000819.1 SRR2584866 10.6 + 3 CP000819.1 SRR2589044 9.3 + ``` - ``` - # A tibble: 3 × 3 - CHROM sample_id mean_DP - - 1 CP000819.1 SRR2584863 10.4 - 2 CP000819.1 SRR2584866 10.6 - 3 CP000819.1 SRR2589044 9.3 - ``` ## Resources - [Handy dplyr diff --git a/docs/figures/open_new_Rmd.png b/docs/figures/open_new_Rmd.png new file mode 100644 index 0000000..215e196 Binary files /dev/null and b/docs/figures/open_new_Rmd.png differ