forked from harvardinformatics/bioinformatics-coffee-hour
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
255 lines (171 loc) · 12.1 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
---
title: "Tidyverse Tibbles and Bits"
subtitle: "Bioinformatics Coffee Hour"
date: "Mar 2, 2021"
author: "Danielle Khost; Brian Arnold"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
One of the reasons R is so useful is because many people have built extensions for R in the form of R packages that you can install. These packages may do broad statistical analyses or analyze specific data types (e.g. comparative phylogenetics).
This lesson will introduce some data manipulation tools within the [tidyverse](https://www.tidyverse.org), which contains several R packages. These packages include:
* tidyr to convert long vs. wide form objects (not covered here)
* dplyr to transform and filter data objects, as well as to summarize their contents
For an overview of how to manipulate data with tidyverse, this [cheat sheet](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) is particularly helpful.
To install new packages you have never used before in R, use the command `install.packages()`:
```{r}
# NOTE that tidyverse is already installed, this is just to illustrate installation
# install.packages("tidyverse")
```
For packages that are already installed, we need to load them every session (i.e. if you restart R). It's good practice to load all packages at the top of your R script. More generally, the [tidyverse style guide](http://style.tidyverse.org) offers some good advice on coding style to help make your code easier to read and write.
```{r}
library(tidyverse)
```
# Data tables: wide format vs long format
When using tidyverse, you will usually want your data to be "tidy". Tidy data is data where:
1. Every column is variable.
2. Every row is an observation.
3. Every cell is a single value.
This is also referred to as **"long format"** (in contrast to "wide format"). Most of the functions in tidyverse assume that your data is in long format; if it isn't, you will want to pre-process it before doing further analysis!
Let's load in some data to use:
```{r}
housing<-read.csv("https://raw.githubusercontent.com/datasets/house-prices-us/master/data/cities-month.csv", stringsAsFactors=F, strip.white = T)
housing=housing[c(1:(length(housing)-3),length(housing))]
View(housing)
```
### Tibbles
The tidyverse uses its own version of the data frame (from base R) that is similar but has several properties that make it superior. This object is a **tibble**. Let's make a data frame called 'df1' and change it into a tibble to see what it looks like.
```{r}
df1<-data.frame(label=c("rep1", "rep2", "rep3", "rep4"), data=c(23, 34, 15, 19))
tbdf1 <- as_tibble(df1)
tbdf1
class(tbdf1)
```
You see here that just printing the tibble to screen displays the data types in each of the columns and the dimensions. Although not apparent with this small dataset, another very handy feature of tibbles is that by default they will only print out the first 10 rows and as many columns as fit in your window. Many packages will accept tibbles instead of data frames with no problems, but if you come across an older package that requires a data frame, it is easy to revert with the `as.data.frame()` function.
#### Side Note: %>%
One important piece of syntax is the %>% operator, which acts like a Unix pipe in R. This means you can pass the output of one command to another in linear fashion, as opposed to having to use either nested operations or temporary objects. This makes your code much easier to read!
---
If we look at the dataset we just downloaded, we can see that it is not in proper long, tidy format.
```{r, echo = TRUE}
View(housing)
```
Let's polish it using some tidyr functions so that it is easier to work with.
## Going from wide to long: pivot_longer()
The pivot_longer() function takes a tibble (or data frame) and lengthens it, i.e. increases the number of rows and decreases number of columns:
```{r, echo = TRUE}
housing_clean <- housing %>% as_tibble %>%
pivot_longer(cols=c(-Date, -National.US), names_to="location", values_to="local_index")
View(housing_clean)
```
The **cols** argument describes what columns need to be reshaped, i.e. which need to be converted into columns. For us, we want to keep the US national average and the date as columns. We give it a list using the **c** ombine function, and by putting an **-** in front of the column names, we tell it to reshape every column *except* Date and National.US
The **names_to** argument gives a name for the new variable (i.e. column) that stores the data that formerly was in the column names. In our case, we make a variable called "location" that holds all the city names.
The **values_to** argument gives a name for the new variable (i.e. column) that stores the data that was in the cell values, which for us was the local index values in each city.
## Turning a column into multiple columns: separate()
The separate() function separates a single character column into several, splitting on a given regular expression. In the following command we run the function twice, separating two different columns:
```{r, echo = TRUE}
housing_clean2 <- housing_clean %>%
separate(Date, into=c("year", "month"), extra="drop", remove=F) %>%
separate(location, into=c("state", "city"),extra="merge")
View(housing_clean2)
```
The first argument is the name of the column in the tibble that we are separating ("location" for the first command, "Date" for the second).
The **into** argument gives the names (as a character vector) of the columns that we are separating into, i.e. what the names of the new columns will be. Note that the number of elements in the vector determine how many columns the target column will be split into (in this case, two).
The **extra** argument tells the function what to do with the "pieces" that are leftover after splitting. "Merge" tells the function to add the leftover pieces back to the final column; "drop" discards them.
To better understand this, we should look at how separate() splits a character column. You might be wondering how separate() "knows" to split our columns on a period ("."). This is because by default, separate() splits on every non-alphanumeric character (e.g. space, period, dash, etc.) given a character vector. This is controlled by the **sep** argument, which we did not included in our code so separate() just uses the default (remember to get the whole list of arguments for a function, you can type ?functionname, e.g. ?separate).
Finally, the **remove** argument determines whether the original column is retained or not (defaults to TRUE). So for our data, the original "location" column is discarded and we keep only the split columns, while the original "Date" column is retained along with its children.
Now let's put it all together. We could do things one function at a time, as above, but it is much cleaner and easier to read if we create our final, cleaned dataset in a single pipeline:
```{r, echo = TRUE}
housing_clean <- housing %>% as_tibble %>%
pivot_longer(cols=c(-Date, -National.US), names_to="location", values_to="local_index") %>%
separate(Date, c("year", "month"), extra="drop", remove=F) %>%
separate(location, c("state", "city"),extra="merge")
View(housing_clean)
```
We can observe the differences between housing and housing_clean to see what's being done here, and a more in-depth description of what these functions do can be found in the cheat sheet.
## Subsetting and Manipulating Data with dplyr
Dplyr is a package included in tidyverse, and is useful for manipulating our data. Each action gets its own (verb) function -- so for example filtering data by rows is done with the filter() function. All of these functions have a very similar syntax to tidyr functions.
arrange(), rename(), select(), mutate() are used to sort data, rename columns, and update/create columns
We'll use the housing dataset to look at how these functions work.
arrange() sorts by one or more columns, with subsequent columns used to break ties in earlier columns. E.g.,
```{r}
housing_clean %>% arrange(year, month)
housing_clean %>% arrange(city, year)
housing_clean %>% arrange(month,state)
housing_clean %>% arrange(desc(year))
```
rename() renames a column:
```{r}
housing_clean %>% arrange(year, month, state, city) %>%
rename(national_index = National.US)
```
select() selects columns to keep. Note that we can simultaneously rename and reorder columns:
```{r}
housing_clean %>% arrange(year, month, state, city) %>%
select(year, month, city, state, local_index, national_index = National.US)
```
distinct() can be used to identify all unique values in your call set:
```{r}
housing_clean %>%
select(state,city) %>%
distinct
```
mutate() is an especially useful function that can do a variety of things, usually in combination with other functions. It can create new columns or update existing ones, while keeping everthing else unchanged. For instance, say we want to change our "month" column from the integer value to the three letter abbreviation:
```{r}
housing_clean %>% arrange(year, month, state, city) %>%
select(year, month, city, state, local_index, national_index = National.US) %>%
mutate(month = month.abb[as.integer(month)])
```
R has several built-in constants, one of which is this "month.abb" function, which has the three letter abbreviations of the months in the Gregorian calendar. What this mutate() call is doing is passing the values of the month column as integers to the month.abb function, which then automatically converts them to their appropriate abbreviations.
Let's look at some other things mutate() can do. We can include several mutate calls at once:
```{r}
housing_clean %>% arrange(year, month, state, city) %>%
select(year, month, city, state, local_index, national_index = National.US) %>%
mutate(month = month.abb[as.integer(month)], city=sub(".", "_", city, fixed=TRUE), rel_index=local_index/national_index)
```
This command uses mutate() three times:
1) The first changes the months, as above
2) In the second mutate() call, we are pairing mutate() with the sub() function, which substitutes characters in a string; we are telling substitute to replace all "." characters with a "_" character for the "city" character vector (i.e. our city column)
3) In the third mutate() call, we are using mutate to create an entirely new column called "rel_index", whose value is local_index / national_index
Now that we've gone through all the dplyr functions, let's update our nice clean housing dataset.
```{r}
housing_clean <- housing_clean %>% arrange(year, month, state, city) %>%
select(year, month, city, state, local_index, national_index = National.US) %>%
mutate(month = month.abb[as.integer(month)], city=sub(".", "_", city, fixed=TRUE), rel_index=local_index/national_index)
```
## Filtering functions
Sometimes we need to filter datasets, and we can do this by selecting rows that meet logical conditions. We set these up with logical tests.
The filter() function can be used to select rows meeting these conditions:
```{r}
housing_clean %>%
filter(city == "Boston")
```
We can give filter many logical tests, all separated by commas. These are linked by logical and (all must be true to select that row):
```{r}
housing_clean %>%
filter(city=="Boston", month=="Aug")
```
If we want to use other booleans, just put them in one statement, e.g.:
```{r}
housing_clean %>%
filter(city=="Boston", month == "Aug" | month == "Jan")
```
Note that we can combine filters and the other statements to get useful subsets, for example let's say we want to see three years when the relative index in Boston was the highest (as of Jan):
```{r}
housing_clean %>%
filter(city=="Boston", month=="Jan") %>%
arrange(desc(rel_index)) %>%
select(year,rel_index) %>%
head(n=3)
```
Finally, many times you want to filter out missing data (typically indicated by `NA` in R), prior to conducting any analyses. To do this, we an use the `is.na()` fuction, which will return a vector with TRUE if NA is present, and FALSE otherwise. For example, let's identify all rows that are missing the local_index in housing_clean.
```{r}
housing_clean %>%
filter(is.na(local_index))
```
To get the inverse for this (or any logical statement), simply add a `!` to the front.
```{r}
housing_clean %>%
filter(!is.na(local_index))
```