forked from harvardinformatics/bioinformatics-coffee-hour
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
320 lines (180 loc) · 9.5 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
---
title: "Intro to R"
subtitle: "Bioinformatics Coffee Hour"
date: "Augus 4, 2020"
author: "Danielle Khost"
output: html_document
---
##Setting up our working directory and saving objects vs files
R is a functional programming language, which means that most of what one does is apply functions to objects.
We will begin with a very brief introduction to R objects and how functions work, and then focus on getting data into R, manipulating that data in R, plotting, and analysis.
First, we need to make sure we are working out of the correct directory. You can see the working directory above the console in RStudio, or use the `getwd()` function to see the current working directory. If you are not currently working in a good location to read and save files, you can use `setwd()` to change the working directory to the location of your choice.
```{r}
getwd()
```
## Vectors
Let's start by creating one of the simplest R objects, a vector of numbers.
```{r}
```
v1 is an *object* which we created by using the ```<-``` operator (less followed by a dash).
v1 now contains the output of the *function* ```c(1,2,3,4,5)``` (c, for **c**ombined, combines elements into a vector
Let's display the contents of v1:
```{r}
```
We can also just type the name of an object (in this case ```v1```) to display it.
Let's make a few more objects:
```{r}
```
With this last variable, we had to use ```""``` to specify that we want to create a character vector. Otherwise it thinks we are looking for variables to combine called a,b,c and d.
```{r, eval=FALSE}
```
We might want to get a list of all the objects we've created at this point. In R, the function ls() returns a character vector with the names of all the objects in a specified environment (by default, the set of user-defined objects and functions).
```{r}
ls()
```
## Getting Help
R has very good built-in documentation that describes what functions do.
To get help about a particular function, use ? followed by the function name, like so:
?ls
If you are not exactly sure of the function name, you can perform a search:
??mean
We can manipulate objects like so:
```{r}
```
Note that the value of ```x``` is not modified here. We did not save the output to a new object, so it is printed to the screen.
If we want to update the value of ```x```, we need to use our assignment operator:
```{r}
```
R handles vector math for us automatically:
```{r}
```
> **Vector exercises:**
>
>1. Create a new vector, (called nums), that contains any 5 numbers you like
>
>2. There are a number of functions that operate on vectors, including length(), max(), min(), and mean(), all of which do exactly what they sound like they do. So for example length(nums) should return 5, because nums is a 5-element vector. Use these functions to get the minimum, maximum, and mean value for the nums vector you created.
```{r}
```
## Object Types
All objects have a type. Object types are a complex topic and we are only going to scratch the surface today. To slightly simplify, all data objects in R are either atomic vectors (contain only a single type of data), or data structures that combine atomic vectors in various ways. We'll walk through a few examples. First, let's consider a couple of the vectors that we've already made.
```{r}
```
*Numeric* and *character* data types are two of the most common we'll encounter, and are just what they sound like. Another useful type is *logical* data, as in TRUE/FALSE. We can create a logical vector directly like so:
```{r}
```
Or we can use logical tests.
```{r}
```
Note that the logical test here (>2) is applied independently to each element in the vector. We'll come back to this when we talk about data subsets.
> **Object type exercises:**
>
>1. Construct a logical vector with the same number of elements as your nums vector, that is TRUE if the corresponding element in the nums vector is less than the mean of the nums vector, and FALSE otherwise. Call this vector big_nums
```{r}
```
## Subsetting and Logical Vectors
We can also extract portions of vectors that we've created using ```[]``` following the vector, with the positions we wish to extract. For example, we can extract portions of the nums vector:
```{r}
```
Now it can be useful to specify which positions you would like to include, but often it is more useful to create new objects based on some sort of rule. That is where logical vectors come in, and really how you will mostly be using logical vectors going forward.
For example, in the last exercise, we constructed a logical vector that was the same length as our nums vector that was ```TRUE``` with if the corresponding element in the nums vector is less than the mean of the nums vector, and ```FALSE``` otherwise.
Let's take a look at both of these vectors again.
```{r}
```
Let's say we actually want to extract all of the numbers that are bigger than the mean (so all of the ```TRUE```) values. All we have to do is put the logical vector in brackets.
```{r}
```
We don't necessarily have to define this vector ahead of time, but can put the rule in the brackets as well.
```{r}
```
> **Vector Subsetting Data exercises:**
>
> 1. Subset nums for all numbers that are larger than the minimum.
>
> Hints -- logical tests in R - greater than, less than, equal to, not equal to
>
> > '> , < , == , !='
```{r}
```
## Data Frames
So far we've been talking about atomic vectors, which only contain a single data type (every element is logical, or character, or numeric). However, data sets will usually have multiple different data types: numeric for continunous data, character for categorical data and sample labels. Depending on how underlying types are combined, we can have four different "higher-level" data types in R:
**Dimensions** | **Homogeneous** | **Heterogeneous**
---------------|-----------------|-------------------
1-D | atomic vector | list
2-D | matrix | data frame / tibble
We'll focus on data frames for today, but lists and matrices can also be very powerful. A data frame is a collection of vectors, which can be (but don't have to be) different types, but all have to have the same length. Let's make a couple of toy data frames.
One way to do this is with the data.frame() function.
```{r}
```
str() gives lots of information about the data type of the constituent parts of a data frame
```{r}
```
We can use the function head() to look at part of a a dataframe (or any R object). This can be very useful if you have a very large or long dataframe. You also can control how many lines of the dataframe you view with ```n=#```, although the default is 6.
```{r}
```
The summary() function can also be very useful to get a snapshot of the data in your dataframe.
```{r}
```
###Reading files into R
So far we have been working with small objects we created by hand. A more common way to create dataframes is by reading from a file. There are a few functions to do this in R: read.table() and read.csv() make this easy.
```{r}
```
> **File reading exercise**
>
>1. Try using head() to see the first 20 rows and summary() to see some information about the data.
```{r}
```
Again, summary() is a good way to see some basic information about a dataframe
```{r}
```
The view function can be useful for viewing dataframes in a spread-sheet like format.
```{r, eval=FALSE}
```
### Subsetting and Manipulating Dataframes
We learned about how to subset vectors earlier, now let's look at ways to get subsets of data from two-dimensional data. R uses ```[]``` to index rows and columns, like so:
```{r, eval=FALSE}
```
The first number refers to rows, and the second to columns in the case for 2-dimensional objects.
Notice that when you are extracting data from a single column, the data are returned as a vector of the same type as the column. However, if you extract more than one column, the data are returend as a data frame.
```{r, eval=FALSE}
```
Often we want to select a set of rows from a data frame matching some criteria. We can use the subset() function for this. Let's start by just getting the data for key bus routes.
```{r}
```
Sometimes we might want something more complicated. Let's say we want to get data for all "local" bus routes with at least 1000 daily riders.
```{r}
```
> **Subsetting Data exercises:**
>
> 1. Subset bus when ridership is greater than 2500.
>
> 2. Subset bus when bus routes are NOT classified as Key.
>
> 3. Subset bus using the conditions of both #1 and #2, save as bus_2500_notKey
>
> Hints -- logical tests in R
>
> > '> , < , == , !='
>
> > '& is and -->> z <- x & y TRUE only when both x and y are true'
>
> > '| is or -->> z <- x | y TRUE if either x or y is true'
>
```{r}
```
We can add and manipulate the data frame as well, but we will save that for the next session, and use some packages specifically built for that purpose.
## Writing Data
We've added several variables now, and we might want to write our updated dataset to a file. We do this with the write.table() function, which writes out a data.frame.
```{r}
```
But where did R put the file on our computer? R does all file operations without a full path in a working directory. RStudio has a preference to set the default working directory, which is typically something like /Users/Danielle/R.
To see the current working directory, use:
```{r}
getwd()
```
We can change the working directory with ```setwd()```.
>**File operators exercises:**
>
>1. Write a copy of bus_2500_notKey, but **be sure to give it a different name than the original!** Otherwise you will overwrite your original data.
>
```{r}
```