Skip to content

Commit 13da4d8

Browse files
committed
initial contents
1 parent 0b0ee34 commit 13da4d8

10 files changed

+8199
-0
lines changed

20_minutes_to_R.Rmd

+537
Large diffs are not rendered by default.

20_minutes_to_R.nb.html

+3,086
Large diffs are not rendered by default.

CAC SCU R Basics.pptx

1.22 MB
Binary file not shown.

R-basics.Rmd

+222
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
---
2+
title: "R Basics"
3+
output: html_document
4+
---
5+
6+
# R Studio Interface
7+
8+
Posit (Formerly R Studio Public Benefit Corporation) publishes helpful and extremely detailed cheatsheets. (e.g. <https://posit.co/wp-content/uploads/2022/10/rstudio-ide-1.pdf>)
9+
10+
1. **Notice:** Working Directory at top of Console
11+
2. **Demo:** Start a new R notebook
12+
3. **Demo:** Use Packages tab to install a package (tidyverse, titanic, gmodels)
13+
14+
```{r message=FALSE, warning=FALSE}
15+
#install.packages("tidyverse") #uncomment (remove leading #) to run
16+
require(tidyverse)
17+
```
18+
19+
## Data import
20+
21+
- **Demo:** Import Dataset Wizard in Upper Tab Pane: Environment
22+
- nutrient.txt (fixed width format) - use base
23+
24+
- registration_times.csv (can set some datatypes on import)
25+
26+
### Output generated by Base wizard for Nutrient.txt
27+
28+
```{r paged.print=FALSE}
29+
n_df <- read.table("~/Documents/CAC/Projects/scu_dev/r_basics/nutrient.txt", quote="\"", comment.char="")
30+
head(n_df)
31+
names(n_df) # Column names are not great
32+
```
33+
34+
```{r paged.print=FALSE}
35+
# replace the names with a vector of new names
36+
names(n_df) = c("caseID", "calcium", "iron", "protein", "vitA", "vitC")
37+
head(n_df)
38+
str(n_df)
39+
```
40+
41+
### Output from readr import wizard:
42+
43+
```{r}
44+
# This help file explains the tokens available for parsing time
45+
?parse_date_time
46+
```
47+
48+
```{r}
49+
# code from import wizard
50+
require(readr)
51+
registration_times <- read_csv(
52+
"registration_times.csv",
53+
col_types = cols(`Registration Time` = col_datetime(format = "%Y-%m-%d %H:%M:%S")
54+
))
55+
```
56+
57+
```{r paged.print=FALSE}
58+
summary(registration_times)
59+
head(registration_times)
60+
```
61+
62+
```{r}
63+
# "org" variable might be better represented as a factor
64+
# check the unique values:
65+
unique(registration_times$org)
66+
```
67+
68+
```{r}
69+
registration_times$org = factor(registration_times$org, levels=c('wcm', 'cu', 'other'))
70+
71+
# While we are at it, lets rename the first column from `registration time` to just `time`:
72+
names(registration_times)[1] = "time"
73+
74+
head(registration_times)
75+
```
76+
77+
## Describing Data
78+
79+
### Numeric data
80+
81+
```{r paged.print=FALSE}
82+
# Basic summary of dataframe
83+
summary(n_df)
84+
```
85+
86+
```{r}
87+
# Base R approach using apply functions (see also sapply, lapply)
88+
apply(n_df, 2, mean) # "2" applies function "by column"
89+
apply(n_df, 2, sd)
90+
```
91+
92+
```{r}
93+
gg = (
94+
ggplot(n_df, aes(x=calcium))
95+
+ geom_histogram(bins=50)
96+
+ ggtitle("Distribution of Calcium Intake")
97+
)
98+
gg
99+
100+
```
101+
102+
```{r}
103+
# Visual Description
104+
require(ggplot2)
105+
gg = (
106+
ggplot(n_df, aes(x=calcium, y=iron))
107+
+ geom_point()
108+
+ ggtitle("Scatterplot of Iron and Calcium Intake")
109+
)
110+
gg
111+
```
112+
113+
### Categorical Data
114+
115+
```{r}
116+
require(titanic)
117+
df = titanic_train
118+
str(df)
119+
head(df)
120+
```
121+
122+
Again, data types are not as precise as they could be.
123+
124+
Types are Character, int, int but they are really all factors
125+
126+
```{r}
127+
# use dplyr functions and the "pipe" operator `%>%`
128+
# alternative: head(select(df, Sex, Survided, Pclass))
129+
df %>% select( Sex, Survived, Pclass) %>% head
130+
df %>% select( Sex, Survived, Pclass) %>% summary
131+
```
132+
133+
```{r}
134+
# less than idead data types lead to less ideal summaries
135+
table(df$Survived)
136+
```
137+
138+
```{r}
139+
# Create factors from the columns
140+
df$Sex = factor(df$Sex, levels=c("male", "female"))
141+
df$Survived = factor(df$Survived, levels=c(0, 1), labels=c("No", "Yes"))
142+
df$Pclass = factor(df$Pclass, levels=c(1,2,3), ordered=TRUE)
143+
144+
#Check the summary now:
145+
df %>% select( Sex, Survived, Pclass) %>% summary
146+
```
147+
148+
Check for missing data:
149+
150+
```{r}
151+
nrow(df)
152+
colSums(is.na(df))
153+
```
154+
155+
```{r}
156+
#Single variable count tables
157+
table(df$Sex)
158+
table(df$Survived)
159+
```
160+
161+
#### Table and Prop.table
162+
163+
```{r}
164+
sex_surv = table(df$Sex, df$Survived, dnn=c("Sex", "Survived"))
165+
sex_surv
166+
addmargins(sex_surv)
167+
writeLines("")
168+
169+
prop.table(sex_surv, 1 ) # The "1" means row proportions
170+
prop.table(sex_surv, 2) # The "2" means column proportions
171+
prop.table(sex_surv) # skip the argument to get proportion of table total
172+
173+
round(prop.table(sex_surv, 1), 2)
174+
```
175+
176+
#### CrossTable (gmodels package)
177+
178+
```{r}
179+
# gmodels package gives output more like SPSS/SAS/STATA
180+
require(gmodels) #show install
181+
CrossTable(df$Sex, df$Survived, digits=2, expected=TRUE, chisq=TRUE)
182+
```
183+
184+
#### Xtabs
185+
186+
```{r}
187+
# We need to know the variable names:
188+
names(df)
189+
```
190+
191+
```{r}
192+
surv_class_sex = xtabs(~Survived+Pclass+Sex, data=df)
193+
surv_class_sex
194+
ftable(surv_class_sex)
195+
```
196+
197+
#### Dplyr
198+
199+
```{r paged.print=FALSE}
200+
(
201+
df
202+
%>% group_by(Pclass, Sex, Survived)
203+
%>% summarize(n = n())
204+
%>% group_by(Pclass, Sex)
205+
%>% mutate( Rate = n/sum(n))
206+
#%>% filter(Survived=='Yes')
207+
)
208+
```
209+
210+
```{r paged.print=FALSE}
211+
df %>% group_by(Sex) %>% summarize(age = mean(Age))
212+
df %>% group_by(Sex) %>% summarize(age = mean(Age, na.rm=TRUE))
213+
```
214+
215+
#### Regression model
216+
217+
(Note: proper model fitting and interpretation is beyond the scope of this tutorial)
218+
219+
```{r}
220+
m1 = glm(Survived ~ Sex + Pclass + Age, family = 'binomial', data=df)
221+
summary(m1)
222+
```

0 commit comments

Comments
 (0)