-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmini-project-1.Rmd
383 lines (247 loc) · 21.8 KB
/
mini-project-1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
---
title: "Mini Data-Analysis Deliverable 1"
author: "Gopal Khanal"
date: "2023-10-05"
output: github_document
---
### Welcome to your (maybe) first-ever data analysis project!
And hopefully the first of many. Let's get started:
1. Install the [`datateachr`](https://github.com/UBC-MDS/datateachr) package by typing the following into your **R terminal**:
<!-- -->
install.packages("devtools")
devtools::install_github("UBC-MDS/datateachr")
2. Load the packages below.
```{r, echo=FALSE, results='hide'}
library(datateachr)
library(tidyverse)
library('VIM') # package to see missing values
```
3. Make a repository in the <https://github.com/stat545ubc-2023> Organization. You can do this by following the steps found on canvas in the entry called [MDA: Create a repository](https://canvas.ubc.ca/courses/126199/pages/mda-create-a-repository). One completed, your repository should automatically be listed as part of the stat545ubc-2023 Organization.
# Instructions
## For Both Milestones
- Each milestone has explicit tasks. Tasks that are more challenging will often be allocated more points.
- Each milestone will be also graded for reproducibility, cleanliness, and coherence of the overall Github submission.
- While the two milestones will be submitted as independent deliverables, the analysis itself is a continuum - think of it as two chapters to a story. Each chapter, or in this case, portion of your analysis, should be easily followed through by someone unfamiliar with the content. [Here](https://swcarpentry.github.io/r-novice-inflammation/06-best-practices-R/) is a good resource for what constitutes "good code". Learning good coding practices early in your career will save you hassle later on!
- The milestones will be equally weighted.
## For Milestone 1
**To complete this milestone**, edit [this very `.Rmd` file](https://raw.githubusercontent.com/UBC-STAT/stat545.stat.ubc.ca/master/content/mini-project/mini-project-1.Rmd) directly. Fill in the sections that are tagged with `<!--- start your work below --->`.
**To submit this milestone**, make sure to knit this `.Rmd` file to an `.md` file by changing the YAML output settings from `output: html_document` to `output: github_document`. Commit and push all of your work to the mini-analysis GitHub repository you made earlier, and tag a release on GitHub. Then, submit a link to your tagged release on canvas.
**Points**: This milestone is worth 36 points: 30 for your analysis, and 6 for overall reproducibility, cleanliness, and coherence of the Github submission.
# Learning Objectives
By the end of this milestone, you should:
- Become familiar with your dataset of choosing
- Select 4 questions that you would like to answer with your data
- Generate a reproducible and clear report using R Markdown
- Become familiar with manipulating and summarizing your data in tibbles using `dplyr`, with a research question in mind.
# Task 1: Choose your favorite dataset
The `datateachr` package by Hayley Boyce and Jordan Bourak currently composed of 7 semi-tidy datasets for educational purposes. Here is a brief description of each dataset:
- *apt_buildings*: Acquired courtesy of The City of Toronto's Open Data Portal. It currently has 3455 rows and 37 columns.
- *building_permits*: Acquired courtesy of The City of Vancouver's Open Data Portal. It currently has 20680 rows and 14 columns.
- *cancer_sample*: Acquired courtesy of UCI Machine Learning Repository. It currently has 569 rows and 32 columns.
- *flow_sample*: Acquired courtesy of The Government of Canada's Historical Hydrometric Database. It currently has 218 rows and 7 columns.
- *parking_meters*: Acquired courtesy of The City of Vancouver's Open Data Portal. It currently has 10032 rows and 22 columns.
- *steam_games*: Acquired courtesy of Kaggle. It currently has 40833 rows and 21 columns.
- *vancouver_trees*: Acquired courtesy of The City of Vancouver's Open Data Portal. It currently has 146611 rows and 20 columns.
**Things to keep in mind**
- We hope that this project will serve as practice for carrying our your own *independent* data analysis. Remember to comment your code, be explicit about what you are doing, and write notes in this markdown document when you feel that context is required. As you advance in the project, prompts and hints to do this will be diminished - it'll be up to you!
- Before choosing a dataset, you should always keep in mind **your goal**, or in other ways, *what you wish to achieve with this data*. This mini data-analysis project focuses on *data wrangling*, *tidying*, and *visualization*. In short, it's a way for you to get your feet wet with exploring data on your own.
And that is exactly the first thing that you will do!
1.1 **(1 point)** Out of the 7 datasets available in the `datateachr` package, choose **4** that appeal to you based on their description. Write your choices below:
**Note**: We encourage you to use the ones in the `datateachr` package, but if you have a dataset that you'd really like to use, you can include it here. But, please check with a member of the teaching team to see whether the dataset is of appropriate complexity. Also, include a **brief** description of the dataset here to help the teaching team understand your data.
<!-------------------------- Start your work below ---------------------------->
**Four chosen data**
1. building_permits
2. cancer_samples
3. flow_sample
4. vancouver_trees
<!----------------------------------------------------------------------------->
1.2 **(6 points)** One way to narrowing down your selection is to *explore* the datasets. Use your knowledge of dplyr to find out at least *3* attributes about each of these datasets (an attribute is something such as number of rows, variables, class type...). The goal here is to have an idea of *what the data looks like*.
*Hint:* This is one of those times when you should think about the cleanliness of your analysis. I added a single code chunk for you below, but do you want to use more than one? Would you like to write more comments outside of the code chunk?
<!-------------------------- Start your work below ---------------------------->
**Exploring the attributes of chosen data**
# Flow data
```{r}
# Overview of the data of row and columns, and col names
glimpse(flow_sample) # I can see here that the flow sample data has 7 variables (cols) and 218 observation for variables (rows)
# class of dataset
class(flow_sample) # The output indicates that the class of data belong to both tibble dataframe and standard tabular struture dataframe in R
# group mean flow by type of the variable 'extreme type'
flow_sample %>%
group_by(extreme_type) %>%
summarise((flow=mean(flow, na.rm=TRUE))) # The mean of extreme max flow is 212.07 m3/sec and mean of extreme min flow is 6.27 m3/sec
# Other base r variants to call attributes of dataset
head(flow_sample) # Give quick look of the first few cols and rows of the data
str(flow_sample) # Provides structure of the variables of data set.For example, the flow is in numeric structure.
summary(flow_sample) # Summarizes observations (mean, max and others) for each variable or columns
```
# Vancouver tree data
```{r}
# Overview of the row and columns, and col names
glimpse(vancouver_trees) # There 20 columns (var) and 146611 observations. Main variables are tree id, species name, and diameter among other.
# dimesion of data
dim(vancouver_trees) # There 20 columns and 146611 rows in the data.
# mean diameter of all trees irrespective of species, and types.
vancouver_trees %>%
summarise((avg=mean(diameter, na.rm=TRUE))) # The mean diameter of trees is 11.49 inch.
# I wanted to see average diameter of tree when tree has "Curb" or not.
# It looks like trees having 'curb' have less average diameter than not having curb
# Average diameter of tree with curb is 11.38 in whereas without curb is 12.57 inch
vancouver_trees %>%
group_by(curb) %>%
summarise((avg=mean(diameter, na.rm=TRUE)))
# Other base r variants to call attributes of dataset
class(vancouver_trees) # the data has both standard dataframe and tibble class
head(vancouver_trees) # Shows first few rows and columns of the data
str(vancouver_trees) # Shows structure of the variable (numeric,character or other)
```
# Cancer sample data
```{r}
# Overview of the row and column of the data
# The data contains 32 columns (var) and 569 observations of samples.
# Main variables are cancer tumor diagnosis type (malignant or benign ), mean radius of the tumor, and diameter among other.
glimpse(cancer_sample)
# dimension of data
dim(cancer_sample) #The data contains 32 columns (var) and 569 observation
# I wanted to see whether mean area of the cell nuclei differs depending upon whether tumor was of benign or malignant type
# It looks like that malignant type of tumors have larger area (978.37 sq. unit) than that of benign type (462.79 sq. unit)
cancer_sample %>%
group_by(diagnosis) %>%
summarise((avg=mean(area_mean, na.rm=TRUE)))
# I wanted to see maximum of the worst tumor
# The output show that max area of worst tumors have area of 4254 sq. unit)
cancer_sample %>%
summarise((maximum.area=max(area_worst, na.rm=TRUE)))
# Other base r variants to call attributes of dataset
head(cancer_sample) # Give quick look of the first few cols and rows of the data
str(cancer_sample)# Provides structure of the variables of data set.For example, the area is in numeric structure.
```
# Building permits data
```{r}
# Overview of the row and columns of data
# The data has 20680 building permits (observations) and 14 variables (attributes) associated with building permits
glimpse(building_permits)
# I wanted to see the total number of housing permits every year
# It looks like the year 2018 had highest permit numbers (6758) whereas 2020 had the lowest permits(1616)
building_permits%>%
group_by(year) %>%
summarise(n= n())
# Other base r variants to call attributes of dataset
head(building_permits) # Give quick look of the first few cols and rows of the data
str(building_permits)# Provides structure of the variables of data set.For example, the year is in numeric structure.
```
<!----------------------------------------------------------------------------->
1.3 **(1 point)** Now that you've explored the 4 datasets that you were initially most interested in, let's narrow it down to 1. What lead you to choose this one? Briefly explain your choice below.
<!-------------------------- Start your work below ---------------------------->
**Reason for selection of data**
_I've chosen the flow_sample data because it has annual trends of flow over long period of time, and I wanted to see how that varies over time and months. And there are some missing values as well to account for and to play with for data wrangling._
<!----------------------------------------------------------------------------->
1.4 **(2 points)** Time for a final decision! Going back to the beginning, it's important to have an *end goal* in mind. For example, if I had chosen the `titanic` dataset for my project, I might've wanted to explore the relationship between survival and other variables. Try to think of 1 research question that you would want to answer with your dataset. Note it down below.
<!-------------------------- Start your work below ---------------------------->
_Using flow data, I want to assess whether there is any significant temporal trend in annual extremes (minimum and maximum) of flow. I also want to determine whether there is difference in extreme flows among decades, and whether any particular months has extreme flows or the extreme flows are consistent across years with regards to months._
<!----------------------------------------------------------------------------->
# Important note
Read Tasks 2 and 3 *fully* before starting to complete either of them. Probably also a good point to grab a coffee to get ready for the fun part!
This project is semi-guided, but meant to be *independent*. For this reason, you will complete tasks 2 and 3 below (under the **START HERE** mark) as if you were writing your own exploratory data analysis report, and this guidance never existed! Feel free to add a brief introduction section to your project, format the document with markdown syntax as you deem appropriate, and structure the analysis as you deem appropriate. If you feel lost, you can find a sample data analysis [here](https://www.kaggle.com/headsortails/tidy-titarnic) to have a better idea. However, bear in mind that it is **just an example** and you will not be required to have that level of complexity in your project.
# Task 2: Exploring your dataset
If we rewind and go back to the learning objectives, you'll see that by the end of this deliverable, you should have formulated *4* research questions about your data that you may want to answer during your project. However, it may be handy to do some more exploration on your dataset of choice before creating these questions - by looking at the data, you may get more ideas. **Before you start this task, read all instructions carefully until you reach START HERE under Task 3**.
2.1 **(12 points)** Complete *4 out of the following 8 exercises* to dive deeper into your data. All datasets are different and therefore, not all of these tasks may make sense for your data - which is why you should only answer *4*.
Make sure that you're using dplyr and ggplot2 rather than base R for this task. Outside of this project, you may find that you prefer using base R functions for certain tasks, and that's just fine! But part of this project is for you to practice the tools we learned in class, which is dplyr and ggplot2.
1. Plot the distribution of a numeric variable.
2. Create a new variable based on other variables in your data (only if it makes sense)
3. Investigate how many missing values there are per variable. Can you find a way to plot this?
4. Explore the relationship between 2 variables in a plot.
5. Filter observations in your data according to your own criteria. Think of what you'd like to explore - again, if this was the `titanic` dataset, I may want to narrow my search down to passengers born in a particular year...
6. Use a boxplot to look at the frequency of different observations within a single variable. You can do this for more than one variable if you wish!
7. Make a new tibble with a subset of your data, with variables and observations that you are interested in exploring.
8. Use a density plot to explore any of your variables (that are suitable for this type of plot).
2.2 **(4 points)** For each of the 4 exercises that you complete, provide a *brief explanation* of why you chose that exercise in relation to your data (in other words, why does it make sense to do that?), and sufficient comments for a reader to understand your reasoning and code.
<!-------------------------- Start your work below ---------------------------->
**Q1.Plot the distribution of a numeric variable.**
```{r}
# Plotting the flow (m3/sec) irrespective of whether it is extreme minimum or maximum
# It is clear in the histogram that min flow and max flow are distributed distinctly
flow_sample %>%
ggplot(aes(x=flow)) +
geom_histogram(binwidth=30, colour="black", fill="white")
#Filter and plot the distribution of maximum flow
flow_sample%>%
filter(extreme_type=="maximum") %>%
ggplot(aes(x=flow)) +
geom_histogram(binwidth=50, colour="black", fill="white")
#Filter and plot the distribution of minimum flow
flow_sample%>%
filter(extreme_type=="minimum") %>%
ggplot(aes(x=flow)) +
geom_histogram(binwidth=1, colour="black", fill="white")
```
**Q2. Investigate how many missing values there are per variable. Can you find a way to plot this?**
```{r}
#Check if there are many missing values
any(is.na(flow_sample$flow))
#Mapping the number of missing observation for each variable and sorting them.
aggr(flow_sample, prop = FALSE, combined = TRUE, numbers = TRUE, sortVars = TRUE, sortCombs = TRUE)
```
**Q3. Explore the relationship between 2 variables in a plot.**
```{r, warning=FALSE}
# Here, I explored whether there has been any relationship between year and flow
# The flow variability over the year doesn't seem to show any unusual trend (based #on eye balling)
# But we have to be careful with this eyeballing because flow type (extreme type) could be obscuring the actual trend
# We ideally need to plot flow based on its extreme types
# Which I'll be doing in subsequent exercises.
flow_sample %>%
ggplot(aes(x=year, y=flow)) +
geom_point(aes(color = extreme_type))
```
**Q4. Filter observations in your data according to your own criteria**
```{r, warning=FALSE}
# I would like to narrow down to maximum flow of each year and
# see whether there is any statistical trend in it
flow_sample %>% filter(extreme_type=="maximum") %>%
ggplot(aes(year, flow)) +
geom_point() +
geom_smooth (method = "lm", se=FALSE)
# There appears to be decreasing trend in yearly maximum flow over the years
```
```{r, warning=FALSE}
## How about trend in minimum flow over the years?
filter(flow_sample, extreme_type=="minimum") %>%
ggplot(aes(x=year, y=flow), rm.na=TRUE) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## It looks like there is positive trend in minimum flow over the year
## But I haven't checked whether the trend is statistically significant using regression models
## I'll be doing regression analysis to see if there statistical trends in flow over the years
```
<!----------------------------------------------------------------------------->
# Task 3: Choose research questions
**(4 points)** So far, you have chosen a dataset and gotten familiar with it through exploring the data. You have also brainstormed one research question that interested you (Task 1.4). Now it's time to pick 4 research questions that you would like to explore in Milestone 2! Write the 4 questions and any additional comments below.
<!--- *****START HERE***** --->
**Based on eye balling of scatter plot of flow over years show that minimum flow appears to be increasing over the years whereas the maximum flow appears to be reflecting the decreasing trend. But we don't know whether these trends are statistically significant. I'm therefore interested in examining following four questions in the next independent exercise.**
1. Whether there there is statistically significant trend in both extreme minimum, and extreme maximum flow (m3/sec) over the years? I'll use linear regression model to see annual trends. I'll check the significance of beta coefficient. I'll also try and explore non-parametric Mann-Kendall and Sen's methods tests to determine whether there is a significant positive or negative trend in flow.
2. Compare flow pattern among decades:whether there are some decades (say 1970-1980) with unusual extreme min and max flow? Which years had more than average of extreme maximum flow? Are extreme flows (both min and max) after 2000 different than those before 2000? If there has been consistent increase in extreme max flow after 2000, this MAY be signature of increase in extreme rainfall after 2000 or MAY be of extreme runoff in the rivers due to deforestation in the catchment area.
3. Are flow patterns consistent across years with regards to months? Or has there been any shift in months which has highest extreme maximum flow or min flow?
4. Which months have extreme min and extreme max flow? Whether particular months have unusually high extreme max flow over the years ?
<!----------------------------->
# Overall reproducibility/Cleanliness/Coherence Checklist
## Coherence (0.5 points)
The document should read sensibly from top to bottom, with no major continuity errors. An example of a major continuity error is having a data set listed for Task 3 that is not part of one of the data sets listed in Task 1.
## Error-free code (3 points)
For full marks, all code in the document should run without error. 1 point deduction if most code runs without error, and 2 points deduction if more than 50% of the code throws an error.
## Main README (1 point)
There should be a file named `README.md` at the top level of your repository. Its contents should automatically appear when you visit the repository on GitHub.
Minimum contents of the README file:
- In a sentence or two, explains what this repository is, so that future-you or someone else stumbling on your repository can be oriented to the repository.
- In a sentence or two (or more??), briefly explains how to engage with the repository. You can assume the person reading knows the material from STAT 545A. Basically, if a visitor to your repository wants to explore your project, what should they know?
Once you get in the habit of making README files, and seeing more README files in other projects, you'll wonder how you ever got by without them! They are tremendously helpful.
## Output (1 point)
All output is readable, recent and relevant:
- All Rmd files have been `knit`ted to their output md files.
- All knitted md files are viewable without errors on Github. Examples of errors: Missing plots, "Sorry about that, but we can't show files that are this big right now" messages, error messages from broken R code
- All of these output files are up-to-date -- that is, they haven't fallen behind after the source (Rmd) files have been updated.
- There should be no relic output files. For example, if you were knitting an Rmd to html, but then changed the output to be only a markdown file, then the html file is a relic and should be deleted.
(0.5 point deduction if any of the above criteria are not met. 1 point deduction if most or all of the above criteria are not met.)
Our recommendation: right before submission, delete all output files, and re-knit each milestone's Rmd file, so that everything is up to date and relevant. Then, after your final commit and push to Github, CHECK on Github to make sure that everything looks the way you intended!
## Tagged release (0.5 points)
You've tagged a release for Milestone 1.
### Attribution
Thanks to Icíar Fernández Boyano for mostly putting this together, and Vincenzo Coia for launching.