-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbellabeat-case-study.Rmd
182 lines (128 loc) · 6.6 KB
/
bellabeat-case-study.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
title: "Bellabeat Case Study"
output:
html_document:
df_print: paged
---
## Business task
Analyzing another brand's smart device usage data in order to improve Bellabeat's future marketing strategies.
>>number of steps/sleep vs number of intensity/sleep
## The data
The data that will be used in this study is called "FitBit Fitness Tracker Data" from Kaggle, made available by Mobius. It was collected through a survey via Amazon's Mechanical Turk. The sample size of the data is not ideal --just 30 eligible FitBit users. Still, it is enough to get an idea on trends surrounding smart device usage. Keeping the business task in mind, out of the 18 files that are present in the data, only 4 of them will be used in this study.
## Cleaning and transforming the data
To do this analysis, R programming language seems the easiest and most efficient one to use. Since the data is not too big, but it consists of multiple spreadsheets, any merging or cleaning will be done very quickly through R. In this project, following packages will be used: ....
```{r}
library(tidyverse)
library(snakecase)
library(skimr)
```
```{r}
daily_activity <- read_csv("Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
daily_sleep <- read_csv("Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
heart_rate <- read_csv("Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
hourly_steps <- read_csv("Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
```
I will start the cleaning process by checking the metadata, to see if they are spelled or written correctly, if they have any uppercase characters or any disorder in general.
```{r}
head(daily_activity)
head(daily_sleep)
head(heart_rate)
head(hourly_steps)
```
Now it would be ideal to confirm how many users are participating, to see if it actually is 30 for all the data sets. For this, checking how many unique ids there are would be enough.
```{r}
n_unique(daily_activity$Id)
n_unique(daily_sleep$Id)
n_unique(heart_rate$Id)
n_unique(hourly_steps$Id)
```
hourly_steps and daily_activity data sets have sufficient participants. For daily_sleep and heart_rate data sets, participants are less than 30 which is not ideal, however for the sake of this study, these data sets will still be used.
From these previews of these data frames, it is understood that changing the column names from camel case (ex. ActivityDate) into snake case (ex. activity_date) is necessary.
```{r}
names(daily_activity) <- to_snake_case(names(daily_activity))
names(daily_sleep) <- to_snake_case(names(daily_sleep))
names(heart_rate) <- to_snake_case(names(heart_rate))
names(hourly_steps) <- to_snake_case(names(hourly_steps))
```
Also even though the data set daily_sleep contains daily sleep data, time of sleep is also present with the date of sleep which is irrelevant and unnecessary. So it is better to change the date format from datetime to just date.
```{r}
daily_sleep <- daily_sleep %>%
mutate(sleep_day= as_date(sleep_day, format= "%m/%d/%Y %I:%M:%S %p"))
```
While at it, it is good to make all date formats the same.
```{r}
daily_activity <- daily_activity %>%
mutate(activity_date= as_date(activity_date, format= "%m/%d/%Y"))
hourly_steps <- hourly_steps %>%
mutate(activity_hour= as_datetime(activity_hour, format= "%m/%d/%Y %I:%M:%S %p"))
```
Next step in the cleaning process will be checking and getting rid of any missing values.
```{r}
sum(is.na(daily_activity))
sum(is.na(daily_sleep))
sum(is.na(heart_rate))
sum(is.na(hourly_steps))
```
According to the results, the data sets do not contain any missing values. Finally it is time to check for duplicate values.
```{r}
sum(duplicated(daily_activity))
sum(duplicated(daily_sleep))
sum(duplicated(hourly_steps))
```
Knowing daily_sleep has 3 duplicate values, it is good to get rid of them and check if they have been removed for sure.
```{r}
daily_sleep <- distinct(daily_sleep)
sum(duplicated(daily_sleep))
```
Okay, daily_sleep data set also only consists of distinct values.
```{r}
daily_activity <- daily_activity %>% mutate( weekday = weekdays(activity_date))
daily_sleep <- daily_sleep %>% mutate(weekday= weekdays(sleep_day))
```
## Analyzing
Questions:
Which days have the least active minutes (to promote activity more on those days)
Which days have the least amount of sleep on average (to promote sleep more on those days)
Which activity level is the least popular?
What hour of the daytime is the least active? ((to increase activity?)
How often are the heart rates out of ordinary? What can be done to warn the user?
```{r}
weekday <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
ggplot(data=daily_activity, aes(x=weekday))+geom_bar(fill='orange')+ labs(title='Frequency of Weekdays')+scale_x_discrete(limits = weekday)+xlab('Number of' )
```
```{r}
ggplot(data=daily_sleep, aes(x=weekday)) + geom_bar(color='purple',fill="pink") +scale_x_discrete(limits = weekday)
```
```{r}
average_daily_activity <- daily_activity %>%
group_by(id) %>%
summarise(average_sedentary_minutes = mean(sedentary_minutes), average_active_minutes = mean(mean(very_active_minutes), mean(fairly_active_minutes), mean(lightly_active_minutes)), average_sedentary_minutes= mean(sedentary_minutes))
head(average_daily_activity)
```
Instead of analyzing every single user individually, the users can be categorized into three different groups based on their average activity. According to [this article](https://pubmed.ncbi.nlm.nih.gov/14715035/), the classification of the users can be done like this:
- <5000 steps/day, 'sedentary';
- 5000-7499 steps/day, 'low active';
- 7500-9999 steps/day, 'somewhat active';
- >or=10000 steps/day, 'active';
- >12500 steps/day, 'highly active'.
```{r}
user_type <- daily_activity %>%
group_by (id) %>%
summarise(average_steps= mean(total_steps)) %>%
mutate(user_type= case_when(
average_steps < 5000 ~"sedentary",
average_steps >= 5000 & average_steps <7499 ~"low active",
average_steps >= 7499 & average_steps <9999 ~"somewhat active",
average_steps >= 10000 & average_steps <12500 ~"active",
average_steps > 12500 ~"highly active"))
```
```{r}
daily_activity <- merge(x= daily_activity, y=user_type, by='id')
daily_activity <- merge(x= daily_activity, y=average_daily_activity, by='id')
```
```{r}
activity_level <- c("sedentary", "low active", "somewhat active", "active", "highly active")
ggplot(data = daily_activity, aes(x = user_type))+geom_bar(color = "purple", fill = "pink") + labs(title = "Activity Levels")+xlab("User Type")+ylab("Number of Users")+scale_x_discrete(limits = activity_level)
```
```{r}
```