-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path06 Clustering Users.Rmd
211 lines (150 loc) · 6.02 KB
/
06 Clustering Users.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
title: "Clustering Your Users/Customers"
author: "Trevor Paulsen"
output: html_notebook
---
In this notebook, we're going to cluster our users using an unsupervised machine learning technique, then we'll also find users that are statistically likely to convert, then finally we'll upload those clusters and propensity scores to Analysis Workspace using something called "Customer Attributes" so that you can use them in reporting!
```{r}
# First step is to create visitor level aggregations to work with:
library(stringr)
user_rollups = stitched_data_feed %>%
group_by(stitched_id) %>%
summarize(
visits = n_distinct(visit_num),
hits = n(),
product_views = sum(ifelse(grepl(",2,", event_list), 1, 0)),
# For Postgres or Spark SQL you'd use %~% or %regexp% as below:
# product_views = sum(ifelse(event_list %~% ",2,", 1, 0))
email_sign_ups = sum(ifelse(grepl(",201,", event_list), 1, 0)),
internal_searches = sum(ifelse(grepl(",202,", event_list), 1, 0)),
discounts = sum(ifelse(grepl(",203,", event_list), 1, 0)),
product_views = sum(ifelse(grepl(",2,", event_list), 1, 0)),
distinct_channels = n_distinct(channel),
orders = sum(ifelse(grepl(",1,", event_list), 1, 0)),
revenue = sum(ifelse(grepl(",200=", event_list), as.numeric(str_match(event_list, ",200=(.*?),")[,2]), 0))
# For Postgres or Spark SQL, use:
# revenue = sum(ifelse(event_list %~% ",200=", as.numeric(regexp_extract(event_list, ",200=(.*?),")), 0))
) %>% collect()
head(user_rollups)
```
# Cleaning the data
Clustering data from web interactions can be extremely problematic - the data is very skewed, conversions are relatively rare, and there is a lot of noise in the data. This makes most cluster models just fall apart typically. To avoid that, we'll use a combination of data filtering and a maximum liklihood cluster estimate to arrive at useful clusters.
To start, I need to throw out a lot of the junk. Most sites typically have a lot of visitors with a single hit - those users are not useful and can gunk up a cluster or propensity model, so we'll first remove them. I'll also remove the user ID field, since it's not useful for modeling either.
```{r}
clean_sample = user_rollups %>%
filter(hits > 1) %>%
select(-stitched_id)
```
# Visualizing Users with tSNE
Next we're going to use a really awesome dimensionality reduction techinque that's relatively new in the machine learning world. It's a technique called "t-distributed stochastic neighbor embedding." There's an awesome Google TechTalk about this approach here: https://www.youtube.com/watch?v=RJVL80Gg3lA - basically this technique is a way to map data with high dimensionality into two dimensions while retaining their relationships and represents them spacially - which is awesome for visualization.
In our case, we're going to map all of the user attributes (total revenue, visits, hits, product views, etc.) into a two dimensional chart to see if we can identify interesting clusters of users.
```{r message=FALSE}
library(Rtsne)
deduped_clean_sample = unique(clean_sample)
set.seed(200) # for reproducibility
transformed_data = Rtsne(deduped_clean_sample, dims=2)
plot_data = as.data.frame(transformed_data$Y)
p = plot_ly(
data = plot_data,
x = ~V1,
y = ~V2
) %>% layout(
xaxis = list(
domain = c(-60,60),
title = "tSNE V1"
),
yaxis = list(
domain = c(-50,50),
title = "tSNE V2",
scaleanchor = "x"
)
)
p
```
# Expectation Maximization Clustering
Notice how the tSNE algorithm has clumped a bunch of users together in an interesting and visual way. With this, we're ready to cluster our users. To start, I'll try a great implementation of the expectation maximization model found in the "mclust" package:
```{r message=FALSE}
# You may want to do this first on a sample if you have a bajillion users
library(mclust)
cluster_count = 5
cluster_model1 = Mclust(plot_data, G=1:cluster_count)
cluster_data1 = plot_data
cluster_data1$cluster = cluster_model1$classification
p <- plot_ly(
data = cluster_data1,
x = ~V1,
y = ~V2,
color = ~as.factor(cluster)
) %>% layout(
xaxis = list(
domain = c(-60,60),
title = "tSNE V1"
),
yaxis = list(
domain = c(-50,50),
title = "tSNE V2",
scaleanchor = "x"
)
)
p
```
#Density Based Clustering
You can see the EM cluster model works pretty well, but perhaps not exactly what I want. It kinda missed the group in the middle and lumped some of the outlying groups together that I wouldn't probably want. Typically tSNE is best clustered by a density based approach which we'll try next:
```{r warning=FALSE,message=FALSE,error=FALSE}
library(dbscan)
cluster_model2 = dbscan(plot_data, minPts = 15, eps=2.3)
clustered_data2 = plot_data
clustered_data2$cluster = cluster_model2$cluster
p = plot_ly(
data = clustered_data2,
x = ~V1,
y = ~V2,
color = ~as.factor(cluster)
) %>% layout(
xaxis = list(
domain = c(-60,60),
title = "tSNE V1"
),
yaxis = list(
domain = c(-50,50),
title = "tSNE V2",
scaleanchor = "x"
)
)
p
```
# Making sense of the clusters
So this looks much better, but the problem is tSNE doesn't give me any useful information about these clusters. To do that, we'll have to append the clusters to the user_rollup for visualization. (Drag this one around to explore)
```{r warning=FALSE,message=FALSE,error=FALSE}
clustered_user_rollup = cbind(deduped_clean_sample, clustered_data2)
# Visualizing the clusters for each metric:
plot_ly(
data = clustered_user_rollup,
x = ~V1,
y = ~V2,
#z = ~as.numeric(hits),
#z = ~as.numeric(visits),
#z = ~as.numeric(email_sign_ups),
z = ~as.numeric(revenue),
#z = ~as.numeric(internal_searches),
#z = ~as.numeric(orders),
#z = ~as.numeric(product_views),
color = ~as.factor(cluster)
) %>% layout(
height = 650,
scene = list(
xaxis = list(
domain = c(-60,60),
title = "tSNE V1"
),
yaxis = list(
domain = c(-50,50),
title = "tSNE V2",
scaleanchor = "x"
),
zaxis = list(
title = "Revenue"
)
)
)
```