You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+13-15Lines changed: 13 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,12 @@
1
-
## UNDER DEVELOPMENT BY FAWAZ SIDDIQI (github.com/fawazsiddiqi)
2
-
3
1
## Introduction
4
2
5
3
In this tutorial, you use unsupervised learning to discover groupings and anomalies in data. Unsupervised learning is when there is no *ground truth* or labeled data set that shows you the expected result. Instead, you take the raw data and use various algorithms to uncover clusters of data. If you want to learn about the theory and ideas behind unsupervised learning, read [Unsupervised learning for data classification](https://developer.ibm.com/articles/cc-unsupervised-learning-data-classification/).
As the clusters become apparent (for example, in 3-D charts), you might be tempted to apply labels to clusters. That is an example of using unsupervised learning to discover hidden features in your data. If you don't care about the clusters, but want to learn more about the outliers, then you are looking at *anomaly detection*.
It's important to note that when you try to visualize data, you tend to use two-dimensional or three-dimensional data, which is the easiest to plot. But the techniques I am using here apply to multidimensional or hyperdimensional data. It is harder to visualize, but you can still measure the mathematical distribution of data in the clusters and use the discovered groupings and outliers in much the same way.
14
12
@@ -82,7 +80,7 @@ Our first example uses the data set that was generated with scikit-learn's `make
82
80
83
81
The following code trains a k-means model and runs prediction on the data set. The chart uses color to show the predicted cluster membership and a red X to show the cluster center.
That's how easy it is when you have the right clustering algorithm for the data. If you are running the Notebook, feel free to change k and rerun the cell to see how the blobs must be divided or combined when k is not equal to 3.
88
86
@@ -100,17 +98,17 @@ The second big limitation is that k-means separates the points into spherical sh
100
98
101
99
Visually, it is obvious that the data points form two shapes, and with k=2 you would like to see the predicted clusters separate the smile from the frown. Unfortunately, you cannot do that separation using spherical clusters.
To look at a less-contrived example, we've used part of a customer data set that includes customer demographics, account activity, and stock-trading profit. Once again, we have more than three dimensions, but we've chosen three significant ones to help with the visualization. The following image shows our result with k=3. We'll use this to show how some of the other algorithms compare. After you get the hang of it, try it out with your own data.
In some use cases, the cluster shapes are useful. For example, clustering is often part of image recognition where the goal is to recognize shapes. However, for our customer example, the shapes help us demonstrate cluster separation and density, but the real goal would be to identify groups of customers so that we can use those groupings for a business purpose. In our example, we had a churn risk label that was not included in the data set for training and prediction. Now, we'll compare our predicted clusters with our known churn risk just to see if we found a customer grouping that might be interesting. Did we uncover a hidden feature or common trait in our customer groupings?
It turns out that these three clusters are great for separating out the low-risk customers. The high-risk cluster looks useful, but we'd need more information to know if the medium/high cluster is useful. We'll use this external evaluation along with scatter diagrams to help illustrate the differences as we try some other clustering algorithms.
116
114
@@ -128,17 +126,17 @@ Let's look at the customer clusters that were predicted with mean shift. First,
128
126
129
127
Next, notice that we used the non-default `cluster_all=False` parameter setting. This allowed the algorithm to take data points in sparse regions and label them as orphans (cluster -1). In the chart, the sparse green points on the left and on the right were orphans. The animated matplotlib chart shows how separating the orphans adds clarity to the remaining clusters and also identifies data points that might be considered anomalies.
When comparing to the risk label, it is not obvious if these groupings are significantly more useful than our original three (from k-means), but it would make sense that the orphans and the mini-clusters on the edges would be less likely to be consistent with the more densely clustered customers. Separating them helps clarify our groupings.
Because mean shift at least considers density, you might think it would handle the moons data set better than k-means, but actually, we had to customize the parameters quite a bit before it would recognize two clusters. Even then, you see that a centroid-based approach does not work well to separate the smile from the frown. Notice the long tails of orphans that we got with the `cluster_all=False` setting.
Fortunately, there are density-based algorithms for tackling such problems.
144
142
@@ -150,7 +148,7 @@ DBSCAN stands for "density-based spatial clustering of applications with noise."
150
148
151
149
Let's get right to what DBSCAN can do that our centroid-based algorithms failed to do. The following image shows how DBSCAN separated the smile from the frown and also found three points to label as an outliers.
This result was exactly what we wanted! However, it's important to note that we had to customize the `eps` parameter to get the result we wanted. DBSCAN uses density to automatically determine the clusters, but `eps` is used to tell it what we consider "dense."
156
154
@@ -172,7 +170,7 @@ The advantages of DBSCAN include:
172
170
173
171
In our Notebook, we also used DBSCAN to remove the noise and get a different clustering of the customer data set.
@@ -186,11 +184,11 @@ In our Notebook, we use scikit-learn's implementation of agglomerative clusterin
186
184
187
185
One of the interesting things about agglomerative clustering is that you get different cluster sizes. Our customer data demo with agglomerative clustering is interesting because we ended up with 14 clusters of various shapes and sizes.
Once again, we see that this works fine to separate our low-risk customers from high-risk. The higher number of clusters appears unnecessary, but does help with finer distinctions in our customer groupings. The smallest clusters look less significant, but they help clarify the other customer groupings.
0 commit comments