|
22 | 22 | " 2. Other readings that the MSU Machine Learning Group has used can be found [here](https://drive.google.com/drive/u/0/folders/1F2hcSpIa_jWyVCVS51oEO1dXzWP_kOu2).\n",
|
23 | 23 | "\n",
|
24 | 24 | "## Data\n",
|
25 |
| - "\n", |
26 |
| - "In the data folder, you will find a few different data sets. To conceptually understand the cluster algorithms you will use the `sample_algorthm.csv` dataset. For the second part of the tutorial, you will use the `sample_5features.csv` dataset. The \"Challenges\" at the end of the tutorial will utilize th `sample_challenge.csv`dataset.\n", |
| 25 | + "In the data folder, you will find a few different data sets. To conceptually understand the cluster algorithms you will use the `clustering_data.pkl`. The clustering data is a simulated dataset of education-like samples. From college GPA and the percentage of passed courses you will attempt to cluster students on whether they get a degree or not. This is encoded in the `degree` column of the dataset as a `1` for those who get a degree and `0` otherwise. The \"Challenges\" at the end of the tutorial will utilize th `sample_challenge.csv`dataset.\n", |
27 | 26 | "\n",
|
28 | 27 | "## K-Means Clustering\n",
|
29 | 28 | "\n",
|
30 | 29 | "*K*-means clustering is a simple approach for partitioning a data set into *K* distinct, non-overlappying clusters. To perform *k*-means clustering, the researcher must specify the desired number of clusters before running the algorithm. There are alternative approaches to clustering which does not require that you commit to a particular number of clusters. We will explore this later but for now, let's focus on *k*-means clustering.\n",
|
31 | 30 | "\n",
|
32 |
| - "### Task 1: Conceptually understanding the *k*-means clustering algorithm\n", |
| 31 | + "### Task 1: Scaling the data\n", |
| 32 | + "The *K*-means algorithm is centroid based, which means it work off measuring distances between datapoints. As such the scale of the features are very important such that small variations in a variable with a large magnitude does not overpower larger variations in features of smaller scale. \n", |
33 | 33 | "1. Import the `sample_algorithm.csv` file and visualize the data using `plt.plot` used in the previous tutorials. This dataset has only 2 features so a simple 2-D plot will work. Does this dataset appear to have a clustering structure? If so, how many clusters do you think are in the dataset?\n",
|
34 |
| - "2. It looks as though there might be 3 clusters in the data. Choose three random points in the x-y plane to be used as the initial centers. Plot them on the previous graph as an 'X'. *Note: Each center needs to be initialized at differnt point in space.*\n", |
35 |
| - "3. Define three new \"distance\" columns in the dataset to calculate the distances between each of the three centroids and each observation. The most common distance metric used in a clustering analysis is the Euclidean distance. Below is a function already written for you to use.\n", |
| 34 | + "2. Does the data exist on different scales? If so rescale the data to a uniform interval using the z-scaling function $f(x) = \\frac{x - \\bar{x}}{\\sigma}$ where $\\bar{x}$ is the sample mean and $\\sigma$ the sample standard deviation \n", |
| 35 | + "\n", |
| 36 | + "### Task 2: Conceptually understanding the *k*-means clustering algorithm\n", |
| 37 | + "\n", |
| 38 | + "1. It looks as though there might be 2 clusters in the data. Choose two random points in the x-y plane to be used as the initial centers. Plot them on the previous graph as an 'X'. *Note: Each center needs to be initialized at differnt point in space.*\n", |
| 39 | + "2. Define three new \"distance\" columns in the dataset to calculate the distances between each of the three centroids and each observation. The most common distance metric used in a clustering analysis is the Euclidean distance. Below is a function already written for you to use.\n", |
36 | 40 | "```python\n",
|
37 | 41 | "def calculate_distance(initial, X, Y):\n",
|
38 |
| - " distances = []\n", |
39 | 42 | " c_x, c_y = initial\n",
|
40 |
| - " for x, y in list(zip(X, Y)):\n", |
41 |
| - " root_diff_x = (x - c_x) ** 2\n", |
42 |
| - " root_diff_y = (y - c_y) ** 2\n", |
43 |
| - " distance = np.sqrt(root_diff_x + root_diff_y)\n", |
44 |
| - " distances.append(distance)\n", |
45 |
| - " return distances\n", |
| 43 | + " root_diff_x = (X - c_x) ** 2\n", |
| 44 | + " root_diff_y = (Y - c_y) ** 2\n", |
| 45 | + " distance = np.sqrt(root_diff_x + root_diff_y)\n", |
| 46 | + " return distance\n", |
46 | 47 | "```\n",
|
47 | 48 | "4. For each observation, compare the three distances and chose the *smallest* one (use [`np.argmin`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmin.html).) Using the `map` function, label the centroids accordingly in a new column called \"Clusters\".\n",
|
48 | 49 | "5. Find the new centroid points by taking the means for both features of each of the three clusters. Make a new plot of the data, coloring the three clusters and labeling the new centroids as 'D'. What happened to the three centroids as a result of this algorithm?\n",
|
|
119 | 120 | "name": "python",
|
120 | 121 | "nbconvert_exporter": "python",
|
121 | 122 | "pygments_lexer": "ipython3",
|
122 |
| - "version": "3.5.5" |
| 123 | + "version": "3.7.3" |
123 | 124 | }
|
124 | 125 | },
|
125 | 126 | "nbformat": 4,
|
|
0 commit comments