learningmachineslab
diff --git a/‎clustering/clustering.ipynb
Lines changed: 14 additions & 13 deletions b/‎clustering/clustering.ipynb
Lines changed: 14 additions & 13 deletions
diff --git a/‎clustering/clustering_with_solutions.ipynb
Lines changed: 283 additions & 223 deletions b/‎clustering/clustering_with_solutions.ipynb
Lines changed: 283 additions & 223 deletions
diff --git a/‎clustering/data/classification_data.pkl
94.5 KB b/‎clustering/data/classification_data.pkl
94.5 KB
diff --git a/‎clustering/data/clustering_data.pkl
94.5 KB b/‎clustering/data/clustering_data.pkl
94.5 KB
diff --git a/‎clustering/data/make_clustering_data.ipynb
Lines changed: 746 additions & 0 deletions b/‎clustering/data/make_clustering_data.ipynb
Lines changed: 746 additions & 0 deletions
diff --git a/‎exploring-data/data/make_regression_data.ipynb
Lines changed: 960 additions & 0 deletions b/‎exploring-data/data/make_regression_data.ipynb
Lines changed: 960 additions & 0 deletions
diff --git a/‎exploring-data/data/regression_data.pkl
220 KB b/‎exploring-data/data/regression_data.pkl
220 KB
diff --git a/‎exploring-data/src/.ipynb_checkpoints/data_exploration_tutorial-checkpoint.ipynb
Lines changed: 31 additions & 586 deletions b/‎exploring-data/src/.ipynb_checkpoints/data_exploration_tutorial-checkpoint.ipynb
Lines changed: 31 additions & 586 deletions
diff --git a/‎exploring-data/src/data_exploration_tutorial.ipynb
Lines changed: 60 additions & 13 deletions b/‎exploring-data/src/data_exploration_tutorial.ipynb
Lines changed: 60 additions & 13 deletions
diff --git a/‎exploring-data/src/data_exploration_tutorial_w_solutions.ipynb
Lines changed: 577 additions & 114 deletions b/‎exploring-data/src/data_exploration_tutorial_w_solutions.ipynb
Lines changed: 577 additions & 114 deletions
@@ -22,27 +22,28 @@
     " 2. Other readings that the MSU Machine Learning Group has used can be found [here](https://drive.google.com/drive/u/0/folders/1F2hcSpIa_jWyVCVS51oEO1dXzWP_kOu2).\n",
     "\n",
     "## Data\n",
-    "\n",
-    "In the data folder, you will find a few different data sets. To conceptually understand the cluster algorithms you will use the `sample_algorthm.csv` dataset. For the second part of the tutorial, you will use the `sample_5features.csv` dataset. The \"Challenges\" at the end of the tutorial will utilize th `sample_challenge.csv`dataset.\n",
+    "In the data folder, you will find a few different data sets. To conceptually understand the cluster algorithms you will use the `clustering_data.pkl`. The clustering data is a simulated dataset of education-like samples. From college GPA and the percentage of passed courses you will attempt to cluster students on whether they get a degree or not. This is encoded in the `degree` column of the dataset as a `1` for those who get a degree and `0` otherwise. The \"Challenges\" at the end of the tutorial will utilize th `sample_challenge.csv`dataset.\n",
     "\n",
     "## K-Means Clustering\n",
     "\n",
     "*K*-means clustering is a simple approach for partitioning a data set into *K* distinct, non-overlappying clusters. To perform *k*-means clustering, the researcher must specify the desired number of clusters before running the algorithm. There are alternative approaches to clustering which does not require that you commit to a particular number of clusters. We will explore this later but for now, let's focus on *k*-means clustering.\n",
     "\n",
-    "### Task 1: Conceptually understanding the *k*-means clustering algorithm\n",
+    "### Task 1: Scaling the data\n",
+    "The *K*-means algorithm is centroid based, which means it work off measuring distances between datapoints. As such the scale of the features are very important such that small variations in a variable with a large magnitude does not overpower larger variations in features of smaller scale. \n",
     "1. Import the `sample_algorithm.csv` file and visualize the data using `plt.plot` used in the previous tutorials. This dataset has only 2 features so a simple 2-D plot will work. Does this dataset appear to have a clustering structure? If so, how many clusters do you think are in the dataset?\n",
-    "2. It looks as though there might be 3 clusters in the data. Choose three random points in the x-y plane to be used as the initial centers. Plot them on the previous graph as an 'X'. *Note: Each center needs to be initialized at differnt point in space.*\n",
-    "3. Define three new \"distance\" columns in the dataset to calculate the distances between each of the three centroids and each observation. The most common distance metric used in a clustering analysis is the Euclidean distance. Below is a function already written for you to use.\n",
+    "2. Does the data exist on different scales? If so rescale the data to a uniform interval using the z-scaling function $f(x) = \\frac{x - \\bar{x}}{\\sigma}$ where $\\bar{x}$ is the sample mean and $\\sigma$ the sample standard deviation \n",
+    "\n",
+    "### Task 2: Conceptually understanding the *k*-means clustering algorithm\n",
+    "\n",
+    "1. It looks as though there might be 2 clusters in the data. Choose two random points in the x-y plane to be used as the initial centers. Plot them on the previous graph as an 'X'. *Note: Each center needs to be initialized at differnt point in space.*\n",
+    "2. Define three new \"distance\" columns in the dataset to calculate the distances between each of the three centroids and each observation. The most common distance metric used in a clustering analysis is the Euclidean distance. Below is a function already written for you to use.\n",
     "```python\n",
     "def calculate_distance(initial, X, Y):\n",
-    "    distances = []\n",
     "    c_x, c_y = initial\n",
-    "    for x, y in list(zip(X, Y)):\n",
-    "        root_diff_x = (x - c_x) ** 2\n",
-    "        root_diff_y = (y - c_y) ** 2\n",
-    "        distance = np.sqrt(root_diff_x + root_diff_y)\n",
-    "        distances.append(distance)\n",
-    "    return distances\n",
+    "    root_diff_x = (X - c_x) ** 2\n",
+    "    root_diff_y = (Y - c_y) ** 2\n",
+    "    distance = np.sqrt(root_diff_x + root_diff_y)\n",
+    "    return distance\n",
     "```\n",
     "4. For each observation, compare the three distances and chose the *smallest* one (use [`np.argmin`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmin.html).) Using the `map` function, label the centroids accordingly in a new column called \"Clusters\".\n",
     "5. Find the new centroid points by taking the means for both features of each of the three clusters. Make a new plot of the data, coloring the three clusters and labeling the new centroids as 'D'. What happened to the three centroids as a result of this algorithm?\n",
@@ -119,7 +120,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.5.5"
+   "version": "3.7.3"
   }
  },
  "nbformat": 4,