Skip to content

Commit 69652f3

Browse files
authored
Merge pull request #20 from learningmachineslab/student_data
Student data
2 parents 0753131 + 0e25eef commit 69652f3

14 files changed

+3792
-1130
lines changed

clustering/clustering.ipynb

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -22,27 +22,28 @@
2222
" 2. Other readings that the MSU Machine Learning Group has used can be found [here](https://drive.google.com/drive/u/0/folders/1F2hcSpIa_jWyVCVS51oEO1dXzWP_kOu2).\n",
2323
"\n",
2424
"## Data\n",
25-
"\n",
26-
"In the data folder, you will find a few different data sets. To conceptually understand the cluster algorithms you will use the `sample_algorthm.csv` dataset. For the second part of the tutorial, you will use the `sample_5features.csv` dataset. The \"Challenges\" at the end of the tutorial will utilize th `sample_challenge.csv`dataset.\n",
25+
"In the data folder, you will find a few different data sets. To conceptually understand the cluster algorithms you will use the `clustering_data.pkl`. The clustering data is a simulated dataset of education-like samples. From college GPA and the percentage of passed courses you will attempt to cluster students on whether they get a degree or not. This is encoded in the `degree` column of the dataset as a `1` for those who get a degree and `0` otherwise. The \"Challenges\" at the end of the tutorial will utilize th `sample_challenge.csv`dataset.\n",
2726
"\n",
2827
"## K-Means Clustering\n",
2928
"\n",
3029
"*K*-means clustering is a simple approach for partitioning a data set into *K* distinct, non-overlappying clusters. To perform *k*-means clustering, the researcher must specify the desired number of clusters before running the algorithm. There are alternative approaches to clustering which does not require that you commit to a particular number of clusters. We will explore this later but for now, let's focus on *k*-means clustering.\n",
3130
"\n",
32-
"### Task 1: Conceptually understanding the *k*-means clustering algorithm\n",
31+
"### Task 1: Scaling the data\n",
32+
"The *K*-means algorithm is centroid based, which means it work off measuring distances between datapoints. As such the scale of the features are very important such that small variations in a variable with a large magnitude does not overpower larger variations in features of smaller scale. \n",
3333
"1. Import the `sample_algorithm.csv` file and visualize the data using `plt.plot` used in the previous tutorials. This dataset has only 2 features so a simple 2-D plot will work. Does this dataset appear to have a clustering structure? If so, how many clusters do you think are in the dataset?\n",
34-
"2. It looks as though there might be 3 clusters in the data. Choose three random points in the x-y plane to be used as the initial centers. Plot them on the previous graph as an 'X'. *Note: Each center needs to be initialized at differnt point in space.*\n",
35-
"3. Define three new \"distance\" columns in the dataset to calculate the distances between each of the three centroids and each observation. The most common distance metric used in a clustering analysis is the Euclidean distance. Below is a function already written for you to use.\n",
34+
"2. Does the data exist on different scales? If so rescale the data to a uniform interval using the z-scaling function $f(x) = \\frac{x - \\bar{x}}{\\sigma}$ where $\\bar{x}$ is the sample mean and $\\sigma$ the sample standard deviation \n",
35+
"\n",
36+
"### Task 2: Conceptually understanding the *k*-means clustering algorithm\n",
37+
"\n",
38+
"1. It looks as though there might be 2 clusters in the data. Choose two random points in the x-y plane to be used as the initial centers. Plot them on the previous graph as an 'X'. *Note: Each center needs to be initialized at differnt point in space.*\n",
39+
"2. Define three new \"distance\" columns in the dataset to calculate the distances between each of the three centroids and each observation. The most common distance metric used in a clustering analysis is the Euclidean distance. Below is a function already written for you to use.\n",
3640
"```python\n",
3741
"def calculate_distance(initial, X, Y):\n",
38-
" distances = []\n",
3942
" c_x, c_y = initial\n",
40-
" for x, y in list(zip(X, Y)):\n",
41-
" root_diff_x = (x - c_x) ** 2\n",
42-
" root_diff_y = (y - c_y) ** 2\n",
43-
" distance = np.sqrt(root_diff_x + root_diff_y)\n",
44-
" distances.append(distance)\n",
45-
" return distances\n",
43+
" root_diff_x = (X - c_x) ** 2\n",
44+
" root_diff_y = (Y - c_y) ** 2\n",
45+
" distance = np.sqrt(root_diff_x + root_diff_y)\n",
46+
" return distance\n",
4647
"```\n",
4748
"4. For each observation, compare the three distances and chose the *smallest* one (use [`np.argmin`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmin.html).) Using the `map` function, label the centroids accordingly in a new column called \"Clusters\".\n",
4849
"5. Find the new centroid points by taking the means for both features of each of the three clusters. Make a new plot of the data, coloring the three clusters and labeling the new centroids as 'D'. What happened to the three centroids as a result of this algorithm?\n",
@@ -119,7 +120,7 @@
119120
"name": "python",
120121
"nbconvert_exporter": "python",
121122
"pygments_lexer": "ipython3",
122-
"version": "3.5.5"
123+
"version": "3.7.3"
123124
}
124125
},
125126
"nbformat": 4,

clustering/clustering_with_solutions.ipynb

Lines changed: 283 additions & 223 deletions
Large diffs are not rendered by default.
94.5 KB
Binary file not shown.

clustering/data/clustering_data.pkl

94.5 KB
Binary file not shown.

clustering/data/make_clustering_data.ipynb

Lines changed: 746 additions & 0 deletions
Large diffs are not rendered by default.

exploring-data/data/make_regression_data.ipynb

Lines changed: 960 additions & 0 deletions
Large diffs are not rendered by default.
220 KB
Binary file not shown.

exploring-data/src/.ipynb_checkpoints/data_exploration_tutorial-checkpoint.ipynb

Lines changed: 31 additions & 586 deletions
Large diffs are not rendered by default.

exploring-data/src/data_exploration_tutorial.ipynb

Lines changed: 60 additions & 13 deletions
Large diffs are not rendered by default.

exploring-data/src/data_exploration_tutorial_w_solutions.ipynb

Lines changed: 577 additions & 114 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)