You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/index.md
+36-30
Original file line number
Diff line number
Diff line change
@@ -5,26 +5,29 @@ Depth = 4
5
5
```
6
6
7
7
## Motivation
8
+
8
9
It's actually a funny story led to the development of this package.
9
-
What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after a heated discussion on the Julia Discourse forum when I asked for Julia optimizaition tips. Long story short, Julia community is an amazing one! Andrey offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia pacakge.
10
+
What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after a heated discussion on the Julia Discourse forum when I asked for Julia optimizaition tips. Long story short, Julia community is an amazing one! Andrey offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia pacakge.
10
11
11
12
Say hello to `ParallelKMeans`!
12
13
13
14
This package aims to utilize the speed of Julia and parallelization (both CPU & GPU) to offer an extremely fast implementation of the K-Means clustering algorithm and its variations via a friendly interface for practioners.
14
15
15
-
In short, we hope this package will eventually mature as the "one stop" shop for everything KMeans on both CPUs and GPUs.
16
+
In short, we hope this package will eventually mature as the "one stop" shop for everything K-Means on both CPUs and GPUs.
16
17
17
18
## K-Means Algorithm Implementation Notes
19
+
18
20
Since Julia is a column major language, the input (design matrix) expected by the package in the following format;
19
21
20
22
- Design matrix X of size n×m, the i-th column of X `(X[:, i])` is a single data point in n-dimensional space.
21
23
- Thus, the rows of the design design matrix represents the feature space with the columns representing all the training examples in this feature space.
22
24
23
-
One of the pitfalls of K-Means algorithm is that it can fall into a local minima.
25
+
One of the pitfalls of K-Means algorithm is that it can fall into a local minima.
24
26
This implementation inherits this problem like every implementation does.
25
27
As a result, it is useful in practice to restart it several times to get the correct results.
26
28
27
29
## Installation
30
+
28
31
You can grab the latest stable version of this package from Julia registries by simply running;
29
32
30
33
*NB:* Don't forget to Julia's package manager with `]`
@@ -40,32 +43,35 @@ dev [email protected]:PyDataBlog/ParallelKMeans.jl.git
40
43
```
41
44
42
45
Don't forget to checkout the experimental branch and you are good to go with bleeding edge features and breaks!
46
+
43
47
```bash
44
48
git checkout experimental
45
49
```
46
50
47
51
## Features
52
+
48
53
- Lightening fast implementation of Kmeans clustering algorithm even on a single thread in native Julia.
49
-
- Support for multi-theading implementation of Kmeans clustering algorithm.
54
+
- Support for multi-theading implementation of K-Means clustering algorithm.
50
55
- 'Kmeans++' initialization for faster and better convergence.
51
-
- Modified version of Elkan's Triangle inequality to speed up K-Means algorithm.
52
-
56
+
- Implementation of available classic and contemporary variants of the K-Means algorithm.
53
57
54
58
## Pending Features
55
-
-[X] Implementation of [Hamerly implementation](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster).
59
+
60
+
-[X] Implementation of [Hamerly implementation](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster).
61
+
-[X] Interface for inclusion in Alan Turing Institute's [MLJModels](https://github.com/alan-turing-institute/MLJModels.jl#who-is-this-repo-for).
56
62
-[ ] Full Implementation of Triangle inequality based on [Elkan - 2003 Using the Triangle Inequality to Accelerate K-Means"](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf).
57
63
-[ ] Implementation of [Geometric methods to accelerate k-means algorithm](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf).
58
-
-[ ]Support for DataFrame inputs.
64
+
-[ ]Native support for tabular data inputs outside of MLJModels' interface.
59
65
-[ ] Refactoring and finalizaiton of API desgin.
60
66
-[ ] GPU support.
61
-
-[ ] Even faster Kmeans implementation based on current literature.
67
+
-[ ] Even faster Kmeans implementation based on recent literature.
62
68
-[ ] Optimization of code base.
63
69
-[ ] Improved Documentation
64
70
-[ ] More benchmark tests
65
71
66
-
67
72
## How To Use
68
-
Taking advantage of Julia's brilliant multiple dispatch system, the package exposes users to a very easy to use API.
73
+
74
+
Taking advantage of Julia's brilliant multiple dispatch system, the package exposes users to a very easy to use API.
69
75
70
76
```julia
71
77
using ParallelKMeans
@@ -83,7 +89,7 @@ The main design goal is to offer all available variations of the KMeans algorith
Currently, this package is benchmarked against similar implementation in both Python and Julia. All reproducible benchmarks can be found in [ParallelKMeans/extras](https://github.com/PyDataBlog/ParallelKMeans.jl/tree/master/extras) directory. More tests in various languages are planned beyond the initial release version (`0.1.0`).
146
-
147
-
*Note*: All benchmark tests are made on the same computer to help eliminate any bias.
148
152
153
+
Currently, this package is benchmarked against similar implementation in both Python and Julia. All reproducible benchmarks can be found in [ParallelKMeans/extras](https://github.com/PyDataBlog/ParallelKMeans.jl/tree/master/extras) directory. More tests in various languages are planned beyond the initial release version (`0.1.0`).
149
154
150
-
Currently, the benchmark speed tests are based on the search for optimal number of clusters using the [Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)) since this is a practical use case for most practioners employing the K-Means algorithm.
155
+
*Note*: All benchmark tests are made on the same computer to help eliminate any bias.
151
156
157
+
Currently, the benchmark speed tests are based on the search for optimal number of clusters using the [Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)) since this is a practical use case for most practioners employing the K-Means algorithm.
Ultimately, we see this package as potentially the one stop shop for everything related to KMeans algorithm and its speed up variants. We are open to new implementations and ideas from anyone interested in this project.
178
184
179
185
Detailed contribution guidelines will be added in upcoming releases.
0 commit comments