Skip to content

Commit 6268743

Browse files
authored
Merge pull request #45 from PyDataBlog/experimental
Patch release `0.1.1` offering interface support for MLJModels.
2 parents 9c61a5a + d62e628 commit 6268743

11 files changed

+375
-46
lines changed

Project.toml

+7-4
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,22 @@
11
name = "ParallelKMeans"
22
uuid = "42b8e9d4-006b-409a-8472-7f34b3fb58af"
33
authors = ["Bernard Brenyah", "Andrey Oskin"]
4-
version = "0.1.0"
4+
version = "0.1.1"
55

66
[deps]
7+
Distances = "b4f34e82-e78d-54a5-968a-f98e89d6e8f7"
8+
MLJModelInterface = "e80e1ace-859a-464e-9ed9-23947d8ae3ea"
79
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
810

911
[compat]
1012
StatsBase = "0.32, 0.33"
11-
julia = "1.3"
13+
julia = "1.3, 1.4"
1214

1315
[extras]
16+
MLJBase = "a7f614a8-145f-11e9-1d2a-a57a1082229d"
1417
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
15-
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
1618
Suppressor = "fd094767-a336-5f1f-9728-57cf17d0bbfb"
19+
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
1720

1821
[targets]
19-
test = ["Test", "Random", "Suppressor"]
22+
test = ["Test", "Random", "Suppressor", "MLJBase"]

README.md

+5
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ ________________________________________________________________________________
1010
_________________________________________________________________________________________________________
1111

1212
## Table Of Content
13+
1314
1. [Documentation](#Documentation)
1415
2. [Installation](#Installation)
1516
3. [Features](#Features)
@@ -18,13 +19,15 @@ ________________________________________________________________________________
1819
_________________________________________________________________________________________________________
1920

2021
### Documentation
22+
2123
- Stable Documentation: [![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://PyDataBlog.github.io/ParallelKMeans.jl/stable)
2224

2325
- Experimental Documentation: [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://PyDataBlog.github.io/ParallelKMeans.jl/dev)
2426

2527
_________________________________________________________________________________________________________
2628

2729
### Installation
30+
2831
You can grab the latest stable version of this package by simply running in Julia.
2932
Don't forget to Julia's package manager with `]`
3033

@@ -39,9 +42,11 @@ pkg> dev [email protected]:PyDataBlog/ParallelKMeans.jl.git
3942
```
4043

4144
Don't forget to checkout the experimental branch and you are good to go with bleeding edge features and breaks!
45+
4246
```bash
4347
git checkout experimental
4448
```
49+
4550
_________________________________________________________________________________________________________
4651

4752
### Features

docs/src/index.md

+36-30
Original file line numberDiff line numberDiff line change
@@ -5,26 +5,29 @@ Depth = 4
55
```
66

77
## Motivation
8+
89
It's actually a funny story led to the development of this package.
9-
What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after a heated discussion on the Julia Discourse forum when I asked for Julia optimizaition tips. Long story short, Julia community is an amazing one! Andrey offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia pacakge.
10+
What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after a heated discussion on the Julia Discourse forum when I asked for Julia optimizaition tips. Long story short, Julia community is an amazing one! Andrey offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia pacakge.
1011

1112
Say hello to `ParallelKMeans`!
1213

1314
This package aims to utilize the speed of Julia and parallelization (both CPU & GPU) to offer an extremely fast implementation of the K-Means clustering algorithm and its variations via a friendly interface for practioners.
1415

15-
In short, we hope this package will eventually mature as the "one stop" shop for everything KMeans on both CPUs and GPUs.
16+
In short, we hope this package will eventually mature as the "one stop" shop for everything K-Means on both CPUs and GPUs.
1617

1718
## K-Means Algorithm Implementation Notes
19+
1820
Since Julia is a column major language, the input (design matrix) expected by the package in the following format;
1921

2022
- Design matrix X of size n×m, the i-th column of X `(X[:, i])` is a single data point in n-dimensional space.
2123
- Thus, the rows of the design design matrix represents the feature space with the columns representing all the training examples in this feature space.
2224

23-
One of the pitfalls of K-Means algorithm is that it can fall into a local minima.
25+
One of the pitfalls of K-Means algorithm is that it can fall into a local minima.
2426
This implementation inherits this problem like every implementation does.
2527
As a result, it is useful in practice to restart it several times to get the correct results.
2628

2729
## Installation
30+
2831
You can grab the latest stable version of this package from Julia registries by simply running;
2932

3033
*NB:* Don't forget to Julia's package manager with `]`
@@ -40,32 +43,35 @@ dev [email protected]:PyDataBlog/ParallelKMeans.jl.git
4043
```
4144

4245
Don't forget to checkout the experimental branch and you are good to go with bleeding edge features and breaks!
46+
4347
```bash
4448
git checkout experimental
4549
```
4650

4751
## Features
52+
4853
- Lightening fast implementation of Kmeans clustering algorithm even on a single thread in native Julia.
49-
- Support for multi-theading implementation of Kmeans clustering algorithm.
54+
- Support for multi-theading implementation of K-Means clustering algorithm.
5055
- 'Kmeans++' initialization for faster and better convergence.
51-
- Modified version of Elkan's Triangle inequality to speed up K-Means algorithm.
52-
56+
- Implementation of available classic and contemporary variants of the K-Means algorithm.
5357

5458
## Pending Features
55-
- [X] Implementation of [Hamerly implementation](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster).
59+
60+
- [X] Implementation of [Hamerly implementation](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster).
61+
- [X] Interface for inclusion in Alan Turing Institute's [MLJModels](https://github.com/alan-turing-institute/MLJModels.jl#who-is-this-repo-for).
5662
- [ ] Full Implementation of Triangle inequality based on [Elkan - 2003 Using the Triangle Inequality to Accelerate K-Means"](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf).
5763
- [ ] Implementation of [Geometric methods to accelerate k-means algorithm](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf).
58-
- [ ] Support for DataFrame inputs.
64+
- [ ] Native support for tabular data inputs outside of MLJModels' interface.
5965
- [ ] Refactoring and finalizaiton of API desgin.
6066
- [ ] GPU support.
61-
- [ ] Even faster Kmeans implementation based on current literature.
67+
- [ ] Even faster Kmeans implementation based on recent literature.
6268
- [ ] Optimization of code base.
6369
- [ ] Improved Documentation
6470
- [ ] More benchmark tests
6571

66-
6772
## How To Use
68-
Taking advantage of Julia's brilliant multiple dispatch system, the package exposes users to a very easy to use API.
73+
74+
Taking advantage of Julia's brilliant multiple dispatch system, the package exposes users to a very easy to use API.
6975

7076
```julia
7177
using ParallelKMeans
@@ -83,7 +89,7 @@ The main design goal is to offer all available variations of the KMeans algorith
8389
some_results = kmeans([algo], input_matrix, k; kwargs)
8490

8591
# example
86-
r = kmeans(Lloyd(), X, 3) # same result as the default
92+
r = kmeans(Lloyd(), X, 3) # same result as the default
8793
```
8894

8995
```julia
@@ -95,30 +101,31 @@ r.iterations # number of elapsed iterations
95101
r.converged # whether the procedure converged
96102
```
97103

98-
### Supported KMeans algorithm variations.
99-
- [Lloyd()](https://cs.nyu.edu/~roweis/csc2515-2006/readings/lloyd57.pdf)
100-
- [Hamerly()](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster)
104+
### Supported KMeans algorithm variations
105+
106+
- [Lloyd()](https://cs.nyu.edu/~roweis/csc2515-2006/readings/lloyd57.pdf)
107+
- [Hamerly()](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster)
101108
- [Geometric()](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf) - (Coming soon)
102-
- [Elkan()](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) - (Coming soon)
109+
- [Elkan()](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) - (Coming soon)
103110
- [MiniBatch()](https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf) - (Coming soon)
104111

105-
106112
### Practical Usage Examples
113+
107114
Some of the common usage examples of this package are as follows:
108115

109116
#### Clustering With A Desired Number Of Groups
110117

111-
```julia
118+
```julia
112119
using ParallelKMeans, RDatasets, Plots
113120

114121
# load the data
115-
iris = dataset("datasets", "iris");
122+
iris = dataset("datasets", "iris");
116123

117124
# features to use for clustering
118-
features = collect(Matrix(iris[:, 1:4])');
125+
features = collect(Matrix(iris[:, 1:4])');
119126

120127
# various artificats can be accessed from the result ie assigned labels, cost value etc
121-
result = kmeans(features, 3);
128+
result = kmeans(features, 3);
122129

123130
# plot with the point color mapped to the assigned cluster index
124131
scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
@@ -129,6 +136,7 @@ scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
129136
![Image description](iris_example.jpg)
130137

131138
#### Elbow Method For The Selection Of optimal number of clusters
139+
132140
```julia
133141
using ParallelKMeans
134142

@@ -140,21 +148,18 @@ c = [ParallelKMeans.kmeans(X, i; tol=1e-6, max_iters=300, verbose=false).totalco
140148

141149
```
142150

143-
144151
## Benchmarks
145-
Currently, this package is benchmarked against similar implementation in both Python and Julia. All reproducible benchmarks can be found in [ParallelKMeans/extras](https://github.com/PyDataBlog/ParallelKMeans.jl/tree/master/extras) directory. More tests in various languages are planned beyond the initial release version (`0.1.0`).
146-
147-
*Note*: All benchmark tests are made on the same computer to help eliminate any bias.
148152

153+
Currently, this package is benchmarked against similar implementation in both Python and Julia. All reproducible benchmarks can be found in [ParallelKMeans/extras](https://github.com/PyDataBlog/ParallelKMeans.jl/tree/master/extras) directory. More tests in various languages are planned beyond the initial release version (`0.1.0`).
149154

150-
Currently, the benchmark speed tests are based on the search for optimal number of clusters using the [Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)) since this is a practical use case for most practioners employing the K-Means algorithm.
155+
*Note*: All benchmark tests are made on the same computer to help eliminate any bias.
151156

157+
Currently, the benchmark speed tests are based on the search for optimal number of clusters using the [Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)) since this is a practical use case for most practioners employing the K-Means algorithm.
152158

153159
### Benchmark Results
154160

155161
![benchmark_image.png](benchmark_image.png)
156162

157-
158163
_________________________________________________________________________________________________________
159164

160165
| 1 million (ms) | 100k (ms) | 10k (ms) | 1k (ms) | package | language |
@@ -168,17 +173,18 @@ ________________________________________________________________________________
168173

169174
_________________________________________________________________________________________________________
170175

176+
## Release History
171177

172-
## Release History
173178
- 0.1.0 Initial release
174-
179+
- 0.1.1 Added interface for MLJ
175180

176181
## Contributing
182+
177183
Ultimately, we see this package as potentially the one stop shop for everything related to KMeans algorithm and its speed up variants. We are open to new implementations and ideas from anyone interested in this project.
178184

179185
Detailed contribution guidelines will be added in upcoming releases.
180186

181-
<!--- Insert Contribution Guidelines Below --->
187+
<!--- TODO: Contribution Guidelines --->
182188

183189
```@index
184190
```

src/ParallelKMeans.jl

+3
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,16 @@
11
module ParallelKMeans
22

33
using StatsBase
4+
using MLJModelInterface
45
import Base.Threads: @spawn
6+
import Distances
57

68
include("seeding.jl")
79
include("kmeans.jl")
810
include("lloyd.jl")
911
include("light_elkan.jl")
1012
include("hamerly.jl")
13+
include("mlj_interface.jl")
1114

1215
export kmeans
1316
export Lloyd, LightElkan, Hamerly

src/hamerly.jl

+5-4
Original file line numberDiff line numberDiff line change
@@ -41,12 +41,13 @@ function kmeans!(alg::Hamerly, containers, design_matrix, k;
4141
@parallelize n_threads ncol chunk_initialize!(alg, containers, centroids, design_matrix)
4242

4343
converged = false
44-
niters = 1
44+
niters = 0
4545
J_previous = 0.0
4646
p = containers.p
4747

4848
# Update centroids & labels with closest members until convergence
49-
while niters <= max_iters
49+
while niters < max_iters
50+
niters += 1
5051
update_containers!(containers, alg, centroids, n_threads)
5152
@parallelize n_threads ncol chunk_update_centroids!(centroids, containers, alg, design_matrix)
5253
collect_containers(alg, containers, n_threads)
@@ -58,7 +59,7 @@ function kmeans!(alg::Hamerly, containers, design_matrix, k;
5859
@parallelize n_threads ncol chunk_update_bounds!(containers, r1, r2, pr1, pr2)
5960

6061
if verbose
61-
# Show progress and terminate if J stopped decreasing.
62+
# Show progress and terminate if J stops decreasing as specified by the tolerance level.
6263
println("Iteration $niters: Jclust = $J")
6364
end
6465

@@ -69,7 +70,7 @@ function kmeans!(alg::Hamerly, containers, design_matrix, k;
6970
end
7071

7172
J_previous = J
72-
niters += 1
73+
7374
end
7475

7576
@parallelize n_threads ncol sum_of_squares(containers, design_matrix, containers.labels, centroids)

src/kmeans.jl

+6-4
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ end
173173

174174

175175
"""
176-
Kmeans!(alg::AbstractKMeansAlg, containers, design_matrix, k; n_threads = nthreads(), k_init="k-means++", max_iters=300, tol=1e-6, verbose=true)
176+
Kmeans!(alg::AbstractKMeansAlg, containers, design_matrix, k; n_threads = nthreads(), k_init="k-means++", max_iters=300, tol=1e-6, verbose=false)
177177
178178
Mutable version of `kmeans` function. Definition of arguments and results can be
179179
found in `kmeans`.
@@ -189,12 +189,14 @@ function kmeans!(alg, containers, design_matrix, k;
189189
centroids = init == nothing ? smart_init(design_matrix, k, n_threads, init=k_init).centroids : deepcopy(init)
190190

191191
converged = false
192-
niters = 1
192+
niters = 0
193193
J_previous = 0.0
194194

195195
# Update centroids & labels with closest members until convergence
196196

197-
while niters <= max_iters
197+
while niters < max_iters
198+
niters += 1
199+
198200
update_containers!(containers, alg, centroids, n_threads)
199201
J = update_centroids!(centroids, containers, alg, design_matrix, n_threads)
200202

@@ -210,7 +212,7 @@ function kmeans!(alg, containers, design_matrix, k;
210212
end
211213

212214
J_previous = J
213-
niters += 1
215+
214216
end
215217

216218
totalcost = sum_of_squares(design_matrix, containers.labels, centroids)

0 commit comments

Comments
 (0)