Skip to content

Commit 2a564dc

Browse files
authored
Merge pull request #42 from PyDataBlog/mlj-port
Mlj port beta
2 parents 4866573 + 714322c commit 2a564dc

File tree

10 files changed

+287
-35
lines changed

10 files changed

+287
-35
lines changed

Project.toml

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,19 @@ authors = ["Bernard Brenyah", "Andrey Oskin"]
44
version = "0.1.0"
55

66
[deps]
7+
Distances = "b4f34e82-e78d-54a5-968a-f98e89d6e8f7"
8+
MLJModelInterface = "e80e1ace-859a-464e-9ed9-23947d8ae3ea"
79
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
810

911
[compat]
10-
StatsBase = "0.32"
11-
julia = "1.3"
12+
StatsBase = "0.32, 0.33"
13+
julia = "1.3, 1.4"
1214

1315
[extras]
16+
MLJBase = "a7f614a8-145f-11e9-1d2a-a57a1082229d"
1417
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
15-
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
1618
Suppressor = "fd094767-a336-5f1f-9728-57cf17d0bbfb"
19+
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
1720

1821
[targets]
19-
test = ["Test", "Random", "Suppressor"]
22+
test = ["Test", "Random", "Suppressor", "MLJBase"]

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ ________________________________________________________________________________
1010
_________________________________________________________________________________________________________
1111

1212
## Table Of Content
13+
1314
1. [Documentation](#Documentation)
1415
2. [Installation](#Installation)
1516
3. [Features](#Features)
@@ -18,13 +19,15 @@ ________________________________________________________________________________
1819
_________________________________________________________________________________________________________
1920

2021
### Documentation
22+
2123
- Stable Documentation: [![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://PyDataBlog.github.io/ParallelKMeans.jl/stable)
2224

2325
- Experimental Documentation: [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://PyDataBlog.github.io/ParallelKMeans.jl/dev)
2426

2527
_________________________________________________________________________________________________________
2628

2729
### Installation
30+
2831
You can grab the latest stable version of this package by simply running in Julia.
2932
Don't forget to Julia's package manager with `]`
3033

@@ -39,9 +42,11 @@ pkg> dev [email protected]:PyDataBlog/ParallelKMeans.jl.git
3942
```
4043

4144
Don't forget to checkout the experimental branch and you are good to go with bleeding edge features and breaks!
45+
4246
```bash
4347
git checkout experimental
4448
```
49+
4550
_________________________________________________________________________________________________________
4651

4752
### Features

docs/Manifest.toml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,9 @@ version = "0.8.1"
1919

2020
[[Documenter]]
2121
deps = ["Base64", "Dates", "DocStringExtensions", "InteractiveUtils", "JSON", "LibGit2", "Logging", "Markdown", "REPL", "Test", "Unicode"]
22-
git-tree-sha1 = "d497bcc45bb98a1fbe19445a774cfafeabc6c6df"
22+
git-tree-sha1 = "646ebc3db49889ffeb4c36f89e5d82c6a26295ff"
2323
uuid = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
24-
version = "0.24.5"
24+
version = "0.24.7"
2525

2626
[[InteractiveUtils]]
2727
deps = ["Markdown"]
@@ -51,9 +51,9 @@ uuid = "a63ad114-7e13-5084-954f-fe012c677804"
5151

5252
[[Parsers]]
5353
deps = ["Dates", "Test"]
54-
git-tree-sha1 = "d112c19ccca00924d5d3a38b11ae2b4b268dda39"
54+
git-tree-sha1 = "0c16b3179190d3046c073440d94172cfc3bb0553"
5555
uuid = "69de0a69-1ddd-5017-9359-2bf0b02dc9f0"
56-
version = "0.3.11"
56+
version = "0.3.12"
5757

5858
[[Pkg]]
5959
deps = ["Dates", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "REPL", "Random", "SHA", "Test", "UUIDs"]

docs/src/index.md

Lines changed: 28 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,9 @@ Depth = 4
55
```
66

77
## Motivation
8+
89
It's actually a funny story led to the development of this package.
9-
What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after a heated discussion on the Julia Discourse forum when I asked for Julia optimizaition tips. Long story short, Julia community is an amazing one! Andrey offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia pacakge.
10+
What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after a heated discussion on the Julia Discourse forum when I asked for Julia optimizaition tips. Long story short, Julia community is an amazing one! Andrey offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia pacakge.
1011

1112
Say hello to `ParallelKMeans`!
1213

@@ -15,16 +16,18 @@ This package aims to utilize the speed of Julia and parallelization (both CPU &
1516
In short, we hope this package will eventually mature as the "one stop" shop for everything KMeans on both CPUs and GPUs.
1617

1718
## K-Means Algorithm Implementation Notes
19+
1820
Since Julia is a column major language, the input (design matrix) expected by the package in the following format;
1921

2022
- Design matrix X of size n×m, the i-th column of X `(X[:, i])` is a single data point in n-dimensional space.
2123
- Thus, the rows of the design design matrix represents the feature space with the columns representing all the training examples in this feature space.
2224

23-
One of the pitfalls of K-Means algorithm is that it can fall into a local minima.
25+
One of the pitfalls of K-Means algorithm is that it can fall into a local minima.
2426
This implementation inherits this problem like every implementation does.
2527
As a result, it is useful in practice to restart it several times to get the correct results.
2628

2729
## Installation
30+
2831
You can grab the latest stable version of this package from Julia registries by simply running;
2932

3033
*NB:* Don't forget to Julia's package manager with `]`
@@ -40,19 +43,21 @@ dev [email protected]:PyDataBlog/ParallelKMeans.jl.git
4043
```
4144

4245
Don't forget to checkout the experimental branch and you are good to go with bleeding edge features and breaks!
46+
4347
```bash
4448
git checkout experimental
4549
```
4650

4751
## Features
52+
4853
- Lightening fast implementation of Kmeans clustering algorithm even on a single thread in native Julia.
4954
- Support for multi-theading implementation of Kmeans clustering algorithm.
5055
- 'Kmeans++' initialization for faster and better convergence.
5156
- Modified version of Elkan's Triangle inequality to speed up K-Means algorithm.
5257

53-
5458
## Pending Features
55-
- [X] Implementation of [Hamerly implementation](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster).
59+
60+
- [X] Implementation of [Hamerly implementation](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster).
5661
- [ ] Full Implementation of Triangle inequality based on [Elkan - 2003 Using the Triangle Inequality to Accelerate K-Means"](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf).
5762
- [ ] Implementation of [Geometric methods to accelerate k-means algorithm](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf).
5863
- [ ] Support for DataFrame inputs.
@@ -63,9 +68,9 @@ git checkout experimental
6368
- [ ] Improved Documentation
6469
- [ ] More benchmark tests
6570

66-
6771
## How To Use
68-
Taking advantage of Julia's brilliant multiple dispatch system, the package exposes users to a very easy to use API.
72+
73+
Taking advantage of Julia's brilliant multiple dispatch system, the package exposes users to a very easy to use API.
6974

7075
```julia
7176
using ParallelKMeans
@@ -83,7 +88,7 @@ The main design goal is to offer all available variations of the KMeans algorith
8388
some_results = kmeans([algo], input_matrix, k; kwargs)
8489

8590
# example
86-
r = kmeans(Lloyd(), X, 3) # same result as the default
91+
r = kmeans(Lloyd(), X, 3) # same result as the default
8792
```
8893

8994
```julia
@@ -95,30 +100,31 @@ r.iterations # number of elapsed iterations
95100
r.converged # whether the procedure converged
96101
```
97102

98-
### Supported KMeans algorithm variations.
99-
- [Lloyd()](https://cs.nyu.edu/~roweis/csc2515-2006/readings/lloyd57.pdf)
100-
- [Hamerly()](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster)
103+
### Supported KMeans algorithm variations
104+
105+
- [Lloyd()](https://cs.nyu.edu/~roweis/csc2515-2006/readings/lloyd57.pdf)
106+
- [Hamerly()](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster)
101107
- [Geometric()](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf) - (Coming soon)
102-
- [Elkan()](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) - (Coming soon)
108+
- [Elkan()](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) - (Coming soon)
103109
- [MiniBatch()](https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf) - (Coming soon)
104110

105-
106111
### Practical Usage Examples
112+
107113
Some of the common usage examples of this package are as follows:
108114

109115
#### Clustering With A Desired Number Of Groups
110116

111-
```julia
117+
```julia
112118
using ParallelKMeans, RDatasets, Plots
113119

114120
# load the data
115-
iris = dataset("datasets", "iris");
121+
iris = dataset("datasets", "iris");
116122

117123
# features to use for clustering
118-
features = collect(Matrix(iris[:, 1:4])');
124+
features = collect(Matrix(iris[:, 1:4])');
119125

120126
# various artificats can be accessed from the result ie assigned labels, cost value etc
121-
result = kmeans(features, 3);
127+
result = kmeans(features, 3);
122128

123129
# plot with the point color mapped to the assigned cluster index
124130
scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
@@ -129,6 +135,7 @@ scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
129135
![Image description](iris_example.jpg)
130136

131137
#### Elbow Method For The Selection Of optimal number of clusters
138+
132139
```julia
133140
using ParallelKMeans
134141

@@ -140,21 +147,18 @@ c = [ParallelKMeans.kmeans(X, i; tol=1e-6, max_iters=300, verbose=false).totalco
140147

141148
```
142149

143-
144150
## Benchmarks
145-
Currently, this package is benchmarked against similar implementation in both Python and Julia. All reproducible benchmarks can be found in [ParallelKMeans/extras](https://github.com/PyDataBlog/ParallelKMeans.jl/tree/master/extras) directory. More tests in various languages are planned beyond the initial release version (`0.1.0`).
146-
147-
*Note*: All benchmark tests are made on the same computer to help eliminate any bias.
148151

152+
Currently, this package is benchmarked against similar implementation in both Python and Julia. All reproducible benchmarks can be found in [ParallelKMeans/extras](https://github.com/PyDataBlog/ParallelKMeans.jl/tree/master/extras) directory. More tests in various languages are planned beyond the initial release version (`0.1.0`).
149153

150-
Currently, the benchmark speed tests are based on the search for optimal number of clusters using the [Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)) since this is a practical use case for most practioners employing the K-Means algorithm.
154+
*Note*: All benchmark tests are made on the same computer to help eliminate any bias.
151155

156+
Currently, the benchmark speed tests are based on the search for optimal number of clusters using the [Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)) since this is a practical use case for most practioners employing the K-Means algorithm.
152157

153158
### Benchmark Results
154159

155160
![benchmark_image.png](benchmark_image.png)
156161

157-
158162
_________________________________________________________________________________________________________
159163

160164
| 1 million (ms) | 100k (ms) | 10k (ms) | 1k (ms) | package | language |
@@ -168,12 +172,12 @@ ________________________________________________________________________________
168172

169173
_________________________________________________________________________________________________________
170174

175+
## Release History
171176

172-
## Release History
173177
- 0.1.0 Initial release
174178

175-
176179
## Contributing
180+
177181
Ultimately, we see this package as potentially the one stop shop for everything related to KMeans algorithm and its speed up variants. We are open to new implementations and ideas from anyone interested in this project.
178182

179183
Detailed contribution guidelines will be added in upcoming releases.

src/ParallelKMeans.jl

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,16 @@
11
module ParallelKMeans
22

33
using StatsBase
4+
using MLJModelInterface
45
import Base.Threads: @spawn
6+
import Distances
57

68
include("seeding.jl")
79
include("kmeans.jl")
810
include("lloyd.jl")
911
include("light_elkan.jl")
1012
include("hamerly.jl")
13+
include("mlj_interface.jl")
1114

1215
export kmeans
1316
export Lloyd, LightElkan, Hamerly

src/kmeans.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ end
173173

174174

175175
"""
176-
Kmeans!(alg::AbstractKMeansAlg, containers, design_matrix, k; n_threads = nthreads(), k_init="k-means++", max_iters=300, tol=1e-6, verbose=true)
176+
Kmeans!(alg::AbstractKMeansAlg, containers, design_matrix, k; n_threads = nthreads(), k_init="k-means++", max_iters=300, tol=1e-6, verbose=false)
177177
178178
Mutable version of `kmeans` function. Definition of arguments and results can be
179179
found in `kmeans`.

0 commit comments

Comments
 (0)