Skip to content

[do not merge] Large scale #175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 120 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
120 commits
Select commit Hold shift + click to select a range
b3500dd
Creating branch for large scale measurements
ethanglaser Aug 30, 2024
4bd6c7f
strong scaling, config updates, minor revisions
ethanglaser Sep 18, 2024
984aab1
Merge branch 'IntelPython:large-scale' into large-scale
ethanglaser Sep 18, 2024
3cd955c
knn and forest config updates
ethanglaser Sep 21, 2024
e39dc2b
lint
md-shafiul-alam Sep 23, 2024
6e0fbf8
just gpu for regular
md-shafiul-alam Sep 23, 2024
7bb8fb4
juremove cuml
md-shafiul-alam Sep 23, 2024
535c1e4
Add incremental algorithms support
olegkkruglov Sep 23, 2024
d6952ac
Fix win yml
olegkkruglov Sep 23, 2024
9cf382e
refactor and kmeans strong
md-shafiul-alam Sep 23, 2024
6c8f529
refactor and add config
md-shafiul-alam Sep 23, 2024
3867a86
strong reduce nodes
md-shafiul-alam Sep 23, 2024
ed875b4
forest reg config
md-shafiul-alam Sep 23, 2024
c596a56
forest reg config
md-shafiul-alam Sep 23, 2024
4fee991
KNN weak
md-shafiul-alam Sep 23, 2024
fce0651
KNN strong
md-shafiul-alam Sep 23, 2024
e1ff9a0
experiment with ppn
md-shafiul-alam Sep 24, 2024
e3d9a35
experiment with ppn
md-shafiul-alam Sep 24, 2024
817710b
bf16
md-shafiul-alam Sep 24, 2024
aaa0039
bf16
md-shafiul-alam Sep 24, 2024
03a152a
Remove samples/ms info
olegkkruglov Sep 24, 2024
b7d962e
knn
md-shafiul-alam Sep 24, 2024
3ac5c23
Remove BS from config (need to add after pip version update)
olegkkruglov Sep 24, 2024
87b6fa6
basic stat single
md-shafiul-alam Sep 25, 2024
9461fad
Add condition for finalize
olegkkruglov Sep 25, 2024
b82d772
Fix num_batches usage
olegkkruglov Sep 25, 2024
c70e122
Creating branch for large scale measurements
ethanglaser Aug 30, 2024
8d74f6d
strong scaling, config updates, minor revisions
ethanglaser Sep 18, 2024
192744f
knn and forest config updates
ethanglaser Sep 21, 2024
b1f2c15
lint
md-shafiul-alam Sep 23, 2024
f3be737
just gpu for regular
md-shafiul-alam Sep 23, 2024
ee8c74b
juremove cuml
md-shafiul-alam Sep 23, 2024
93eae2f
Add metrics to list for proper report generation
olegkkruglov Sep 24, 2024
574ff2a
batch for online
md-shafiul-alam Sep 26, 2024
da7f425
online vs spmd
md-shafiul-alam Sep 26, 2024
2377a9e
spmd vs online fix
md-shafiul-alam Sep 26, 2024
3e4333e
batch vs online fix
md-shafiul-alam Sep 26, 2024
40ad9d5
increase online data size
md-shafiul-alam Sep 26, 2024
894ed1d
batch vs online fix
md-shafiul-alam Sep 26, 2024
36c57c3
separate nodes
md-shafiul-alam Sep 26, 2024
08f0aa8
minor
md-shafiul-alam Sep 26, 2024
3302212
dbscan
md-shafiul-alam Sep 26, 2024
1779a9f
config fixes
md-shafiul-alam Sep 27, 2024
4ac119e
config fix
md-shafiul-alam Sep 27, 2024
902f0ec
forest regression
md-shafiul-alam Sep 27, 2024
d40389e
forest regression
md-shafiul-alam Sep 27, 2024
906de02
forest regression
md-shafiul-alam Sep 27, 2024
7348b42
kmeans and logreg update
md-shafiul-alam Oct 1, 2024
270c841
forest reg data same as cls
md-shafiul-alam Oct 2, 2024
d172d2a
knn bf16
md-shafiul-alam Oct 2, 2024
29ea288
cov regular prev
md-shafiul-alam Oct 2, 2024
35282a0
add incremental
md-shafiul-alam Oct 4, 2024
13c0514
Update logreg.json
icfaust Oct 7, 2024
8532908
Update ensemble.json
icfaust Oct 7, 2024
c3ac4bb
Update kmeans.json
icfaust Oct 7, 2024
a8d898b
Update knn.json
icfaust Oct 7, 2024
fe90de2
Update logreg.json
icfaust Oct 7, 2024
7ab1cc3
Update pca.json
icfaust Oct 7, 2024
595a7ee
Update linear_model.json
icfaust Oct 7, 2024
8025719
dbscan large scale support and logreg details
ethanglaser Oct 7, 2024
fcaa907
reformat
md-shafiul-alam Oct 8, 2024
a4653a1
knn bf16
md-shafiul-alam Oct 8, 2024
4f65e1f
add bf16 cases
md-shafiul-alam Oct 8, 2024
105d203
merge changes from root and add config updates
ethanglaser Oct 8, 2024
c852279
forest bf16
md-shafiul-alam Oct 8, 2024
698d884
incremental
md-shafiul-alam Oct 8, 2024
5592d31
spmd online
md-shafiul-alam Oct 8, 2024
c47649a
fix
md-shafiul-alam Oct 8, 2024
687178b
incremental spmd
md-shafiul-alam Oct 10, 2024
5c97aed
incremental spmd test
md-shafiul-alam Oct 10, 2024
907b35a
incremental spmd
md-shafiul-alam Oct 10, 2024
7ed0235
incremental spmd
md-shafiul-alam Oct 10, 2024
d732e0f
rebase
ethanglaser Oct 15, 2024
e68edd5
configs nearly finalized + minor job updates
ethanglaser Oct 15, 2024
e834493
<=
ethanglaser Oct 16, 2024
75f2f10
lint
ethanglaser Oct 16, 2024
9b70f7e
Merge pull request #161 from ethanglaser/large-scale
ethanglaser Oct 16, 2024
fdd32d1
Update knn.json
icfaust Oct 16, 2024
99fdb89
Update linear_model.json
icfaust Oct 16, 2024
d419a01
minor
md-shafiul-alam Oct 17, 2024
72dfdd2
Merge branch 'main' into large-scale
ethanglaser Dec 12, 2024
fd59a64
Added updated configs.
KateBlueSky Mar 17, 2025
985db07
Added shift.
KateBlueSky Mar 17, 2025
34a30c7
Added center box.
KateBlueSky Mar 17, 2025
d47face
Removed the inertia for Kmeans.
KateBlueSky Mar 18, 2025
e617791
fixed config locations.
KateBlueSky Mar 18, 2025
00ac46d
Updated configs.
KateBlueSky Mar 18, 2025
f37f964
Moved large scale files.
KateBlueSky Mar 18, 2025
1c5552b
Added line.
KateBlueSky Mar 18, 2025
dcfef94
Added large scale 2k parameters sample shift
KateBlueSky Mar 18, 2025
4ba3fe4
Fixed imports.
KateBlueSky Mar 18, 2025
5c04a35
Updated format.
KateBlueSky Mar 18, 2025
72d65c1
Merge branch 'main' into large-scale
ethanglaser Mar 18, 2025
af48e96
Added the math import.
KateBlueSky Mar 18, 2025
c7f38f4
Rolled back the accidental changes to the ranked_based distributed_sp…
KateBlueSky Mar 20, 2025
264701e
Updated large scale 2k parameters for the full 24576 tiles.
KateBlueSky Mar 20, 2025
20419a9
Updated config files.
KateBlueSky Mar 20, 2025
4e93858
cleaned up diff.
KateBlueSky Mar 20, 2025
428f3df
Reformatted correctly.
KateBlueSky Mar 21, 2025
2f8c68b
Fixed if else.
KateBlueSky Mar 21, 2025
816c6dc
Updated format.
KateBlueSky Mar 21, 2025
5d3bf52
Added mpi4py
KateBlueSky Mar 21, 2025
edceece
Merge branch 'large-scale' of https://github.com/IntelPython/scikit-l…
KateBlueSky Mar 21, 2025
a937963
fixed mpi4py
KateBlueSky Mar 21, 2025
3809d17
Rolled back mpi4py.
KateBlueSky Mar 21, 2025
c128748
Formatted file.
KateBlueSky Mar 21, 2025
15db792
Removed environment from diff.
KateBlueSky Mar 21, 2025
30b0b80
initial alignment of configs to final results (#176)
ethanglaser Mar 21, 2025
f0fccdd
Revert "Removed environment from diff."
KateBlueSky Mar 21, 2025
548d824
Merged change from large-scale
KateBlueSky Mar 21, 2025
4d675ec
Removed extra code for sample_shift.
KateBlueSky Mar 21, 2025
e8fbd0b
Changes for sample_shift.
KateBlueSky Mar 21, 2025
a7cea17
Updated sample shift.
KateBlueSky Mar 21, 2025
f3c2757
Updated sample shift.
KateBlueSky Mar 21, 2025
2ae3c39
Removed extra code.
KateBlueSky Mar 21, 2025
39cc4f2
Added comment for sample_shift.
KateBlueSky Mar 21, 2025
3fc7c42
Added back in x_train in sample_shift.
KateBlueSky Mar 21, 2025
1bd5aa1
Updated description of sample_shift.
KateBlueSky Mar 21, 2025
06944c1
Added predict back in.
KateBlueSky Mar 21, 2025
2edb597
Merge pull request #174 from IntelPython/dev/large_scale_kmeans
KateBlueSky Mar 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ Configs have the three highest parameter keys:
| `data`:`format` | `pandas` | `pandas`, `numpy`, `cudf` | Data format to use in benchmark. |
| `data`:`order` | `F` | `C`, `F` | Data order to use in benchmark: contiguous(C) or Fortran. |
| `data`:`dtype` | `float64` | | Data type to use in benchmark. |
| `data`:`distributed_split` | None | None, `rank_based` | Split type used to distribute data between machines in distributed algorithm. `None` type means usage of all data without split on all machines. `rank_based` type splits the data equally between machines with split sequence based on rank id from MPI. |
| `data`:`distributed_split` | None | None, `rank_based`, `sample_shift` | Split type used to distribute data between machines in distributed algorithm. `sample_shift`: Shift each data point in each rank by sqrt (rank id) * 0.003) + 1. `None` type means usage of all data without split on all machines. `rank_based` type splits the data equally between machines with split sequence based on rank id from MPI. |
|<h3>Algorithm parameters</h3>||||
| `algorithm`:`library` | None | | Python module containing measured entity (class or function). |
| `algorithm`:`device` | `default` | `default`, `cpu`, `gpu` | Device selected for computation. |
Expand Down
5 changes: 5 additions & 0 deletions configs/common/sklearn.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@
{ "library": "sklearnex", "device": ["cpu", "gpu"] }
]
},
"sklearn-ex[gpu] implementations": {
"algorithm": [
{ "library": "sklearnex", "device": ["gpu"] }
]
},
"sklearn-ex[preview] implementations": {
"algorithm": [
{ "library": "sklearn", "device": "cpu" },
Expand Down
85 changes: 85 additions & 0 deletions configs/regular/batch_for_online.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
{
"INCLUDE": ["../common/sklearn.json"],
"PARAMETERS_SETS": {
"common": {"bench": {"n_runs": 10}},
"basic_statistics data": {
"data": {
"source": "make_blobs",
"generation_kwargs": {
"centers": 1,
"n_samples": 12000000,
"n_features": [10, 100]
},
"split_kwargs": {"ignore": true}
}
},
"linear_regression data": {
"data": {
"source": "make_regression",
"split_kwargs": {"train_size": 0.2, "test_size": 0.8},
"generation_kwargs": {
"n_samples": 12000000,
"n_features": [10, 100],
"n_informative": 5,
"noise": 2.0
}
}
},
"pca data": {
"data": {
"source": "make_blobs",
"generation_kwargs": {
"centers": 1,
"n_samples": 12000000,
"n_features": [10, 100]
},
"split_kwargs": {"ignore": true}
}
},
"basic_statistics": {
"algorithm": [
{
"estimator": "BasicStatistics",
"library": "sklearnex.basic_statistics",
"estimator_methods": {"training": "fit"}
}
]
},
"covariance": {
"algorithm": [
{
"estimator": "EmpiricalCovariance",
"library": "sklearnex.preview.covariance",
"estimator_methods": {"training": "fit"}
}
]
},
"linear_regression": {
"algorithm": [
{
"estimator": "LinearRegression",
"library": "sklearnex.linear_model",
"estimator_methods": {"training": "fit"}
}
]
},
"pca": {
"algorithm": [
{
"estimator": "PCA",
"library": "sklearnex.decomposition",
"estimator_methods": {"training": "fit"}
}
]
}
},
"TEMPLATES": {
"basic_statistics": {"SETS": ["common", "basic_statistics", "basic_statistics data", "sklearn-ex[gpu] implementations"]},
"covariance": {"SETS": ["common", "basic_statistics data", "sklearn-ex[gpu] implementations", "covariance"]},
"linear_regression": {
"SETS": ["common", "linear_regression", "linear_regression data", "sklearn-ex[gpu] implementations"]
},
"pca": {"SETS": ["common", "pca", "pca data", "sklearn-ex[gpu] implementations"]}
}
}

27 changes: 27 additions & 0 deletions configs/regular/bf16/basic_statistics.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"INCLUDE": ["../../common/sklearn.json"],
"PARAMETERS_SETS": {
"basic stats parameters": {
"algorithm": {
"estimator": "BasicStatistics"
},
"data": {
"dtype": ["float32"]
}
},
"synthetic data": {
"data": [
{ "source": "make_blobs", "generation_kwargs": { "n_samples": 10000000, "n_features": 10, "centers": 1 } }
]
}
},
"TEMPLATES": {
"basic_statistics": {
"SETS": [
"sklearn-ex[gpu] implementations",
"basic stats parameters",
"synthetic data"
]
}
}
}
28 changes: 28 additions & 0 deletions configs/regular/bf16/covariance.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
{
"INCLUDE": ["../../common/sklearn.json"],
"PARAMETERS_SETS": {
"covariance parameters": {
"algorithm": {
"estimator": "EmpiricalCovariance",
"library": "sklearnex.preview.covariance"
},
"data": {
"dtype": ["float32"]
}
},
"synthetic data": {
"data": [
{ "source": "make_blobs", "generation_kwargs": { "n_samples": 10000000, "n_features": 10, "centers": 1 } }
]
}
},
"TEMPLATES": {
"covariance": {
"SETS": [
"sklearn-ex[gpu] implementations",
"covariance parameters",
"synthetic data"
]
}
}
}
41 changes: 41 additions & 0 deletions configs/regular/bf16/dbscan.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
{
"INCLUDE": ["../../common/sklearn.json"],
"PARAMETERS_SETS": {
"common dbscan parameters": {
"algorithm": {
"estimator": "DBSCAN",
"estimator_params": {
"eps": "[SPECIAL_VALUE]distances_quantile:0.01",
"min_samples": 5,
"metric": "euclidean"
}
},
"data": {
"dtype": ["float32"]
}
},
"sklearn dbscan parameters": {
"algorithm": {
"estimator_params": {
"algorithm": "brute",
"n_jobs": "[SPECIAL_VALUE]physical_cpus"
}
}
},
"synthetic dataset": {
"data": [
{ "source": "make_blobs", "generation_kwargs": { "n_samples": 100000, "n_features": 10, "centers": 10 }, "algorithm": { "eps": 5, "min_samples": 5 } }
]
}
},
"TEMPLATES": {
"sklearn dbscan": {
"SETS": [
"sklearn-ex[gpu] implementations",
"common dbscan parameters",
"sklearn dbscan parameters",
"synthetic dataset"
]
}
}
}
34 changes: 34 additions & 0 deletions configs/regular/bf16/forest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"INCLUDE": ["../../common/sklearn.json"],
"PARAMETERS_SETS": {
"common forest params": {
"data": {
"dtype": ["float32"]
}
},
"forest classifier params": {
"algorithm": {"estimator": "RandomForestClassifier"},
"data": { "source": "make_classification", "split_kwargs": { "train_size": 500000, "test_size": 1000 }, "generation_kwargs": { "n_samples": 501000, "n_features": 10, "n_classes": 2 }, "algorithm": { "estimator_params": { "n_estimators": 20, "max_depth": 4 } } }
},
"forest regression params": {
"algorithm": {"estimator": "RandomForestRegressor"},
"data": { "source": "make_regression", "generation_kwargs": { "n_samples": 501000, "n_features": 10, "noise": 1.25 }, "split_kwargs": { "train_size": 500000, "test_size": 1000 }, "algorithm": { "estimator_params": { "n_estimators": 20, "max_depth": 4 } }}
}
},
"TEMPLATES": {
"forest cls": {
"SETS": [
"sklearn-ex[gpu] implementations",
"common forest params",
"forest classifier params"
]
},
"forest reg": {
"SETS": [
"sklearn-ex[gpu] implementations",
"common forest params",
"forest regression params"
]
}
}
}
40 changes: 40 additions & 0 deletions configs/regular/bf16/kmeans.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
{
"INCLUDE": ["../../common/sklearn.json"],
"PARAMETERS_SETS": {
"common kmeans parameters": {
"algorithm": {
"estimator": "KMeans",
"estimator_params": {
"n_clusters": "[SPECIAL_VALUE]auto",
"n_init": 1,
"max_iter": 30,
"tol": 1e-3,
"random_state": 42
},
"estimator_methods": { "inference": "predict" }
},
"data": {
"dtype": ["float32"],
"preprocessing_kwargs": { "normalize": true }
}
},
"sklearn kmeans parameters": {
"algorithm": { "estimator_params": { "init": "k-means++", "algorithm": "lloyd" } }
},
"synthetic data": {
"data": [
{ "source": "make_blobs", "generation_kwargs": { "n_samples": 1000000, "n_features": 100, "centers": 100 }, "algorithm": { "n_clusters": 100, "max_iter": 100 } }
]
}
},
"TEMPLATES": {
"sklearn kmeans": {
"SETS": [
"sklearn-ex[gpu] implementations",
"common kmeans parameters",
"sklearn kmeans parameters",
"synthetic data"
]
}
}
}
56 changes: 56 additions & 0 deletions configs/regular/bf16/knn.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
{
"INCLUDE": ["../../common/sklearn.json"],
"PARAMETERS_SETS": {
"common knn parameters": {
"algorithm": {
"estimator_params": {
"n_neighbors": [10, 100],
"weights": "uniform"
}
},
"data": {
"dtype": ["float32"],
"preprocessing_kwargs": { "normalize": true }
}
},
"sklearn knn parameters": {
"algorithm": { "estimator_params": { "n_jobs": "[SPECIAL_VALUE]physical_cpus" } }
},
"synthetic classification data": {
"algorithm": {
"estimator": "KNeighborsClassifier",
"estimator_params": { "algorithm": "brute", "metric": "minkowski", "p": [1, 2] }
},
"data": [
{ "source": "make_classification", "split_kwargs": { "train_size": 5000000, "test_size": 1000 }, "generation_kwargs": { "n_samples": 5001000, "n_features": 100, "n_classes": 2, "n_informative": "[SPECIAL_VALUE]0.5" } }
]
},
"synthetic regression data": {
"algorithm": {
"estimator": "KNeighborsRegressor",
"estimator_params": { "algorithm": "brute", "metric": "minkowski", "p": [1, 2] }
},
"data": [
{ "source": "make_regression", "split_kwargs": { "train_size": 5000000, "test_size": 1000 }, "generation_kwargs": { "n_samples": 5001000, "n_features": 100, "noise":1.5 } }
]
}
},
"TEMPLATES": {
"sklearn brute knn clsf": {
"SETS": [
"sklearn-ex[gpu] implementations",
"common knn parameters",
"sklearn knn parameters",
"synthetic classification data"
]
},
"sklearn brute knn reg": {
"SETS": [
"sklearn-ex[gpu] implementations",
"common knn parameters",
"sklearn knn parameters",
"synthetic regression data"
]
}
}
}
33 changes: 33 additions & 0 deletions configs/regular/bf16/linear_model.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
{
"INCLUDE": ["../../common/sklearn.json"],
"PARAMETERS_SETS": {
"synthetic data": {
"data": [
{ "source": "make_regression", "generation_kwargs": { "n_samples": 3005000, "n_features": 10, "noise": 1.25 }, "split_kwargs": { "train_size": 3000000, "test_size": 5000 } }
]
},
"common linear parameters": {
"algorithm": {
"estimator": "LinearRegression",
"estimator_params": { "fit_intercept": true, "copy_X": true }
},
"data": {
"dtype": ["float32"],
"order": "C"
}
},
"sklearn linear parameters": {
"estimator_params": { "n_jobs": "[SPECIAL_VALUE]physical_cpus" }
}
},
"TEMPLATES": {
"sklearn linear": {
"SETS": [
"sklearn-ex[gpu] implementations",
"common linear parameters",
"sklearn linear parameters",
"synthetic data"
]
}
}
}
Loading