Skip to content

[Merge only onto large-scale] Large scale Kmeans changes. #174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Mar 22, 2025
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
fd59a64
Added updated configs.
KateBlueSky Mar 17, 2025
985db07
Added shift.
KateBlueSky Mar 17, 2025
34a30c7
Added center box.
KateBlueSky Mar 17, 2025
d47face
Removed the inertia for Kmeans.
KateBlueSky Mar 18, 2025
e617791
fixed config locations.
KateBlueSky Mar 18, 2025
00ac46d
Updated configs.
KateBlueSky Mar 18, 2025
f37f964
Moved large scale files.
KateBlueSky Mar 18, 2025
1c5552b
Added line.
KateBlueSky Mar 18, 2025
dcfef94
Added large scale 2k parameters sample shift
KateBlueSky Mar 18, 2025
4ba3fe4
Fixed imports.
KateBlueSky Mar 18, 2025
5c04a35
Updated format.
KateBlueSky Mar 18, 2025
af48e96
Added the math import.
KateBlueSky Mar 18, 2025
c7f38f4
Rolled back the accidental changes to the ranked_based distributed_sp…
KateBlueSky Mar 20, 2025
264701e
Updated large scale 2k parameters for the full 24576 tiles.
KateBlueSky Mar 20, 2025
20419a9
Updated config files.
KateBlueSky Mar 20, 2025
4e93858
cleaned up diff.
KateBlueSky Mar 20, 2025
428f3df
Reformatted correctly.
KateBlueSky Mar 21, 2025
2f8c68b
Fixed if else.
KateBlueSky Mar 21, 2025
816c6dc
Updated format.
KateBlueSky Mar 21, 2025
5d3bf52
Added mpi4py
KateBlueSky Mar 21, 2025
edceece
Merge branch 'large-scale' of https://github.com/IntelPython/scikit-l…
KateBlueSky Mar 21, 2025
a937963
fixed mpi4py
KateBlueSky Mar 21, 2025
3809d17
Rolled back mpi4py.
KateBlueSky Mar 21, 2025
c128748
Formatted file.
KateBlueSky Mar 21, 2025
15db792
Removed environment from diff.
KateBlueSky Mar 21, 2025
f0fccdd
Revert "Removed environment from diff."
KateBlueSky Mar 21, 2025
548d824
Merged change from large-scale
KateBlueSky Mar 21, 2025
4d675ec
Removed extra code for sample_shift.
KateBlueSky Mar 21, 2025
e8fbd0b
Changes for sample_shift.
KateBlueSky Mar 21, 2025
a7cea17
Updated sample shift.
KateBlueSky Mar 21, 2025
f3c2757
Updated sample shift.
KateBlueSky Mar 21, 2025
2ae3c39
Removed extra code.
KateBlueSky Mar 21, 2025
39cc4f2
Added comment for sample_shift.
KateBlueSky Mar 21, 2025
3fc7c42
Added back in x_train in sample_shift.
KateBlueSky Mar 21, 2025
1bd5aa1
Updated description of sample_shift.
KateBlueSky Mar 21, 2025
06944c1
Added predict back in.
KateBlueSky Mar 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ Configs have the three highest parameter keys:
| `data`:`format` | `pandas` | `pandas`, `numpy`, `cudf` | Data format to use in benchmark. |
| `data`:`order` | `F` | `C`, `F` | Data order to use in benchmark: contiguous(C) or Fortran. |
| `data`:`dtype` | `float64` | | Data type to use in benchmark. |
| `data`:`distributed_split` | None | None, `rank_based` | Split type used to distribute data between machines in distributed algorithm. `None` type means usage of all data without split on all machines. `rank_based` type splits the data equally between machines with split sequence based on rank id from MPI. |
| `data`:`distributed_split` | None | None, `rank_based`, `sample_shift` | `rank_based` Split type used to distribute data between machines in distributed algorithm. `sample_shift`: Shift each data point in each rank by sqrt (rank id) * 0.003) + 1. `None` type means usage of all data without split on all machines. `rank_based` type splits the data equally between machines with split sequence based on rank id from MPI. |
|<h3>Algorithm parameters</h3>||||
| `algorithm`:`library` | None | | Python module containing measured entity (class or function). |
| `algorithm`:`device` | `default` | `default`, `cpu`, `gpu` | Device selected for computation. |
Expand Down
33 changes: 33 additions & 0 deletions configs/spmd/large_scale/kmeans_narrow_weak.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
{
"INCLUDE": ["../../common/sklearn.json", "large_scale.json"],
"PARAMETERS_SETS": {
"spmd kmeans parameters": {
"algorithm": {
"estimator": "KMeans",
"estimator_params": {
"algorithm": "lloyd",
"max_iter": 20,
"n_clusters": 10,
"random_state": 42
},
"estimator_methods": { "training": "fit", "inference": "" },
"sklearnex_context": { "use_raw_input": true }
}
},
"synthetic data": {
"data": [
{ "source": "make_blobs", "generation_kwargs": { "n_samples": 2000000, "n_features": 100, "centers": 2000, "cluster_std": 3, "center_box": 100.0}}
]
}
},
"TEMPLATES": {
"kmeans": {
"SETS": [
"synthetic data",
"sklearnex spmd implementation",
"large scale 2k parameters sample shift",
"spmd kmeans parameters"
]
}
}
}
16 changes: 9 additions & 7 deletions configs/spmd/large_scale/kmeans_strong.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,17 @@
"algorithm": {
"estimator": "KMeans",
"estimator_params": {
"algorithm": "lloyd"
"algorithm": "lloyd",
"max_iter": 20,
"n_clusters": 100
},
"estimator_methods": { "training": "fit", "inference": "predict" }
"estimator_methods": { "training": "fit", "inference": "predict" },
"sklearnex_context": { "use_raw_input": true }
}
},
"synthetic data": {
},
"synthetic data": {
"data": [
{ "source": "make_blobs", "generation_kwargs": { "n_samples": 5000000, "n_features": 10, "centers": 10 }, "algorithm": { "n_clusters": 10, "max_iter": 10 } },
{ "source": "make_blobs", "generation_kwargs": { "n_samples": 30000, "n_features": 1000, "centers": 10 }, "algorithm": { "n_clusters": 10, "max_iter": 10 } },
{ "source": "make_blobs", "generation_kwargs": { "n_samples": 1000000, "n_features": 100, "centers": 100 }, "algorithm": { "n_clusters": 100, "max_iter": 100 } }
{ "source": "make_blobs", "generation_kwargs": { "n_samples": 25000000, "n_features": 100, "centers": 100 }}
]
}
},
Expand All @@ -29,3 +30,4 @@
}
}
}

Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,18 @@
"algorithm": {
"estimator": "KMeans",
"estimator_params": {
"algorithm": "lloyd"
"algorithm": "lloyd",
"max_iter": 20,
"n_clusters": 10,
"random_state": 42
},
"estimator_methods": { "training": "fit", "inference": "predict" }
"estimator_methods": { "training": "fit", "inference": "" },
"sklearnex_context": { "use_raw_input": true }
}
},
"synthetic data": {
},
"synthetic data": {
"data": [
{ "source": "make_blobs", "generation_kwargs": { "n_samples": 5000000, "n_features": 10, "centers": 10 }, "algorithm": { "n_clusters": 10, "max_iter": 10 } },
{ "source": "make_blobs", "generation_kwargs": { "n_samples": 30000, "n_features": 1000, "centers": 10 }, "algorithm": { "n_clusters": 10, "max_iter": 10 } }
{ "source": "make_blobs", "generation_kwargs": { "n_samples": 1000000, "n_features": 1000, "centers": 2000}}
]
}
},
Expand All @@ -28,3 +31,4 @@
}
}
}

9 changes: 9 additions & 0 deletions configs/spmd/large_scale/large_scale.json
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,15 @@
"mpi_params": {"n": [1,2,6,12,24,48,96,192,384,768,1536,3072,6144,12288,24576], "ppn": 12, "-hostfile": "", "-cpu-bind=list:0-7,104-111:8-15,112-119:16-23,120-127:24-31,128-135:32-39,136-143:40-47,144-151:52-59,156-163:60-67,164-171:68-75,172-179:76-83,180-187:84-91,188-195:92-99,196-203": "--envall gpu_tile_compact.sh" }
}
},
"large scale 2k parameters sample shift": {
"data": {
"dtype": "float64",
"distributed_split": "sample_shift"
},
"bench": {
"mpi_params": {"n": [1,2,6,12,24,48,96,192,384,768,1536,3072,6144,12288,24576], "ppn": 12, "-hostfile": "", "-cpu-bind=list:0-7,104-111:8-15,112-119:16-23,120-127:24-31,128-135:32-39,136-143:40-47,144-151:52-59,156-163:60-67,164-171:68-75,172-179:76-83,180-187:84-91,188-195:92-99,196-203": "--envall gpu_tile_compact.sh" }
}
},
"large scale 32 parameters": {
"data": {
"dtype": "float64",
Expand Down
13 changes: 0 additions & 13 deletions sklbench/benchmarks/sklearn_estimator.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,19 +191,6 @@ def get_subset_metrics_of_estimator(
}
)
elif task == "clustering":
if hasattr(estimator_instance, "inertia_"):
# compute inertia manually using distances to cluster centers
# provided by KMeans.transform
metrics.update(
{
"inertia": float(
np.power(
convert_to_numpy(estimator_instance.transform(x)).min(axis=1),
2,
).sum()
)
}
)
if hasattr(estimator_instance, "predict"):
y_pred = convert_to_numpy(estimator_instance.predict(x))
metrics.update(
Expand Down
5 changes: 5 additions & 0 deletions sklbench/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,11 @@ def load_data(bench_case: BenchCase) -> Tuple[Dict, Dict]:
generation_kwargs = get_bench_case_value(
bench_case, "data:generation_kwargs", dict()
)
if "center_box" in generation_kwargs:
generation_kwargs["center_box"] = (
-1 * generation_kwargs["center_box"],
generation_kwargs["center_box"],
)
return load_sklearn_synthetic_data(
function_name=source,
input_kwargs=generation_kwargs,
Expand Down
22 changes: 21 additions & 1 deletion sklbench/datasets/transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
# limitations under the License.
# ===============================================================================

import math
import os

import numpy as np
Expand Down Expand Up @@ -113,7 +114,26 @@ def split_and_transform_data(bench_case, data, data_description):
# "KNeighbors" in get_bench_case_value(bench_case, "algorithm:estimator", "")
# and int(get_bench_case_value(bench_case, "bench:mpi_params:n", 1)) > 1
# )
if distributed_split == "rank_based":
if distributed_split == "sample_shift":
from mpi4py import MPI

rank = MPI.COMM_WORLD.Get_rank()
adjust_number = (math.sqrt(rank) * 0.003) + 1

if "y" in data:
x_train, y_train = (
x_train * adjust_number,
y_train,
)

x_test, y_test = (
x_test * adjust_number,
y_test,
)
else:
x_test = x_test * adjust_number

elif distributed_split == "rank_based":
from mpi4py import MPI

comm = MPI.COMM_WORLD
Expand Down