Skip to content

Commit eddb9e8

Browse files
authored
Refactoring of benchmarks (#133)
1 parent 1d29e5c commit eddb9e8

File tree

219 files changed

+8158
-20705
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

219 files changed

+8158
-20705
lines changed

Diff for: .github/CODEOWNERS

+7-9
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,7 @@
1-
#owners and reviewers
2-
cuml_bench/* @Alexsandruss
3-
daal4py_bench/* @Alexsandruss @samir-nasibli
4-
datasets/* @Alexsandruss
5-
modelbuilders_bench/* @Alexsandruss
6-
report_generator/* @Alexsandruss
7-
sklearn_bench/* @Alexsandruss @samir-nasibli
8-
xgboost_bench/* @Alexsandruss
9-
*.md @Alexsandruss @maria-Petrova
1+
# owners and reviewers
2+
configs @Alexsandruss
3+
configs/spmd* @Alexsandruss @ethanglaser
4+
sklbench @Alexsandruss
5+
*.md @Alexsandruss @samir-nasibli
6+
requirements*.txt @Alexsandruss @ethanglaser
7+
conda-env-*.yml @Alexsandruss @ethanglaser

Diff for: .gitignore

+8-8
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
1-
# Logs
2-
*.log
3-
41
# Release and work directories
52
__pycache__*
63
__work*
74

85
# Visual Studio related files, e.g., ".vscode"
96
.vs*
107

11-
# Datasets
12-
data
8+
# Dataset files
9+
data_cache
1310
*.csv
1411
*.npy
12+
*.npz
1513

16-
# Results
17-
results*.json
18-
*.xlsx
14+
# Results at repo root
15+
vtune_results
16+
/*.json
17+
/*.xlsx
18+
/*.ipynb

Diff for: .pre-commit-config.yaml

+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
#===============================================================================
2+
# Copyright 2024 Intel Corporation
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
#===============================================================================
16+
17+
repos:
18+
- repo: https://github.com/psf/black
19+
rev: 23.7.0
20+
hooks:
21+
- id: black
22+
language_version: python3.10
23+
- repo: https://github.com/PyCQA/isort
24+
rev: 5.12.0
25+
hooks:
26+
- id: isort
27+
language_version: python3.10

Diff for: README.md

+66-108
Original file line numberDiff line numberDiff line change
@@ -1,147 +1,105 @@
1-
2-
# Machine Learning Benchmarks <!-- omit in toc -->
1+
# Machine Learning Benchmarks
32

43
[![Build Status](https://dev.azure.com/daal/scikit-learn_bench/_apis/build/status/IntelPython.scikit-learn_bench?branchName=main)](https://dev.azure.com/daal/scikit-learn_bench/_build/latest?definitionId=8&branchName=main)
54

6-
**Machine Learning Benchmarks** contains implementations of machine learning algorithms
7-
across data analytics frameworks. Scikit-learn_bench can be extended to add new frameworks
8-
and algorithms. It currently supports the [scikit-learn](https://scikit-learn.org/),
9-
[DAAL4PY](https://intelpython.github.io/daal4py/), [cuML](https://github.com/rapidsai/cuml),
10-
and [XGBoost](https://github.com/dmlc/xgboost) frameworks for commonly used
11-
[machine learning algorithms](#supported-algorithms).
12-
13-
## Follow us on Medium <!-- omit in toc -->
14-
15-
We publish blogs on Medium, so [follow us](https://medium.com/intel-analytics-software/tagged/machine-learning) to learn tips and tricks for more efficient data analysis. Here are our latest blogs:
5+
**Scikit-learn_bench** is a benchmark tool for libraries and frameworks implementing Scikit-learn-like APIs and other workloads.
166

17-
- [Save Time and Money with Intel Extension for Scikit-learn](https://medium.com/intel-analytics-software/save-time-and-money-with-intel-extension-for-scikit-learn-33627425ae4)
18-
- [Superior Machine Learning Performance on the Latest Intel Xeon Scalable Processors](https://medium.com/intel-analytics-software/superior-machine-learning-performance-on-the-latest-intel-xeon-scalable-processor-efdec279f5a3)
19-
- [Leverage Intel Optimizations in Scikit-Learn](https://medium.com/intel-analytics-software/leverage-intel-optimizations-in-scikit-learn-f562cb9d5544)
20-
- [Optimizing CatBoost Performance](https://medium.com/intel-analytics-software/optimizing-catboost-performance-4f73f0593071)
21-
- [Intel Gives Scikit-Learn the Performance Boost Data Scientists Need](https://medium.com/intel-analytics-software/intel-gives-scikit-learn-the-performance-boost-data-scientists-need-42eb47c80b18)
22-
- [From Hours to Minutes: 600x Faster SVM](https://medium.com/intel-analytics-software/from-hours-to-minutes-600x-faster-svm-647f904c31ae)
23-
- [Improve the Performance of XGBoost and LightGBM Inference](https://medium.com/intel-analytics-software/improving-the-performance-of-xgboost-and-lightgbm-inference-3b542c03447e)
24-
- [Accelerate Kaggle Challenges Using Intel AI Analytics Toolkit](https://medium.com/intel-analytics-software/accelerate-kaggle-challenges-using-intel-ai-analytics-toolkit-beb148f66d5a)
25-
- [Accelerate Your scikit-learn Applications](https://medium.com/intel-analytics-software/improving-the-performance-of-xgboost-and-lightgbm-inference-3b542c03447e)
26-
- [Optimizing XGBoost Training Performance](https://medium.com/intel-analytics-software/new-optimizations-for-cpu-in-xgboost-1-1-81144ea21115)
27-
- [Accelerate Linear Models for Machine Learning](https://medium.com/intel-analytics-software/accelerating-linear-models-for-machine-learning-5a75ff50a0fe)
28-
- [Accelerate K-Means Clustering](https://medium.com/intel-analytics-software/accelerate-k-means-clustering-6385088788a1)
29-
- [Fast Gradient Boosting Tree Inference](https://medium.com/intel-analytics-software/fast-gradient-boosting-tree-inference-for-intel-xeon-processors-35756f174f55)
7+
Benefits:
8+
- Full control of benchmarks suite through CLI
9+
- Flexible and powerful benchmark config structure
10+
- Available with advanced profiling tools, such as Intel(R) VTune* Profiler
11+
- Automated benchmarks report generation
3012

31-
## Table of content <!-- omit in toc -->
13+
### 📜 Table of Contents
3214

33-
- [How to create conda environment for benchmarking](#how-to-create-conda-environment-for-benchmarking)
34-
- [Running Python benchmarks with runner script](#running-python-benchmarks-with-runner-script)
35-
- [Benchmark supported algorithms](#benchmark-supported-algorithms)
36-
- [Scikit-learn benchmakrs](#scikit-learn-benchmakrs)
37-
- [Algorithm parameters](#algorithm-parameters)
15+
- [Machine Learning Benchmarks](#machine-learning-benchmarks)
16+
- [🔧 Create a Python Environment](#-create-a-python-environment)
17+
- [🚀 How To Use Scikit-learn\_bench](#-how-to-use-scikit-learn_bench)
18+
- [Benchmarks Runner](#benchmarks-runner)
19+
- [Report Generator](#report-generator)
20+
- [Scikit-learn\_bench High-Level Workflow](#scikit-learn_bench-high-level-workflow)
21+
- [📚 Benchmark Types](#-benchmark-types)
22+
- [📑 Documentation](#-documentation)
3823

39-
## How to create conda environment for benchmarking
24+
## 🔧 Create a Python Environment
4025

41-
Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.
26+
How to create a usable Python environment with the following required frameworks:
4227

43-
- [**scikit-learn**](sklearn_bench#how-to-create-conda-environment-for-benchmarking)
28+
- **sklearn, sklearnex, and gradient boosting frameworks**:
4429

4530
```bash
46-
pip install -r sklearn_bench/requirements.txt
47-
# or
48-
conda install -c intel scikit-learn scikit-learn-intelex pandas tqdm
31+
# with pip
32+
pip install -r envs/requirements-sklearn.txt
33+
# or with conda
34+
conda env create -n sklearn -f envs/conda-env-sklearn.yml
4935
```
5036

51-
- [**daal4py**](daal4py_bench#how-to-create-conda-environment-for-benchmarking)
37+
- **RAPIDS**:
5238

5339
```bash
54-
conda install -c conda-forge scikit-learn daal4py pandas tqdm
40+
conda env create -n rapids --solver=libmamba -f envs/conda-env-rapids.yml
5541
```
5642

57-
- [**cuml**](cuml_bench#how-to-create-conda-environment-for-benchmarking)
43+
## 🚀 How To Use Scikit-learn_bench
5844

59-
```bash
60-
conda install -c rapidsai -c conda-forge cuml pandas cudf tqdm
61-
```
45+
### Benchmarks Runner
6246

63-
- [**xgboost**](xgboost_bench#how-to-create-conda-environment-for-benchmarking)
47+
How to run benchmarks using the `sklbench` module and a specific configuration:
6448

6549
```bash
66-
pip install -r xgboost_bench/requirements.txt
67-
# or
68-
conda install -c conda-forge xgboost scikit-learn pandas tqdm
50+
python -m sklbench --config configs/sklearn_example.json
6951
```
7052

71-
## Running Python benchmarks with runner script
72-
73-
Run `python runner.py --configs configs/config_example.json [--output-file result.json --verbose INFO --report]` to launch benchmarks.
74-
75-
Options:
76-
77-
- ``--configs``: specify the path to a configuration file or a folder that contains configuration files.
78-
- ``--no-intel-optimized``: use Scikit-learn without [Intel(R) Extension for Scikit-learn*](#intelr-extension-for-scikit-learn-support). Now available for [scikit-learn benchmarks](https://github.com/IntelPython/scikit-learn_bench/tree/main/sklearn_bench). By default, the runner uses Intel(R) Extension for Scikit-learn.
79-
- ``--output-file``: specify the name of the output file for the benchmark result. The default name is `result.json`
80-
- ``--report``: create an Excel report based on benchmark results. The `openpyxl` library is required.
81-
- ``--dummy-run``: run configuration parser and dataset generation without benchmarks running.
82-
- ``--verbose``: *WARNING*, *INFO*, *DEBUG*. Print out additional information when the benchmarks are running. The default is *INFO*.
83-
84-
| Level | Description |
85-
|-----------|---------------|
86-
| *DEBUG* | etailed information, typically of interest only when diagnosing problems. Usually at this level the logging output is so low level that it’s not useful to users who are not familiar with the software’s internals. |
87-
| *INFO* | Confirmation that things are working as expected. |
88-
| *WARNING* | An indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The software is still working as expected. |
89-
90-
Benchmarks currently support the following frameworks:
53+
The default output is a file with JSON-formatted results of benchmarking cases. To generate a better human-readable report, use the following command:
9154

92-
- **scikit-learn**
93-
- **daal4py**
94-
- **cuml**
95-
- **xgboost**
55+
```bash
56+
python -m sklbench --config configs/sklearn_example.json --report
57+
```
9658

97-
The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.
59+
By default, output and report file paths are `result.json` and `report.xlsx`. To specify custom file paths, run:
9860

99-
You can configure benchmarks by editing a config file. Check [config.json schema](https://github.com/IntelPython/scikit-learn_bench/blob/main/configs/README.md) for more details.
61+
```bash
62+
python -m sklbench --config configs/sklearn_example.json --report --result-file result_example.json --report-file report_example.xlsx
63+
```
10064

101-
## Benchmark supported algorithms
65+
For a description of all benchmarks runner arguments, refer to [documentation](sklbench/runner/README.md#arguments).
10266

103-
| algorithm | benchmark name | sklearn (CPU) | sklearn (GPU) | daal4py | cuml | xgboost |
104-
|---|---|---|---|---|---|---|
105-
|**[DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)**|dbscan|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
106-
|**[RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)**|df_clfs|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
107-
|**[RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)**|df_regr|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
108-
|**[pairwise_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html)**|distances|:white_check_mark:|:x:|:white_check_mark:|:x:|:x:|
109-
|**[KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)**|kmeans|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
110-
|**[KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)**|knn_clsf|:white_check_mark:|:x:|:x:|:white_check_mark:|:x:|
111-
|**[LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)**|linear|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
112-
|**[LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)**|log_reg|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
113-
|**[PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)**|pca|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
114-
|**[Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)**|ridge|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
115-
|**[SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)**|svm|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
116-
|**[TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)**|tsne|:white_check_mark:|:x:|:x:|:white_check_mark:|:x:|
117-
|**[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)**|train_test_split|:white_check_mark:|:x:|:x:|:white_check_mark:|:x:|
118-
|**[GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)**|gbt|:x:|:x:|:x:|:x:|:white_check_mark:|
119-
|**[GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)**|gbt|:x:|:x:|:x:|:x:|:white_check_mark:|
67+
### Report Generator
12068

121-
### Scikit-learn benchmakrs
69+
To combine raw result files gathered from different environments, call the report generator:
12270

123-
When you run scikit-learn benchmarks on CPU, [Intel(R) Extension for Scikit-learn](https://github.com/intel/scikit-learn-intelex) is used by default. Use the ``--no-intel-optimized`` option to run the benchmarks without the extension.
71+
```bash
72+
python -m sklbench.report --result-files result_1.json result_2.json --report-file report_example.xlsx
73+
```
12474

125-
For the algorithms with both CPU and GPU support, you may use the same [configuration file](https://github.com/IntelPython/scikit-learn_bench/blob/main/configs/skl_xpu_config.json) to run the scikit-learn benchmarks on CPU and GPU.
75+
For a description of all report generator arguments, refer to [documentation](sklbench/report/README.md#arguments).
12676

127-
## Algorithm parameters
77+
### Scikit-learn_bench High-Level Workflow
12878

129-
You can launch benchmarks for each algorithm separately.
130-
To do this, go to the directory with the benchmark:
79+
```mermaid
80+
flowchart TB
81+
A[User] -- High-level arguments --> B[Benchmarks runner]
82+
B -- Generated benchmarking cases --> C["Benchmarks collection"]
83+
C -- Raw JSON-formatted results --> D[Report generator]
84+
D -- Human-readable report --> A
13185
132-
```bash
133-
cd <framework>
86+
classDef userStyle fill:#44b,color:white,stroke-width:2px,stroke:white;
87+
class A userStyle
13488
```
13589

136-
Run the following command:
90+
## 📚 Benchmark Types
13791

138-
```bash
139-
python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>
140-
```
92+
**Scikit-learn_bench** supports the following types of benchmarks:
14193

142-
The list of supported parameters for each algorithm you can find here:
94+
- **Scikit-learn estimator** - Measures performance and quality metrics of the [sklearn-like estimator](https://scikit-learn.org/stable/glossary.html#term-estimator).
95+
- **Function** - Measures performance metrics of specified function.
14396

144-
- [**scikit-learn**](sklearn_bench#algorithms-parameters)
145-
- [**daal4py**](daal4py_bench#algorithms-parameters)
146-
- [**cuml**](cuml_bench#algorithms-parameters)
147-
- [**xgboost**](xgboost_bench#algorithms-parameters)
97+
## 📑 Documentation
98+
[Scikit-learn_bench](README.md):
99+
- [Configs](configs/README.md)
100+
- [Benchmarks Runner](sklbench/runner/README.md)
101+
- [Report Generator](sklbench/report/README.md)
102+
- [Benchmarks](sklbench/benchmarks/README.md)
103+
- [Data Processing](sklbench/datasets/README.md)
104+
- [Emulators](sklbench/emulators/README.md)
105+
- [Developer Guide](docs/README.md)

0 commit comments

Comments
 (0)