Skip to content

Commit 525bb66

Browse files
committed
Updated README with MLperf description and results
1 parent 94e800d commit 525bb66

File tree

1 file changed

+34
-1
lines changed

1 file changed

+34
-1
lines changed

README.md

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,38 @@ on the MOC and NERC, specifically related to AI applications.
1212

1313
## MLPerf Storage / `dlio_benchmark`
1414

15+
MLPerf Storage benchmarks the performance for training workloads.
16+
This is achieved by generating a dataset and simulating the process of training over the generated dataset.
17+
It does not make use of GPUs, and the time which the GPU would have spent on training over each sample of the dataset has been replaced with a sleep command.
18+
Processing time on actual hardware (A100 and H100) has been measured in order to calculate the correct sleep time amount for each sample.
19+
Training is run over 5 epochs and does not perform checkpointing.
1520

16-
## End to End
21+
The main metric for the MLPerf Storage experiment is `Accelerator Utilization`.
22+
This is measured as the fraction of time that the GPU would spend processing compared to the overall duration of the experiment, as defined by the formula `AU = Accelerator Total Time / Total Duration = Accelerator Total Time / (Accelerator Total Time + Storage Load Time)`.
23+
24+
By default, MLPerf Storage defines an accelerator utilization score below 90% as a fail.
25+
26+
In our setup of the experiment the dataset is loaded from a Persistent Volume Claim that is hosted on the NESE ceph cluster.
27+
The training workload is unet3d, 1 simulated GPU, and 1000 (~140GB) and 3500 (~500GB) samples of dataset.
28+
Each sample ranges in size from around 80MB to 200MB.
29+
Kubernetes job and PVC definition can be found in the [k8s/](k8s) folder.
30+
31+
The results can be found in the [results/](results) folder.
32+
33+
| Simulated GPU | Samples | Storage Type | AU (%) | MB/s |
34+
|---------------|---------|--------------------|--------|--------|
35+
| A100 | 3500 | NESE Ceph PVC | 10.81 | 165.35 |
36+
| H100 | 3500 | NESE Ceph PVC | 5.58 | 168.05 |
37+
| A100 | 1000 | Local EmptyDir PVC | 24.51 | 729.10 |
38+
| | | Weka PVC | | |
39+
| | | Weka PVC | | |
40+
41+
The below results have not been run on the NERC and are provided purely for reference.
42+
43+
| Simulated GPU | Samples | Storage Type | AU (%) | MB/s |
44+
|---------------|---------|----------------------|--------|---------|
45+
| A100 | 1200 | Macbook Pro 14" NVMe | 99.16 | 1495.76 |
46+
47+
Other results that have been contributed from organizations can be found on the [MLPerf Storage website](https://mlcommons.org/benchmarks/storage/).
48+
49+
## Actual Inference Workload

0 commit comments

Comments
 (0)