You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+34-1Lines changed: 34 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -12,5 +12,38 @@ on the MOC and NERC, specifically related to AI applications.
12
12
13
13
## MLPerf Storage / `dlio_benchmark`
14
14
15
+
MLPerf Storage benchmarks the performance for training workloads.
16
+
This is achieved by generating a dataset and simulating the process of training over the generated dataset.
17
+
It does not make use of GPUs, and the time which the GPU would have spent on training over each sample of the dataset has been replaced with a sleep command.
18
+
Processing time on actual hardware (A100 and H100) has been measured in order to calculate the correct sleep time amount for each sample.
19
+
Training is run over 5 epochs and does not perform checkpointing.
15
20
16
-
## End to End
21
+
The main metric for the MLPerf Storage experiment is `Accelerator Utilization`.
22
+
This is measured as the fraction of time that the GPU would spend processing compared to the overall duration of the experiment, as defined by the formula `AU = Accelerator Total Time / Total Duration = Accelerator Total Time / (Accelerator Total Time + Storage Load Time)`.
23
+
24
+
By default, MLPerf Storage defines an accelerator utilization score below 90% as a fail.
25
+
26
+
In our setup of the experiment the dataset is loaded from a Persistent Volume Claim that is hosted on the NESE ceph cluster.
27
+
The training workload is unet3d, 1 simulated GPU, and 1000 (~140GB) and 3500 (~500GB) samples of dataset.
28
+
Each sample ranges in size from around 80MB to 200MB.
29
+
Kubernetes job and PVC definition can be found in the [k8s/](k8s) folder.
30
+
31
+
The results can be found in the [results/](results) folder.
32
+
33
+
| Simulated GPU | Samples | Storage Type | AU (%) | MB/s |
0 commit comments