[TEST PR] ignore #2645

EuphoricThinking · 2025-01-30T14:55:43Z

No description provided.

github-actions · 2025-01-30T15:24:33Z

Compute Benchmarks level_zero run (with params: --iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13055486959

github-actions · 2025-01-30T16:02:09Z

Compute Benchmarks level_zero run (--iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13055486959
Job status: success. Test status: success.

Summary

Total 148 benchmarks in mean.
Geomean 89.971%.
Improved 24 Regressed 38 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group api (12): 96.139%

Benchmark	This PR	baseline	Relative perf	Change	-
api_overhead_benchmark_ur SubmitKernel in order	16.547000 μs	16.785 μs	101.44%	1.44%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.123000 μs	2.149 μs	101.22%	1.22%	.
api_overhead_benchmark_ur SubmitKernel in order with measure completion	21.259000 μs	21.495 μs	101.11%	1.11%	.
api_overhead_benchmark_ur SubmitKernel out of order	15.749000 μs	15.866 μs	100.74%	0.74%	.
api_overhead_benchmark_ur SubmitKernel out of order CPU count	104663.000000 instr	104663.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order CPU count	110006.000000 instr	110006.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	123190.000 instr	123166.000000 instr	99.98%	-0.02%	.
api_overhead_benchmark_sycl SubmitKernel out of order	23.608 μs	23.506000 μs	99.57%	-0.43%	.
api_overhead_benchmark_sycl SubmitKernel in order	24.521 μs	24.407000 μs	99.54%	-0.46%	.
api_overhead_benchmark_l0 SubmitKernel in order	11.632 μs	11.395000 μs	97.96%	-2.04%	.
api_overhead_benchmark_l0 SubmitKernel out of order	11.680 μs	11.369000 μs	97.34%	-2.66%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	2.652 μs	1.673000 μs	63.08%	-36.92%	.

Relative perf in group memory (4): 114.033%

Benchmark	This PR	baseline	Relative perf	Change	-
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	134.861000 μs	219.832 μs	163.01%	63.01%	.
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.242000 GB/s	3.070 GB/s	105.60%	5.60%	.
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	254.750 μs	252.914000 μs	99.28%	-0.72%	.
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.963 μs	5.900000 μs	98.94%	-1.06%	.

Relative perf in group miscellaneous (1): 99.795%

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	859.782 bw GB/s	858.023000 bw GB/s	99.80%	-0.20%	.

Relative perf in group multithread (10): 100.557%

Benchmark	This PR	baseline	Relative perf	Change	-
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	40929.325000 μs	42602.254 μs	104.09%	4.09%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	7475.640000 μs	7766.797 μs	103.89%	3.89%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	26036.929000 μs	27030.035 μs	103.81%	3.81%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	8794.927000 μs	8883.578 μs	101.01%	1.01%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	6907.362 μs	6896.127000 μs	99.84%	-0.16%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1203.921 μs	1199.669000 μs	99.65%	-0.35%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	47195.783 μs	46811.855000 μs	99.19%	-0.81%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	17377.891 μs	17165.065000 μs	98.78%	-1.22%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	114673.544 μs	112408.658000 μs	98.02%	-1.98%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	2098.660 μs	2047.766000 μs	97.57%	-2.43%	.

Relative perf in group graph (10): 108.656%

Benchmark	This PR	baseline	Relative perf	Change	-
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	4039.886000 μs	5631.730 μs	139.40%	39.40%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	4052.131000 μs	5621.320 μs	138.73%	38.73%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	46811.230000 μs	56454.921 μs	120.60%	20.60%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	62.355000 μs	62.493 μs	100.22%	0.22%	.
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	71750.799 μs	71746.038000 μs	99.99%	-0.01%	.
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	353400.657 μs	353349.563000 μs	99.99%	-0.01%	.
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	353387.985 μs	353086.695000 μs	99.91%	-0.09%	.
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	72724.829 μs	72583.103000 μs	99.81%	-0.19%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	682.492 μs	677.203000 μs	99.23%	-0.77%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	55.700 μs	55.253000 μs	99.20%	-0.80%	.

Relative perf in group Velocity-Bench (9): 99.981%

Benchmark	This PR	baseline	Relative perf	Change	-
Velocity-Bench Bitcracker	35.461400 s	38.359 s	108.17%	8.17%	.
Velocity-Bench Hashtable	380.134197 M keys/sec	363.340 M keys/sec	104.62%	4.62%	.
Velocity-Bench CudaSift	201.277000 ms	203.947 ms	101.33%	1.33%	.
Velocity-Bench Sobel Filter	596.660000 ms	603.076 ms	101.08%	1.08%	.
Velocity-Bench QuickSilver	117.170000 MMS/CTT	116.460 MMS/CTT	100.61%	0.61%	.
Velocity-Bench dl-cifar	23.601100 s	23.630 s	100.12%	0.12%	.
Velocity-Bench dl-mnist	2.730 s	2.710000 s	99.27%	-0.73%	.
Velocity-Bench Easywave	229.000 ms	227.000000 ms	99.13%	-0.87%	.
Velocity-Bench svm	0.156 s	0.135900 s	86.89%	-13.11%	.

Relative perf in group Runtime (8): 98.026%

Benchmark	This PR	baseline	Relative perf	Change	-
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	274.365000 ms	276.461 ms	100.76%	0.76%	.
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	276.091 ms	275.173000 ms	99.67%	-0.33%	.
Runtime_IndependentDAGTaskThroughput_SingleTask	263.435 ms	259.444000 ms	98.49%	-1.51%	.
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1743.424 ms	1710.439000 ms	98.11%	-1.89%	.
Runtime_DAGTaskThroughput_SingleTask	1685.683 ms	1648.643000 ms	97.80%	-2.20%	.
Runtime_DAGTaskThroughput_NDRangeParallelFor	1712.514 ms	1673.462000 ms	97.72%	-2.28%	.
Runtime_DAGTaskThroughput_BasicParallelFor	1766.802 ms	1704.436000 ms	96.47%	-3.53%	.
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	287.813 ms	274.274000 ms	95.30%	-4.70%	.

Relative perf in group MicroBench (14): 95.682%

Benchmark	This PR	baseline	Relative perf	Change	-
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	4.738000 ms	4.940 ms	104.26%	4.26%	.
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	4.789000 ms	4.909 ms	102.51%	2.51%	.
MicroBench_LocalMem_int32_4096	29.855000 ms	29.862 ms	100.02%	0.02%	.
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.587 ms	4.585000 ms	99.96%	-0.04%	.
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	617.867 ms	617.442000 ms	99.93%	-0.07%	.
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	617.882 ms	617.437000 ms	99.93%	-0.07%	.
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	617.277 ms	616.784000 ms	99.92%	-0.08%	.
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	617.554 ms	616.834000 ms	99.88%	-0.12%	.
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	4.464 ms	4.456000 ms	99.82%	-0.18%	.
MicroBench_LocalMem_fp32_4096	30.015 ms	29.902000 ms	99.62%	-0.38%	.
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.400 ms	4.376000 ms	99.45%	-0.55%	.
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.547 ms	4.276000 ms	94.04%	-5.96%	.
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	5.133 ms	4.716000 ms	91.88%	-8.12%	.
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	7.641 ms	4.526000 ms	59.23%	-40.77%	.

Relative perf in group Pattern (10): 101.767%

Benchmark	This PR	baseline	Relative perf	Change	-
Pattern_Reduction_Hierarchical_int32	13.403000 ms	16.339 ms	121.91%	21.91%	.
Pattern_Reduction_NDRange_int32	16.264000 ms	16.339 ms	100.46%	0.46%	.
Pattern_SegmentedReduction_Hierarchical_fp32	11.593 ms	11.587000 ms	99.95%	-0.05%	.
Pattern_SegmentedReduction_Hierarchical_int64	11.795 ms	11.782000 ms	99.89%	-0.11%	.
Pattern_SegmentedReduction_Hierarchical_int16	11.819 ms	11.796000 ms	99.81%	-0.19%	.
Pattern_SegmentedReduction_Hierarchical_int32	11.613 ms	11.588000 ms	99.78%	-0.22%	.
Pattern_SegmentedReduction_NDRange_int64	2.344 ms	2.337000 ms	99.70%	-0.30%	.
Pattern_SegmentedReduction_NDRange_int32	2.172 ms	2.165000 ms	99.68%	-0.32%	.
Pattern_SegmentedReduction_NDRange_fp32	2.176 ms	2.168000 ms	99.63%	-0.37%	.
Pattern_SegmentedReduction_NDRange_int16	2.292 ms	2.265000 ms	98.82%	-1.18%	.

Relative perf in group ScalarProduct (6): 99.760%

Benchmark	This PR	baseline	Relative perf	Change	-
ScalarProduct_Hierarchical_int32	10.532000 ms	10.541 ms	100.09%	0.09%	.
ScalarProduct_Hierarchical_fp32	10.166000 ms	10.167 ms	100.01%	0.01%	.
ScalarProduct_NDRange_int32	3.770 ms	3.765000 ms	99.87%	-0.13%	.
ScalarProduct_NDRange_fp32	3.758 ms	3.749000 ms	99.76%	-0.24%	.
ScalarProduct_Hierarchical_int64	11.535 ms	11.490000 ms	99.61%	-0.39%	.
ScalarProduct_NDRange_int64	5.467 ms	5.425000 ms	99.23%	-0.77%	.

Relative perf in group USM (7): 90.095%

Benchmark	This PR	baseline	Relative perf	Change	-
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.201000 ms	1.258 ms	104.75%	4.75%	.
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.043000 ms	1.087 ms	104.22%	4.22%	.
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.865000 ms	1.893 ms	101.50%	1.50%	.
USM_Allocation_latency_fp32_host	37.971 ms	37.623000 ms	99.08%	-0.92%	.
USM_Allocation_latency_fp32_device	0.067 ms	0.065000 ms	97.01%	-2.99%	.
USM_Allocation_latency_fp32_shared	0.069 ms	0.062000 ms	89.86%	-10.14%	.
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	3.450 ms	1.737000 ms	50.35%	-49.65%	.

Relative perf in group VectorAddition (3): 100.179%

Benchmark	This PR	baseline	Relative perf	Change	-
VectorAddition_int32	1.463000 ms	1.477 ms	100.96%	0.96%	.
VectorAddition_fp32	1.470000 ms	1.480 ms	100.68%	0.68%	.
VectorAddition_int64	3.122 ms	3.088000 ms	98.91%	-1.09%	.

Relative perf in group Polybench (3): 99.051%

Benchmark	This PR	baseline	Relative perf	Change	-
Polybench_3mm	1.486 ms	1.477000 ms	99.39%	-0.61%	.
Polybench_Atax	6.467 ms	6.402000 ms	98.99%	-1.01%	.
Polybench_2mm	1.052 ms	1.039000 ms	98.76%	-1.24%	.

Relative perf in group Kmeans (1): 99.654%

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	14.160 ms	14.111000 ms	99.65%	-0.35%	.

Relative perf in group LinearRegressionCoeff (1): 102.215%

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	862.805000 ms	881.915 ms	102.21%	2.21%	.

Relative perf in group MolecularDynamics (1): 53.571%

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	0.056 ms	0.030000 ms	53.57%	-46.43%	.

Relative perf in group llama.cpp (6): 100.104%

Benchmark	This PR	baseline	Relative perf	Change	-
llama.cpp Text Generation Batched 512	63.033973 token/s	62.789 token/s	100.39%	0.39%	.
llama.cpp Text Generation Batched 256	63.013882 token/s	62.777 token/s	100.38%	0.38%	.
llama.cpp Text Generation Batched 128	62.969580 token/s	62.791 token/s	100.28%	0.28%	.
llama.cpp Prompt Processing Batched 128	832.145879 token/s	830.097 token/s	100.25%	0.25%	.
llama.cpp Prompt Processing Batched 256	878.241 token/s	878.291089 token/s	99.99%	-0.01%	.
llama.cpp Prompt Processing Batched 512	432.819 token/s	435.723514 token/s	99.33%	-0.67%	.

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7): 176.175%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	136.944000 ns	2688.530 ns	1963.23%	1863.23%	++++++++++
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2078.040000 ns	2113.560 ns	101.71%	1.71%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3205.220 ns	3097.620000 ns	96.64%	-3.36%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	306.302 ns	287.722000 ns	93.93%	-6.07%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2631.780 ns	2464.050000 ns	93.63%	-6.37%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4701.140000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3663.010000 ns	-

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7): 143.213%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	104.170000 ns	705.635 ns	677.39%	577.39%	+++
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	211.610 ns	208.759000 ns	98.65%	-1.35%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	196.961 ns	191.313000 ns	97.13%	-2.87%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	282.053 ns	272.237000 ns	96.52%	-3.48%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	726.325 ns	698.410000 ns	96.16%	-3.84%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	507.879000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	118.136000 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7): 155.774%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	120.540000 ns	1226.080 ns	1017.16%	917.16%	+++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1852.650000 ns	2038.360 ns	110.02%	10.02%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3476.710 ns	3338.690000 ns	96.03%	-3.97%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	277.559 ns	261.553000 ns	94.23%	-5.77%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1407.250 ns	1274.570000 ns	90.57%	-9.43%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4636.620000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3705.340000 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7): 120.900%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	241.115000 ns	707.467 ns	293.41%	193.41%	+
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	720.411 ns	706.907000 ns	98.13%	-1.87%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	193.486 ns	189.545000 ns	97.96%	-2.04%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	204.454 ns	196.551000 ns	96.13%	-3.87%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	326.361 ns	310.903000 ns	95.26%	-4.74%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	529.648000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	118.489000 ns	-

Relative perf in group alloc/min (8): 81.507%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	546.234000 ns	832.725 ns	152.45%	52.45%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	962.529 ns	958.800000 ns	99.61%	-0.39%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	177.724 ns	174.753000 ns	98.33%	-1.67%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	812.237 ns	797.092000 ns	98.14%	-1.86%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1032.840 ns	965.779000 ns	93.51%	-6.49%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	827.745 ns	177.130000 ns	21.40%	-78.60%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider>	4285.610000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider>	350.082000 ns	-

Relative perf in group multiple (22): 26.627%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	14944.400000 ns	16418.600 ns	109.86%	9.86%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	32440.600000 ns	33153.600 ns	102.20%	2.20%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4229.770000 ns	4283.690 ns	101.27%	1.27%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	75433.700000 ns	75451.700 ns	100.02%	0.02%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	139261.000 ns	138360.000000 ns	99.35%	-0.65%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25830.100 ns	25525.500000 ns	98.82%	-1.18%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	31342.800 ns	30910.300000 ns	98.62%	-1.38%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1196070.000 ns	1174970.000000 ns	98.24%	-1.76%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	150500.000 ns	146423.000000 ns	97.29%	-2.71%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1202440.000 ns	1162100.000000 ns	96.65%	-3.35%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	169941.000 ns	162279.000000 ns	95.49%	-4.51%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	48117.700 ns	41438.000000 ns	86.12%	-13.88%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	10832400.000 ns	140162.000000 ns	1.29%	-98.71%	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	2732150.000 ns	30121.800000 ns	1.10%	-98.90%	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	7935540.000 ns	27477.700000 ns	0.35%	-99.65%	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	2723490.000 ns	4208.520000 ns	0.15%	-99.85%	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider>	1763760.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider>	218003.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider>	524139.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider>	24479.700000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider>	632496.000000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider>	60170.100000 ns	-

QS_DEVICE=GPU

Command:

/home/pmdk/bench_workdir/QuickSilver/qs -i /home/pmdk/bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp

Velocity-Bench Sobel Filter

Environment Variables:

OPENCV_IO_MAX_IMAGE_PIXELS=1677721600

Command:

/home/pmdk/bench_workdir/sobel_filter/sobel_filter -i /home/pmdk/bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png -n 5

Velocity-Bench dl-cifar

Environment Variables:

Command:

/home/pmdk/bench_workdir/dl-cifar/dl-cifar_sycl

Velocity-Bench dl-mnist

Environment Variables:

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

github-actions · 2025-01-30T16:02:29Z

Compute Benchmarks level_zero run (with params: --iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13055887853

github-actions · 2025-01-30T16:40:56Z

Compute Benchmarks level_zero run (--iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13055887853
Job status: success. Test status: success.

Summary

Total 148 benchmarks in mean.
Geomean 90.065%.
Improved 26 Regressed 37 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group api (12): 99.961%

Benchmark	This PR	baseline	Relative perf	Change	-
api_overhead_benchmark_ur SubmitKernel in order with measure completion	21.073000 μs	21.495 μs	102.00%	2.00%	.
api_overhead_benchmark_sycl SubmitKernel out of order	23.192000 μs	23.506 μs	101.35%	1.35%	.
api_overhead_benchmark_ur SubmitKernel in order	16.646000 μs	16.785 μs	100.84%	0.84%	.
api_overhead_benchmark_ur SubmitKernel out of order	15.766000 μs	15.866 μs	100.63%	0.63%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.666000 μs	1.673 μs	100.42%	0.42%	.
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	122876.000000 instr	123166.000 instr	100.24%	0.24%	.
api_overhead_benchmark_ur SubmitKernel out of order CPU count	104663.000000 instr	104663.000 instr	100.00%	0.00%	.
api_overhead_benchmark_ur SubmitKernel in order CPU count	110006.000000 instr	110006.000 instr	100.00%	0.00%	.
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.158 μs	2.149000 μs	99.58%	-0.42%	.
api_overhead_benchmark_sycl SubmitKernel in order	24.558 μs	24.407000 μs	99.39%	-0.61%	.
api_overhead_benchmark_l0 SubmitKernel in order	11.478 μs	11.395000 μs	99.28%	-0.72%	.
api_overhead_benchmark_l0 SubmitKernel out of order	11.851 μs	11.369000 μs	95.93%	-4.07%	.

Relative perf in group memory (4): 114.451%

Benchmark	This PR	baseline	Relative perf	Change	-
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	135.850000 μs	219.832 μs	161.82%	61.82%	.
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.184000 GB/s	3.070 GB/s	103.71%	3.71%	.
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.776000 μs	5.900 μs	102.15%	2.15%	.
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	252.684000 μs	252.914 μs	100.09%	0.09%	.

Relative perf in group miscellaneous (1): 99.727%

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	860.370 bw GB/s	858.023000 bw GB/s	99.73%	-0.27%	.

Relative perf in group multithread (10): 100.556%

Benchmark	This PR	baseline	Relative perf	Change	-
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	25956.135000 μs	27030.035 μs	104.14%	4.14%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	40999.622000 μs	42602.254 μs	103.91%	3.91%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	7489.297000 μs	7766.797 μs	103.71%	3.71%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	111874.103000 μs	112408.658 μs	100.48%	0.48%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	8905.592 μs	8883.578000 μs	99.75%	-0.25%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	6927.748 μs	6896.127000 μs	99.54%	-0.46%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1211.929 μs	1199.669000 μs	98.99%	-1.01%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	47402.251 μs	46811.855000 μs	98.75%	-1.25%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	2081.600 μs	2047.766000 μs	98.37%	-1.63%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	17485.423 μs	17165.065000 μs	98.17%	-1.83%	.

Relative perf in group graph (10): 107.589%

Benchmark	This PR	baseline	Relative perf	Change	-
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	4070.135000 μs	5621.320 μs	138.11%	38.11%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	4080.166000 μs	5631.730 μs	138.03%	38.03%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	49630.523000 μs	56454.921 μs	113.75%	13.75%	.
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	71762.825 μs	71746.038000 μs	99.98%	-0.02%	.
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	353323.951 μs	353086.695000 μs	99.93%	-0.07%	.
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	353679.319 μs	353349.563000 μs	99.91%	-0.09%	.
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	72663.276 μs	72583.103000 μs	99.89%	-0.11%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	681.692 μs	677.203000 μs	99.34%	-0.66%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	55.834 μs	55.253000 μs	98.96%	-1.04%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	63.914 μs	62.493000 μs	97.78%	-2.22%	.

Relative perf in group Velocity-Bench (9): 99.016%

Benchmark	This PR	baseline	Relative perf	Change	-
Velocity-Bench Bitcracker	35.470800 s	38.359 s	108.14%	8.14%	.
Velocity-Bench QuickSilver	118.570000 MMS/CTT	116.460 MMS/CTT	101.81%	1.81%	.
Velocity-Bench CudaSift	201.729000 ms	203.947 ms	101.10%	1.10%	.
Velocity-Bench Easywave	227.000000 ms	227.000 ms	100.00%	0.00%	.
Velocity-Bench dl-cifar	23.680 s	23.630300 s	99.79%	-0.21%	.
Velocity-Bench Sobel Filter	606.368 ms	603.076000 ms	99.46%	-0.54%	.
Velocity-Bench Hashtable	361.069 M keys/sec	363.339623 M keys/sec	99.37%	-0.63%	.
Velocity-Bench dl-mnist	2.740 s	2.710000 s	98.91%	-1.09%	.
Velocity-Bench svm	0.161 s	0.135900 s	84.25%	-15.75%	.

Relative perf in group Runtime (8): 95.078%

Benchmark	This PR	baseline	Relative perf	Change	-
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1750.403 ms	1710.439000 ms	97.72%	-2.28%	.
Runtime_DAGTaskThroughput_NDRangeParallelFor	1725.472 ms	1673.462000 ms	96.99%	-3.01%	.
Runtime_DAGTaskThroughput_SingleTask	1705.268 ms	1648.643000 ms	96.68%	-3.32%	.
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	284.474 ms	274.274000 ms	96.41%	-3.59%	.
Runtime_IndependentDAGTaskThroughput_SingleTask	271.448 ms	259.444000 ms	95.58%	-4.42%	.
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	290.329 ms	276.461000 ms	95.22%	-4.78%	.
Runtime_DAGTaskThroughput_BasicParallelFor	1791.302 ms	1704.436000 ms	95.15%	-4.85%	.
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	315.245 ms	275.173000 ms	87.29%	-12.71%	.

Relative perf in group MicroBench (14): 92.622%

Benchmark	This PR	baseline	Relative perf	Change	-
MicroBench_LocalMem_fp32_4096	29.825000 ms	29.902 ms	100.26%	0.26%	.
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	618.172 ms	617.442000 ms	99.88%	-0.12%	.
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	617.586 ms	616.834000 ms	99.88%	-0.12%	.
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	618.214 ms	617.437000 ms	99.87%	-0.13%	.
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	617.585 ms	616.784000 ms	99.87%	-0.13%	.
MicroBench_LocalMem_int32_4096	29.931 ms	29.862000 ms	99.77%	-0.23%	.
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	5.092 ms	4.909000 ms	96.41%	-3.59%	.
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	5.229 ms	4.940000 ms	94.47%	-5.53%	.
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.934 ms	4.585000 ms	92.93%	-7.07%	.
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	4.841 ms	4.456000 ms	92.05%	-7.95%	.
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	5.155 ms	4.716000 ms	91.48%	-8.52%	.
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.821 ms	4.376000 ms	90.77%	-9.23%	.
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.807 ms	4.276000 ms	88.95%	-11.05%	.
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	7.580 ms	4.526000 ms	59.71%	-40.29%	.

Relative perf in group Pattern (10): 99.939%

Benchmark	This PR	baseline	Relative perf	Change	-
Pattern_Reduction_Hierarchical_int32	16.116000 ms	16.339 ms	101.38%	1.38%	.
Pattern_Reduction_NDRange_int32	16.281000 ms	16.339 ms	100.36%	0.36%	.
Pattern_SegmentedReduction_Hierarchical_int64	11.784 ms	11.782000 ms	99.98%	-0.02%	.
Pattern_SegmentedReduction_Hierarchical_fp32	11.602 ms	11.587000 ms	99.87%	-0.13%	.
Pattern_SegmentedReduction_NDRange_int64	2.341 ms	2.337000 ms	99.83%	-0.17%	.
Pattern_SegmentedReduction_Hierarchical_int32	11.613 ms	11.588000 ms	99.78%	-0.22%	.
Pattern_SegmentedReduction_NDRange_fp32	2.173 ms	2.168000 ms	99.77%	-0.23%	.
Pattern_SegmentedReduction_NDRange_int32	2.170 ms	2.165000 ms	99.77%	-0.23%	.
Pattern_SegmentedReduction_Hierarchical_int16	11.825 ms	11.796000 ms	99.75%	-0.25%	.
Pattern_SegmentedReduction_NDRange_int16	2.290 ms	2.265000 ms	98.91%	-1.09%	.

Relative perf in group ScalarProduct (6): 99.670%

Benchmark	This PR	baseline	Relative perf	Change	-
ScalarProduct_NDRange_int32	3.760000 ms	3.765 ms	100.13%	0.13%	.
ScalarProduct_Hierarchical_fp32	10.157000 ms	10.167 ms	100.10%	0.10%	.
ScalarProduct_Hierarchical_int32	10.542 ms	10.541000 ms	99.99%	-0.01%	.
ScalarProduct_Hierarchical_int64	11.517 ms	11.490000 ms	99.77%	-0.23%	.
ScalarProduct_NDRange_fp32	3.766 ms	3.749000 ms	99.55%	-0.45%	.
ScalarProduct_NDRange_int64	5.508 ms	5.425000 ms	98.49%	-1.51%	.

Relative perf in group USM (7): 87.700%

Benchmark	This PR	baseline	Relative perf	Change	-
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.231000 ms	1.258 ms	102.19%	2.19%	.
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.077000 ms	1.087 ms	100.93%	0.93%	.
USM_Allocation_latency_fp32_host	37.865 ms	37.623000 ms	99.36%	-0.64%	.
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.917 ms	1.893000 ms	98.75%	-1.25%	.
USM_Allocation_latency_fp32_device	0.071 ms	0.065000 ms	91.55%	-8.45%	.
USM_Allocation_latency_fp32_shared	0.074 ms	0.062000 ms	83.78%	-16.22%	.
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	3.379 ms	1.737000 ms	51.41%	-48.59%	.

Relative perf in group VectorAddition (3): 100.605%

Benchmark	This PR	baseline	Relative perf	Change	-
VectorAddition_int32	1.460000 ms	1.477 ms	101.16%	1.16%	.
VectorAddition_int64	3.070000 ms	3.088 ms	100.59%	0.59%	.
VectorAddition_fp32	1.479000 ms	1.480 ms	100.07%	0.07%	.

Relative perf in group Polybench (3): 98.868%

Benchmark	This PR	baseline	Relative perf	Change	-
Polybench_3mm	1.491 ms	1.477000 ms	99.06%	-0.94%	.
Polybench_Atax	6.475 ms	6.402000 ms	98.87%	-1.13%	.
Polybench_2mm	1.053 ms	1.039000 ms	98.67%	-1.33%	.

Relative perf in group Kmeans (1): 99.993%

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	14.112 ms	14.111000 ms	99.99%	-0.01%	.

Relative perf in group LinearRegressionCoeff (1): 98.403%

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	896.225 ms	881.915000 ms	98.40%	-1.60%	.

Relative perf in group MolecularDynamics (1): 50.000%

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	0.060 ms	0.030000 ms	50.00%	-50.00%	.

Relative perf in group llama.cpp (6): 99.564%

Benchmark	This PR	baseline	Relative perf	Change	-
llama.cpp Text Generation Batched 256	63.031606 token/s	62.777 token/s	100.41%	0.41%	.
llama.cpp Text Generation Batched 512	63.024801 token/s	62.789 token/s	100.38%	0.38%	.
llama.cpp Text Generation Batched 128	62.993262 token/s	62.791 token/s	100.32%	0.32%	.
llama.cpp Prompt Processing Batched 512	432.189 token/s	435.723514 token/s	99.19%	-0.81%	.
llama.cpp Prompt Processing Batched 256	867.135 token/s	878.291089 token/s	98.73%	-1.27%	.
llama.cpp Prompt Processing Batched 128	816.653 token/s	830.097430 token/s	98.38%	-1.62%	.

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7): 187.944%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	126.330000 ns	2688.530 ns	2128.18%	2028.18%	++++++++++
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	1970.000000 ns	2113.560 ns	107.29%	7.29%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2394.250000 ns	2464.050 ns	102.92%	2.92%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3043.580000 ns	3097.620 ns	101.78%	1.78%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	293.438 ns	287.722000 ns	98.05%	-1.95%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4632.600000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3747.240000 ns	-

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7): 144.481%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	102.333000 ns	705.635 ns	689.55%	589.55%	+++
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	276.470 ns	272.237000 ns	98.47%	-1.53%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	711.427 ns	698.410000 ns	98.17%	-1.83%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	212.825 ns	208.759000 ns	98.09%	-1.91%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	198.680 ns	191.313000 ns	96.29%	-3.71%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	491.472000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	120.857000 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7): 166.841%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	123.843000 ns	1226.080 ns	990.03%	890.03%	++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1662.350000 ns	2038.360 ns	122.62%	22.62%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3152.380000 ns	3338.690 ns	105.91%	5.91%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1240.040000 ns	1274.570 ns	102.78%	2.78%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	267.369 ns	261.553000 ns	97.82%	-2.18%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4538.790000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3568.870000 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7): 126.217%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	197.885000 ns	707.467 ns	357.51%	257.51%	+
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	303.412000 ns	310.903 ns	102.47%	2.47%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	195.379 ns	189.545000 ns	97.01%	-2.99%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	739.729 ns	706.907000 ns	95.56%	-4.44%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	208.399 ns	196.551000 ns	94.31%	-5.69%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	503.244000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	123.388000 ns	-

Relative perf in group alloc/min (8): 81.758%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	637.153000 ns	832.725 ns	130.69%	30.69%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	954.465000 ns	958.800 ns	100.45%	0.45%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	176.066 ns	174.753000 ns	99.25%	-0.75%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	999.902 ns	965.779000 ns	96.59%	-3.41%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	892.006 ns	797.092000 ns	89.36%	-10.64%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	667.010 ns	177.130000 ns	26.56%	-73.44%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider>	4173.800000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider>	395.250000 ns	-

Relative perf in group multiple (22): 26.824%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15177.600000 ns	16418.600 ns	108.18%	8.18%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	138901.000000 ns	146423.000 ns	105.42%	5.42%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1151840.000000 ns	1174970.000 ns	102.01%	2.01%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	162815.000 ns	162279.000000 ns	99.67%	-0.33%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	75741.900 ns	75451.700000 ns	99.62%	-0.38%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4310.020 ns	4283.690000 ns	99.39%	-0.61%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1171730.000 ns	1162100.000000 ns	99.18%	-0.82%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	33449.000 ns	33153.600000 ns	99.12%	-0.88%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	31315.700 ns	30910.300000 ns	98.71%	-1.29%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25911.300 ns	25525.500000 ns	98.51%	-1.49%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	142900.000 ns	138360.000000 ns	96.82%	-3.18%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	46556.300 ns	41438.000000 ns	89.01%	-10.99%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	10794200.000 ns	140162.000000 ns	1.30%	-98.70%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	2716370.000 ns	30121.800000 ns	1.11%	-98.89%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	8007430.000 ns	27477.700000 ns	0.34%	-99.66%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	2737030.000 ns	4208.520000 ns	0.15%	-99.85%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider>	1739660.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider>	206854.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider>	516154.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider>	24325.300000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider>	648229.000000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider>	62128.900000 ns	-

QS_DEVICE=GPU

Command:

/home/pmdk/bench_workdir/QuickSilver/qs -i /home/pmdk/bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp

Velocity-Bench Sobel Filter

Environment Variables:

OPENCV_IO_MAX_IMAGE_PIXELS=1677721600

Command:

/home/pmdk/bench_workdir/sobel_filter/sobel_filter -i /home/pmdk/bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png -n 5

Velocity-Bench dl-cifar

Environment Variables:

Command:

/home/pmdk/bench_workdir/dl-cifar/dl-cifar_sycl

Velocity-Bench dl-mnist

Environment Variables:

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

github-actions · 2025-01-30T16:50:32Z

Compute Benchmarks level_zero run (with params: --iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13057141939

github-actions · 2025-01-30T17:02:38Z

Compute Benchmarks level_zero run (--iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13057141939
Job status: success. Test status: success.

Summary

Total 42 benchmarks in mean.
Geomean 70.728%.
Improved 7 Regressed 23 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7): 177.072%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	117.389000 ns	2688.530 ns	2290.27%	2190.27%	++++++++++
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2137.750 ns	2113.560000 ns	98.87%	-1.13%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	307.692 ns	287.722000 ns	93.51%	-6.49%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2685.060 ns	2464.050000 ns	91.77%	-8.23%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3457.650 ns	3097.620000 ns	89.59%	-10.41%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4937.160000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3506.160000 ns	-

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7): 143.876%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	104.384000 ns	705.635 ns	676.00%	576.00%	+++
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	701.882 ns	698.410000 ns	99.51%	-0.49%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	278.621 ns	272.237000 ns	97.71%	-2.29%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	213.676 ns	208.759000 ns	97.70%	-2.30%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	199.261 ns	191.313000 ns	96.01%	-3.99%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	491.052000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	120.083000 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7): 148.526%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	133.241000 ns	1226.080 ns	920.20%	820.20%	++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3420.390 ns	3338.690000 ns	97.61%	-2.39%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1315.010 ns	1274.570000 ns	96.92%	-3.08%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	2199.040 ns	2038.360000 ns	92.69%	-7.31%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	292.019 ns	261.553000 ns	89.57%	-10.43%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4770.270000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3227.210000 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7): 126.308%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	196.181000 ns	707.467 ns	360.62%	260.62%	+
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	311.369 ns	310.903000 ns	99.85%	-0.15%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	726.483 ns	706.907000 ns	97.31%	-2.69%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	196.378 ns	189.545000 ns	96.52%	-3.48%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	206.764 ns	196.551000 ns	95.06%	-4.94%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	506.238000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	119.273000 ns	-

Relative perf in group alloc/min (8): 81.100%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	625.028000 ns	832.725 ns	133.23%	33.23%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	170.467000 ns	174.753 ns	102.51%	2.51%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	958.816 ns	958.800000 ns	100.00%	-0.00%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	810.623 ns	797.092000 ns	98.33%	-1.67%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1019.430 ns	965.779000 ns	94.74%	-5.26%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	792.064 ns	177.130000 ns	22.36%	-77.64%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider>	4415.900000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider>	346.002000 ns	-

Relative perf in group multiple (22): 26.729%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15224.200000 ns	16418.600 ns	107.85%	7.85%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	145167.000000 ns	146423.000 ns	100.87%	0.87%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	161032.000000 ns	162279.000 ns	100.77%	0.77%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	75164.700000 ns	75451.700 ns	100.38%	0.38%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	33117.300000 ns	33153.600 ns	100.11%	0.11%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4333.700 ns	4283.690000 ns	98.85%	-1.15%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25961.700 ns	25525.500000 ns	98.32%	-1.68%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	31494.200 ns	30910.300000 ns	98.15%	-1.85%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1186360.000 ns	1162100.000000 ns	97.96%	-2.04%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	142303.000 ns	138360.000000 ns	97.23%	-2.77%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1209150.000 ns	1174970.000000 ns	97.17%	-2.83%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	47311.300 ns	41438.000000 ns	87.59%	-12.41%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	10725700.000 ns	140162.000000 ns	1.31%	-98.69%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	2669490.000 ns	30121.800000 ns	1.13%	-98.87%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	8126240.000 ns	27477.700000 ns	0.34%	-99.66%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	2632540.000 ns	4208.520000 ns	0.16%	-99.84%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider>	1749830.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider>	209833.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider>	496101.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider>	24181.900000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider>	621195.000000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider>	60798.000000 ns	-

Relative perf in group api (12): cannot calculate

Benchmark	This PR	baseline
api_overhead_benchmark_l0 SubmitKernel out of order	-	11.369000 μs
api_overhead_benchmark_l0 SubmitKernel in order	-	11.395000 μs
api_overhead_benchmark_sycl SubmitKernel out of order	-	23.506000 μs
api_overhead_benchmark_sycl SubmitKernel in order	-	24.407000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	-	2.149000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	-	1.673000 μs
api_overhead_benchmark_ur SubmitKernel out of order CPU count	-	104663.000000 instr
api_overhead_benchmark_ur SubmitKernel out of order	-	15.866000 μs
api_overhead_benchmark_ur SubmitKernel in order CPU count	-	110006.000000 instr
api_overhead_benchmark_ur SubmitKernel in order	-	16.785000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	-	123166.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion	-	21.495000 μs

Relative perf in group memory (4): cannot calculate

Benchmark	This PR	baseline
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	-	252.914000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	-	219.832000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	-	5.900000 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	-	3.070000 GB/s

Relative perf in group miscellaneous (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	-	858.023000 bw GB/s

Relative perf in group multithread (10): cannot calculate

Benchmark	This PR	baseline
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	-	6896.127000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	-	17165.065000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	-	46811.855000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	-	2047.766000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	-	7766.797000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	-	8883.578000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	-	27030.035000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	-	1199.669000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	42602.254000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	112408.658000 μs

Relative perf in group graph (10): cannot calculate

Benchmark	This PR	baseline
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	-	71746.038000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	-	72583.103000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	-	353349.563000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	-	353086.695000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	-	55.253000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	-	62.493000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	-	677.203000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	-	5621.320000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	-	5631.730000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	-	56454.921000 μs

Relative perf in group Velocity-Bench (9): cannot calculate

Benchmark	This PR	baseline
Velocity-Bench Hashtable	-	363.339623 M keys/sec
Velocity-Bench Bitcracker	-	38.359400 s
Velocity-Bench CudaSift	-	203.947000 ms
Velocity-Bench Easywave	-	227.000000 ms
Velocity-Bench QuickSilver	-	116.460000 MMS/CTT
Velocity-Bench Sobel Filter	-	603.076000 ms
Velocity-Bench dl-cifar	-	23.630300 s
Velocity-Bench dl-mnist	-	2.710000 s
Velocity-Bench svm	-	0.135900 s

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline
Runtime_IndependentDAGTaskThroughput_SingleTask	-	259.444000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	-	274.274000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	-	275.173000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	-	276.461000 ms
Runtime_DAGTaskThroughput_SingleTask	-	1648.643000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	-	1704.436000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	-	1710.439000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	-	1673.462000 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	-	4.526000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	-	4.585000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	-	4.376000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	-	4.456000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	-	617.437000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	-	617.442000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	-	4.276000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	-	4.940000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	-	4.909000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	-	4.716000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	-	616.834000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	-	616.784000 ms
MicroBench_LocalMem_int32_4096	-	29.862000 ms
MicroBench_LocalMem_fp32_4096	-	29.902000 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline
Pattern_Reduction_NDRange_int32	-	16.339000 ms
Pattern_Reduction_Hierarchical_int32	-	16.339000 ms
Pattern_SegmentedReduction_NDRange_int16	-	2.265000 ms
Pattern_SegmentedReduction_NDRange_int32	-	2.165000 ms
Pattern_SegmentedReduction_NDRange_int64	-	2.337000 ms
Pattern_SegmentedReduction_NDRange_fp32	-	2.168000 ms
Pattern_SegmentedReduction_Hierarchical_int16	-	11.796000 ms
Pattern_SegmentedReduction_Hierarchical_int32	-	11.588000 ms
Pattern_SegmentedReduction_Hierarchical_int64	-	11.782000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	-	11.587000 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline
ScalarProduct_NDRange_int32	-	3.765000 ms
ScalarProduct_NDRange_int64	-	5.425000 ms
ScalarProduct_NDRange_fp32	-	3.749000 ms
ScalarProduct_Hierarchical_int32	-	10.541000 ms
ScalarProduct_Hierarchical_int64	-	11.490000 ms
ScalarProduct_Hierarchical_fp32	-	10.167000 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline
USM_Allocation_latency_fp32_device	-	0.065000 ms
USM_Allocation_latency_fp32_host	-	37.623000 ms
USM_Allocation_latency_fp32_shared	-	0.062000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	-	1.737000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	-	1.087000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	-	1.893000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	-	1.258000 ms

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline
VectorAddition_int32	-	1.477000 ms
VectorAddition_int64	-	3.088000 ms
VectorAddition_fp32	-	1.480000 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline
Polybench_2mm	-	1.039000 ms
Polybench_3mm	-	1.477000 ms
Polybench_Atax	-	6.402000 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	-	14.111000 ms

Relative perf in group LinearRegressionCoeff (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	-	881.915000 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	-	0.030000 ms

Relative perf in group llama.cpp (6): cannot calculate

Benchmark	This PR	baseline
llama.cpp Prompt Processing Batched 128	-	830.097430 token/s
llama.cpp Text Generation Batched 128	-	62.790938 token/s
llama.cpp Prompt Processing Batched 256	-	878.291089 token/s
llama.cpp Text Generation Batched 256	-	62.777001 token/s
llama.cpp Prompt Processing Batched 512	-	435.723514 token/s
llama.cpp Text Generation Batched 512	-	62.788791 token/s

LD_PRELOAD=/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/lib/libumf_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

github-actions · 2025-01-30T17:15:19Z

Compute Benchmarks level_zero run (with params: --iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13057580018

github-actions · 2025-01-30T17:21:47Z

Compute Benchmarks level_zero run (--iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13057580018
Job status: success. Test status: success.

Summary

Total 42 benchmarks in mean.
Geomean 97.691%.
Improved 5 Regressed 22 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (6): 94.616%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	2648.410000 ns	2688.530 ns	101.51%	1.51%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3064.420000 ns	3097.620 ns	101.08%	1.08%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	305.545 ns	287.722000 ns	94.17%	-5.83%	-----
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2370.280 ns	2113.560000 ns	89.17%	-10.83%	---------
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2799.920 ns	2464.050000 ns	88.00%	-12.00%	----------
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc	117.189000 ns	-

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (6): 96.714%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	721.182 ns	705.635000 ns	97.84%	-2.16%	--
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	717.457 ns	698.410000 ns	97.35%	-2.65%	--
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	282.656 ns	272.237000 ns	96.31%	-3.69%	---
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	216.960 ns	208.759000 ns	96.22%	-3.78%	---
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	199.570 ns	191.313000 ns	95.86%	-4.14%	---
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc	86.443300 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (6): 100.102%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1921.220000 ns	2038.360 ns	106.10%	6.10%	+++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1231.370000 ns	1274.570 ns	103.51%	3.51%	+++
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3290.330000 ns	3338.690 ns	101.47%	1.47%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	1238.010 ns	1226.080000 ns	99.04%	-0.96%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	287.178 ns	261.553000 ns	91.08%	-8.92%	-------
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc	109.364000 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (6): 97.198%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	194.876000 ns	196.551 ns	100.86%	0.86%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	313.688 ns	310.903000 ns	99.11%	-0.89%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	736.269 ns	707.467000 ns	96.09%	-3.91%	---
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	739.125 ns	706.907000 ns	95.64%	-4.36%	----
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	200.720 ns	189.545000 ns	94.43%	-5.57%	-----
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc	84.055000 ns	-

Relative perf in group alloc/min (8): 97.779%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	970.888 ns	965.779000 ns	99.47%	-0.53%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	177.158 ns	174.753000 ns	98.64%	-1.36%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	981.498 ns	958.800000 ns	97.69%	-2.31%	--
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	854.006 ns	832.725000 ns	97.51%	-2.49%	--
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	821.160 ns	797.092000 ns	97.07%	-2.93%	--
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	183.892 ns	177.130000 ns	96.32%	-3.68%	---
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc	435.088000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc	280.303000 ns	-

Relative perf in group multiple (20): 98.353%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15097.900000 ns	16418.600 ns	108.75%	8.75%	+++++++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	141189.000000 ns	146423.000 ns	103.71%	3.71%	+++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	158959.000000 ns	162279.000 ns	102.09%	2.09%	++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	32597.200000 ns	33153.600 ns	101.71%	1.71%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4227.420000 ns	4283.690 ns	101.33%	1.33%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	4199.780000 ns	4208.520 ns	100.21%	0.21%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	139441.000 ns	138360.000000 ns	99.22%	-0.78%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	142050.000 ns	140162.000000 ns	98.67%	-1.33%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	76545.000 ns	75451.700000 ns	98.57%	-1.43%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	42248.200 ns	41438.000000 ns	98.08%	-1.92%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	31636.100 ns	30910.300000 ns	97.71%	-2.29%	--
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	28264.900 ns	27477.700000 ns	97.21%	-2.79%	--
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	31783.800 ns	30121.800000 ns	94.77%	-5.23%	----
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	27169.100 ns	25525.500000 ns	93.95%	-6.05%	-----
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1309000.000 ns	1174970.000000 ns	89.76%	-10.24%	---------
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1295170.000 ns	1162100.000000 ns	89.73%	-10.27%	---------
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc	30644.600000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc	24425.600000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc	47559.500000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc	26449.600000 ns	-

Relative perf in group api (12): cannot calculate

Benchmark	This PR	baseline
api_overhead_benchmark_l0 SubmitKernel out of order	-	11.369000 μs
api_overhead_benchmark_l0 SubmitKernel in order	-	11.395000 μs
api_overhead_benchmark_sycl SubmitKernel out of order	-	23.506000 μs
api_overhead_benchmark_sycl SubmitKernel in order	-	24.407000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	-	2.149000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	-	1.673000 μs
api_overhead_benchmark_ur SubmitKernel out of order CPU count	-	104663.000000 instr
api_overhead_benchmark_ur SubmitKernel out of order	-	15.866000 μs
api_overhead_benchmark_ur SubmitKernel in order CPU count	-	110006.000000 instr
api_overhead_benchmark_ur SubmitKernel in order	-	16.785000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	-	123166.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion	-	21.495000 μs

Relative perf in group memory (4): cannot calculate

Benchmark	This PR	baseline
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	-	252.914000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	-	219.832000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	-	5.900000 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	-	3.070000 GB/s

Relative perf in group miscellaneous (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	-	858.023000 bw GB/s

Relative perf in group multithread (10): cannot calculate

Benchmark	This PR	baseline
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	-	6896.127000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	-	17165.065000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	-	46811.855000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	-	2047.766000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	-	7766.797000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	-	8883.578000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	-	27030.035000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	-	1199.669000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	42602.254000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	112408.658000 μs

Relative perf in group graph (10): cannot calculate

Benchmark	This PR	baseline
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	-	71746.038000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	-	72583.103000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	-	353349.563000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	-	353086.695000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	-	55.253000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	-	62.493000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	-	677.203000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	-	5621.320000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	-	5631.730000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	-	56454.921000 μs

Relative perf in group Velocity-Bench (9): cannot calculate

Benchmark	This PR	baseline
Velocity-Bench Hashtable	-	363.339623 M keys/sec
Velocity-Bench Bitcracker	-	38.359400 s
Velocity-Bench CudaSift	-	203.947000 ms
Velocity-Bench Easywave	-	227.000000 ms
Velocity-Bench QuickSilver	-	116.460000 MMS/CTT
Velocity-Bench Sobel Filter	-	603.076000 ms
Velocity-Bench dl-cifar	-	23.630300 s
Velocity-Bench dl-mnist	-	2.710000 s
Velocity-Bench svm	-	0.135900 s

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline
Runtime_IndependentDAGTaskThroughput_SingleTask	-	259.444000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	-	274.274000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	-	275.173000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	-	276.461000 ms
Runtime_DAGTaskThroughput_SingleTask	-	1648.643000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	-	1704.436000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	-	1710.439000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	-	1673.462000 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	-	4.526000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	-	4.585000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	-	4.376000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	-	4.456000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	-	617.437000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	-	617.442000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	-	4.276000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	-	4.940000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	-	4.909000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	-	4.716000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	-	616.834000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	-	616.784000 ms
MicroBench_LocalMem_int32_4096	-	29.862000 ms
MicroBench_LocalMem_fp32_4096	-	29.902000 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline
Pattern_Reduction_NDRange_int32	-	16.339000 ms
Pattern_Reduction_Hierarchical_int32	-	16.339000 ms
Pattern_SegmentedReduction_NDRange_int16	-	2.265000 ms
Pattern_SegmentedReduction_NDRange_int32	-	2.165000 ms
Pattern_SegmentedReduction_NDRange_int64	-	2.337000 ms
Pattern_SegmentedReduction_NDRange_fp32	-	2.168000 ms
Pattern_SegmentedReduction_Hierarchical_int16	-	11.796000 ms
Pattern_SegmentedReduction_Hierarchical_int32	-	11.588000 ms
Pattern_SegmentedReduction_Hierarchical_int64	-	11.782000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	-	11.587000 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline
ScalarProduct_NDRange_int32	-	3.765000 ms
ScalarProduct_NDRange_int64	-	5.425000 ms
ScalarProduct_NDRange_fp32	-	3.749000 ms
ScalarProduct_Hierarchical_int32	-	10.541000 ms
ScalarProduct_Hierarchical_int64	-	11.490000 ms
ScalarProduct_Hierarchical_fp32	-	10.167000 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline
USM_Allocation_latency_fp32_device	-	0.065000 ms
USM_Allocation_latency_fp32_host	-	37.623000 ms
USM_Allocation_latency_fp32_shared	-	0.062000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	-	1.737000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	-	1.087000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	-	1.893000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	-	1.258000 ms

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline
VectorAddition_int32	-	1.477000 ms
VectorAddition_int64	-	3.088000 ms
VectorAddition_fp32	-	1.480000 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline
Polybench_2mm	-	1.039000 ms
Polybench_3mm	-	1.477000 ms
Polybench_Atax	-	6.402000 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	-	14.111000 ms

Relative perf in group LinearRegressionCoeff (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	-	881.915000 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	-	0.030000 ms

Relative perf in group llama.cpp (6): cannot calculate

Benchmark	This PR	baseline
llama.cpp Prompt Processing Batched 128	-	830.097430 token/s
llama.cpp Text Generation Batched 128	-	62.790938 token/s
llama.cpp Prompt Processing Batched 256	-	878.291089 token/s
llama.cpp Text Generation Batched 256	-	62.777001 token/s
llama.cpp Prompt Processing Batched 512	-	435.723514 token/s
llama.cpp Text Generation Batched 512	-	62.788791 token/s

LD_PRELOAD=libjemalloc.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

github-actions · 2025-01-30T17:25:11Z

Compute Benchmarks level_zero run (with params: --iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13057758246

github-actions · 2025-01-30T17:37:27Z

Compute Benchmarks level_zero run (--iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13057758246
Job status: success. Test status: success.

Summary

Total 42 benchmarks in mean.
Geomean 70.702%.
Improved 7 Regressed 21 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (8): 175.093%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	125.078000 ns	2688.530 ns	2149.48%	2049.48%	++++++++++
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	287.486000 ns	287.722 ns	100.08%	0.08%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3139.720 ns	3097.620000 ns	98.66%	-1.34%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2341.950 ns	2113.560000 ns	90.25%	-9.75%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2867.930 ns	2464.050000 ns	85.92%	-14.08%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4886.400000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3675.650000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc	115.317000 ns	-

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (8): 144.153%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	104.878000 ns	705.635 ns	672.82%	572.82%	+++
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	273.410 ns	272.237000 ns	99.57%	-0.43%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	709.090 ns	698.410000 ns	98.49%	-1.51%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	195.642 ns	191.313000 ns	97.79%	-2.21%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	216.395 ns	208.759000 ns	96.47%	-3.53%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	504.066000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	119.657000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc	83.827600 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (8): 158.822%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	120.480000 ns	1226.080 ns	1017.66%	917.66%	++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1881.740000 ns	2038.360 ns	108.32%	8.32%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1276.630 ns	1274.570000 ns	99.84%	-0.16%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3419.580 ns	3338.690000 ns	97.63%	-2.37%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	278.118 ns	261.553000 ns	94.04%	-5.96%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4683.850000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3476.140000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc	105.754000 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (8): 126.359%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	202.191000 ns	707.467 ns	349.90%	249.90%	+
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	308.898000 ns	310.903 ns	100.65%	0.65%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	720.780 ns	706.907000 ns	98.08%	-1.92%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	203.250 ns	196.551000 ns	96.70%	-3.30%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	196.533 ns	189.545000 ns	96.44%	-3.56%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	497.786000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	124.563000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc	85.477400 ns	-

Relative perf in group alloc/min (10): 80.930%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	558.978000 ns	832.725 ns	148.97%	48.97%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	177.448 ns	174.753000 ns	98.48%	-1.52%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	986.078 ns	958.800000 ns	97.23%	-2.77%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1029.760 ns	965.779000 ns	93.79%	-6.21%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	874.017 ns	797.092000 ns	91.20%	-8.80%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	769.181 ns	177.130000 ns	23.03%	-76.97%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider>	4336.230000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider>	356.478000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc	369.412000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc	259.044000 ns	-

Relative perf in group multiple (26): 26.243%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15530.100000 ns	16418.600 ns	105.72%	5.72%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	145697.000000 ns	146423.000 ns	100.50%	0.50%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4283.890 ns	4283.690000 ns	100.00%	-0.00%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	75817.200 ns	75451.700000 ns	99.52%	-0.48%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25743.700 ns	25525.500000 ns	99.15%	-0.85%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	33503.600 ns	33153.600000 ns	98.96%	-1.04%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	140397.000 ns	138360.000000 ns	98.55%	-1.45%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	169540.000 ns	162279.000000 ns	95.72%	-4.28%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	33766.800 ns	30910.300000 ns	91.54%	-8.46%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1301690.000 ns	1162100.000000 ns	89.28%	-10.72%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1345340.000 ns	1174970.000000 ns	87.34%	-12.66%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	47572.100 ns	41438.000000 ns	87.11%	-12.89%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	10698100.000 ns	140162.000000 ns	1.31%	-98.69%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	2622980.000 ns	30121.800000 ns	1.15%	-98.85%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	8020940.000 ns	27477.700000 ns	0.34%	-99.66%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	2607090.000 ns	4208.520000 ns	0.16%	-99.84%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider>	1780830.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider>	216329.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider>	517181.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider>	24685.200000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider>	637670.000000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider>	60627.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc	30705.300000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc	24048.400000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc	48735.200000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc	26814.000000 ns	-

Relative perf in group api (12): cannot calculate

Benchmark	This PR	baseline
api_overhead_benchmark_l0 SubmitKernel out of order	-	11.369000 μs
api_overhead_benchmark_l0 SubmitKernel in order	-	11.395000 μs
api_overhead_benchmark_sycl SubmitKernel out of order	-	23.506000 μs
api_overhead_benchmark_sycl SubmitKernel in order	-	24.407000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	-	2.149000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	-	1.673000 μs
api_overhead_benchmark_ur SubmitKernel out of order CPU count	-	104663.000000 instr
api_overhead_benchmark_ur SubmitKernel out of order	-	15.866000 μs
api_overhead_benchmark_ur SubmitKernel in order CPU count	-	110006.000000 instr
api_overhead_benchmark_ur SubmitKernel in order	-	16.785000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	-	123166.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion	-	21.495000 μs

Relative perf in group memory (4): cannot calculate

Benchmark	This PR	baseline
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	-	252.914000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	-	219.832000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	-	5.900000 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	-	3.070000 GB/s

Relative perf in group miscellaneous (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	-	858.023000 bw GB/s

Relative perf in group multithread (10): cannot calculate

Benchmark	This PR	baseline
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	-	6896.127000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	-	17165.065000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	-	46811.855000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	-	2047.766000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	-	7766.797000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	-	8883.578000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	-	27030.035000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	-	1199.669000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	42602.254000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	112408.658000 μs

Relative perf in group graph (10): cannot calculate

Benchmark	This PR	baseline
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	-	71746.038000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	-	72583.103000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	-	353349.563000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	-	353086.695000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	-	55.253000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	-	62.493000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	-	677.203000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	-	5621.320000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	-	5631.730000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	-	56454.921000 μs

Relative perf in group Velocity-Bench (9): cannot calculate

Benchmark	This PR	baseline
Velocity-Bench Hashtable	-	363.339623 M keys/sec
Velocity-Bench Bitcracker	-	38.359400 s
Velocity-Bench CudaSift	-	203.947000 ms
Velocity-Bench Easywave	-	227.000000 ms
Velocity-Bench QuickSilver	-	116.460000 MMS/CTT
Velocity-Bench Sobel Filter	-	603.076000 ms
Velocity-Bench dl-cifar	-	23.630300 s
Velocity-Bench dl-mnist	-	2.710000 s
Velocity-Bench svm	-	0.135900 s

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline
Runtime_IndependentDAGTaskThroughput_SingleTask	-	259.444000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	-	274.274000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	-	275.173000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	-	276.461000 ms
Runtime_DAGTaskThroughput_SingleTask	-	1648.643000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	-	1704.436000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	-	1710.439000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	-	1673.462000 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	-	4.526000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	-	4.585000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	-	4.376000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	-	4.456000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	-	617.437000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	-	617.442000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	-	4.276000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	-	4.940000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	-	4.909000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	-	4.716000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	-	616.834000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	-	616.784000 ms
MicroBench_LocalMem_int32_4096	-	29.862000 ms
MicroBench_LocalMem_fp32_4096	-	29.902000 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline
Pattern_Reduction_NDRange_int32	-	16.339000 ms
Pattern_Reduction_Hierarchical_int32	-	16.339000 ms
Pattern_SegmentedReduction_NDRange_int16	-	2.265000 ms
Pattern_SegmentedReduction_NDRange_int32	-	2.165000 ms
Pattern_SegmentedReduction_NDRange_int64	-	2.337000 ms
Pattern_SegmentedReduction_NDRange_fp32	-	2.168000 ms
Pattern_SegmentedReduction_Hierarchical_int16	-	11.796000 ms
Pattern_SegmentedReduction_Hierarchical_int32	-	11.588000 ms
Pattern_SegmentedReduction_Hierarchical_int64	-	11.782000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	-	11.587000 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline
ScalarProduct_NDRange_int32	-	3.765000 ms
ScalarProduct_NDRange_int64	-	5.425000 ms
ScalarProduct_NDRange_fp32	-	3.749000 ms
ScalarProduct_Hierarchical_int32	-	10.541000 ms
ScalarProduct_Hierarchical_int64	-	11.490000 ms
ScalarProduct_Hierarchical_fp32	-	10.167000 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline
USM_Allocation_latency_fp32_device	-	0.065000 ms
USM_Allocation_latency_fp32_host	-	37.623000 ms
USM_Allocation_latency_fp32_shared	-	0.062000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	-	1.737000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	-	1.087000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	-	1.893000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	-	1.258000 ms

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline
VectorAddition_int32	-	1.477000 ms
VectorAddition_int64	-	3.088000 ms
VectorAddition_fp32	-	1.480000 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline
Polybench_2mm	-	1.039000 ms
Polybench_3mm	-	1.477000 ms
Polybench_Atax	-	6.402000 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	-	14.111000 ms

Relative perf in group LinearRegressionCoeff (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	-	881.915000 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	-	0.030000 ms

Relative perf in group llama.cpp (6): cannot calculate

Benchmark	This PR	baseline
llama.cpp Prompt Processing Batched 128	-	830.097430 token/s
llama.cpp Text Generation Batched 128	-	62.790938 token/s
llama.cpp Prompt Processing Batched 256	-	878.291089 token/s
llama.cpp Text Generation Batched 256	-	62.777001 token/s
llama.cpp Prompt Processing Batched 512	-	435.723514 token/s
llama.cpp Text Generation Batched 512	-	62.788791 token/s

LD_PRELOAD=libjemalloc.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

github-actions · 2025-01-31T12:36:21Z

Compute Benchmarks level_zero run (with params: --iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13072873192

github-actions · 2025-01-31T12:51:59Z

Compute Benchmarks level_zero run (--iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13072873192
Job status: success. Test status: success.

Summary

Total 42 benchmarks in mean.
Geomean 70.284%.
Improved 9 Regressed 22 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (8): 175.757%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	132.930000 ns	2735.530 ns	2057.87%	1957.87%	++++++++++
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	299.151000 ns	306.767 ns	102.55%	2.55%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2342.330 ns	2192.650000 ns	93.61%	-6.39%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2834.960 ns	2620.060000 ns	92.42%	-7.58%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3455.850 ns	3174.620000 ns	91.86%	-8.14%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4683.370000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3318.150000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc	116.434000 ns	-

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (8): 146.171%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	102.201000 ns	711.693 ns	696.37%	596.37%	+++
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	190.621000 ns	195.988 ns	102.82%	2.82%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	709.139000 ns	710.790 ns	100.23%	0.23%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	216.513 ns	213.992000 ns	98.84%	-1.16%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	288.393 ns	271.315000 ns	94.08%	-5.92%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	494.155000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	120.109000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc	83.974500 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (8): 153.395%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	127.946000 ns	1230.060 ns	961.39%	861.39%	++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3392.580 ns	3386.980000 ns	99.83%	-0.17%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1290.880 ns	1267.280000 ns	98.17%	-1.83%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	266.420 ns	253.226000 ns	95.05%	-4.95%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	2042.030 ns	1936.480000 ns	94.83%	-5.17%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4713.150000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3310.840000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc	107.088000 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (8): 111.552%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	203.756000 ns	730.895 ns	358.71%	258.71%	+
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	192.770000 ns	192.935 ns	100.09%	0.09%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	322.983 ns	299.838000 ns	92.83%	-7.17%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	269.559 ns	206.336000 ns	76.55%	-23.45%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	1075.200 ns	727.999000 ns	67.71%	-32.29%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	512.818000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	119.699000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc	83.848400 ns	-

Relative perf in group alloc/min (10): 84.067%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	584.576000 ns	834.560 ns	142.76%	42.76%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1051.670000 ns	1128.250 ns	107.28%	7.28%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	963.425000 ns	968.189 ns	100.49%	0.49%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	177.373 ns	177.227000 ns	99.92%	-0.08%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	882.652 ns	809.442000 ns	91.71%	-8.29%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	728.316 ns	182.287000 ns	25.03%	-74.97%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider>	4316.560000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider>	408.532000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc	454.859000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc	260.657000 ns	-

Relative perf in group multiple (26): 26.626%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	141791.000000 ns	144859.000 ns	102.16%	2.16%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15048.000000 ns	15279.900 ns	101.54%	1.54%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25436.400 ns	25041.800000 ns	98.45%	-1.55%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1201200.000 ns	1181150.000000 ns	98.33%	-1.67%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	163895.000 ns	160647.000000 ns	98.02%	-1.98%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	77240.700 ns	75687.100000 ns	97.99%	-2.01%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	30853.700 ns	30222.700000 ns	97.95%	-2.05%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	142499.000 ns	139089.000000 ns	97.61%	-2.39%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4309.030 ns	4200.920000 ns	97.49%	-2.51%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1199190.000 ns	1162710.000000 ns	96.96%	-3.04%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	33278.800 ns	31133.200000 ns	93.55%	-6.45%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	47485.600 ns	41527.800000 ns	87.45%	-12.55%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	10815000.000 ns	138580.000000 ns	1.28%	-98.72%	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	2624880.000 ns	31018.400000 ns	1.18%	-98.82%	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	7634970.000 ns	27865.300000 ns	0.36%	-99.64%	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	2619500.000 ns	4241.250000 ns	0.16%	-99.84%	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider>	1757030.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider>	216294.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider>	522324.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider>	24844.800000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider>	636413.000000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider>	59566.400000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc	29733.900000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc	25184.600000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc	49547.600000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc	26554.300000 ns	-

Relative perf in group api (12): cannot calculate

Benchmark	This PR	baseline
api_overhead_benchmark_l0 SubmitKernel out of order	-	11.868000 μs
api_overhead_benchmark_l0 SubmitKernel in order	-	11.418000 μs
api_overhead_benchmark_sycl SubmitKernel out of order	-	22.969000 μs
api_overhead_benchmark_sycl SubmitKernel in order	-	24.133000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	-	2.113000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	-	1.679000 μs
api_overhead_benchmark_ur SubmitKernel out of order CPU count	-	104663.000000 instr
api_overhead_benchmark_ur SubmitKernel out of order	-	15.750000 μs
api_overhead_benchmark_ur SubmitKernel in order CPU count	-	110006.000000 instr
api_overhead_benchmark_ur SubmitKernel in order	-	16.241000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	-	122876.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion	-	21.005000 μs

Relative perf in group memory (4): cannot calculate

Benchmark	This PR	baseline
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	-	251.872000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	-	132.472000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	-	5.573000 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	-	3.158000 GB/s

Relative perf in group miscellaneous (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	-	860.664000 bw GB/s

Relative perf in group multithread (10): cannot calculate

Benchmark	This PR	baseline
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	-	6939.950000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	-	17154.077000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	-	46935.372000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	-	2093.086000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	-	7472.404000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	-	8689.121000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	-	25587.435000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	-	1201.865000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	40846.653000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	112790.682000 μs

Relative perf in group graph (10): cannot calculate

Benchmark	This PR	baseline
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	-	71747.470000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	-	72642.878000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	-	353339.946000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	-	353502.721000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	-	54.566000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	-	62.367000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	-	674.284000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	-	5721.966000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	-	5688.177000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	-	57817.523000 μs

Relative perf in group Velocity-Bench (9): cannot calculate

Benchmark	This PR	baseline
Velocity-Bench Hashtable	-	358.375158 M keys/sec
Velocity-Bench Bitcracker	-	35.965200 s
Velocity-Bench CudaSift	-	201.701000 ms
Velocity-Bench Easywave	-	226.000000 ms
Velocity-Bench QuickSilver	-	117.580000 MMS/CTT
Velocity-Bench Sobel Filter	-	611.944000 ms
Velocity-Bench dl-cifar	-	23.442800 s
Velocity-Bench dl-mnist	-	2.720000 s
Velocity-Bench svm	-	0.134300 s

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline
Runtime_IndependentDAGTaskThroughput_SingleTask	-	268.614000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	-	277.626000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	-	277.078000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	-	277.264000 ms
Runtime_DAGTaskThroughput_SingleTask	-	1688.724000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	-	1764.745000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	-	1737.282000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	-	1705.559000 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	-	5.241000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	-	4.991000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	-	4.763000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	-	4.863000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	-	618.230000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	-	618.282000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	-	4.928000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	-	5.197000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	-	5.079000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	-	5.207000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	-	617.816000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	-	617.727000 ms
MicroBench_LocalMem_int32_4096	-	29.924000 ms
MicroBench_LocalMem_fp32_4096	-	29.864000 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline
Pattern_Reduction_NDRange_int32	-	16.761000 ms
Pattern_Reduction_Hierarchical_int32	-	16.736000 ms
Pattern_SegmentedReduction_NDRange_int16	-	2.264000 ms
Pattern_SegmentedReduction_NDRange_int32	-	2.166000 ms
Pattern_SegmentedReduction_NDRange_int64	-	2.337000 ms
Pattern_SegmentedReduction_NDRange_fp32	-	2.165000 ms
Pattern_SegmentedReduction_Hierarchical_int16	-	11.801000 ms
Pattern_SegmentedReduction_Hierarchical_int32	-	11.589000 ms
Pattern_SegmentedReduction_Hierarchical_int64	-	11.771000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	-	11.590000 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline
ScalarProduct_NDRange_int32	-	3.744000 ms
ScalarProduct_NDRange_int64	-	5.440000 ms
ScalarProduct_NDRange_fp32	-	3.760000 ms
ScalarProduct_Hierarchical_int32	-	10.507000 ms
ScalarProduct_Hierarchical_int64	-	11.485000 ms
ScalarProduct_Hierarchical_fp32	-	10.152000 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline
USM_Allocation_latency_fp32_device	-	0.066000 ms
USM_Allocation_latency_fp32_host	-	37.402000 ms
USM_Allocation_latency_fp32_shared	-	0.065000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	-	1.681000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	-	1.056000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	-	1.838000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	-	1.205000 ms

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline
VectorAddition_int32	-	1.492000 ms
VectorAddition_int64	-	3.061000 ms
VectorAddition_fp32	-	1.434000 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline
Polybench_2mm	-	1.039000 ms
Polybench_3mm	-	1.482000 ms
Polybench_Atax	-	6.416000 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	-	14.144000 ms

Relative perf in group LinearRegressionCoeff (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	-	899.874000 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	-	0.029000 ms

Relative perf in group llama.cpp (6): cannot calculate

Benchmark	This PR	baseline
llama.cpp Prompt Processing Batched 128	-	824.202968 token/s
llama.cpp Text Generation Batched 128	-	62.990615 token/s
llama.cpp Prompt Processing Batched 256	-	870.375426 token/s
llama.cpp Text Generation Batched 256	-	62.990517 token/s
llama.cpp Prompt Processing Batched 512	-	429.991968 token/s
llama.cpp Text Generation Batched 512	-	62.959741 token/s

LD_PRELOAD=libjemalloc.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

lukaszstolarczuk · 2025-01-31T13:32:14Z

.github/workflows/benchmarks-reusable.yml

@@ -176,6 +176,16 @@ jobs:
        -B${{github.workspace}}/umf_build
        -DUMF_BUILD_BENCHMARKS=ON
        -DUMF_TESTS_FAIL_ON_SKIP=ON


FYI, you don't need UMF_TESTS_FAIL_ON_SKIP=ON if you disabled tests 😉

github-actions · 2025-01-31T13:53:59Z

Compute Benchmarks level_zero run (with params: --iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13074127788

github-actions · 2025-01-31T14:01:17Z

Compute Benchmarks level_zero run (--iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13074127788
Job status: success. Test status: success.

Summary

Total 42 benchmarks in mean.
Geomean 98.966%.
Improved 7 Regressed 13 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7): 102.411%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	1974.950000 ns	2192.650 ns	111.02%	11.02%	++++++
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	2980.010000 ns	3174.620 ns	106.53%	6.53%	++++
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2612.180000 ns	2620.060 ns	100.30%	0.30%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	2801.300 ns	2735.530000 ns	97.65%	-2.35%	-
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	315.463 ns	306.767000 ns	97.24%	-2.76%	--
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc	116.435000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:4 tbbProxy	285.909000 ns	-

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7): 100.033%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	194.213000 ns	195.988 ns	100.91%	0.91%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	270.684000 ns	271.315 ns	100.23%	0.23%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	710.416000 ns	711.693 ns	100.18%	0.18%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	712.439 ns	710.790000 ns	99.77%	-0.23%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	215.977 ns	213.992000 ns	99.08%	-0.92%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc	83.198600 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:1 tbbProxy	200.493000 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7): 101.053%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1879.980000 ns	1936.480 ns	103.01%	3.01%	++
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3296.140000 ns	3386.980 ns	102.76%	2.76%	++
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1253.630000 ns	1267.280 ns	101.09%	1.09%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	250.554000 ns	253.226 ns	101.07%	1.07%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	1262.270 ns	1230.060000 ns	97.45%	-2.55%	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc	107.085000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 tbbProxy	283.559000 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7): 96.234%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	189.414000 ns	192.935 ns	101.86%	1.86%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	728.650000 ns	730.895 ns	100.31%	0.31%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	730.075 ns	727.999000 ns	99.72%	-0.28%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	305.863 ns	299.838000 ns	98.03%	-1.97%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	249.688 ns	206.336000 ns	82.64%	-17.36%	----------
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc	83.063600 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 tbbProxy	237.189000 ns	-

Relative perf in group alloc/min (10): 97.801%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	175.385000 ns	182.287 ns	103.94%	3.94%	++
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	177.380 ns	177.227000 ns	99.91%	-0.09%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	995.080 ns	968.189000 ns	97.30%	-2.70%	--
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1169.240 ns	1128.250000 ns	96.49%	-3.51%	--
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	874.349 ns	834.560000 ns	95.45%	-4.55%	---
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	860.804 ns	809.442000 ns	94.03%	-5.97%	---
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc	424.589000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc	269.304000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 tbbProxy	1002.640000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 tbbProxy	553.923000 ns	-

Relative perf in group multiple (24): 98.237%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	27077.900000 ns	27865.300 ns	102.91%	2.91%	++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	14959.100000 ns	15279.900 ns	102.14%	2.14%	+
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	30837.900000 ns	31018.400 ns	100.59%	0.59%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	75342.600000 ns	75687.100 ns	100.46%	0.46%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	4224.170000 ns	4241.250 ns	100.40%	0.40%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	138852.000 ns	138580.000000 ns	99.80%	-0.20%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1185160.000 ns	1181150.000000 ns	99.66%	-0.34%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	146076.000 ns	144859.000000 ns	99.17%	-0.83%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	42043.500 ns	41527.800000 ns	98.77%	-1.23%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	30600.800 ns	30222.700000 ns	98.76%	-1.24%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4285.380 ns	4200.920000 ns	98.03%	-1.97%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	142705.000 ns	139089.000000 ns	97.47%	-2.53%	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1201590.000 ns	1162710.000000 ns	96.76%	-3.24%	--
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	167342.000 ns	160647.000000 ns	96.00%	-4.00%	--
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	33983.500 ns	31133.200000 ns	91.61%	-8.39%	-----
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	27771.100 ns	25041.800000 ns	90.17%	-9.83%	------
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc	31597.600000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc	24584.800000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc	49996.400000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc	26547.700000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 tbbProxy	41531.500000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 tbbProxy	7802.280000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 tbbProxy	71042.700000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 tbbProxy	21488.000000 ns	-

Relative perf in group api (12): cannot calculate

Benchmark	This PR	baseline
api_overhead_benchmark_l0 SubmitKernel out of order	-	11.868000 μs
api_overhead_benchmark_l0 SubmitKernel in order	-	11.418000 μs
api_overhead_benchmark_sycl SubmitKernel out of order	-	22.969000 μs
api_overhead_benchmark_sycl SubmitKernel in order	-	24.133000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	-	2.113000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	-	1.679000 μs
api_overhead_benchmark_ur SubmitKernel out of order CPU count	-	104663.000000 instr
api_overhead_benchmark_ur SubmitKernel out of order	-	15.750000 μs
api_overhead_benchmark_ur SubmitKernel in order CPU count	-	110006.000000 instr
api_overhead_benchmark_ur SubmitKernel in order	-	16.241000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	-	122876.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion	-	21.005000 μs

Relative perf in group memory (4): cannot calculate

Benchmark	This PR	baseline
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	-	251.872000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	-	132.472000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	-	5.573000 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	-	3.158000 GB/s

Relative perf in group miscellaneous (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	-	860.664000 bw GB/s

Relative perf in group multithread (10): cannot calculate

Benchmark	This PR	baseline
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	-	6939.950000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	-	17154.077000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	-	46935.372000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	-	2093.086000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	-	7472.404000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	-	8689.121000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	-	25587.435000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	-	1201.865000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	40846.653000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	112790.682000 μs

Relative perf in group graph (10): cannot calculate

Benchmark	This PR	baseline
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	-	71747.470000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	-	72642.878000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	-	353339.946000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	-	353502.721000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	-	54.566000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	-	62.367000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	-	674.284000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	-	5721.966000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	-	5688.177000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	-	57817.523000 μs

Relative perf in group Velocity-Bench (9): cannot calculate

Benchmark	This PR	baseline
Velocity-Bench Hashtable	-	358.375158 M keys/sec
Velocity-Bench Bitcracker	-	35.965200 s
Velocity-Bench CudaSift	-	201.701000 ms
Velocity-Bench Easywave	-	226.000000 ms
Velocity-Bench QuickSilver	-	117.580000 MMS/CTT
Velocity-Bench Sobel Filter	-	611.944000 ms
Velocity-Bench dl-cifar	-	23.442800 s
Velocity-Bench dl-mnist	-	2.720000 s
Velocity-Bench svm	-	0.134300 s

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline
Runtime_IndependentDAGTaskThroughput_SingleTask	-	268.614000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	-	277.626000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	-	277.078000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	-	277.264000 ms
Runtime_DAGTaskThroughput_SingleTask	-	1688.724000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	-	1764.745000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	-	1737.282000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	-	1705.559000 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	-	5.241000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	-	4.991000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	-	4.763000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	-	4.863000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	-	618.230000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	-	618.282000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	-	4.928000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	-	5.197000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	-	5.079000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	-	5.207000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	-	617.816000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	-	617.727000 ms
MicroBench_LocalMem_int32_4096	-	29.924000 ms
MicroBench_LocalMem_fp32_4096	-	29.864000 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline
Pattern_Reduction_NDRange_int32	-	16.761000 ms
Pattern_Reduction_Hierarchical_int32	-	16.736000 ms
Pattern_SegmentedReduction_NDRange_int16	-	2.264000 ms
Pattern_SegmentedReduction_NDRange_int32	-	2.166000 ms
Pattern_SegmentedReduction_NDRange_int64	-	2.337000 ms
Pattern_SegmentedReduction_NDRange_fp32	-	2.165000 ms
Pattern_SegmentedReduction_Hierarchical_int16	-	11.801000 ms
Pattern_SegmentedReduction_Hierarchical_int32	-	11.589000 ms
Pattern_SegmentedReduction_Hierarchical_int64	-	11.771000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	-	11.590000 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline
ScalarProduct_NDRange_int32	-	3.744000 ms
ScalarProduct_NDRange_int64	-	5.440000 ms
ScalarProduct_NDRange_fp32	-	3.760000 ms
ScalarProduct_Hierarchical_int32	-	10.507000 ms
ScalarProduct_Hierarchical_int64	-	11.485000 ms
ScalarProduct_Hierarchical_fp32	-	10.152000 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline
USM_Allocation_latency_fp32_device	-	0.066000 ms
USM_Allocation_latency_fp32_host	-	37.402000 ms
USM_Allocation_latency_fp32_shared	-	0.065000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	-	1.681000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	-	1.056000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	-	1.838000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	-	1.205000 ms

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline
VectorAddition_int32	-	1.492000 ms
VectorAddition_int64	-	3.061000 ms
VectorAddition_fp32	-	1.434000 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline
Polybench_2mm	-	1.039000 ms
Polybench_3mm	-	1.482000 ms
Polybench_Atax	-	6.416000 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	-	14.144000 ms

Relative perf in group LinearRegressionCoeff (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	-	899.874000 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	-	0.029000 ms

Relative perf in group llama.cpp (6): cannot calculate

Benchmark	This PR	baseline
llama.cpp Prompt Processing Batched 128	-	824.202968 token/s
llama.cpp Text Generation Batched 128	-	62.990615 token/s
llama.cpp Prompt Processing Batched 256	-	870.375426 token/s
llama.cpp Text Generation Batched 256	-	62.990517 token/s
llama.cpp Prompt Processing Batched 512	-	429.991968 token/s
llama.cpp Text Generation Batched 512	-	62.959741 token/s

LD_PRELOAD=libtbbmalloc_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

github-actions · 2025-01-31T14:26:18Z

Compute Benchmarks level_zero run (with params: --iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13074686372

github-actions · 2025-01-31T14:28:13Z

Compute Benchmarks level_zero run (--iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13074686372
Job status: cancelled. Test status: skipped.

github-actions · 2025-01-31T14:28:41Z

Compute Benchmarks level_zero run (with params: --iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13074730535

github-actions · 2025-01-31T14:41:04Z

Compute Benchmarks level_zero run (--iterations-stddev 2 --iterations 2):
https://github.com/oneapi-src/unified-runtime/actions/runs/13074730535
Job status: success. Test status: success.

Summary

Total 42 benchmarks in mean.
Geomean 70.872%.
Improved 7 Regressed 22 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (9): 180.497%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	129.151000 ns	2735.530 ns	2118.09%	2018.09%	++++++++++
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3114.580000 ns	3174.620 ns	101.93%	1.93%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	302.353000 ns	306.767 ns	101.46%	1.46%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2770.360 ns	2620.060000 ns	94.57%	-5.43%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2370.970 ns	2192.650000 ns	92.48%	-7.52%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4937.450000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3693.380000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc	119.817000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:4 tbbProxy	292.652000 ns	-

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (9): 143.572%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	111.827000 ns	711.693 ns	636.42%	536.42%	+++
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	703.831000 ns	710.790 ns	100.99%	0.99%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	215.418 ns	213.992000 ns	99.34%	-0.66%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	274.851 ns	271.315000 ns	98.71%	-1.29%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	202.487 ns	195.988000 ns	96.79%	-3.21%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	509.297000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	119.301000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc	85.880000 ns	-
alloc/size:10000/0/4096/iterations:200000/threads:1 tbbProxy	193.541000 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (9): 157.782%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	114.166000 ns	1230.060 ns	1077.43%	977.43%	+++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1213.010000 ns	1267.280 ns	104.47%	4.47%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3347.390000 ns	3386.980 ns	101.18%	1.18%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	268.784 ns	253.226000 ns	94.21%	-5.79%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	2124.880 ns	1936.480000 ns	91.13%	-8.87%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4721.800000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3626.490000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc	107.922000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 tbbProxy	301.013000 ns	-

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (9): 125.679%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	203.178000 ns	730.895 ns	359.73%	259.73%	+
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	724.707000 ns	727.999 ns	100.45%	0.45%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	207.595 ns	206.336000 ns	99.39%	-0.61%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	195.211 ns	192.935000 ns	98.83%	-1.17%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	339.452 ns	299.838000 ns	88.33%	-11.67%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	498.022000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	119.925000 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc	85.368600 ns	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 tbbProxy	235.938000 ns	-

Relative perf in group alloc/min (12): 80.852%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	673.706000 ns	834.560 ns	123.88%	23.88%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1071.980000 ns	1128.250 ns	105.25%	5.25%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	1000.410 ns	968.189000 ns	96.78%	-3.22%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	186.226 ns	177.227000 ns	95.17%	-4.83%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	870.143 ns	809.442000 ns	93.02%	-6.98%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	728.926 ns	182.287000 ns	25.01%	-74.99%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider>	4429.480000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider>	359.652000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc	436.643000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc	265.077000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 tbbProxy	855.811000 ns	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 tbbProxy	585.770000 ns	-

Relative perf in group multiple (30): 26.301%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	142890.000000 ns	144859.000 ns	101.38%	1.38%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1183620.000 ns	1181150.000000 ns	99.79%	-0.21%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1171440.000 ns	1162710.000000 ns	99.25%	-0.75%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	141479.000 ns	139089.000000 ns	98.31%	-1.69%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	164147.000 ns	160647.000000 ns	97.87%	-2.13%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15632.200 ns	15279.900000 ns	97.75%	-2.25%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	31080.900 ns	30222.700000 ns	97.24%	-2.76%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4351.410 ns	4200.920000 ns	96.54%	-3.46%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	78417.600 ns	75687.100000 ns	96.52%	-3.48%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	32525.400 ns	31133.200000 ns	95.72%	-4.28%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	27680.600 ns	25041.800000 ns	90.47%	-9.53%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	46880.000 ns	41527.800000 ns	88.58%	-11.42%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	10839200.000 ns	138580.000000 ns	1.28%	-98.72%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	2738240.000 ns	31018.400000 ns	1.13%	-98.87%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	8036080.000 ns	27865.300000 ns	0.35%	-99.65%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	2667030.000 ns	4241.250000 ns	0.16%	-99.84%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider>	1738330.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider>	214090.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider>	493122.000000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider>	24593.300000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider>	618549.000000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider>	61579.600000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc	30269.100000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc	24191.300000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc	52034.000000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc	26243.700000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 tbbProxy	42244.900000 ns	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 tbbProxy	7734.190000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 tbbProxy	71309.600000 ns	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 tbbProxy	21424.600000 ns	-

Relative perf in group api (12): cannot calculate

Benchmark	This PR	baseline
api_overhead_benchmark_l0 SubmitKernel out of order	-	11.868000 μs
api_overhead_benchmark_l0 SubmitKernel in order	-	11.418000 μs
api_overhead_benchmark_sycl SubmitKernel out of order	-	22.969000 μs
api_overhead_benchmark_sycl SubmitKernel in order	-	24.133000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	-	2.113000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	-	1.679000 μs
api_overhead_benchmark_ur SubmitKernel out of order CPU count	-	104663.000000 instr
api_overhead_benchmark_ur SubmitKernel out of order	-	15.750000 μs
api_overhead_benchmark_ur SubmitKernel in order CPU count	-	110006.000000 instr
api_overhead_benchmark_ur SubmitKernel in order	-	16.241000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	-	122876.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion	-	21.005000 μs

Relative perf in group memory (4): cannot calculate

Benchmark	This PR	baseline
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	-	251.872000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	-	132.472000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	-	5.573000 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	-	3.158000 GB/s

Relative perf in group miscellaneous (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	-	860.664000 bw GB/s

Relative perf in group multithread (10): cannot calculate

Benchmark	This PR	baseline
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	-	6939.950000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	-	17154.077000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	-	46935.372000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	-	2093.086000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	-	7472.404000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	-	8689.121000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	-	25587.435000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	-	1201.865000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	40846.653000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	112790.682000 μs

Relative perf in group graph (10): cannot calculate

Benchmark	This PR	baseline
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	-	71747.470000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	-	72642.878000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	-	353339.946000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	-	353502.721000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	-	54.566000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	-	62.367000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	-	674.284000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	-	5721.966000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	-	5688.177000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	-	57817.523000 μs

Relative perf in group Velocity-Bench (9): cannot calculate

Benchmark	This PR	baseline
Velocity-Bench Hashtable	-	358.375158 M keys/sec
Velocity-Bench Bitcracker	-	35.965200 s
Velocity-Bench CudaSift	-	201.701000 ms
Velocity-Bench Easywave	-	226.000000 ms
Velocity-Bench QuickSilver	-	117.580000 MMS/CTT
Velocity-Bench Sobel Filter	-	611.944000 ms
Velocity-Bench dl-cifar	-	23.442800 s
Velocity-Bench dl-mnist	-	2.720000 s
Velocity-Bench svm	-	0.134300 s

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline
Runtime_IndependentDAGTaskThroughput_SingleTask	-	268.614000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	-	277.626000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	-	277.078000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	-	277.264000 ms
Runtime_DAGTaskThroughput_SingleTask	-	1688.724000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	-	1764.745000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	-	1737.282000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	-	1705.559000 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	-	5.241000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	-	4.991000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	-	4.763000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	-	4.863000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	-	618.230000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	-	618.282000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	-	4.928000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	-	5.197000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	-	5.079000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	-	5.207000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	-	617.816000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	-	617.727000 ms
MicroBench_LocalMem_int32_4096	-	29.924000 ms
MicroBench_LocalMem_fp32_4096	-	29.864000 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline
Pattern_Reduction_NDRange_int32	-	16.761000 ms
Pattern_Reduction_Hierarchical_int32	-	16.736000 ms
Pattern_SegmentedReduction_NDRange_int16	-	2.264000 ms
Pattern_SegmentedReduction_NDRange_int32	-	2.166000 ms
Pattern_SegmentedReduction_NDRange_int64	-	2.337000 ms
Pattern_SegmentedReduction_NDRange_fp32	-	2.165000 ms
Pattern_SegmentedReduction_Hierarchical_int16	-	11.801000 ms
Pattern_SegmentedReduction_Hierarchical_int32	-	11.589000 ms
Pattern_SegmentedReduction_Hierarchical_int64	-	11.771000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	-	11.590000 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline
ScalarProduct_NDRange_int32	-	3.744000 ms
ScalarProduct_NDRange_int64	-	5.440000 ms
ScalarProduct_NDRange_fp32	-	3.760000 ms
ScalarProduct_Hierarchical_int32	-	10.507000 ms
ScalarProduct_Hierarchical_int64	-	11.485000 ms
ScalarProduct_Hierarchical_fp32	-	10.152000 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline
USM_Allocation_latency_fp32_device	-	0.066000 ms
USM_Allocation_latency_fp32_host	-	37.402000 ms
USM_Allocation_latency_fp32_shared	-	0.065000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	-	1.681000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	-	1.056000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	-	1.838000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	-	1.205000 ms

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline
VectorAddition_int32	-	1.492000 ms
VectorAddition_int64	-	3.061000 ms
VectorAddition_fp32	-	1.434000 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline
Polybench_2mm	-	1.039000 ms
Polybench_3mm	-	1.482000 ms
Polybench_Atax	-	6.416000 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	-	14.144000 ms

Relative perf in group LinearRegressionCoeff (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	-	899.874000 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	-	0.029000 ms

Relative perf in group llama.cpp (6): cannot calculate

Benchmark	This PR	baseline
llama.cpp Prompt Processing Batched 128	-	824.202968 token/s
llama.cpp Text Generation Batched 128	-	62.990615 token/s
llama.cpp Prompt Processing Batched 256	-	870.375426 token/s
llama.cpp Text Generation Batched 256	-	62.990517 token/s
llama.cpp Prompt Processing Batched 512	-	429.991968 token/s
llama.cpp Text Generation Batched 512	-	62.959741 token/s

LD_PRELOAD=libtbbmalloc_proxy.so

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv --benchmark_filter=glibc

martygrant · 2025-02-20T11:53:03Z

Unified Runtime -> intel/llvm Repo Move Notice

Information

The source code of Unified Runtime has been moved to intel/llvm under the unified-runtime top-level directory,
all future development will now be carried out there. This was done in intel/llvm#17043.

The code will be mirrored to oneapi-src/unified-runtime and the specification will continue to be hosted at oneapi-src.github.io/unified-runtime.

The contribution guide has been updated with new instructions for contributing to Unified Runtime.

PR Migration

All open PRs including this one will be labelled auto-close and shall be automatically closed after 30 days.
To allow for some breathing space, this automation will not be enabled until next week (27/02/2025).

Should you wish to continue with your PR you will need to migrate it to intel/llvm.
We have provided a script to help automate this process.

This is an automated comment.

martygrant · 2025-02-28T10:45:30Z

Unified Runtime -> intel/llvm Repo Move Notice

Following on from the previous notice, we have now enabled workflows to automatically label and close PRs because the Unified Runtime source code has moved to intel/llvm.

This PR has now been marked with the auto-close label and will be automatically closed after 30 days.

Please review the previous notice for more information, including assistance with migrating your PR to intel/llvm.

Should there be a reason for this PR to remain open, manually remove the auto-close label.

This is an automated comment.

github-actions · 2025-03-31T00:32:05Z

Automatic PR Closure Notice

Information

This PR has been closed automatically. It was marked with the auto-close label 30 days ago as part of the Unified Runtime source code migration to the intel/llvm repository - intel/llvm#17043.

All Unified Runtime development should be done in intel/llvm, details can be found in the updated contribution guide.
This repository will continue to exist as a mirror and will host the specification documentation.

Next Steps

Should you wish to re-open this PR it must be moved to intel/llvm. We have provided a script to help automate this process, otherwise no actions are required.

This is an automated comment.

fix missing cmake options

fe1484f

EuphoricThinking requested a review from a team as a code owner January 30, 2025 14:55

github-actions bot added the ci/cd Continuous integration/devliery label Jan 30, 2025

Results printed inside benchmark

75669c2

Add jemalloc

6e67fda

A little bit more output for umf data

3c0f838

add and test tbbProxy

61190c4

lukaszstolarczuk reviewed Jan 31, 2025

View reviewed changes

actually test tbbProxy

caeb644

Looking for tbb

a28c119

martygrant added the auto-close label Feb 28, 2025

github-actions bot closed this Mar 31, 2025

[TEST PR] ignore #2645

[TEST PR] ignore #2645

Conversation

EuphoricThinking commented Jan 30, 2025

github-actions bot commented Jan 30, 2025

github-actions bot commented Jan 30, 2025

Summary

Performance change in benchmark groups

Details

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables:

Command:

Environment Variables: