You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Summary
Upon copy-pasting the A, B tile prefetching code from `xe_mma.hpp` to
`xe_mma_w8a8.hpp` to make the two files even more similar (both are
almost same, except that xe_mma_w8a8.hpp converts FP8 `A` & `B` to FP16,
but the issue of whether or not it's possible to refactor & merge both
files is beyond this PR's scope), I noticed a performance boost of ~16%
for many input shapes on Intel GPU Max 1550.
## Performance data
On PVC 1550, with dpcpp nightly of March 23, 2025.
Used the existing benchmark, e.g.
`./examples/sycl/08_pvc_gemm_f8/08_pvc_gemm_f8 --iterations=1 --m=1024
--n=7168 --k=128`
I benchmarked by running a GEMM problem only once to verify that the
change (adding prefetching) indeed resulted in a perf boost (an
alternative would've been multiple iterations after cache flushes
between each iteration. A digression - in real workloads, though, the
activation is likely to be in cache while the weights are likely to be
in global memory before a linear op, for example, and that scenario
isn't simulated by either of the two approaches I considered).
| M | N | K | L | Latency of one invocation before this change | Latency
of one invocation after this change|Speedup|
|--|--|--|--|-----|-----|---|
|1024|1536|7168|1|3.76 ms |3.2304 ms | 1.16x |
|1024|1536|1536|1|0.824 ms |0.7034 ms | 1.17x|
|1024|576|7168|1|3.75 ms | 3.2274 ms| 1.16x |
|1024|2048|512|1|0.2853 ms |0.2458 ms | 1.16x |
|1024|7168|1024|1|1.54 ms | 1.2762 ms| 1.20x |
|1024|256|7168|1| 3.76 ms| 3.2237 ms| 1.16x |
|1024|7168|128|1|0.2270 ms |0.1997 ms | 1.13x |
|1|1536|7168|1| 3.76 ms|3.1790 ms | 1.18x |
|1|1536|1536|1|0.8206 ms |0.6925 ms | 1.18x |
|1|576|7168|1| 3.76 ms| 2.7831 ms| 1.352x |
|1|2048|512|1| 0.2802 ms| 0.2413 ms| 1.16x |
|1|7168|1024|1|0.5504 ms |0.4669 ms | 1.17x |
|1|256|7168|1| 3.7701 ms| 3.1678 ms| 1.19x |
|1|7168|128|1|0.0733 ms | 0.0683 ms| 1.07x |
## Build instructions
```
export IGC_ExtraOCLOptions="-cl-intel-256-GRF-per-thread"
export IGC_VectorAliasBBThreshold=1200
export IGC_VISAOptions="-perfmodel"
mkdir build; cd build; CC=clang CXX=clang++ cmake .. -GNinja -DCUTLASS_ENABLE_EXAMPLES=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCUTLASS_ENABLE_SYCL=ON -DCUTLASS_SYCL_PROFILING_ENABLED=ON -DDPCPP_SYCL_TARGET=intel_gpu_pvc -DCUTLASS_ENABLE_BENCHMARKS=OFF -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-ftemplate-backtrace-limit=0 -fdiagnostics-color=always"
```
cc @pengzhao-intel
Thanks!
Co-authored-by: Tadej Ciglarič <[email protected]>
0 commit comments