High-performance CUDA matrix multiplication kernels - shared memory tiling, register blocking, Roofline Model analysis. Benchmarked against cuBLAS.
c-plus-plus machine-learning deep-learning hpc parallel-computing cuda nvidia matrix-multiplication high-performance-computing cuda-kernels gpu-computing gpu-optimization roofline-model memory-coalescing tiled-matmul
-
Updated
Apr 30, 2026 - Cuda