GPU Computing Project – CMPS 224 (AUB)
Authors: Mohammad Khaled Charaf · Mohammad Ayman Charaf · Jad Abou Hawili
Supervisor: Dr. Izzat El Hajj
This project implements and progressively optimizes sparse × thin-sparse matrix multiplication on NVIDIA GPUs (CUDA). We move from a sequential baseline to warp-level kernels that improve parallelism, memory locality, and synchronization, culminating in strong speedups over the baseline.
Why it matters: Sparse ops are ubiquitous in graph analytics, PDE solvers, ML (e.g., NLP/recommenders), and scientific computing. Efficient sparse multiplication reduces both compute and memory overhead.
- Implement a correct baseline for sparse × thin-sparse multiplication.
- Design five CUDA kernels with increasing sophistication:
- Thread-per-row → Block-per-row → Shared-memory privatization → Warp-level coalescing → Multi-warp write-back.
- Benchmark and analyze performance, occupancy, and bottlenecks.
- Start with COO × CSR → COO, then switch to CSR × CSR for faster row access and reduced memory overhead.
- Per-row temporary accumulators replace hashmap merges.
- Each thread computes a row of A (CSR) against B (CSR).
- Column-chunking controls memory usage; uses
atomicAddto write out. - Limitation: Under-parallelization and load imbalance on uneven sparsity.
- One thread block per row; threads collaborate via shared memory to accumulate column chunks.
- Fewer global memory reads; still uses
atomicAddon a global output counter.
- Add a block-local shared counter to batch output reservation:
- Threads increment a shared counter.
- One global
atomicAddper block reserves a contiguous region. - Threads write directly into their reserved slice.
- Result: Drastically lower contention on the global counter.
- Warp-per-element: a warp collaboratively processes one nonzero of A’s row, traversing B’s corresponding row with coalesced loads.
- Use warp intrinsics:
__ballot_sync,__popc,__shfl_syncto:- Compute compacted write positions.
- Perform one atomicAdd per warp per sub-chunk.
kernel3.cu: first warp writes back (cap at 32 columns per chunk).kernel4.cu: multi-warp write-back scales to larger column chunks (e.g., 512), utilizing all warps in the block.
- Kernel1–4 substantially outperform the baseline; occupancy increases from ~29% (baseline) to ~68–72% (best).
- Kernel3 provides the strongest balance of compute utilization and memory throughput.
- Kernel4 improves small/medium cases further; for very large datasets, global memory traffic to B becomes the main bottleneck.
Key bottleneck: non-coalesced, repeated global reads from the second matrix (CSR).
Future work: data reordering/tiling for better locality, hybrid sparse formats, cache-aware traversal.
- CUDA Toolkit (nvcc), NVIDIA GPU (≥ 8 GB recommended), C++17 compiler.
📘 For detailed explanations, implementation notes, and performance analysis of all milestones, please refer to the Final Report.