CUTLASS/CuTe are libraries advertised for performing grid-level GEMMs and convolutions. But the primitives they offer— namely abstractions over the tensor core intrinsics and required register partitioning— are useful for many operations that can employ tensor cores.
This repository contains an implementation of a batched 2D discrete cosine transform (DCT)
using CuTe. The kernel is in cuda-kernels/cuda/dct.cu, while the rest of the repository
is driver code to test its throughput.
The kernel performs two GEMMs for each tile, transposing the tile in between, representing row-wise and column-wise DCT respectively. It carefully lays out the tensor data in shared memory with CuTe swizzling to reduce bank conflicts for that transpose.
Thanks to the high throughput of tensor cores even on consumer GPUs, the kernel is fast enough to be completely memory-bound up to a block size of 128x128 (per batch item), which is the largest block size I'm aware any video/image codec is using.