Skip to content

caelunshun/cutlass-dct

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DCT demo using CUTLASS/CuTe

CUTLASS/CuTe are libraries advertised for performing grid-level GEMMs and convolutions. But the primitives they offer— namely abstractions over the tensor core intrinsics and required register partitioning— are useful for many operations that can employ tensor cores.

This repository contains an implementation of a batched 2D discrete cosine transform (DCT) using CuTe. The kernel is in cuda-kernels/cuda/dct.cu, while the rest of the repository is driver code to test its throughput.

The kernel performs two GEMMs for each tile, transposing the tile in between, representing row-wise and column-wise DCT respectively. It carefully lays out the tensor data in shared memory with CuTe swizzling to reduce bank conflicts for that transpose.

Thanks to the high throughput of tensor cores even on consumer GPUs, the kernel is fast enough to be completely memory-bound up to a block size of 128x128 (per batch item), which is the largest block size I'm aware any video/image codec is using.

About

Fast 2D discrete cosine transforms on GPUs using CUTLASS and Tensor Cores

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published