DCT demo using CUTLASS/CuTe

CUTLASS/CuTe are libraries advertised for performing grid-level GEMMs and convolutions. But the primitives they offer— namely abstractions over the tensor core intrinsics and required register partitioning— are useful for many operations that can employ tensor cores.

This repository contains an implementation of a batched 2D discrete cosine transform (DCT) using CuTe. The kernel is in cuda-kernels/cuda/dct.cu, while the rest of the repository is driver code to test its throughput.

The kernel performs two GEMMs for each tile, transposing the tile in between, representing row-wise and column-wise DCT respectively. It carefully lays out the tensor data in shared memory with CuTe swizzling to reduce bank conflicts for that transpose.

Thanks to the high throughput of tensor cores even on consumer GPUs, the kernel is fast enough to be completely memory-bound up to a block size of 128x128 (per batch item), which is the largest block size I'm aware any video/image codec is using.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
cuda-kernels		cuda-kernels
helper		helper
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DCT demo using CUTLASS/CuTe

About

Uh oh!

Releases

Packages

Languages

caelunshun/cutlass-dct

Folders and files

Latest commit

History

Repository files navigation

DCT demo using CUTLASS/CuTe

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages