[fp8 blockwise training] improve fp8 blockwise gemm perf; add GEMM bench script #2730

danielvegamyhre · 2025-08-11T16:47:46Z

Repro for triton compiler upcasting fp8 inputs to f16 and doing mma.sync f16.f16.fp32:

Check out this PR
Run python benchmarks/blockwise_fp8_training/bench_gemms.py and use NCU see associated PTX
Kernel name: blockwise_fp8_gemm_1x128_128x1_kernel (accepts "A" scaled with 1x128 granularity and "B" scaled with 128x128 granularity)

pytorch-bot · 2025-08-11T16:47:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2730

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 215e891 with merge base 1526dfe ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Aug 11, 2025

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 11, 2025

danielvegamyhre force-pushed the perf2 branch 2 times, most recently from fc0fdaf to 252da1e Compare August 13, 2025 01:03

danielvegamyhre marked this pull request as draft August 13, 2025 01:03

improve fp8 blockwise gemm perf

215e891

danielvegamyhre force-pushed the perf2 branch from 252da1e to 215e891 Compare August 15, 2025 20:51

danielvegamyhre closed this Aug 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fp8 blockwise training] improve fp8 blockwise gemm perf; add GEMM bench script #2730

[fp8 blockwise training] improve fp8 blockwise gemm perf; add GEMM bench script #2730

danielvegamyhre commented Aug 11, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

[fp8 blockwise training] improve fp8 blockwise gemm perf; add GEMM bench script #2730

[fp8 blockwise training] improve fp8 blockwise gemm perf; add GEMM bench script #2730

Conversation

danielvegamyhre commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2730

✅ No Failures

Uh oh!

Uh oh!

danielvegamyhre commented Aug 11, 2025 •

edited

Loading

pytorch-bot bot commented Aug 11, 2025 •

edited

Loading