Skip to content

Conversation

@juuso-oskari
Copy link
Contributor

This PR provides finetuned configs for deepseek fp8 blockscale. Also fixes the benching script (K = intermediate dimension // 2 for fc2 if GLU is used, NOT N = intemediate dimension * 2 for fc1).

@anhminhnguyenhoang
Copy link

Tensor shapes M=[64, 128, 256], N=2112, M=7168 for Deepseek model has been tuned for Triton block configs facilitating at least 2x uplift in performance in comparison to the current state in main branch.

# before
     M     N     K  throughput (TFLOPs)  time (ms)  bandwidth (GB/s)
0   64  2112  7168            14.984021   0.128799        123.609343
1  128  2112  7168            30.529371   0.127317        131.801976
2  256  2112  7168            66.172665   0.117496        154.754403

# after
     M     N     K  throughput (TFLOPs)  time (ms)  bandwidth (GB/s)
0   64  2112  7168            27.073608   0.068483        221.219118
1  128  2112  7168            89.067354   0.043015        372.392590
2  256  2112  7168           180.079694   0.045219        414.613027

Ps: Results rendered on asrock-1w300-e0-3.mkm.dcgpu (mi350xas2)

@anhminhnguyenhoang anhminhnguyenhoang marked this pull request as ready for review October 23, 2025 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants