Release v0.1.5 release · ROCm/aiter

What's Changed

Update gfx942 FA fwd kernel by @slippedJim in #648
Fix Precision Issue in RoPE Tests by @ruanjm in #627
[TRITON]: add json config and refactor by @rahulbatra85 in #595
[TRITON] Refactor Triton RMSNorm and LayerNorm unit tests by @lucas-santos-amd in #598
[TRITON]: Add Triton PodAttention by @valechen in #651
Update MI300 FA fwd kernel by @slippedJim in #655
update moe sorting and CK by @junhaha666 in #660
refactor by @fsx950223 in #664
[Triton] DS fused custom ops by @k50112113 in #607
Fix ck_gemm_a4w4_blockscale tune with splitK by @ukannika-amd in #653
add fmoe_int8_g1u1_smf_subGU_256 by @valarLip in #667
Add option to choose between CK RMSNorm pipelines by @ClementLinCF in #647
Update CK by @poyenc in #669
edit gemm_a4w8_asm api by @junhaha666 in #672
Optimize the topK Softmax kernel to reduce one round of topK reduce(idea by cui cu) by @junhaha666 in #673
Remove dpad==dvpad limit in CK FA bwd codegen by @slippedJim in #677
[TRITON]: Benchmarking scripts updates by @willzhou-amd in #650
[TRITON]: Adding Lean + Paged Attention, for decode by @alexdutu in #376
[TRITON] Tune fp4xfp4 GEMM by @willzhou-amd in #641
slice acc into two parts to reduce vgpr usage by @xiaohuguo2023 in #659
fix gemm a4w4 compile issue by @rocking5566 in #681
FA bwd asm kernel update by @slippedJim in #679
Gemm a8w8 bpreshuffle api fix by @junhaha666 in #682
Refine FA impl by @slippedJim in #683
fix fmoe a8w8 ck stage2 not support inter_dim % 256 = 0 by @junhaha666 in #684
[TRITON]: add hstu attn op to aiter by @scxiao in #629
add support for load json.gz by @valarLip in #687
add blockscale ps asm moe by @junhaha666 in #624
Pa fp8 mfma by @fsx950223 in #694
[fea]: new kernel for allreduce optimize by @TennyWang1223 in #699
fmoe_codegen_asm by @amd-ruitang3 in #690
add moe_fuse_gate_topK from sglang by @junhaha666 in #700
fix prebuild file path by @fsx950223 in #692
[TRITON]: Add benchmark test for leanAttention by @valechen in #688
[TRITON] Add LayerNorm Backward Triton Kernels by @lucas-santos-amd in #546
[TRITON] Add Torch unit test reference to PA Prefill Triton Kernels by @lucas-santos-amd in #676
[TRITON]: Add missing GEMM benchmarks by @willzhou-amd in #680
A4w4_asm_pro by @zufayu in #649
fix topk bug by @junhaha666 in #708
Fix swa condition in FA bwd v3 api by @slippedJim in #707
use ck_tile::get_warp_size() by @junhaha666 in #710
fix bug in splitK select by @zufayu in #717
enable gemm_a4w4 asm kernel to tune splitk by @yzhou103 in #662
refine moe by @valarLip in #701
[TRITON]: extend attention bf16 text fix by @Chi-Chu319 in #705
[Bugfix] Skinny GEMM in tuned gemm.py: add output conversion to tuned_gemm.mm by @vllmellm in #665
[TRITON]: Add logging to GEMM ops by @rahulbatra85 in #722
[TRITON] Shaoclee/ds mxfp4 gemm tune by @k50112113 in #693
[TRITON] shaoclee/triton gemm a8w8 dev by @k50112113 in #709
[TRITON]: enable buffer ops for lean attention by @xiaohuguo2023 in #725
update ptpc bpreshuffle gemm tune by @valarLip in #719
Try to get cu num from env first by @slippedJim in #739
[fea]: new ar interface by @TennyWang1223 in #750
A4w4_asm_pro_max_v2 by @zufayu in #741
asm_fmoe_codegen by @amd-ruitang3 in #702
Fix fmha codegen when pip install aiter by @slippedJim in #734
Add sglang ci tests by @gyohuangxin in #735
[TRITON]: LeanAttention implement loop unrolling to reduce VGPR usage by @valechen in #744
increase build core num by @valarLip in #730
[TRITON] mha benchmark fix by @Chi-Chu319 in #748
fix confilct between AITER_REBUILD and gen_func by @valarLip in #761
add more bpreshuffle instances by @solinzby1 in #747
fix random precision issues:192/224x256 tile asm so files by @zufayu in #751
[TRITON]: MLA and Lean Attention updates by @willzhou-amd in #720
[TRITON]: Add fused GEMMs to optimize FF block by @willzhou-amd in #736
[TRITON]: Clear cache allocator in Triton tests by @rahulbatra85 in #743
mdf_UT_args by @amd-ruitang3 in #752
Enable custom op and avoid graph breaks by @ZhangLirong-amd in #740
Create docs folder and the doc 'Build and Run the Aiter Container as a Non-root User' by @gyohuangxin in #760
fix quant_type=1x128(128x128) can't use tuned_fmoe cfg by @junhaha666 in #758
add prebuild options in ck_moe by @lalala-sh in #732
optimize test args by @amd-ruitang3 in #768
[TRITON]: Add logging info to Triton Kernels by @rahulbatra85 in #729
fix multiprocess tuning problem by @yzhou103 in #733
add layout limitation for FA fwd v3 by @slippedJim in #764
Sampling by @fsx950223 in #727
Fix issues in sglang ci test when it's from a forked repo. by @gyohuangxin in #769
Support torch.library.infer_schema for torch < 2.5 by @ZhangLirong-amd in #773
Fix FA fwd asm limitation by @slippedJim in #782
LeanAttention code modularization by @valechen in #765
fix arg parser in pa_v1.py main entry by @842974287 in #772
fix missing-braces warning during compilation by @842974287 in #770
Fix MHA build failed by @ZhangLirong-amd in #787
Wrapper import torch to avoid build issue by @ZhangLirong-amd in #780
Add assert to prevent user forget to return lse for training by @rocking5566 in #776
fix test_rmsnorm2dFusedAddQuant.py --mode 3 by @valarLip in #794
Make Gemm and other ops return Tesnor And graph break fix by @ZhangLirong-amd in #783
Batch gemm tuning in parallel by @yzhou103 in #711
fix typehint for rmsnorm2d_fwd_with_add_smoothquant by @valarLip in #796
Fix issues in sglang test by @gyohuangxin in #800
Add receipt for pytorch by @alugorey in #791
[TRITON]: Benchmarking changes for performance CI by @willzhou-amd in #762
fix ep test by @valarLip in #799
[TRITON] Add Chunked PA Prefill Triton Kernel by @lucas-santos-amd in #745
update ck and compiler to c++20 by @rocking5566 in #803
update aiter paramsupported arguments configuration in readme by @minmengdie in #789
Enable FA multi target build by @slippedJim in #774
Optimize topksoftmax: top-K-only softmax + 32B vector loads by @CuiCu-618 in #804
Fix get_num, gfx, get_padded_m and other breaks in dynamo by @ZhangLirong-amd in #797
[fix]: fix ar 1stage sync error by @TennyWang1223 in #807
update CK to fix fa fwd build error by @slippedJim in #810
Fix issues in Triton Test by @gyohuangxin in #813
LeanAttention optimization by @valechen in #817
update ck to improve mha bwd by @rocking5566 in #808
Dispatch dq_shuffle kernel base on hdim_q by @slippedJim in #812
fix time calculate when multiple same kernels in one test by @yzhou103 in #802
Fmoe update by @junhaha666 in #821
fmoe fp8 g1u1 vskip by @amd-ruitang3 in #798
[TRITON] Add non-TN layout tests to Triton GEMMs by @lucas-santos-amd in #824
[TRITON]: Disable mha bkwd UT by @rahulbatra85 in #831
Sglang Test Enhancement by @gyohuangxin in #818
Fix mha running without bwd and torch 2.9 in compatible by @ZhangLirong-amd in #820
Fix torch2.4 not support infer_schema with str having default value by @ZhangLirong-amd in #827
[TRITON]: extend_attention.py and mla_decode_rope.py tuning for mi350 by @juuso-oskari in #696
[fix]: replace 512 as smem_gpu_loop_stride by @TennyWang1223 in #828
modify the swa condition in mha readme by @minmengdie in #806
update gfx950 fmha_v3_bwd co file which is generated by gfx942 ams code by @minmengdie in #816
Pa v0 fp8 by @fsx950223 in #814
Disable getHipblasltKernelName to fix tune error by @ZhangLirong-amd in #846
Fix FA codegen import path by @slippedJim in #845
[TRITON]: CI Test Set & Benchmark Speedups by @willzhou-amd in #825
[TRITON] add test scripts for fp8 bmm prequant kernel by @k50112113 in #786
[TRITON]: End-to-end fused feed-forward kernel by @willzhou-amd in #778
fix gfx950 by @fsx950223 in #849
add moe 1stage implementation to tune by @yzhou103 in #837
[TRITON] Moe fp8 tuning mi350 by @Chi-Chu319 in #790
[TRITON]: Reduce usage of hardcoded XCD by @rahulbatra85 in #852
[TRITON]: Lean Attention spatial2 by @valechen in #853
fix fp4 gemm precision issue by @junhaha666 in #861
PA asm arg refine by @valarLip in #856
Fix old torch version not support ignore_method by @ZhangLirong-amd in #857
Update ck_tile::make_kernel interface to fix the build by @linqun in #854
Fix mha gen_fake impl error and rope acc issue in torch.compile by @ZhangLirong-amd in #860
fix typo by @fsx950223 in #865
Enable FA bwd gfx950 kernels padding on seqlen by @minmengdie in #815
fix fmoe autotune and tuned_gemm by @yzhou103 in #863
fix init_dist_env world_size==1 use car and comment out get_distribu… by @junhaha666 in #876
fix splitk kernel precision issues by @zufayu in #877
[TRITON]: Add Json file for batched gemm a8w8 by @rahulbatra85 in #871
Enable group mode seqlen padding between batch by @slippedJim in #859
Enable concurrency in Sglang CI to abort previous runs on new workflow triggers by @gyohuangxin in #869
Update CK to Fix Build Error for Instances with ELEMENTWISE_BIAS by @DDEle in #874
refine moe tuner by @lalala-sh in #858
refine moe codegen by @lalala-sh in #866
Add Quick Gemm Performance Tuning for The Popular Workloads by @wuhuikx in #838
kvcache_minor_opt by @valarLip in #881
Use env var to index asm path by @slippedJim in #880
moe_op fix typo by @lalala-sh in #883
[TRITON]: remove print from mha bwd by @juuso-oskari in #868
[TRITON]: Disable hanging testcases for gemm_a8w8_blockscale by @rahulbatra85 in #872
opt_act_and_mul by @valarLip in #885
add fp8 test for rms_fuse_quant by @Bernard-Liu in #715
AITER_JIT_DIR by @amd-ruitang3 in #889
unit8 compatibility by @zufayu in #888
Enable CK exclude build in FA cpp api by @slippedJim in #894
[GemmTuner] Dump immediately a result row once tuned out by @xziya in #882
[Triton] Constexpr function bug fix by @k50112113 in #890
Nano-vllm model enable by @ZhangLirong-amd in #891
update jit error log style by @valarLip in #905
update batch & group mode fwd kernel padding on seqlen dimension by @minmengdie in #870
[TRITON]: Add Gluon Blockscale A8W8 GEMM by @willzhou-amd in #886
Rename variable name from scoring_func to is_softmax by @ZhangLirong-amd in #904
Fix FA cpp api build error by @slippedJim in #903

New Contributors

@ClementLinCF made their first contribution in #647
@alexdutu made their first contribution in #376
@xiaohuguo2023 made their first contribution in #659
@scxiao made their first contribution in #629
@vllmellm made their first contribution in #665
@gyohuangxin made their first contribution in #735
@ZhangLirong-amd made their first contribution in #740
@842974287 made their first contribution in #772
@alugorey made their first contribution in #791
@minmengdie made their first contribution in #789
@linqun made their first contribution in #854
@DDEle made their first contribution in #874
@wuhuikx made their first contribution in #838
@Bernard-Liu made their first contribution in #715
@xziya made their first contribution in #882

Full Changelog: v0.1.4...v0.1.5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.1.5 release

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!