Releases: ROCm/aiter
Releases · ROCm/aiter
v0.1.5 release
What's Changed
- Update gfx942 FA fwd kernel by @slippedJim in #648
- Fix Precision Issue in RoPE Tests by @ruanjm in #627
- [TRITON]: add json config and refactor by @rahulbatra85 in #595
- [TRITON] Refactor Triton RMSNorm and LayerNorm unit tests by @lucas-santos-amd in #598
- [TRITON]: Add Triton PodAttention by @valechen in #651
- Update MI300 FA fwd kernel by @slippedJim in #655
- update moe sorting and CK by @junhaha666 in #660
- refactor by @fsx950223 in #664
- [Triton] DS fused custom ops by @k50112113 in #607
- Fix ck_gemm_a4w4_blockscale tune with splitK by @ukannika-amd in #653
- add fmoe_int8_g1u1_smf_subGU_256 by @valarLip in #667
- Add option to choose between CK RMSNorm pipelines by @ClementLinCF in #647
- Update CK by @poyenc in #669
- edit gemm_a4w8_asm api by @junhaha666 in #672
- Optimize the topK Softmax kernel to reduce one round of topK reduce(idea by cui cu) by @junhaha666 in #673
- Remove dpad==dvpad limit in CK FA bwd codegen by @slippedJim in #677
- [TRITON]: Benchmarking scripts updates by @willzhou-amd in #650
- [TRITON]: Adding Lean + Paged Attention, for decode by @alexdutu in #376
- [TRITON] Tune fp4xfp4 GEMM by @willzhou-amd in #641
- slice acc into two parts to reduce vgpr usage by @xiaohuguo2023 in #659
- fix gemm a4w4 compile issue by @rocking5566 in #681
- FA bwd asm kernel update by @slippedJim in #679
- Gemm a8w8 bpreshuffle api fix by @junhaha666 in #682
- Refine FA impl by @slippedJim in #683
- fix fmoe a8w8 ck stage2 not support inter_dim % 256 = 0 by @junhaha666 in #684
- [TRITON]: add hstu attn op to aiter by @scxiao in #629
- add support for load json.gz by @valarLip in #687
- add blockscale ps asm moe by @junhaha666 in #624
- Pa fp8 mfma by @fsx950223 in #694
- [fea]: new kernel for allreduce optimize by @TennyWang1223 in #699
- fmoe_codegen_asm by @amd-ruitang3 in #690
- add moe_fuse_gate_topK from sglang by @junhaha666 in #700
- fix prebuild file path by @fsx950223 in #692
- [TRITON]: Add benchmark test for leanAttention by @valechen in #688
- [TRITON] Add LayerNorm Backward Triton Kernels by @lucas-santos-amd in #546
- [TRITON] Add Torch unit test reference to PA Prefill Triton Kernels by @lucas-santos-amd in #676
- [TRITON]: Add missing GEMM benchmarks by @willzhou-amd in #680
- A4w4_asm_pro by @zufayu in #649
- fix topk bug by @junhaha666 in #708
- Fix swa condition in FA bwd v3 api by @slippedJim in #707
- use ck_tile::get_warp_size() by @junhaha666 in #710
- fix bug in splitK select by @zufayu in #717
- enable gemm_a4w4 asm kernel to tune splitk by @yzhou103 in #662
- refine moe by @valarLip in #701
- [TRITON]: extend attention bf16 text fix by @Chi-Chu319 in #705
- [Bugfix] Skinny GEMM in tuned gemm.py: add output conversion to tuned_gemm.mm by @vllmellm in #665
- [TRITON]: Add logging to GEMM ops by @rahulbatra85 in #722
- [TRITON] Shaoclee/ds mxfp4 gemm tune by @k50112113 in #693
- [TRITON] shaoclee/triton gemm a8w8 dev by @k50112113 in #709
- [TRITON]: enable buffer ops for lean attention by @xiaohuguo2023 in #725
- update ptpc bpreshuffle gemm tune by @valarLip in #719
- Try to get cu num from env first by @slippedJim in #739
- [fea]: new ar interface by @TennyWang1223 in #750
- A4w4_asm_pro_max_v2 by @zufayu in #741
- asm_fmoe_codegen by @amd-ruitang3 in #702
- Fix fmha codegen when pip install aiter by @slippedJim in #734
- Add sglang ci tests by @gyohuangxin in #735
- [TRITON]: LeanAttention implement loop unrolling to reduce VGPR usage by @valechen in #744
- increase build core num by @valarLip in #730
- [TRITON] mha benchmark fix by @Chi-Chu319 in #748
- fix confilct between AITER_REBUILD and gen_func by @valarLip in #761
- add more bpreshuffle instances by @solinzby1 in #747
- fix random precision issues:192/224x256 tile asm so files by @zufayu in #751
- [TRITON]: MLA and Lean Attention updates by @willzhou-amd in #720
- [TRITON]: Add fused GEMMs to optimize FF block by @willzhou-amd in #736
- [TRITON]: Clear cache allocator in Triton tests by @rahulbatra85 in #743
- mdf_UT_args by @amd-ruitang3 in #752
- Enable custom op and avoid graph breaks by @ZhangLirong-amd in #740
- Create docs folder and the doc 'Build and Run the Aiter Container as a Non-root User' by @gyohuangxin in #760
- fix quant_type=1x128(128x128) can't use tuned_fmoe cfg by @junhaha666 in #758
- add prebuild options in ck_moe by @lalala-sh in #732
- optimize test args by @amd-ruitang3 in #768
- [TRITON]: Add logging info to Triton Kernels by @rahulbatra85 in #729
- fix multiprocess tuning problem by @yzhou103 in #733
- add layout limitation for FA fwd v3 by @slippedJim in #764
- Sampling by @fsx950223 in #727
- Fix issues in sglang ci test when it's from a forked repo. by @gyohuangxin in #769
- Support torch.library.infer_schema for torch < 2.5 by @ZhangLirong-amd in #773
- Fix FA fwd asm limitation by @slippedJim in #782
- LeanAttention code modularization by @valechen in #765
- fix arg parser in pa_v1.py main entry by @842974287 in #772
- fix missing-braces warning during compilation by @842974287 in #770
- Fix MHA build failed by @ZhangLirong-amd in #787
- Wrapper import torch to avoid build issue by @ZhangLirong-amd in #780
- Add assert to prevent user forget to return lse for training by @rocking5566 in #776
- fix test_rmsnorm2dFusedAddQuant.py --mode 3 by @valarLip in #794
- Make Gemm and other ops return Tesnor And graph break fix by @ZhangLirong-amd in #783
- Batch gemm tuning in parallel by @yzhou103 in #711
- fix typehint for rmsnorm2d_fwd_with_add_smoothquant by @valarLip in #796
- Fix issues in sglang test by @gyohuangxin in #800
- Add receipt for pytorch by @alugorey in #791
- [TRITON]: Benchmarking changes for performance CI by @willzhou-amd in #762
- fix ep test by @valarLip in #799
- [TRITON] Add Chunked PA Prefill Triton Kernel by @lucas-santos-amd in #745
- update ck and compiler to c++20 by @rocking5566 in #803
- update aiter paramsupported arguments configuration in readme by @minmengdie in #789
- Enable FA multi target build by @slippedJim in #774
- Optimize topksoftmax: top-K-only softmax + 32B vector loads by @CuiCu-618 in #804
- Fix get_num, gfx, get_padded_m and other breaks in dynamo by @ZhangLirong-amd in #797
- [fix]: fix ar 1stage sync error by @TennyWang1223 in #807
- update CK to fix fa fwd build error by @slippedJim in #810
- Fix issues in Triton Test by @gyohuangxin in #813
- LeanAttention optimization by @valechen in https://github.com/R...
v0.1.4 July release
- mxfp4 enable for gfx950, including GEMM, MoE, and per1x32 Quant
- multi-gpu tuning enable for most kind of GEMMs
- fp8 all reduce
- numbers of triton kernels
What's Changed
- [TRITON] Add Triton Topk Kernel by @hubertlu-tw in #458
- Find executable in rocm home when not found in PATH by @xli in #549
- [TRITON]: Disable int4 moe UT by @rahulbatra85 in #563
- add a4w4 asm_moe by @valarLip in #482
- Improved detection of setup.py install by @ekuznetsov139 in #534
- Disable mha related modules in prebuild by @slippedJim in #567
- Fix format error in .clang-format by @poyenc in #568
- update pa asm by @amd-ruitang3 in #553
- [TRITON]: Reorg mha code and use common fp8 type by @rahulbatra85 in #561
- [TRITON]: Gemm refactor by @rahulbatra85 in #558
- [Triton]: Add has_attr check in get_config by @rahulbatra85 in #572
- [TRITON]: GEMM updates for DS by @rahulbatra85 in #573
- update_codegen by @amd-ruitang3 in #581
- mi350_pa by @amd-ruitang3 in #579
- Change input tensor format to [B,S,H,d] and add batch support for causal by @valechen in #578
- update tune config file by @solinzby1 in #569
- [TRITON] Add RMSNorm bwd Triton Kernels by @lucas-santos-amd in #576
- fix prebuild by @junhaha666 in #592
- [TRITON]: Quantization updates(add int8 and use common fp8 dtypes) by @rahulbatra85 in #588
- Dispatch combine by @junhaha666 in #571
- update args by @amd-ruitang3 in #590
- Pa rocm refresh4 by @fsx950223 in #591
- [update]: update all-reduce by @TennyWang1223 in #552
- Fix compile error in MI350 with ROCm7 by @rocking5566 in #599
- new codegen for elementwise by @TennyWang1223 in #585
- [fix]: elementwise prebuild slow by @TennyWang1223 in #609
- [TRITON]: Fp4gemm m=256 tuning by @Chi-Chu319 in #533
- add MI350 support for skinny_gemm by @yanguahe in #602
- Fix prebuild 350 by @junhaha666 in #608
- [fix]: change ar namespace by @TennyWang1223 in #611
- compile flag clean up by @valarLip in #615
- DIY_args by @amd-ruitang3 in #596
- fix NUM_Q_HEADS - 1 in remap_xcd in _attn_fwd by @juuso-oskari in #612
- add ck gemm a4w4 blockscale with splitK support by @ukannika-amd in #603
- [TRITON]: pid grid fix by @Chi-Chu319 in #618
- Refine ck instance and update a8w8_bpreshuffle_tuned_gemm.csv by @solinzby1 in #621
- merge moe from 350 launch by @lalala-sh in #580
- Remove seqlen limit on FA fwd kernel by @slippedJim in #622
- (Triton] RoPE dev by @k50112113 in #606
- [TRITON]: Fix num_warps typo which was causing performance issues by @valechen in #604
- Topksoftmax_opt by @junhaha666 in #626
- update hip quant for corner case by @valarLip in #633
- [TRITON]: use int64 strides by default for MHA by @rahulbatra85 in #634
- [TRITON]: Standardize GEMM weight shape to (N, K) and TN memory layout (by default) by @willzhou-amd in #597
- [TRITON] Add Softmax Triton Kernel by @lucas-santos-amd in #605
- Enable gfx942 FA fwd asm kernels by @slippedJim in #619
- Update CK by @poyenc in #635
- Fix error message for rocminfo by @Rohan138 in #636
- [TRITON]: Moe tuning mi350 by @Chi-Chu319 in #610
- Fix test_pa_ragged.py use_alibi=True test cases by @poyenc in #639
- Fix FA fwd nan issue by @slippedJim in #646
- fix for fp8 e4m3fn by @valarLip in #640
- [TRITON]: Kernel benchmarking improvements (for op_benchmarks/triton) by @willzhou-amd in #594
- [Triton]: Disable fused+causal for MHA bkwd by @rahulbatra85 in #642
- enable parallel tuning on CK kernels by @yzhou103 in #625
- Pa fix2 by @fsx950223 in #645
- Update dependencies and add backup for unknown hw by @kunaltyagi in #623
- Optimize topksoftmax WARPS_PER_TB for higher occupancy and remove redundant precision conversion by @CuiCu-618 in #652
New Contributors
- @hubertlu-tw made their first contribution in #458
- @xli made their first contribution in #549
- @ekuznetsov139 made their first contribution in #534
- @valechen made their first contribution in #578
- @willzhou-amd made their first contribution in #597
- @Rohan138 made their first contribution in #636
- @yzhou103 made their first contribution in #625
- @kunaltyagi made their first contribution in #623
- @CuiCu-618 made their first contribution in #652
Full Changelog: v0.1.3...v0.1.4