What's Changed
- Update gfx942 FA fwd kernel by @slippedJim in #648
- Fix Precision Issue in RoPE Tests by @ruanjm in #627
- [TRITON]: add json config and refactor by @rahulbatra85 in #595
- [TRITON] Refactor Triton RMSNorm and LayerNorm unit tests by @lucas-santos-amd in #598
- [TRITON]: Add Triton PodAttention by @valechen in #651
- Update MI300 FA fwd kernel by @slippedJim in #655
- update moe sorting and CK by @junhaha666 in #660
- refactor by @fsx950223 in #664
- [Triton] DS fused custom ops by @k50112113 in #607
- Fix ck_gemm_a4w4_blockscale tune with splitK by @ukannika-amd in #653
- add fmoe_int8_g1u1_smf_subGU_256 by @valarLip in #667
- Add option to choose between CK RMSNorm pipelines by @ClementLinCF in #647
- Update CK by @poyenc in #669
- edit gemm_a4w8_asm api by @junhaha666 in #672
- Optimize the topK Softmax kernel to reduce one round of topK reduce(idea by cui cu) by @junhaha666 in #673
- Remove dpad==dvpad limit in CK FA bwd codegen by @slippedJim in #677
- [TRITON]: Benchmarking scripts updates by @willzhou-amd in #650
- [TRITON]: Adding Lean + Paged Attention, for decode by @alexdutu in #376
- [TRITON] Tune fp4xfp4 GEMM by @willzhou-amd in #641
- slice acc into two parts to reduce vgpr usage by @xiaohuguo2023 in #659
- fix gemm a4w4 compile issue by @rocking5566 in #681
- FA bwd asm kernel update by @slippedJim in #679
- Gemm a8w8 bpreshuffle api fix by @junhaha666 in #682
- Refine FA impl by @slippedJim in #683
- fix fmoe a8w8 ck stage2 not support inter_dim % 256 = 0 by @junhaha666 in #684
- [TRITON]: add hstu attn op to aiter by @scxiao in #629
- add support for load json.gz by @valarLip in #687
- add blockscale ps asm moe by @junhaha666 in #624
- Pa fp8 mfma by @fsx950223 in #694
- [fea]: new kernel for allreduce optimize by @TennyWang1223 in #699
- fmoe_codegen_asm by @amd-ruitang3 in #690
- add moe_fuse_gate_topK from sglang by @junhaha666 in #700
- fix prebuild file path by @fsx950223 in #692
- [TRITON]: Add benchmark test for leanAttention by @valechen in #688
- [TRITON] Add LayerNorm Backward Triton Kernels by @lucas-santos-amd in #546
- [TRITON] Add Torch unit test reference to PA Prefill Triton Kernels by @lucas-santos-amd in #676
- [TRITON]: Add missing GEMM benchmarks by @willzhou-amd in #680
- A4w4_asm_pro by @zufayu in #649
- fix topk bug by @junhaha666 in #708
- Fix swa condition in FA bwd v3 api by @slippedJim in #707
- use ck_tile::get_warp_size() by @junhaha666 in #710
- fix bug in splitK select by @zufayu in #717
- enable gemm_a4w4 asm kernel to tune splitk by @yzhou103 in #662
- refine moe by @valarLip in #701
- [TRITON]: extend attention bf16 text fix by @Chi-Chu319 in #705
- [Bugfix] Skinny GEMM in tuned gemm.py: add output conversion to tuned_gemm.mm by @vllmellm in #665
- [TRITON]: Add logging to GEMM ops by @rahulbatra85 in #722
- [TRITON] Shaoclee/ds mxfp4 gemm tune by @k50112113 in #693
- [TRITON] shaoclee/triton gemm a8w8 dev by @k50112113 in #709
- [TRITON]: enable buffer ops for lean attention by @xiaohuguo2023 in #725
- update ptpc bpreshuffle gemm tune by @valarLip in #719
- Try to get cu num from env first by @slippedJim in #739
- [fea]: new ar interface by @TennyWang1223 in #750
- A4w4_asm_pro_max_v2 by @zufayu in #741
- asm_fmoe_codegen by @amd-ruitang3 in #702
- Fix fmha codegen when pip install aiter by @slippedJim in #734
- Add sglang ci tests by @gyohuangxin in #735
- [TRITON]: LeanAttention implement loop unrolling to reduce VGPR usage by @valechen in #744
- increase build core num by @valarLip in #730
- [TRITON] mha benchmark fix by @Chi-Chu319 in #748
- fix confilct between AITER_REBUILD and gen_func by @valarLip in #761
- add more bpreshuffle instances by @solinzby1 in #747
- fix random precision issues:192/224x256 tile asm so files by @zufayu in #751
- [TRITON]: MLA and Lean Attention updates by @willzhou-amd in #720
- [TRITON]: Add fused GEMMs to optimize FF block by @willzhou-amd in #736
- [TRITON]: Clear cache allocator in Triton tests by @rahulbatra85 in #743
- mdf_UT_args by @amd-ruitang3 in #752
- Enable custom op and avoid graph breaks by @ZhangLirong-amd in #740
- Create docs folder and the doc 'Build and Run the Aiter Container as a Non-root User' by @gyohuangxin in #760
- fix quant_type=1x128(128x128) can't use tuned_fmoe cfg by @junhaha666 in #758
- add prebuild options in ck_moe by @lalala-sh in #732
- optimize test args by @amd-ruitang3 in #768
- [TRITON]: Add logging info to Triton Kernels by @rahulbatra85 in #729
- fix multiprocess tuning problem by @yzhou103 in #733
- add layout limitation for FA fwd v3 by @slippedJim in #764
- Sampling by @fsx950223 in #727
- Fix issues in sglang ci test when it's from a forked repo. by @gyohuangxin in #769
- Support torch.library.infer_schema for torch < 2.5 by @ZhangLirong-amd in #773
- Fix FA fwd asm limitation by @slippedJim in #782
- LeanAttention code modularization by @valechen in #765
- fix arg parser in pa_v1.py main entry by @842974287 in #772
- fix missing-braces warning during compilation by @842974287 in #770
- Fix MHA build failed by @ZhangLirong-amd in #787
- Wrapper import torch to avoid build issue by @ZhangLirong-amd in #780
- Add assert to prevent user forget to return lse for training by @rocking5566 in #776
- fix test_rmsnorm2dFusedAddQuant.py --mode 3 by @valarLip in #794
- Make Gemm and other ops return Tesnor And graph break fix by @ZhangLirong-amd in #783
- Batch gemm tuning in parallel by @yzhou103 in #711
- fix typehint for rmsnorm2d_fwd_with_add_smoothquant by @valarLip in #796
- Fix issues in sglang test by @gyohuangxin in #800
- Add receipt for pytorch by @alugorey in #791
- [TRITON]: Benchmarking changes for performance CI by @willzhou-amd in #762
- fix ep test by @valarLip in #799
- [TRITON] Add Chunked PA Prefill Triton Kernel by @lucas-santos-amd in #745
- update ck and compiler to c++20 by @rocking5566 in #803
- update aiter paramsupported arguments configuration in readme by @minmengdie in #789
- Enable FA multi target build by @slippedJim in #774
- Optimize topksoftmax: top-K-only softmax + 32B vector loads by @CuiCu-618 in #804
- Fix get_num, gfx, get_padded_m and other breaks in dynamo by @ZhangLirong-amd in #797
- [fix]: fix ar 1stage sync error by @TennyWang1223 in #807
- update CK to fix fa fwd build error by @slippedJim in #810
- Fix issues in Triton Test by @gyohuangxin in #813
- LeanAttention optimization by @valechen in #817
- update ck to improve mha bwd by @rocking5566 in #808
- Dispatch dq_shuffle kernel base on hdim_q by @slippedJim in #812
- fix time calculate when multiple same kernels in one test by @yzhou103 in #802
- Fmoe update by @junhaha666 in #821
- fmoe fp8 g1u1 vskip by @amd-ruitang3 in #798
- [TRITON] Add non-TN layout tests to Triton GEMMs by @lucas-santos-amd in #824
- [TRITON]: Disable mha bkwd UT by @rahulbatra85 in #831
- Sglang Test Enhancement by @gyohuangxin in #818
- Fix mha running without bwd and torch 2.9 in compatible by @ZhangLirong-amd in #820
- Fix torch2.4 not support infer_schema with str having default value by @ZhangLirong-amd in #827
- [TRITON]: extend_attention.py and mla_decode_rope.py tuning for mi350 by @juuso-oskari in #696
- [fix]: replace 512 as smem_gpu_loop_stride by @TennyWang1223 in #828
- modify the swa condition in mha readme by @minmengdie in #806
- update gfx950 fmha_v3_bwd co file which is generated by gfx942 ams code by @minmengdie in #816
- Pa v0 fp8 by @fsx950223 in #814
- Disable getHipblasltKernelName to fix tune error by @ZhangLirong-amd in #846
- Fix FA codegen import path by @slippedJim in #845
- [TRITON]: CI Test Set & Benchmark Speedups by @willzhou-amd in #825
- [TRITON] add test scripts for fp8 bmm prequant kernel by @k50112113 in #786
- [TRITON]: End-to-end fused feed-forward kernel by @willzhou-amd in #778
- fix gfx950 by @fsx950223 in #849
- add moe 1stage implementation to tune by @yzhou103 in #837
- [TRITON] Moe fp8 tuning mi350 by @Chi-Chu319 in #790
- [TRITON]: Reduce usage of hardcoded XCD by @rahulbatra85 in #852
- [TRITON]: Lean Attention spatial2 by @valechen in #853
- fix fp4 gemm precision issue by @junhaha666 in #861
- PA asm arg refine by @valarLip in #856
- Fix old torch version not support ignore_method by @ZhangLirong-amd in #857
- Update ck_tile::make_kernel interface to fix the build by @linqun in #854
- Fix mha gen_fake impl error and rope acc issue in torch.compile by @ZhangLirong-amd in #860
- fix typo by @fsx950223 in #865
- Enable FA bwd gfx950 kernels padding on seqlen by @minmengdie in #815
- fix fmoe autotune and tuned_gemm by @yzhou103 in #863
- fix init_dist_env world_size==1 use car and comment out get_distribu… by @junhaha666 in #876
- fix splitk kernel precision issues by @zufayu in #877
- [TRITON]: Add Json file for batched gemm a8w8 by @rahulbatra85 in #871
- Enable group mode seqlen padding between batch by @slippedJim in #859
- Enable concurrency in Sglang CI to abort previous runs on new workflow triggers by @gyohuangxin in #869
- Update CK to Fix Build Error for Instances with ELEMENTWISE_BIAS by @DDEle in #874
- refine moe tuner by @lalala-sh in #858
- refine moe codegen by @lalala-sh in #866
- Add Quick Gemm Performance Tuning for The Popular Workloads by @wuhuikx in #838
- kvcache_minor_opt by @valarLip in #881
- Use env var to index asm path by @slippedJim in #880
- moe_op fix typo by @lalala-sh in #883
- [TRITON]: remove print from mha bwd by @juuso-oskari in #868
- [TRITON]: Disable hanging testcases for gemm_a8w8_blockscale by @rahulbatra85 in #872
- opt_act_and_mul by @valarLip in #885
- add fp8 test for rms_fuse_quant by @Bernard-Liu in #715
- AITER_JIT_DIR by @amd-ruitang3 in #889
- unit8 compatibility by @zufayu in #888
- Enable CK exclude build in FA cpp api by @slippedJim in #894
- [GemmTuner] Dump immediately a result row once tuned out by @xziya in #882
- [Triton] Constexpr function bug fix by @k50112113 in #890
- Nano-vllm model enable by @ZhangLirong-amd in #891
- update jit error log style by @valarLip in #905
- update batch & group mode fwd kernel padding on seqlen dimension by @minmengdie in #870
- [TRITON]: Add Gluon Blockscale A8W8 GEMM by @willzhou-amd in #886
- Rename variable name from scoring_func to is_softmax by @ZhangLirong-amd in #904
- Fix FA cpp api build error by @slippedJim in #903
New Contributors
- @ClementLinCF made their first contribution in #647
- @alexdutu made their first contribution in #376
- @xiaohuguo2023 made their first contribution in #659
- @scxiao made their first contribution in #629
- @vllmellm made their first contribution in #665
- @gyohuangxin made their first contribution in #735
- @ZhangLirong-amd made their first contribution in #740
- @842974287 made their first contribution in #772
- @alugorey made their first contribution in #791
- @minmengdie made their first contribution in #789
- @linqun made their first contribution in #854
- @DDEle made their first contribution in #874
- @wuhuikx made their first contribution in #838
- @Bernard-Liu made their first contribution in #715
- @xziya made their first contribution in #882
Full Changelog: v0.1.4...v0.1.5