Re-enable FlashInfer for Llama4 on Blackwell in e2e fusion tests #28966
+4
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
Fix #28604 (resolved by #28739) - re-enable FlashInfer as the attention backend for Llama4 on Blackwell platforms in e2e fusion tests.
Test Plan
Existing e2e fusion tests in
tests/compile/test_fusions_e2e.pywill validate the change:test_attn_quant- Tests attention+quant fusion with FlashInfer on Blackwelltest_tp2_attn_quant_allreduce_rmsnorm- Tests multi-GPU fusion patternstest_tp2_attn_quant_async_tp- Tests async TP with FlashInferTest Result
Changes are minimal and follow the existing pattern used for Llama3. The Llama4 model configuration now uses:
This enables FlashInfer on Blackwell (where #28739 fixed the issue) while preserving TRITON_ATTN on Hopper (where #28568 remains unresolved).
Changes:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.