# [Bug/Workaround] FlashInfer & Triton JIT compilation failure on heterogeneous GPUs (e.g., RTX 3090 + 4090) in TP mode

### Reminder

- [x] I have read the above rules and searched the existing issues.

### System Info

### Environment
* **OS:** Ubuntu 24.04
* **GPU:** 1x RTX 4090 (sm_89) + 1x RTX 3090 (sm_86)
* **Python:** 3.12
* **ktransformers:** 0.6.2
* **FlashInfer:** 0.6.3
* **Model:** Qwen3.5-MoE (Hybrid model with Mamba/GDN + standard attention)

### Reproduction

### Description
When running ktransformers 0.6.2 with Tensor Parallelism on heterogeneous GPUs (3090 + 4090), the server crashes with `no kernel image is available for execution on the device` or fails during JIT compilation with missing headers (e.g., `fatal error: flashinfer/page.cuh: No such file or directory`).

This issue is caused by the interaction of **two separate JIT compilers** (FlashInfer C++ and Triton) incorrectly handling mixed compute capabilities.

### Root Cause Analysis

1. **FlashInfer C++ JIT Bug (Arch Mismatch in Cache):** 
   In FlashInfer 0.6.3, when compiling the sampling kernel in a multi-GPU environment, it detects the highest architecture (sm_89 from the 4090) and uses it for compilation, even if the cache directory is named `86/`. 
   As a result, `~/.cache/flashinfer/0.6.3/86/sampling.so` is compiled with sm_89 instructions. When the 3090 (sm_86) tries to execute this kernel, it crashes.
   *(Note: In kt 0.5.3 with older FlashInfer, the cache was `86_89/` and correctly compiled sm_86, so this didn't happen).*

2. **Triton JIT Bug (Primary GPU Detection):**
   SGLang/kt also uses Triton for MoE and Mamba/GDN linear attention kernels. Triton determines the compilation architecture based solely on the primary GPU (GPU 0). 
   By default, GPU 0 is the 4090 (sm_89), so Triton compiles sm_89 kernels. When the 3090 (GPU 1) tries to execute these Triton kernels, it crashes with the same "no kernel image" error.

### Reproduction Steps
Set up a 3090 + 4090 system with kt 0.6.2 and FlashInfer 0.6.3.
Clear FlashInfer cache:rm -rf ~/.cache/flashinfer/0.6.3/
Launch the server with default GPU order:python -m sglang.launch_server
–tensor-parallel-size 2
–attention-backend flashinfer \
… other args
Server crashes on startup or during the first inference request.

### Workaround / Solution
To solve this, we must explicitly constrain both JIT compilers to only generate sm_86 instructions. The 4090 runs these kernels perfectly via backward compatibility.

Clear the corrupted cache first:

rm -rf ~/.cache/flashinfer/0.6.3/

Launch with both environment variables:

CUDA_VISIBLE_DEVICES=1,0 \
TORCH_CUDA_ARCH_LIST="8.6" \
python -m sglang.launch_server \
  --host 0.0.0.0 --port 8080 \
  --tensor-parallel-size 2 \
  --attention-backend flashinfer \
  ... other args

CUDA_VISIBLE_DEVICES=1,0: Makes the 3090 the primary GPU (GPU 0). This forces Triton to compile sm_86 kernels for MoE and Mamba/GDN layers.
TORCH_CUDA_ARCH_LIST="8.6": Forces FlashInfer C++ JIT to strictly compile sm_86 kernels, preventing it from generating sm_89 instructions for the 3090.
With this configuration, the server starts successfully, and both GPUs operate normally using sm_86 kernels.

Note on Performance: Interestingly, in this specific heterogeneous setup, using the Triton attention backend actually yields slightly faster generation speeds compared to FlashInfer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# [Bug/Workaround] FlashInfer & Triton JIT compilation failure on heterogeneous GPUs (e.g., RTX 3090 + 4090) in TP mode #2039

Reminder

System Info

Environment

Reproduction

Description

Root Cause Analysis

Reproduction Steps

Workaround / Solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

# [Bug/Workaround] FlashInfer & Triton JIT compilation failure on heterogeneous GPUs (e.g., RTX 3090 + 4090) in TP mode #2039

Description

Reminder

System Info

Environment

Reproduction

Description

Root Cause Analysis

Reproduction Steps

Workaround / Solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions