Skip to content

# [Bug/Workaround] FlashInfer & Triton JIT compilation failure on heterogeneous GPUs (e.g., RTX 3090 + 4090) in TP mode #2039

Description

@y8nqd4jtfn-cell

Reminder

  • I have read the above rules and searched the existing issues.

System Info

Environment

  • OS: Ubuntu 24.04
  • GPU: 1x RTX 4090 (sm_89) + 1x RTX 3090 (sm_86)
  • Python: 3.12
  • ktransformers: 0.6.2
  • FlashInfer: 0.6.3
  • Model: Qwen3.5-MoE (Hybrid model with Mamba/GDN + standard attention)

Reproduction

Description

When running ktransformers 0.6.2 with Tensor Parallelism on heterogeneous GPUs (3090 + 4090), the server crashes with no kernel image is available for execution on the device or fails during JIT compilation with missing headers (e.g., fatal error: flashinfer/page.cuh: No such file or directory).

This issue is caused by the interaction of two separate JIT compilers (FlashInfer C++ and Triton) incorrectly handling mixed compute capabilities.

Root Cause Analysis

  1. FlashInfer C++ JIT Bug (Arch Mismatch in Cache):
    In FlashInfer 0.6.3, when compiling the sampling kernel in a multi-GPU environment, it detects the highest architecture (sm_89 from the 4090) and uses it for compilation, even if the cache directory is named 86/.
    As a result, ~/.cache/flashinfer/0.6.3/86/sampling.so is compiled with sm_89 instructions. When the 3090 (sm_86) tries to execute this kernel, it crashes.
    (Note: In kt 0.5.3 with older FlashInfer, the cache was 86_89/ and correctly compiled sm_86, so this didn't happen).

  2. Triton JIT Bug (Primary GPU Detection):
    SGLang/kt also uses Triton for MoE and Mamba/GDN linear attention kernels. Triton determines the compilation architecture based solely on the primary GPU (GPU 0).
    By default, GPU 0 is the 4090 (sm_89), so Triton compiles sm_89 kernels. When the 3090 (GPU 1) tries to execute these Triton kernels, it crashes with the same "no kernel image" error.

Reproduction Steps

Set up a 3090 + 4090 system with kt 0.6.2 and FlashInfer 0.6.3.
Clear FlashInfer cache:rm -rf ~/.cache/flashinfer/0.6.3/
Launch the server with default GPU order:python -m sglang.launch_server
–tensor-parallel-size 2
–attention-backend flashinfer
… other args
Server crashes on startup or during the first inference request.

Workaround / Solution

To solve this, we must explicitly constrain both JIT compilers to only generate sm_86 instructions. The 4090 runs these kernels perfectly via backward compatibility.

Clear the corrupted cache first:

rm -rf ~/.cache/flashinfer/0.6.3/

Launch with both environment variables:

CUDA_VISIBLE_DEVICES=1,0
TORCH_CUDA_ARCH_LIST="8.6"
python -m sglang.launch_server
--host 0.0.0.0 --port 8080
--tensor-parallel-size 2
--attention-backend flashinfer
... other args

CUDA_VISIBLE_DEVICES=1,0: Makes the 3090 the primary GPU (GPU 0). This forces Triton to compile sm_86 kernels for MoE and Mamba/GDN layers.
TORCH_CUDA_ARCH_LIST="8.6": Forces FlashInfer C++ JIT to strictly compile sm_86 kernels, preventing it from generating sm_89 instructions for the 3090.
With this configuration, the server starts successfully, and both GPUs operate normally using sm_86 kernels.

Note on Performance: Interestingly, in this specific heterogeneous setup, using the Triton attention backend actually yields slightly faster generation speeds compared to FlashInfer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions