[Feature Request] Multi-Head Latent Attention(DeepSeek) support on CPU/NPU #23925

bkaruman · 2025-03-06T23:27:06Z

DeepSeek's models use Multi-head Latent Attention, the current ONNX model https://huggingface.co/onnxruntime/DeepSeek-R1-Distill-ONNX release leverages GroupQueryAttention.

Is MLA on roadmap for ONNXRT?

Lower KV cache footprint with Multi-Head Latent Attention improving mobile and edge inference

bkaruman added the feature request request for unsupported feature or enhancement label Mar 6, 2025

github-actions bot added the platform:mobile issues related to ONNX Runtime mobile; typically submitted using template label Mar 6, 2025

Provide feedback