[Feature Request] Multi-Head Latent Attention(DeepSeek) support on CPU/NPU #23925
Labels
feature request
request for unsupported feature or enhancement
platform:mobile
issues related to ONNX Runtime mobile; typically submitted using template
Describe the feature request
DeepSeek's models use Multi-head Latent Attention, the current ONNX model https://huggingface.co/onnxruntime/DeepSeek-R1-Distill-ONNX release leverages GroupQueryAttention.
Is MLA on roadmap for ONNXRT?
Describe scenario use case
Lower KV cache footprint with Multi-Head Latent Attention improving mobile and edge inference
The text was updated successfully, but these errors were encountered: