Intel® Extension for PyTorch* v2.7.0+cpu Release Notes
2.7.0
We are excited to announce the release of Intel® Extension for PyTorch* 2.7.0+cpu which accompanies PyTorch 2.7. This release mainly brings you new LLM model optimization including DeepSeek-R1-671B and Phi-4, new APIs for LLM serving frameworks including sliding window and softcap support in PagedAttention APIs, MambaMixer API for Jamba and Mamba model and API for multi-LoRA inference kernels. This release also includes a set of bug fixing and small optimizations. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try this release and feedback as to improve further on this product.
Highlights
- DeepSeek-R1 support
Intel® Extension for PyTorch* provides optimization for the hot DeepSeek-R1-671B model. A few optimizations including Multi-Head Latent Attention (MLA), fused MoE, fused-shared-expert and MoEGate, brings you well-performing experience with INT8 precision on Intel® Xeon®.
- Phi-4 support
Microsoft has recently released Phi-4, including Phi-4-mini (3.8B dense decoder-only transformer model) and Phi-4-multimodal (5.6B multimodal model). Intel® Extension for PyTorch* provides support of Phi-4 since its launch date with early release version, and the related optimizations are included in this official release.
- General Large Language Model (LLM) optimizations
Intel® Extension for PyTorch* provides sliding window and softcap support in PagedAttention APIs, MambaMixer API for Jamba and Mamba model and API for multi-LoRA inference kernels for LLM serving frameworks. For user experience improvements, Intel® Extension for PyTorch* supports running INT4 workloads with only INT4 weights, removing the need of downloading the original high precision weights. A full list of optimized models can be found at LLM optimization.
-
Bug fixing and other optimization
- Optimized the performance of LLM #3537 #3611 #3549
- Handled new linear modules in DeepSpeed v0.16.5 #3622 #3638
- Fixed PagedAttention kernel to avoid the graph break when using
torch.compile
#3641 - Added user guide for running DeepSeek-R1#3660 and multimodal models #3649
- Upgraded oneDNN to v3.7.2 #3582
Full Changelog: v2.6.0+cpu...v2.7.0+cpu