@@ -28,7 +28,8 @@ What's new
28
28
29
29
* More GenAI coverage and framework integrations to minimize code changes.
30
30
31
- * New models supported: Qwen 2.5.
31
+ * New models supported: Qwen 2.5, Deepseek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-7B,
32
+ and DeepSeek-R1-Distill-Qwen-1.5B, FLUX.1 Schnell and FLUX.1 Dev.
32
33
* Whisper Model: Improved performance on CPUs, built-in GPUs, and discrete GPUs with GenAI API.
33
34
* Preview: Introducing NPU support for torch.compile, giving developers the ability to use the
34
35
OpenVINO backend to run the PyTorch API on NPUs. 300+ deep learning models enabled from the
@@ -38,30 +39,34 @@ What's new
38
39
39
40
* Preview: Addition of Prompt Lookup to GenAI API improves 2nd token latency for LLMs by
40
41
effectively utilizing predefined prompts that match the intended use case.
42
+ * Preview: The GenAI API now offers image-to-image inpainting functionality. This feature
43
+ enables models to generate realistic content by inpainting specified modifications and
44
+ seamlessly integrating them with the original image.
41
45
* Asymmetric KV Cache compression is now enabled for INT8 on CPUs, resulting in lower
42
46
memory consumption and improved 2nd token latency, especially when dealing with long prompts
43
47
that require significant memory. The option should be explicitly specified by the user.
44
48
45
49
* More portability and performance to run AI at the edge, in the cloud, or locally.
46
50
47
- * Support for the latest Intel® Core™ Ultra 200H series processors (formerly codenamed Arrow
48
- Lake-H)
49
- * Preview: The GenAI API now offers image-to-image inpainting functionality. This feature
50
- enables models to generate realistic content by inpainting specified modifications and
51
- seamlessly integrating them with the original image.
52
- * Integration of the OpenVINO backend with the Triton Inference Server allows developers to
51
+ * Support for the latest Intel® Core™ Ultra 200H series processors (formerly codenamed
52
+ Arrow Lake-H)
53
+ * Integration of the OpenVINO ™ backend with the Triton Inference Server allows developers to
53
54
utilize the Triton server for enhanced model serving performance when deploying on Intel
54
55
CPUs.
55
- * Preview: A new OpenVINO backend integration allows developers to leverage OpenVINO
56
- performance optimizations directly within Keras 3 workflows for faster AI inference on
57
- Intel® CPUs, built-in GPUs, discrete GPUs, and NPUs. This feature is available with the
58
- latest Keras 3.8 release.
56
+ * Preview: A new OpenVINO ™ backend integration allows developers to leverage OpenVINO
57
+ performance optimizations directly within Keras 3 workflows for faster AI inference on CPUs,
58
+ built-in GPUs, discrete GPUs, and NPUs. This feature is available with the latest Keras 3.8
59
+ release.
60
+ * The OpenVINO Model Server now supports native Windows Server deployments, allowing
61
+ developers to leverage better performance by eliminating container overhead and simplifying
62
+ GPU deployment.
63
+
59
64
60
65
61
66
Now Deprecated
62
67
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
63
68
64
- * Legacy prefixes ( l _,w_,m_) have been removed from OpenVINO archive names.
69
+ * Legacy prefixes ` l_ `, ` w_ `, and ` m_ ` have been removed from OpenVINO archive names.
65
70
* The `runtime ` namespace for Python API has been marked as deprecated and designated to be
66
71
removed for 2026.0. The new namespace structure has been delivered, and migration is possible
67
72
immediately. Details will be communicated through warnings and via documentation.
@@ -91,9 +96,9 @@ CPU Device Plugin
91
96
-----------------------------------------------------------------------------------------------
92
97
93
98
* Intel® Core™ Ultra 200H processors (formerly code named Arrow Lake-H) are now fully supported.
94
- * Asymmetric 8bit key-value cache compression is now enabled on CPU by default, reducing memory
99
+ * Asymmetric 8bit KV Cache cache compression is now enabled on CPU by default, reducing memory
95
100
usage and memory bandwidth consumption for large language models and improving performance
96
- for 2nd token generation. Asymmetric 4bit key-value cache compression on CPU is now supported
101
+ for 2nd token generation. Asymmetric 4bit KV Cache cache compression on CPU is now supported
97
102
as an option to further reduce memory consumption.
98
103
* Performance of models running in FP16 on 6th generation of Intel® Xeon® processors with P-core
99
104
has been enhanced by improving utilization of the underlying AMX FP16 capabilities.
@@ -112,18 +117,19 @@ GPU Device Plugin
112
117
OpenVINO GenAI APIs with continuous batching and SDPA-based LLMs with long prompts (>4k).
113
118
* Stateful models are now enabled, significantly improving performance of Whisper models on all
114
119
GPU platforms.
115
- * Stable Diffusion 3 and Flux .1 performance has been improved.
120
+ * Stable Diffusion 3 and FLUX .1 performance has been improved.
116
121
* The issue of a black image output for image generation models, including SDXL, SD3, and
117
- Flux .1, with FP16 precision has been solved.
122
+ FLUX .1, with FP16 precision has been solved.
118
123
119
124
120
125
NPU Device Plugin
121
126
-----------------------------------------------------------------------------------------------
122
127
123
- * Performance has been improved for Channel-Wise symmetrically quantized LLMs, including
124
- Llama2-7B-chat, Llama3-8B-instruct, qwen-2-7B, Mistral-0.2-7B-instruct, phi-3-mini-4K-instruct,
125
- miniCPM-1B models. The best performance is achieved using fp16-in4 quantized models.
126
- * Preview: Introducing NPU support for torch.compile, giving developers the ability to use the
128
+ * Performance has been improved for CW symmetrically quantized LLMs, including Llama2-7B-chat,
129
+ Llama3-8B-instruct, Qwen-2-7B, Mistral-0.2-7B-Instruct, Phi-3-Mini-4K-Instruct, MiniCPM-1B
130
+ models. The best performance is achieved using symmetrically-quantized 4-bit (INT4) quantized
131
+ models.
132
+ * Preview: Introducing NPU support for torch.compile, giving developers the ability to use the
127
133
OpenVINO backend to run the PyTorch API on NPUs. 300+ deep learning models enabled from
128
134
the TorchVision, Timm, and TorchBench repositories.
129
135
@@ -187,9 +193,6 @@ ONNX Framework Support
187
193
-----------------------------------------------------------------------------------------------
188
194
189
195
* Runtime memory consumption for models with quantized weight has been reduced.
190
- * Models from the com.microsoft domain that use the following operations are now enabled:
191
- SkipSimplifiedLayerNormalization, SimplifiedLayerNormalization, FusedMatMul, QLinearSigmoid,
192
- QLinearLeakyRelu, QLinearAdd, QLinearMul, Range, DynamicQuantizeMatMul, MatMulIntegerToFloat.
193
196
* Workflow which affected reading of 2 bytes data types has been fixed.
194
197
195
198
@@ -205,7 +208,7 @@ OpenVINO Model Server
205
208
* Generative endpoints are fully supported, including text generation and embeddings based on
206
209
the OpenAI API, and reranking based on the Cohere API.
207
210
* Functional parity with the Linux version is available with minor differences.
208
- * The feature is targeted at client machines with Windows 11 and Data Center environment
211
+ * The feature is targeted at client machines with Windows 11 and data center environment
209
212
with Windows 2022 Server OS.
210
213
* Demos have been updated to work on both Linux and Windows. Check the
211
214
`installation guide <https://docs.openvino.ai/2025/openvino-workflow/model-server/ovms_docs_deploying_server_baremetal.html >`__
@@ -284,7 +287,7 @@ The following has been added:
284
287
* Stateful decoder for WhisperPipeline. Whisper decoder models with past are deprecated.
285
288
* Export a model with new optimum-intel to obtain stateful version.
286
289
* Performance metrics for WhisperPipeline.
287
- * initial_prompt and hotwords parameters for whisper pipeline allowing to guide generation.
290
+ * initial_prompt and hotwords parameters for the Whisper pipeline allowing to guide generation.
288
291
289
292
* LLMPipeline
290
293
@@ -297,10 +300,9 @@ The following has been added:
297
300
* rng_seed parameter to ImageGenerationConfig.
298
301
* Callback for image generation pipelines allowing to track generation progress and obtain
299
302
intermediate results.
300
- * EulerAncestralDiscreteScheduler - SDXL turbo.
301
- * PNDMScheduler – Stable Diffusion 1.x and 2.x.
302
- * Models: black-forest-labs/FLUX.1-schnell, Freepik/flux.1-lite-8B-alpha,
303
- black-forest-labs/FLUX.1-dev shuttleai/shuttle-3-diffusion.
303
+ * EulerAncestralDiscreteScheduler for SDXL turbo.
304
+ * PNDMScheduler for Stable Diffusion 1.x and 2.x.
305
+ * Models: FLUX.1-Schnell, Flux.1-Lite-8B-Alpha, FLUX.1-Dev, and Shuttle-3-Diffusion.
304
306
* T5 encoder for SD3 Pipeline.
305
307
306
308
* VLMPipeline
@@ -351,18 +353,21 @@ Known Issues
351
353
| ID: 161336
352
354
| Description:
353
355
| Compilation of an openvino model performing weight quantization fails with Segmentation
354
- Fault on LNL. The following workaround can be applied to make it work with existing OV
355
- versions (including 25.0 RCs) before application run: export DNNL_MAX_CPU_ISA=AVX2_VNNI.
356
+ Fault on Intel® Core™ Ultra 200V processors. The following workaround can be applied to
357
+ make it work with existing OV versions (including 25.0 RCs) before application run:
358
+ export DNNL_MAX_CPU_ISA=AVX2_VNNI.
356
359
357
360
| **Component: GPU Plugin**
358
361
| ID: 160802
359
362
| Description:
360
- | mllama model crashes on LNL. Please use OpenVINO 2024.6 or earlier to run the model.
363
+ | mllama model crashes on Intel® Core™ Ultra 200V processors. Please use OpenVINO 2024.6 or
364
+ earlier to run the model.
361
365
362
366
| **Component: GPU Plugin**
363
367
| ID: 160948
364
368
| Description:
365
- | Several models have accuracy degradation on LNL, ACM, and BMG. Please use OpenVINO 2024.6
369
+ | Several models have accuracy degradation on Intel® Core™ Ultra 200V processors,
370
+ Intel® Arc™ A-Series Graphics, and Intel® Arc™ B-Series Graphics. Please use OpenVINO 2024.6
366
371
to run the models. Model list: Denoise, Sharpen-Sharpen, fastseg-small, hbonet-0.5,
367
372
modnet_photographic_portrait_matting, modnet_webcam_portrait_matting,
368
373
mobilenet-v3-small-1.0-224, nasnet-a-mobile-224, yolo_v4, yolo_v5m, yolo_v5s, yolo_v8n,
0 commit comments