Skip to content

Conversation

@Ashitpatel001
Copy link

@Ashitpatel001 Ashitpatel001 commented Jan 5, 2026

Solves issue #3244

This PR adds a new notebook demonstrating the latency reduction achieved by quantizing Qwen 2.5 (0.5B) to INT4 using OpenVINO.

Design Decisions

  • Model: Chosen Qwen/Qwen2.5-0.5B-Instruct for its speed and relevance to edge devices.
  • Tools: Uses optimum-intel for seamless conversion.
  • Time: Execution takes 5-10 minutes (depending on internet speed for model download).

Checklist

  • Notebook follows the notebooks/<title>/<title>.ipynb structure.
  • README.md included with Colab/Binder badges.
  • Telemetry and Scarf Pixel added.
  • Black formatted (line width 160).

This notebook provides a benchmark for the Qwen 2.5 model, comparing FP32 and INT4 quantized performance on CPU.
This notebook provides a benchmark for the Qwen 2.5 model, comparing its performance with PyTorch FP32 and OpenVINO INT4 quantized versions.
This notebook provides a benchmark for the Qwen 2.5 model, comparing the performance of FP32 and INT4 quantized versions on CPU.
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@brmarkus
Copy link

brmarkus commented Jan 5, 2026

Would you mind renaming the notebook "Normal-cpu-vs-intel-openvino (1).ipynb‎", it looks like a copied file (containing (1)).

Measuring the latency and throughput could be difficult - just imagine a spike in the background done by your operating system, or caching effects. You might need to run a few, hundreds of runs and determine an average. But using always the same data might benefit from underlying caching effects, too, that's why many different data - or even random data - is used.

Have a look into the OpenVINO tool benchmark_app:

- Renamed notebook to remove duplicate artifacts (removed "(1)").
- Replaced manual Python timing with OpenVINO `benchmark_app` for scientifically accurate latency measurements.
- Added missing Markdown context (Introduction, Utils, Frontend Check).
- Fixed dynamic input shape errors by defining explicit shapes for the benchmark tool.
@Ashitpatel001
Copy link
Author

@brmarkus Thank you so much for your patience and the detailed feedback. I learned a lot about how to properly measure latency using benchmark_app from your comments.

I have updated the PR with the following changes:

Fixed Filename: I apologize for the oversight with the duplicate filename; I have deleted the old file and renamed the notebook to benchmark-qwen-2.5-int4.ipynb.

Improvised Benchmarking: I replaced the manual Python loop with the official benchmark_app tool (using -hint latency) to ensure scientifically accurate results, as you suggested.

Stability: I defined explicit input shapes (including beam_idx) to prevent dynamic shape errors during the run.

Documentation: I added the standard Context, Utility, and Frontend Verification sections to match the repository's quality standards.

The notebook is now running smoothly. I look forward to your re-review and upgradations or changes if any!

@github-actions
Copy link
Contributor

This PR will be closed in a week because of 2 weeks of no activity.

@github-actions github-actions bot added the Stale label Jan 21, 2026
@Ashitpatel001
Copy link
Author

Hi @jgespino, just checking in on this! I believe I've addressed the previous feedback regarding stability and documentation. The benchmark is running smoothly on my end.

Please let me know if there are any further changes needed or if we can proceed with the review. Thanks!

@jgespino
Copy link

@brmarkus Could you please help review?

@brmarkus
Copy link

Using the latest version of this repo under MS-Win11-Pro, using Python v3.12.4, on a Laptop with a CPU "Intel Core Ultra 7 155H", with 64GB system memory I can almost run the notebook.

I'm getting these results on my system:

 Running Inference on PyTorch...
 PyTorch Time: 2.7203 seconds

and

 OpenVINO Python Time: 1.9287 seconds

========================================
RESULT: OpenVINO is 1.41x FASTER than PyTorch on CPU!
========================================

Calling benchmark_app fails in my case:

[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2025.4.1-20426-82bbf0292c5-releases/2025/4
[ INFO ] 
[ INFO ] Device info:
[ INFO ] CPU
[ INFO ] Build ................................. 2025.4.1-20426-82bbf0292c5-releases/2025/4
[ INFO ] 
[ INFO ] 
[Step 3/11] Setting device configuration
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 314.13 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ]     input_ids (node: input_ids) : i64 / [...] / [?,?]
[ INFO ]     attention_mask (node: attention_mask) : i64 / [...] / [?,?]
[ INFO ]     position_ids (node: position_ids) : i64 / [...] / [?,?]
[ INFO ]     beam_idx (node: beam_idx) : i32 / [...] / [?]
[ INFO ] Model outputs:
[ INFO ]     logits (node: __module.lm_head/ov_ext::linear/MatMul) : f32 / [...] / [?,?,151936]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 1
[ INFO ] Reshaping model: 'input_ids': [1,10], 'attention_mask': [1,10], 'position_ids': [1,10], 'beam_idx': [1]
[ INFO ] Reshape model took 133.44 ms
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ]     input_ids (node: input_ids) : i64 / [...] / [1,10]
[ INFO ]     attention_mask (node: attention_mask) : i64 / [...] / [1,10]
[ INFO ]     position_ids (node: position_ids) : i64 / [...] / [1,10]
[ INFO ]     beam_idx (node: beam_idx) : i32 / [...] / [1]
[ INFO ] Model outputs:
[ INFO ]     logits (node: __module.lm_head/ov_ext::linear/MatMul) : f32 / [...] / [1,10,151936]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 2670.99 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ]   NETWORK_NAME: Model0
[ INFO ]   OPTIMAL_NUMBER_OF_INFER_REQUESTS: 1
[ INFO ]   NUM_STREAMS: 1
[ INFO ]   INFERENCE_NUM_THREADS: 6
[ INFO ]   PERF_COUNT: NO
[ INFO ]   INFERENCE_PRECISION_HINT: <Type: 'float32'>
[ INFO ]   PERFORMANCE_HINT: LATENCY
[ INFO ]   EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ]   ENABLE_CPU_PINNING: False
[ INFO ]   ENABLE_CPU_RESERVATION: False
[ INFO ]   SCHEDULING_CORE_TYPE: SchedulingCoreType.ANY_CORE
[ INFO ]   MODEL_DISTRIBUTION_POLICY: set()
[ INFO ]   ENABLE_HYPER_THREADING: False
[ INFO ]   EXECUTION_DEVICES: ['CPU']
[ INFO ]   CPU_DENORMALS_OPTIMIZATION: False
[ INFO ]   LOG_LEVEL: Level.NO
[ INFO ]   CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE: 1.0
[ INFO ]   ENABLE_TENSOR_PARALLEL: False
[ INFO ]   DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ]   KV_CACHE_PRECISION: <Type: 'uint8_t'>
[ INFO ]   KEY_CACHE_PRECISION: <Type: 'uint8_t'>
[ INFO ]   VALUE_CACHE_PRECISION: <Type: 'uint8_t'>
[ INFO ]   KEY_CACHE_GROUP_SIZE: 0
[ INFO ]   VALUE_CACHE_GROUP_SIZE: 0
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'input_ids'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'attention_mask'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'position_ids'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'beam_idx'!. This input will be filled with random values!
[ INFO ] Fill input 'input_ids' with random values 
[ INFO ] Fill input 'attention_mask' with random values 
[ INFO ] Fill input 'position_ids' with random values 
[ INFO ] Fill input 'beam_idx' with random values 
[Step 10/11] Measuring performance (Start inference synchronously, limits: 15000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 38.12 ms
[ ERROR ] Exception from src\inference\src\cpp\infer_request.cpp:224:
Exception from src\plugins\intel_cpu\src\graph.cpp:1619:
Node __module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention of type ScaledDotProductAttentionWithKVCache
Check 'index >= 0 && index < static_cast<int32_t>(B)' failed at src\plugins\intel_cpu\src\nodes\scaled_attn.cpp:1891:
ScaledDotProductAttentionWithKVCache node with name '__module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention' beam_idx 122 is outside of the allowed interval [0,  1)


Traceback (most recent call last):
  File "c:\localdisk\OpenVINO-MSWin\openvino_env\Lib\site-packages\openvino\tools\benchmark\main.py", line 624, in main
    fps, median_latency_ms, avg_latency_ms, min_latency_ms, max_latency_ms, total_duration_sec, iteration = benchmark.main_loop(requests, data_queue, batch_size, args.latency_percentile, pcseq)
                                                                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\localdisk\OpenVINO-MSWin\openvino_env\Lib\site-packages\openvino\tools\benchmark\benchmark.py", line 181, in main_loop
    times, total_duration_sec, iteration = self.sync_inference(requests[0], data_queue)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\localdisk\OpenVINO-MSWin\openvino_env\Lib\site-packages\openvino\tools\benchmark\benchmark.py", line 106, in sync_inference
    request.infer()
  File "c:\localdisk\OpenVINO-MSWin\openvino_env\Lib\site-packages\openvino\_ov_api.py", line 184, in infer
    return OVDict(super().infer(_data_dispatch(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Exception from src\inference\src\cpp\infer_request.cpp:224:
Exception from src\plugins\intel_cpu\src\graph.cpp:1619:
Node __module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention of type ScaledDotProductAttentionWithKVCache
Check 'index >= 0 && index < static_cast<int32_t>(B)' failed at src\plugins\intel_cpu\src\nodes\scaled_attn.cpp:1891:
ScaledDotProductAttentionWithKVCache node with name '__module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention' beam_idx 122 is outside of the allowed interval [0,  1)

Have you tested in a different environment like under Linux/MacOS, different version of OpenVINO or Python? The self-contained installed packages are upgraded/installed without using a pinned version. Over time this could result in broken builds or different timings.

Why is there a
pip install -q --upgrade optimum-intel[openvino] nncf transformers torch onnx
and a few cells later again a
pip install -q --upgrade optimum-intel[openvino] nncf transformers torch
(but with onnx missing)

Do split your monster cells for better usability and faster feedback, please.

My main concern with this Juypiter notebook is that it's "just" a benchmarking notebook, using one specific model, one specific configration, one specific (Int4)conversion.
What's the specifics and differentiator of this notebook compared to using benchmark_app or one of the other demo notebooks?
Will users expect more of such benchmark notebooks, one for each different model?
Could you imagine to add more metrics to show highlights, benefits for using quantized models, not only in terms of latency, but storage, memory-footprint, latency, CPU-utilization (embedded/edge devices often also have an embedded/integrated GPU or NPU/VPU), throughput?

@github-actions github-actions bot removed the Stale label Jan 22, 2026
@Ashitpatel001
Copy link
Author

Hi @brmarkus and @jgespino , thank you so much for the detailed review and guidance! Your feedback was incredibly helpful in shaping this into a proper contribution.

I have addressed all the comments and completed the refactoring. Here is a summary of the changes:

Modular Framework: Moved all heavy logic into a structured bench/ package (Model Loader, KV Cache, Metrics) to keep the notebook clean.

Tutorial Narrative: Renamed the notebook to benchmark-transformer.ipynb and rewrote it as a guided tutorial. It now includes clear explanations, a "Heavy vs. Light" visualization , and a structured final report.

Robust Config: Updated the benchmark to use 3 warmups and 15 measure iterations as discussed to ensure statistically significant results on consumer hardware.

Cleanup: Removed venv and temporary cache folders, and added a clear README.md and requirements.txt.

Results(Max observed) : With these changes, I am seeing a 9.41x Speedup (1.6s vs 15s latency) and ~80% storage reduction (308MB vs 1500MB) using the INT4 optimization!

The PR is now clean and ready for a final look. Thanks again for pushing me to improve the quality of this work!

A sample result image is shown below:

Screenshot 2026-01-23 050521

@Ashitpatel001 Ashitpatel001 changed the title [Benchmark] Latency Comparison: PyTorch vs OpenVINO INT4 (Qwen 2.5-0.5B) [Benchmark] Latency Comparison: PyTorch vs OpenVINO INT4 Jan 23, 2026


2. Run the Benchmark Open the Jupyter Notebook and execute all cells.
jupyter lab benchmark-transformer.ipynb

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file is called benchmark-transformer-notebook.ipynb, not benchmark-transformer.ipynb.

@brmarkus
Copy link

I can confirm that the notebook runs in my environment.

Find my comments below:

The initial check contains a No GPU found, using CPU only; in my case an iGPU is detected. The comment below, however, says "This represents the standard "out-of-the-box" performance on a CPU.".
Would you mind adding a device-selector (like in the other notebooks), to allow doing the benchmarking on other accelerators, where supported? (even embedded, edge devices sometimes have an (i)(e)GPU or an VPU/NPU).

Would it make sense to add a selection for conversion/compression to FP32, INT4 and INT8 as a drop-down selection for the user, similar to the device-selection drop down? You do provice a config file, yes. It would be a better user-experience with a drop-down, wouldn't it?

Requiring to call pip install -r requirements.txt manually for a notebook is unusual, usually a notebook is self-contained. The whole repo requires to install dependencies; and if a notebook needs something specific, then the notebook installs its dependencies self-contained.

The step "Benchmarking PyTorch (Baseline)" prints a progress bar and, in my case, prints a single number "9". What is it supposed to say, "9"?

The final benchmark results, using my environment, looks like this:
image

The field "N/A (Variable)" of the original model is hard-coded - you can't access the original model, from the Huggingface cache, or via AutoModelForCausalLM (e.g. using HfApi().model_info()) (in my case the file model.safetensors has a file-size of 942MB, and the quantized&compressed model (BIN and XML file) have a reported size of 308MB)?

@Ashitpatel001
Copy link
Author

Thanks @brmarkus for the detailed review! I completely agree with your points regarding usability and robustness. I have updated the notebook and README.md to address all of them.

Here is a summary of the changes:

1. Device & Precision Selectors (UX Improvement)

The Problem: As you noted, the notebook was previously hardcoded for CPU and required editing a YAML config file to change quantization settings, which isn't a great user experience.

The Solution: I have added Interactive Widgets (ipywidgets) at the start of the notebook.

Device Selector: Now scans for available accelerators (CPU, GPU, NPU) and lets the user choose.

Precision Selector: Added a dropdown for INT4, INT8, and FP16 so users can easily compare different compression levels without touching config files.

2. Self-Contained Installation

The Problem: Requiring a manual pip install -r requirements.txt prior to running the notebook breaks the "one-click" flow standard for notebooks.

The Solution: I added a "Smart Installation" cell at the very top.

It automatically detects if dependencies (openvino, optimum-intel, torch) are missing and installs them.

It includes a check to skip installation if libraries are already loaded in memory, preventing WinError 32 file-lock crashes on Windows during re-runs.

3. The "9" Output Bug

The Problem: The "9" was an unsuppressed return value (likely the iteration count or a metric object) from the final line of the benchmark cell, which Jupyter prints by default.

The Solution: I suppressed the raw output using a semicolon ; and added explicit logging (e.g., PyTorch Benchmark Complete) to ensure the progress bar finishes cleanly without confusing numbers.

4. Dynamic "Disk Size" Calculation (Fixing the "N/A")

The Problem: The "N/A (Variable)" field was hardcoded because fetching model size metadata from the Hugging Face Hub API was unreliable (often returning None for specific file shards or failing offline).

The Solution: I implemented a Physics-Based Calculation directly in the PyTorch phase.

Before deleting the baseline model to save RAM, the code now iterates through the loaded model's parameters in memory: sum(p.nelement() * p.element_size() for p in model.parameters()).

This guarantees an exact, accurate file size (in MB) for any model the user loads, completely offline, allowing for a fair compression comparison in the final table.

The notebook is now fully interactive and self-contained. Ready for re-review!

Screenshot 2026-01-23 233049

@brmarkus
Copy link

This looks great! Thank you very much for your prompt feedback!!

I still recommend to split the bigger cells into smaller ones (e.g. to get results faster, redo some cells again with changed configs, etc.).
Can you add the chosen configuration (besides the model, also the device and precission) to the benchmark results, please?
There are a few hardcodings left, like

  • "OpenVINO Optimization (INT4)": precission is configurable
  • "print("\n Benchmarking OpenVINO (INT4)")": precission is configurable

Wrong precission used:

  • ""Backend": f"OpenVINO ({config['export']['precision'].upper()})",": this is supposed to use the setting from the widget, and not from the config YAML-file

When selecting GPU, then "Benchmarking PyTorch (Baseline) on GPU" gets printed, but CPU is used (in my case no NVIDIA GPU available in my Laptop environment).

Minor note: when doing the individual benchmarks you might already print some of the values (latency and throughput, not requiring any calculations) in addition to the progress bar as a preview.

Is NPU supposed to be supported (dynamic shapes?)? You might neede to exclude it from the device selection widget otherwise.

(On my MeteorLake Laptop I get an amazing 50x speedup in latency when using GPU and INT4 with 15 tokens per seconds (instead of 0.4 tokens per seconds with using Pytorch on CPU&FP32)).

@Ashitpatel001
Copy link
Author

Thanks @brmarkus for the review! I’ve pushed fixes for everything you mentioned:

Split the "Monster" Cell: I separated the Export/Load step from the Benchmark step. Much cleaner now, and you can re-run the benchmark loop without reloading the model.

Dynamic Results Table: The final table now has dedicated Device and Precision columns. It pulls the actual values from the widgets (e.g., GPU, INT4) instead of the config file, so it's always accurate.

Graph Labels: The X-Axis on the chart is now dynamic too , it explicitly says things like "OpenVINO (INT4)" so the viewer knows exactly what ran.

Device Detection: I fixed the "lying" PyTorch print. It now checks for CUDA availability and correctly reports "CPU" if no NVIDIA GPU is found, even if "GPU" was requested.

NPU Safety: I filtered NPU out of the widget list for now to keep things stable until we add static reshaping.

Immediate Feedback: Added print statements right after the progress bars so you see the ms/token speed instantly.

Let me know if any cell still needs any modifications or changes. The benchmark is ready for a re-review.

@brmarkus
Copy link

Looks great!
@jgespino what do you think?

@Ashitpatel001
Copy link
Author

Thanks @brmarkus for your feedback about the notebook!

@jgespino, to save you some scrolling, here is a quick recap of the recent updates based on the feedback:

Code Structure: Split the OpenVINO cell into Export and Benchmark phases for better modularity.

Reporting: The results table and graph now use dynamic labels (e.g., INT4, GPU) based on the widget selection rather than the static config.

Bug Fixes: Fixed PyTorch device detection (now falls back to CPU if no CUDA is found) and suppressed the GC output.

Let me know if you need anything else!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants