-
Notifications
You must be signed in to change notification settings - Fork 967
[Benchmark] Latency Comparison: PyTorch vs OpenVINO INT4 #3245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: latest
Are you sure you want to change the base?
Conversation
This notebook provides a benchmark for the Qwen 2.5 model, comparing FP32 and INT4 quantized performance on CPU.
This notebook provides a benchmark for the Qwen 2.5 model, comparing its performance with PyTorch FP32 and OpenVINO INT4 quantized versions.
This notebook provides a benchmark for the Qwen 2.5 model, comparing the performance of FP32 and INT4 quantized versions on CPU.
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Would you mind renaming the notebook "Normal-cpu-vs-intel-openvino (1).ipynb", it looks like a copied file (containing Measuring the latency and throughput could be difficult - just imagine a spike in the background done by your operating system, or caching effects. You might need to run a few, hundreds of runs and determine an average. But using always the same data might benefit from underlying caching effects, too, that's why many different data - or even random data - is used. Have a look into the OpenVINO tool
|
- Renamed notebook to remove duplicate artifacts (removed "(1)"). - Replaced manual Python timing with OpenVINO `benchmark_app` for scientifically accurate latency measurements. - Added missing Markdown context (Introduction, Utils, Frontend Check). - Fixed dynamic input shape errors by defining explicit shapes for the benchmark tool.
|
@brmarkus Thank you so much for your patience and the detailed feedback. I learned a lot about how to properly measure latency using benchmark_app from your comments. I have updated the PR with the following changes: Fixed Filename: I apologize for the oversight with the duplicate filename; I have deleted the old file and renamed the notebook to benchmark-qwen-2.5-int4.ipynb. Improvised Benchmarking: I replaced the manual Python loop with the official benchmark_app tool (using -hint latency) to ensure scientifically accurate results, as you suggested. Stability: I defined explicit input shapes (including beam_idx) to prevent dynamic shape errors during the run. Documentation: I added the standard Context, Utility, and Frontend Verification sections to match the repository's quality standards. The notebook is now running smoothly. I look forward to your re-review and upgradations or changes if any! |
|
This PR will be closed in a week because of 2 weeks of no activity. |
|
Hi @jgespino, just checking in on this! I believe I've addressed the previous feedback regarding stability and documentation. The benchmark is running smoothly on my end. Please let me know if there are any further changes needed or if we can proceed with the review. Thanks! |
|
@brmarkus Could you please help review? |
|
Using the latest version of this repo under MS-Win11-Pro, using Python v3.12.4, on a Laptop with a CPU "Intel Core Ultra 7 155H", with 64GB system memory I can almost run the notebook. I'm getting these results on my system: and Calling Have you tested in a different environment like under Linux/MacOS, different version of OpenVINO or Python? The self-contained installed packages are upgraded/installed without using a pinned version. Over time this could result in broken builds or different timings. Why is there a Do split your monster cells for better usability and faster feedback, please. My main concern with this Juypiter notebook is that it's "just" a benchmarking notebook, using one specific model, one specific configration, one specific (Int4)conversion. |
b340d94 to
f6bf42c
Compare
|
Hi @brmarkus and @jgespino , thank you so much for the detailed review and guidance! Your feedback was incredibly helpful in shaping this into a proper contribution. I have addressed all the comments and completed the refactoring. Here is a summary of the changes: Modular Framework: Moved all heavy logic into a structured bench/ package (Model Loader, KV Cache, Metrics) to keep the notebook clean. Tutorial Narrative: Renamed the notebook to benchmark-transformer.ipynb and rewrote it as a guided tutorial. It now includes clear explanations, a "Heavy vs. Light" visualization , and a structured final report. Robust Config: Updated the benchmark to use 3 warmups and 15 measure iterations as discussed to ensure statistically significant results on consumer hardware. Cleanup: Removed venv and temporary cache folders, and added a clear README.md and requirements.txt. Results(Max observed) : With these changes, I am seeing a 9.41x Speedup (1.6s vs 15s latency) and ~80% storage reduction (308MB vs 1500MB) using the INT4 optimization! The PR is now clean and ready for a final look. Thanks again for pushing me to improve the quality of this work! A sample result image is shown below:
|
|
|
||
|
|
||
| 2. Run the Benchmark Open the Jupyter Notebook and execute all cells. | ||
| jupyter lab benchmark-transformer.ipynb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file is called benchmark-transformer-notebook.ipynb, not benchmark-transformer.ipynb.
|
Thanks @brmarkus for the detailed review! I completely agree with your points regarding usability and robustness. I have updated the notebook and README.md to address all of them. Here is a summary of the changes: 1. Device & Precision Selectors (UX Improvement) The Problem: As you noted, the notebook was previously hardcoded for CPU and required editing a YAML config file to change quantization settings, which isn't a great user experience. The Solution: I have added Interactive Widgets (ipywidgets) at the start of the notebook. Device Selector: Now scans for available accelerators (CPU, GPU, NPU) and lets the user choose. Precision Selector: Added a dropdown for INT4, INT8, and FP16 so users can easily compare different compression levels without touching config files. 2. Self-Contained Installation The Problem: Requiring a manual pip install -r requirements.txt prior to running the notebook breaks the "one-click" flow standard for notebooks. The Solution: I added a "Smart Installation" cell at the very top. It automatically detects if dependencies (openvino, optimum-intel, torch) are missing and installs them. It includes a check to skip installation if libraries are already loaded in memory, preventing WinError 32 file-lock crashes on Windows during re-runs. 3. The "9" Output Bug The Problem: The "9" was an unsuppressed return value (likely the iteration count or a metric object) from the final line of the benchmark cell, which Jupyter prints by default. The Solution: I suppressed the raw output using a semicolon ; and added explicit logging (e.g., PyTorch Benchmark Complete) to ensure the progress bar finishes cleanly without confusing numbers. 4. Dynamic "Disk Size" Calculation (Fixing the "N/A") The Problem: The "N/A (Variable)" field was hardcoded because fetching model size metadata from the Hugging Face Hub API was unreliable (often returning None for specific file shards or failing offline). The Solution: I implemented a Physics-Based Calculation directly in the PyTorch phase. Before deleting the baseline model to save RAM, the code now iterates through the loaded model's parameters in memory: sum(p.nelement() * p.element_size() for p in model.parameters()). This guarantees an exact, accurate file size (in MB) for any model the user loads, completely offline, allowing for a fair compression comparison in the final table. The notebook is now fully interactive and self-contained. Ready for re-review!
|
|
This looks great! Thank you very much for your prompt feedback!! I still recommend to split the bigger cells into smaller ones (e.g. to get results faster, redo some cells again with changed configs, etc.).
Wrong precission used:
When selecting GPU, then "Benchmarking PyTorch (Baseline) on GPU" gets printed, but CPU is used (in my case no NVIDIA GPU available in my Laptop environment). Minor note: when doing the individual benchmarks you might already print some of the values (latency and throughput, not requiring any calculations) in addition to the progress bar as a preview. Is NPU supposed to be supported (dynamic shapes?)? You might neede to exclude it from the device selection widget otherwise. (On my MeteorLake Laptop I get an amazing 50x speedup in latency when using GPU and INT4 with 15 tokens per seconds (instead of 0.4 tokens per seconds with using Pytorch on CPU&FP32)). |
|
Thanks @brmarkus for the review! I’ve pushed fixes for everything you mentioned: Split the "Monster" Cell: I separated the Export/Load step from the Benchmark step. Much cleaner now, and you can re-run the benchmark loop without reloading the model. Dynamic Results Table: The final table now has dedicated Device and Precision columns. It pulls the actual values from the widgets (e.g., GPU, INT4) instead of the config file, so it's always accurate. Graph Labels: The X-Axis on the chart is now dynamic too , it explicitly says things like "OpenVINO (INT4)" so the viewer knows exactly what ran. Device Detection: I fixed the "lying" PyTorch print. It now checks for CUDA availability and correctly reports "CPU" if no NVIDIA GPU is found, even if "GPU" was requested. NPU Safety: I filtered NPU out of the widget list for now to keep things stable until we add static reshaping. Immediate Feedback: Added print statements right after the progress bars so you see the ms/token speed instantly. Let me know if any cell still needs any modifications or changes. The benchmark is ready for a re-review. |
|
Looks great! |
|
Thanks @brmarkus for your feedback about the notebook! @jgespino, to save you some scrolling, here is a quick recap of the recent updates based on the feedback: Code Structure: Split the OpenVINO cell into Export and Benchmark phases for better modularity. Reporting: The results table and graph now use dynamic labels (e.g., INT4, GPU) based on the widget selection rather than the static config. Bug Fixes: Fixed PyTorch device detection (now falls back to CPU if no CUDA is found) and suppressed the GC output. Let me know if you need anything else! |



Solves issue #3244
This PR adds a new notebook demonstrating the latency reduction achieved by quantizing Qwen 2.5 (0.5B) to INT4 using OpenVINO.
Design Decisions
Qwen/Qwen2.5-0.5B-Instructfor its speed and relevance to edge devices.optimum-intelfor seamless conversion.Checklist
notebooks/<title>/<title>.ipynbstructure.README.mdincluded with Colab/Binder badges.