[OpenVINO] Introduce extended quantization dataset options like 'wikitext2:seq_len=128' #1564

nikita-savelyevv · 2025-12-17T14:12:05Z

What does this PR do?

Sequence length during quantization calibration data collection can now be provided as a string like "wikitext2:seq_len=128". The motivation is that sometimes it makes sense to adjust default sequence length value, and this way it will be possible to configure it in those particular cases. For example for configs inside _DEFAULT_4BIT_WQ_CONFIGS.

Based on PR by Copilot: nikita-savelyevv#4

Ticket: CVS-176623

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

- Add dataset_kwargs attribute and parsing logic to OVQuantizationConfigBase - Parse "dataset:seq_len=value" syntax with validation - Rename seqlen parameter to seq_len in _prepare_causal_lm_calibration_data - Thread dataset_kwargs through all calibration helpers - Add comprehensive unit and integration tests Co-authored-by: nikita-savelyevv <[email protected]> Apply black and ruff formatting to changes Co-authored-by: nikita-savelyevv <[email protected]> Revert unrelated formatting change to minimize diff Co-authored-by: nikita-savelyevv <[email protected]> Address PR feedback: merge tests, update docstrings, use explicit seq_len parameter - Merge TestDatasetIntegration into TestDatasetParsing and move to test_quantization.py - Remove standalone test files (test_dataset_parsing.py, test_dataset_integration.py) - Update dataset docstrings in all OVQuantizationConfigBase child classes - Update CLI --dataset help text in openvino.py - Update dataset documentation in export.mdx - Change from **dataset_kwargs unpacking to explicit seq_len= parameter - Remove unnecessary comment about integrating seq_len Co-authored-by: nikita-savelyevv <[email protected]> Add test cases for new dataset format with seq_len option - Add test case to test_exporters_cli.py using wikitext2:seq_len=64 - Add test case to test_quantization.py using c4:seq_len=64 Co-authored-by: nikita-savelyevv <[email protected]> Address final PR feedback: remove empty line, adjust test samples, remove redundant tests - Remove empty line between seq_len assignment and tokenizer initialization - Change test to use num_samples=1 instead of 100 for faster execution - Add num_samples=1 to test configuration in test_quantization.py - Remove redundant integration test methods (causal_lm, gsm8k, text_to_text, text_encoder) - Remove redundant list_dataset_backward_compatibility test Co-authored-by: nikita-savelyevv <[email protected]> Fix serialization and seq_len parameter passing issues - Handle dataset_kwargs deserialization for backward compatibility by extracting it from kwargs - Only pass seq_len to helper functions when it exists in dataset_kwargs to avoid overriding defaults with None - Use conditional kwargs construction for all calibration helpers Co-authored-by: nikita-savelyevv <[email protected]>

HuggingFaceDocBuilderDev · 2025-12-17T14:15:36Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

echarlaix

Thanks for the addition!

echarlaix · 2025-12-22T11:45:48Z

optimum/intel/openvino/configuration.py

            dataset (`str or List[str]`, *optional*):
-                The dataset used for data-aware optimization with NNCF.
+                The dataset used for data-aware optimization with NNCF. Can be a dataset name (e.g., 'wikitext')
+                or a string with options (e.g., 'wikitext2:seq_len=128'). Currently supported option: seq_len.


Wouldn't that be redundant with num_sample ? Also what if both are provided ?

There is a distinction between those: num_samples is a number of sequences in the dataset, while seq_len is a length of each sequence. Let me update the description to clarify what seq_len stands for.

ljaljushkin · 2025-12-23T13:13:27Z

optimum/intel/openvino/configuration.py

            dataset (`str or List[str]`, *optional*):
-                The dataset used for data-aware optimization with NNCF.
+                The dataset used for data-aware optimization with NNCF. Can be a dataset name (e.g., 'wikitext2')
+                or a string with options (e.g., 'wikitext2:seq_len=128'). The only currently supported option is `seq_len`


There are some assumptions for seqlen=32 in gptq utils:
https://github.com/huggingface/optimum/blob/main/optimum/gptq/data.py#L128-L129
please check whether this code will be valid for other seq_len.

probably, it makes sense to copy paste this helper code to optimum-intel and extend for other seq_len

limit = num_samples * seqlen // 4 # ~1k for 128 samples with seqlen=32 to be aligned with optimum text = "".join([" \n" if s == "" else s for s in traindata["text"][:limit]])

Good catch! Added

Copilot AI and others added 2 commits December 17, 2025 14:58

Final touches

6c845e0

nikita-savelyevv changed the title ~~[OpenVINO] Introduce extended dataset options like 'wikitext:seq_len=128'~~ [OpenVINO] Introduce extended quantization dataset options like 'wikitext:seq_len=128' Dec 17, 2025

Tweak

998eb6d

nikita-savelyevv changed the title ~~[OpenVINO] Introduce extended quantization dataset options like 'wikitext:seq_len=128'~~ [OpenVINO] Introduce extended quantization dataset options like 'wikitext2:seq_len=128' Dec 17, 2025

nikita-savelyevv added 4 commits December 17, 2025 17:07

Tweak

68d8092

Rename to _dataset_kwargs

d058daf

Fix

d548fc7

Tweak docs

d270802

nikita-savelyevv marked this pull request as ready for review December 18, 2025 13:11

nikita-savelyevv requested review from IlyasMoutawwakil, echarlaix, popovaan and rkazants December 18, 2025 13:12

echarlaix reviewed Dec 22, 2025

View reviewed changes

Update docstrings

9f50ba4

ljaljushkin reviewed Dec 23, 2025

View reviewed changes

nikita-savelyevv added 4 commits December 23, 2025 21:02

Compute limit based on nsamples and seqlen

07b428f

Merge branch 'main' into ns/seq-len-dataset-option

f6a497a

Update seed; copy c4

780695a

Tweak test

11dd71d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OpenVINO] Introduce extended quantization dataset options like 'wikitext2:seq_len=128' #1564

[OpenVINO] Introduce extended quantization dataset options like 'wikitext2:seq_len=128' #1564

Uh oh!

nikita-savelyevv commented Dec 17, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Dec 17, 2025

Uh oh!

echarlaix left a comment

Uh oh!

echarlaix Dec 22, 2025

Uh oh!

nikita-savelyevv Dec 22, 2025

Uh oh!

ljaljushkin Dec 23, 2025 •

edited

Loading

Uh oh!

nikita-savelyevv Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[OpenVINO] Introduce extended quantization dataset options like 'wikitext2:seq_len=128' #1564

Are you sure you want to change the base?

[OpenVINO] Introduce extended quantization dataset options like 'wikitext2:seq_len=128' #1564

Uh oh!

Conversation

nikita-savelyevv commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Uh oh!

HuggingFaceDocBuilderDev commented Dec 17, 2025

Uh oh!

echarlaix left a comment

Choose a reason for hiding this comment

Uh oh!

echarlaix Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

nikita-savelyevv Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

ljaljushkin Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikita-savelyevv Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nikita-savelyevv commented Dec 17, 2025 •

edited

Loading

ljaljushkin Dec 23, 2025 •

edited

Loading