Skip to content

Conversation

@nikita-savelyevv
Copy link
Collaborator

@nikita-savelyevv nikita-savelyevv commented Dec 17, 2025

What does this PR do?

Sequence length during quantization calibration data collection can now be provided as a string like "wikitext2:seq_len=128". The motivation is that sometimes it makes sense to adjust default sequence length value, and this way it will be possible to configure it in those particular cases. For example for configs inside _DEFAULT_4BIT_WQ_CONFIGS.

Based on PR by Copilot: nikita-savelyevv#4

Ticket: CVS-176623

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Copilot AI and others added 2 commits December 17, 2025 14:58
- Add dataset_kwargs attribute and parsing logic to OVQuantizationConfigBase
- Parse "dataset:seq_len=value" syntax with validation
- Rename seqlen parameter to seq_len in _prepare_causal_lm_calibration_data
- Thread dataset_kwargs through all calibration helpers
- Add comprehensive unit and integration tests

Co-authored-by: nikita-savelyevv <[email protected]>

Apply black and ruff formatting to changes

Co-authored-by: nikita-savelyevv <[email protected]>

Revert unrelated formatting change to minimize diff

Co-authored-by: nikita-savelyevv <[email protected]>

Address PR feedback: merge tests, update docstrings, use explicit seq_len parameter

- Merge TestDatasetIntegration into TestDatasetParsing and move to test_quantization.py
- Remove standalone test files (test_dataset_parsing.py, test_dataset_integration.py)
- Update dataset docstrings in all OVQuantizationConfigBase child classes
- Update CLI --dataset help text in openvino.py
- Update dataset documentation in export.mdx
- Change from **dataset_kwargs unpacking to explicit seq_len= parameter
- Remove unnecessary comment about integrating seq_len

Co-authored-by: nikita-savelyevv <[email protected]>

Add test cases for new dataset format with seq_len option

- Add test case to test_exporters_cli.py using wikitext2:seq_len=64
- Add test case to test_quantization.py using c4:seq_len=64

Co-authored-by: nikita-savelyevv <[email protected]>

Address final PR feedback: remove empty line, adjust test samples, remove redundant tests

- Remove empty line between seq_len assignment and tokenizer initialization
- Change test to use num_samples=1 instead of 100 for faster execution
- Add num_samples=1 to test configuration in test_quantization.py
- Remove redundant integration test methods (causal_lm, gsm8k, text_to_text, text_encoder)
- Remove redundant list_dataset_backward_compatibility test

Co-authored-by: nikita-savelyevv <[email protected]>

Fix serialization and seq_len parameter passing issues

- Handle dataset_kwargs deserialization for backward compatibility by extracting it from kwargs
- Only pass seq_len to helper functions when it exists in dataset_kwargs to avoid overriding defaults with None
- Use conditional kwargs construction for all calibration helpers

Co-authored-by: nikita-savelyevv <[email protected]>
@nikita-savelyevv nikita-savelyevv changed the title [OpenVINO] Introduce extended dataset options like 'wikitext:seq_len=128' [OpenVINO] Introduce extended quantization dataset options like 'wikitext:seq_len=128' Dec 17, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@nikita-savelyevv nikita-savelyevv changed the title [OpenVINO] Introduce extended quantization dataset options like 'wikitext:seq_len=128' [OpenVINO] Introduce extended quantization dataset options like 'wikitext2:seq_len=128' Dec 17, 2025
@nikita-savelyevv nikita-savelyevv marked this pull request as ready for review December 18, 2025 13:11
Copy link
Collaborator

@echarlaix echarlaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition!

dataset (`str or List[str]`, *optional*):
The dataset used for data-aware optimization with NNCF.
The dataset used for data-aware optimization with NNCF. Can be a dataset name (e.g., 'wikitext')
or a string with options (e.g., 'wikitext2:seq_len=128'). Currently supported option: seq_len.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that be redundant with num_sample ? Also what if both are provided ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a distinction between those: num_samples is a number of sequences in the dataset, while seq_len is a length of each sequence. Let me update the description to clarify what seq_len stands for.

dataset (`str or List[str]`, *optional*):
The dataset used for data-aware optimization with NNCF.
The dataset used for data-aware optimization with NNCF. Can be a dataset name (e.g., 'wikitext2')
or a string with options (e.g., 'wikitext2:seq_len=128'). The only currently supported option is `seq_len`
Copy link
Contributor

@ljaljushkin ljaljushkin Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some assumptions for seqlen=32 in gptq utils:
https://github.com/huggingface/optimum/blob/main/optimum/gptq/data.py#L128-L129
please check whether this code will be valid for other seq_len.

probably, it makes sense to copy paste this helper code to optimum-intel and extend for other seq_len

    limit = num_samples * seqlen // 4  # ~1k for 128 samples with seqlen=32 to be aligned with optimum
    text = "".join([" \n" if s == "" else s for s in traindata["text"][:limit]])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants