-
Notifications
You must be signed in to change notification settings - Fork 167
[OpenVINO] Introduce extended quantization dataset options like 'wikitext2:seq_len=128' #1564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add dataset_kwargs attribute and parsing logic to OVQuantizationConfigBase - Parse "dataset:seq_len=value" syntax with validation - Rename seqlen parameter to seq_len in _prepare_causal_lm_calibration_data - Thread dataset_kwargs through all calibration helpers - Add comprehensive unit and integration tests Co-authored-by: nikita-savelyevv <[email protected]> Apply black and ruff formatting to changes Co-authored-by: nikita-savelyevv <[email protected]> Revert unrelated formatting change to minimize diff Co-authored-by: nikita-savelyevv <[email protected]> Address PR feedback: merge tests, update docstrings, use explicit seq_len parameter - Merge TestDatasetIntegration into TestDatasetParsing and move to test_quantization.py - Remove standalone test files (test_dataset_parsing.py, test_dataset_integration.py) - Update dataset docstrings in all OVQuantizationConfigBase child classes - Update CLI --dataset help text in openvino.py - Update dataset documentation in export.mdx - Change from **dataset_kwargs unpacking to explicit seq_len= parameter - Remove unnecessary comment about integrating seq_len Co-authored-by: nikita-savelyevv <[email protected]> Add test cases for new dataset format with seq_len option - Add test case to test_exporters_cli.py using wikitext2:seq_len=64 - Add test case to test_quantization.py using c4:seq_len=64 Co-authored-by: nikita-savelyevv <[email protected]> Address final PR feedback: remove empty line, adjust test samples, remove redundant tests - Remove empty line between seq_len assignment and tokenizer initialization - Change test to use num_samples=1 instead of 100 for faster execution - Add num_samples=1 to test configuration in test_quantization.py - Remove redundant integration test methods (causal_lm, gsm8k, text_to_text, text_encoder) - Remove redundant list_dataset_backward_compatibility test Co-authored-by: nikita-savelyevv <[email protected]> Fix serialization and seq_len parameter passing issues - Handle dataset_kwargs deserialization for backward compatibility by extracting it from kwargs - Only pass seq_len to helper functions when it exists in dataset_kwargs to avoid overriding defaults with None - Use conditional kwargs construction for all calibration helpers Co-authored-by: nikita-savelyevv <[email protected]>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
echarlaix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the addition!
| dataset (`str or List[str]`, *optional*): | ||
| The dataset used for data-aware optimization with NNCF. | ||
| The dataset used for data-aware optimization with NNCF. Can be a dataset name (e.g., 'wikitext') | ||
| or a string with options (e.g., 'wikitext2:seq_len=128'). Currently supported option: seq_len. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that be redundant with num_sample ? Also what if both are provided ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a distinction between those: num_samples is a number of sequences in the dataset, while seq_len is a length of each sequence. Let me update the description to clarify what seq_len stands for.
| dataset (`str or List[str]`, *optional*): | ||
| The dataset used for data-aware optimization with NNCF. | ||
| The dataset used for data-aware optimization with NNCF. Can be a dataset name (e.g., 'wikitext2') | ||
| or a string with options (e.g., 'wikitext2:seq_len=128'). The only currently supported option is `seq_len` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some assumptions for seqlen=32 in gptq utils:
https://github.com/huggingface/optimum/blob/main/optimum/gptq/data.py#L128-L129
please check whether this code will be valid for other seq_len.
probably, it makes sense to copy paste this helper code to optimum-intel and extend for other seq_len
limit = num_samples * seqlen // 4 # ~1k for 128 samples with seqlen=32 to be aligned with optimum
text = "".join([" \n" if s == "" else s for s in traindata["text"][:limit]])There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Added
What does this PR do?
Sequence length during quantization calibration data collection can now be provided as a string like
"wikitext2:seq_len=128". The motivation is that sometimes it makes sense to adjust default sequence length value, and this way it will be possible to configure it in those particular cases. For example for configs inside_DEFAULT_4BIT_WQ_CONFIGS.Based on PR by Copilot: nikita-savelyevv#4
Ticket: CVS-176623
Before submitting