-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid using skip() in hf_datasets #835
Conversation
if isinstance(self._data, Dataset) and self._sample_idx == len(self._data): | ||
return iter([]) | ||
|
||
return iter(self._data.skip(self._sample_idx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two questions:
- The wrong behavior is not captured by unit test https://github.com/pytorch/torchtitan/blob/main/tests/unit_tests/test_dataset_checkpointing.py. Is it because the error only happens in distributed environment, on some rank but not others?
- Does skip fail for both map-style datasets (
c4_test
) and IterableDatasets (c4
)? Asking because from my PR uniformly use skip for both (map-style) Dataset and IterableDataset #521 I recallskip
for IterableDataset is a "new" feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on testing loss issue(whether still happens) and whether difference between datasets c4/c4_test right now.
unit_test would be the next PR after fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It only happens in distributed environment and on certain ranks only.
- I could not reproduce this with
c4_test
.
I don't know the detail of the root cause though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not land this PR but I verified all ranks output the expected sample even after resuming from checkpoints, @mori360 would work on loss parity with unittest verifcation.
We found out that skip() may not behave as expected. This is a workaround solution while we are investiging the root cause. ghstack-source-id: 412810bc4a83de39df3fdfe62bc249d80c7deb19 Pull Request resolved: #835
Stack from ghstack (oldest at bottom):
We found out that skip() may not behave as expected. This is a workaround solution while we are investiging the root cause.