-
Notifications
You must be signed in to change notification settings - Fork 87
Open
Description
hello,
I am trying to download the openwebtext dataset from huggingface, but I keep getting the following error:
Downloading data: 100%|________________________________________________________________________________________________________________| 12.9G/12.9G [25:43<00:00, 8.35MB/s]
/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/download/download_manager.py:527: FutureWarning: 'num_proc' was deprecated in version 2.6.2 and will be removed in 3.0.0. Pass `DownloadConfig(num_proc=<num_proc>)` to the initializer instead.
warnings.warn(
Extracting data files: 100%|________________________________________________________________________________________________________| 20610/20610 [9:43:42<00:00, 1.70s/it]
Traceback (most recent call last):
File "ssd_process_data.py", line 485, in <module>
main()
File "ssd_process_data.py", line 369, in main
raw_datasets["train"] = load_dataset(
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/load.py", line 1782, in load_dataset
builder_instance.download_and_prepare(
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 872, in download_and_prepare
self._download_and_prepare(
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
super()._download_and_prepare(
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 985, in _download_and_prepare
verify_splits(self.info.splits, split_dict)
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 100, in verify_splits
raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=39769494896, num_examples=8013769, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=39769065791, num_examples=8013740, shard_lengths=[101000, 100000, 101000, 101000, 102000, 102000, 101000, 102000, 101000, 101000, 101000, 101000, 101000, 102000, 101000, 101000, 101000, 101000, 102000, 102000, 100000, 101000, 100000, 101000, 102000, 101000, 102000, 101000, 102000, 102000, 102000, 101000, 101000, 101000, 101000, 102000, 101000, 102000, 101000, 101000, 100000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 100000, 101000, 102000, 101000, 101000, 101000, 101000, 101000, 102000, 102000, 101000, 102000, 101000, 102000, 102000, 101000, 101000, 102000, 102000, 102000, 101000, 102000, 102000, 102000, 101000, 101000, 102000, 101000, 13740], dataset_name='openwebtext')}]
(ssdlm) sloboda1@dgx02:~/controlled_reduction/decoding_approaches/ssd-lm$ less /home/nlp/sloboda1/.cache/huggingface/datasets/openwebtext/plain_text/1.0.0/85b3ae7051d2d72e7c5fdf6dfb462603aaa26e9ed506202bf3a24d261c6c40a1_builder.lock
I have tried forcing the redownloading of the dataset by passing the download_mode="force_redownload" parameter, but it yield the same error.
I have also tried passing the ignore_verifications=True parameter, but this in turn yielded the following error:
raw_datasets["train"] = load_dataset(
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/load.py", line 1754, in load_dataset
verification_mode = VerificationMode(
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/enum.py", line 339, in __call__
return cls.__new__(cls, value)
File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/enum.py", line 663, in __new__
raise ve_exc
ValueError: 'none' is not a valid VerificationMode
I am at loss here, and would really appreciate some guidance as to how to address this problem.
Thanks.
Metadata
Metadata
Assignees
Labels
No labels