Skip to content

getting a datasets.utils.info_utils.NonMatchingSplitsSizesError when downloading the dataset from huggingface #42

@lovodkin93

Description

@lovodkin93

hello,
I am trying to download the openwebtext dataset from huggingface, but I keep getting the following error:

Downloading data: 100%|________________________________________________________________________________________________________________| 12.9G/12.9G [25:43<00:00, 8.35MB/s]
/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/download/download_manager.py:527: FutureWarning: 'num_proc' was deprecated in version 2.6.2 and will be removed in 3.0.0. Pass `DownloadConfig(num_proc=<num_proc>)` to the initializer instead.
  warnings.warn(
Extracting data files: 100%|________________________________________________________________________________________________________| 20610/20610 [9:43:42<00:00,  1.70s/it]
Traceback (most recent call last):
  File "ssd_process_data.py", line 485, in <module>
    main()
  File "ssd_process_data.py", line 369, in main
    raw_datasets["train"] = load_dataset(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 985, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 100, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=39769494896, num_examples=8013769, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=39769065791, num_examples=8013740, shard_lengths=[101000, 100000, 101000, 101000, 102000, 102000, 101000, 102000, 101000, 101000, 101000, 101000, 101000, 102000, 101000, 101000, 101000, 101000, 102000, 102000, 100000, 101000, 100000, 101000, 102000, 101000, 102000, 101000, 102000, 102000, 102000, 101000, 101000, 101000, 101000, 102000, 101000, 102000, 101000, 101000, 100000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 100000, 101000, 102000, 101000, 101000, 101000, 101000, 101000, 102000, 102000, 101000, 102000, 101000, 102000, 102000, 101000, 101000, 102000, 102000, 102000, 101000, 102000, 102000, 102000, 101000, 101000, 102000, 101000, 13740], dataset_name='openwebtext')}]
(ssdlm) sloboda1@dgx02:~/controlled_reduction/decoding_approaches/ssd-lm$ less  /home/nlp/sloboda1/.cache/huggingface/datasets/openwebtext/plain_text/1.0.0/85b3ae7051d2d72e7c5fdf6dfb462603aaa26e9ed506202bf3a24d261c6c40a1_builder.lock

I have tried forcing the redownloading of the dataset by passing the download_mode="force_redownload" parameter, but it yield the same error.

I have also tried passing the ignore_verifications=True parameter, but this in turn yielded the following error:

    raw_datasets["train"] = load_dataset(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/load.py", line 1754, in load_dataset
    verification_mode = VerificationMode(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/enum.py", line 339, in __call__
    return cls.__new__(cls, value)
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/enum.py", line 663, in __new__
    raise ve_exc
ValueError: 'none' is not a valid VerificationMode

I am at loss here, and would really appreciate some guidance as to how to address this problem.
Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions