Skip to content

Commit

Permalink
Auto-download MMLU data, add small datasets (#4)
Browse files Browse the repository at this point in the history
* Remove unnecessary files
* Added lost datasets
* Remove GSM8K
* Automated download script
* Autodownload in eval script

---------
Co-authored-by: Dmitrii Khizbullin <[email protected]>
  • Loading branch information
mczhuge authored Feb 27, 2024
1 parent 796b891 commit 931cb1e
Show file tree
Hide file tree
Showing 24 changed files with 7,487 additions and 642 deletions.
3 changes: 0 additions & 3 deletions .gitmodules

This file was deleted.

5 changes: 5 additions & 0 deletions datasets/MMLU/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
data.tar
data/
data/README.txt
data/auxiliary_train/
data/possibly_contaminated_urls.txt
27 changes: 27 additions & 0 deletions datasets/MMLU/download.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
import os
import requests
import tarfile


def download():

this_file_path = os.path.split(__file__)[0]
tar_path = os.path.join(this_file_path, "data.tar")
if not os.path.exists(tar_path):
url = "https://people.eecs.berkeley.edu/~hendrycks/data.tar"
print(f"Downloading {url}")
r = requests.get(url, allow_redirects=True)
with open(tar_path, 'wb') as f:
f.write(r.content)
print(f"Saved to {tar_path}")

data_path = os.path.join(this_file_path, "data")
if not os.path.exists(data_path):
tar = tarfile.open(tar_path)
tar.extractall(this_file_path)
tar.close()
print(f"Saved to {data_path}")


if __name__ == "__main__":
download()
11 changes: 11 additions & 0 deletions datasets/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
**We include the following datasets linked to our best understanding of their original source.**

[GSM8K](https://github.com/openai/grade-school-math/tree/3101c7d5072418e28b9008a6636bde82a006892c)

[MMLU](https://github.com/hendrycks/test)

[Mini Crosswords](https://www.goobix.com/crosswords/0505/)

[GAIA](https://huggingface.co/datasets/gaia-benchmark/GAIA)

[HumanEval](https://github.com/openai/human-eval)
Loading

0 comments on commit 931cb1e

Please sign in to comment.