This is the official codebase of AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists.
[Website] • [Paper] • [Dataset] • [Twitter]
Our AutoSDT collects data-driven discovery tasks in three steps: (1) AutoSDT-Search generates a list of keywords for each discipline and searches for relevant repositories. (2) AutoSDT-Select identifies programs that represent data-driven discovery tasks and extracts their execution dependency folders. (3) AutoSDT-Adapt modifies the selected programs to be independently executable and generates their corresponding task instructions.
We fine-tune Qwen2.5-Coder-32B on AutoSDT-5K to get AutoSDT-Coder-32B, which surpasses the performance of GPT-4o (2024-05-13) on ScienceAgentBench.
- 📌 Overview
- ⚙️ Installation
- 🚀 AutoSDT-Pipeline
- 🛠️ Training and Inference
- 📧 Contact
- 📄 Disclaimer
- 📜 License
- 📖 Citation
Despite long-standing efforts in accelerating scientific discovery with AI, building reliable AI co-scientists remains challenging due to the lack of high-quality data for training and evaluation. To address this data scarcity problem, we introduce AutoSDT—an automatic pipeline that collects high-quality coding tasks from real-world data-driven discovery workflows.
AutoSDT leverages the coding capabilities and parametric knowledge of large language models (LLMs) to search from diverse sources, identify ecologically valid scientific tasks, and synthesize both task instructions and code solutions automatically. Using this pipeline, we construct AutoSDT-5K, a dataset of 5,404 scientific coding tasks spanning four scientific disciplines (bioinformatics, computational chemistry, geographical information science, and psychology and cognitive neuroscience) and using 756 unique Python packages.
- To the best of our knowledge, AutoSDT-5K is the largest and the only automatically collected open dataset for data-driven scientific discovery so far.
- After fine-tuning Qwen2.5-Coder-32B-Instruct on AutoSDT-5K, the model reaches GPT-4o-level performance on ScienceAgentBench with a success rate of 7.8%, doubling the performance of its base model.
- It also improves the hypothesis matching score by 17.4% relatively on DiscoveryBench, narrowing the gap between open-weight models and proprietary ones.
Clone this repository and install the required packages:
git clone https://github.com/OSU-NLP-Group/AutoSDT
cd AutoSDT
pip install -r requirements.txt
vim ~/.bashrc
export AZURE_OPENAI_KEY=YOUR_AZURE_API_KEY
export AZURE_ENDPOINT=YOUR_AZURE_ENDPOINT
export AZURE_API_VERSION=YOUR_AZURE_API_VERSION
source ~/.bashrc
cd autosdt/scripts
bash run_search.sh
And specify discipline keywords in base_keywords
argument.
AutoSDT-Select: Crawl python files, verify that they represent data-driven scientific tasks, and prepare their workspaces
bash run_crawl_files.sh
bash run_scientific_task_verify.sh
bash run_locate_dependencies.sh
bash run_prepare_env.sh
bash run_adapt_code.sh
bash run_generate_instruction.sh
After the above steps, you should obtain a final_combined_training_data.jsonl
containing the generated instructions and code. After that, run
python convert_data_to_alpaca_format.py
to convert the data format into alpaca training format.
We use the LLaMA-Factory library to conduct SFT experiments. We provide the .yaml
files in the models/
folder in this repo:
-- qwen2.5-coder-7b-instruct_full_sft.yaml
-- qwen2.5-coder-7b-instruct_full_sft.yaml
-- qwen2.5-coder-7b-instruct_full_sft.yaml
Please refer to LLaMA-Factory for more details.
For ScienceAgentBench, we directly follow the original repo for running inference and evaluation. Please refer to ScienceAgentBench/README.md
for more information.
For DiscoveryBench, first start an LLM engine at localhost using vllm, then run
python evaluate_with_llm_engine.py
to generate all the evaluation results, and run
python cal_eval_avg.py
to compute the final results.
Yifei Li, Hanane Nour Moussa, Huan Sun, The Ohio State University
AutoSDT creates tasks based on open-source code and data, and we respect the creators' ownership and intellectual property. We have made our best effort to ensure that the repositories included in AutoSDT-5K have permissive licenses allowing for academic use. We provide more details in Appendix G in the paper. We welcome requests from the original authors to modify or remove relevant tasks related to their repositories if needed.
We ensure that all 1325 repositories composing the final tasks in AutoSDT-5K allow for academic use. We list the licenses and the number of corresponding repositories in the following table:
License | Repositories |
---|---|
MIT | 449 |
GNU | 247 |
Apache | 145 |
BSD | 84 |
CC | 57 |
Boost | 4 |
Public Domain | 3 |
ISC | 1 |
Eclipse | 1 |
PolyForm | 1 |
Mulan | 1 |
Other (Custom) | 15 |
We manually checked the remaining 15 repositories with custom licenses and ensured that they all allow academic and non-commercial use:
Repositories with Custom Licenses |
---|
GabrieleLozupone/AXIAL |
fhalab/MLDE |
snacktavish/TreeToReads |
usnistgov/SDNist |
ruppinlab/CSI-Microbes-identification |
fenchri/edge-oriented-graph |
SNU-LIST/QSMnet |
Ramprasad-Group/polygnn |
gdalessi/OpenMORe |
svalkiers/clusTCR |
AI-sandbox/SALAI-Net |
pixelite1201/agora_evaluation |
jsunn-y/PolymerGasMembraneML |
spectrochempy/spectrochempy |
usnistgov/atomgpt |
There are also 317 repositories without any license information. We assume that these repositories are permissive for academic purposes.
Code under this repo is licensed under MIT License.
Please cite our paper (and star our repo!) if you use our data, models, or code.
@misc{li2025autosdtscalingdatadrivendiscovery,
title={AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists},
author={Yifei Li and Hanane Nour Moussa and Ziru Chen and Shijie Chen and Botao Yu and Mingyi Xue and Benjamin Burns and Tzu-Yao Chiu and Vishal Dey and Zitong Lu and Chen Wei and Qianheng Zhang and Tianyu Zhang and Song Gao and Xuhui Huang and Xia Ning and Nesreen K. Ahmed and Ali Payani and Huan Sun},
year={2025},
eprint={2506.08140},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.08140},
}