AutoSDT

This is the official codebase of AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists.

[Website] • [Paper] • [Dataset] • [Twitter]

Our AutoSDT collects data-driven discovery tasks in three steps: (1) AutoSDT-Search generates a list of keywords for each discipline and searches for relevant repositories. (2) AutoSDT-Select identifies programs that represent data-driven discovery tasks and extracts their execution dependency folders. (3) AutoSDT-Adapt modifies the selected programs to be independently executable and generates their corresponding task instructions.

We fine-tune Qwen2.5-Coder-32B on AutoSDT-5K to get AutoSDT-Coder-32B, which surpasses the performance of GPT-4o (2024-05-13) on ScienceAgentBench.

Overview

Despite long-standing efforts in accelerating scientific discovery with AI, building reliable AI co-scientists remains challenging due to the lack of high-quality data for training and evaluation. To address this data scarcity problem, we introduce AutoSDT—an automatic pipeline that collects high-quality coding tasks from real-world data-driven discovery workflows.

AutoSDT leverages the coding capabilities and parametric knowledge of large language models (LLMs) to search from diverse sources, identify ecologically valid scientific tasks, and synthesize both task instructions and code solutions automatically. Using this pipeline, we construct AutoSDT-5K, a dataset of 5,404 scientific coding tasks spanning four scientific disciplines (bioinformatics, computational chemistry, geographical information science, and psychology and cognitive neuroscience) and using 756 unique Python packages.

To the best of our knowledge, AutoSDT-5K is the largest and the only automatically collected open dataset for data-driven scientific discovery so far.
After fine-tuning Qwen2.5-Coder-32B-Instruct on AutoSDT-5K, the model reaches GPT-4o-level performance on ScienceAgentBench with a success rate of 7.8%, doubling the performance of its base model.
It also improves the hypothesis matching score by 17.4% relatively on DiscoveryBench, narrowing the gap between open-weight models and proprietary ones.

Installation

Clone this repository and install the required packages:

git clone https://github.com/OSU-NLP-Group/AutoSDT
cd AutoSDT
pip install -r requirements.txt

AutoSDT Pipeline

Configure Azure endpoint and API key

vim ~/.bashrc
export AZURE_OPENAI_KEY=YOUR_AZURE_API_KEY
export AZURE_ENDPOINT=YOUR_AZURE_ENDPOINT
export AZURE_API_VERSION=YOUR_AZURE_API_VERSION
source ~/.bashrc

AutoSDT-Search: Search for research related repositories

cd autosdt/scripts
bash run_search.sh

And specify discipline keywords in base_keywords argument.

AutoSDT-Select: Crawl python files, verify that they represent data-driven scientific tasks, and prepare their workspaces

bash run_crawl_files.sh
bash run_scientific_task_verify.sh
bash run_locate_dependencies.sh
bash run_prepare_env.sh

AutoSDT-Adapt: Adapt program for standalone executability and generate task instruction

bash run_adapt_code.sh
bash run_generate_instruction.sh

After the above steps, you should obtain a final_combined_training_data.jsonl containing the generated instructions and code. After that, run

python convert_data_to_alpaca_format.py

to convert the data format into alpaca training format.

Training and Inference

Supervised Fine-tuning

We use the LLaMA-Factory library to conduct SFT experiments. We provide the .yaml files in the models/ folder in this repo:

-- qwen2.5-coder-7b-instruct_full_sft.yaml
-- qwen2.5-coder-7b-instruct_full_sft.yaml
-- qwen2.5-coder-7b-instruct_full_sft.yaml

Please refer to LLaMA-Factory for more details.

Inference and Evaluation

For ScienceAgentBench, we directly follow the original repo for running inference and evaluation. Please refer to ScienceAgentBench/README.md for more information.

For DiscoveryBench, first start an LLM engine at localhost using vllm, then run

python evaluate_with_llm_engine.py

to generate all the evaluation results, and run

python cal_eval_avg.py

to compute the final results.

Contact

Yifei Li, Hanane Nour Moussa, Huan Sun, The Ohio State University

Disclaimer

AutoSDT creates tasks based on open-source code and data, and we respect the creators' ownership and intellectual property. We have made our best effort to ensure that the repositories included in AutoSDT-5K have permissive licenses allowing for academic use. We provide more details in Appendix G in the paper. We welcome requests from the original authors to modify or remove relevant tasks related to their repositories if needed.

We ensure that all 1325 repositories composing the final tasks in AutoSDT-5K allow for academic use. We list the licenses and the number of corresponding repositories in the following table:

License	Repositories
MIT	449
GNU	247
Apache	145
BSD	84
CC	57
Boost	4
Public Domain	3
ISC	1
Eclipse	1
PolyForm	1
Mulan	1
Other (Custom)	15

We manually checked the remaining 15 repositories with custom licenses and ensured that they all allow academic and non-commercial use:

Repositories with Custom Licenses
GabrieleLozupone/AXIAL
fhalab/MLDE
snacktavish/TreeToReads
usnistgov/SDNist
ruppinlab/CSI-Microbes-identification
fenchri/edge-oriented-graph
SNU-LIST/QSMnet
Ramprasad-Group/polygnn
gdalessi/OpenMORe
svalkiers/clusTCR
AI-sandbox/SALAI-Net
pixelite1201/agora_evaluation
jsunn-y/PolymerGasMembraneML
spectrochempy/spectrochempy
usnistgov/atomgpt

There are also 317 repositories without any license information. We assume that these repositories are permissive for academic purposes.

License

Code under this repo is licensed under MIT License.

Citation

Please cite our paper (and star our repo!) if you use our data, models, or code.

@misc{li2025autosdtscalingdatadrivendiscovery,
      title={AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists}, 
      author={Yifei Li and Hanane Nour Moussa and Ziru Chen and Shijie Chen and Botao Yu and Mingyi Xue and Benjamin Burns and Tzu-Yao Chiu and Vishal Dey and Zitong Lu and Chen Wei and Qianheng Zhang and Tianyu Zhang and Song Gao and Xuhui Huang and Xia Ning and Nesreen K. Ahmed and Ali Payani and Huan Sun},
      year={2025},
      eprint={2506.08140},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.08140}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
autosdt		autosdt
benchmarks		benchmarks
figures		figures
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AutoSDT

Table-of-Contents

Overview

Installation

AutoSDT Pipeline

Configure Azure endpoint and API key

AutoSDT-Search: Search for research related repositories

AutoSDT-Select: Crawl python files, verify that they represent data-driven scientific tasks, and prepare their workspaces

AutoSDT-Adapt: Adapt program for standalone executability and generate task instruction

Training and Inference

Supervised Fine-tuning

Inference and Evaluation

Contact

Disclaimer

License

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

OSU-NLP-Group/AutoSDT

Folders and files

Latest commit

History

Repository files navigation

AutoSDT

Table-of-Contents

Overview

Installation

AutoSDT Pipeline

Configure Azure endpoint and API key

AutoSDT-Search: Search for research related repositories

AutoSDT-Select: Crawl python files, verify that they represent data-driven scientific tasks, and prepare their workspaces

AutoSDT-Adapt: Adapt program for standalone executability and generate task instruction

Training and Inference

Supervised Fine-tuning

Inference and Evaluation

Contact

Disclaimer

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages