Skip to content

AutoSDT is a fully automatic pipeline to collect data-driven scientific coding tasks to train co-scientist models.

License

Notifications You must be signed in to change notification settings

OSU-NLP-Group/AutoSDT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoSDT

This is the official codebase of AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists.

[Website] • [Paper] • [Dataset] • [Twitter]


Our AutoSDT collects data-driven discovery tasks in three steps: (1) AutoSDT-Search generates a list of keywords for each discipline and searches for relevant repositories. (2) AutoSDT-Select identifies programs that represent data-driven discovery tasks and extracts their execution dependency folders. (3) AutoSDT-Adapt modifies the selected programs to be independently executable and generates their corresponding task instructions.

We fine-tune Qwen2.5-Coder-32B on AutoSDT-5K to get AutoSDT-Coder-32B, which surpasses the performance of GPT-4o (2024-05-13) on ScienceAgentBench.

Table-of-Contents

Overview

Despite long-standing efforts in accelerating scientific discovery with AI, building reliable AI co-scientists remains challenging due to the lack of high-quality data for training and evaluation. To address this data scarcity problem, we introduce AutoSDT—an automatic pipeline that collects high-quality coding tasks from real-world data-driven discovery workflows.

AutoSDT leverages the coding capabilities and parametric knowledge of large language models (LLMs) to search from diverse sources, identify ecologically valid scientific tasks, and synthesize both task instructions and code solutions automatically. Using this pipeline, we construct AutoSDT-5K, a dataset of 5,404 scientific coding tasks spanning four scientific disciplines (bioinformatics, computational chemistry, geographical information science, and psychology and cognitive neuroscience) and using 756 unique Python packages.

  • To the best of our knowledge, AutoSDT-5K is the largest and the only automatically collected open dataset for data-driven scientific discovery so far.
  • After fine-tuning Qwen2.5-Coder-32B-Instruct on AutoSDT-5K, the model reaches GPT-4o-level performance on ScienceAgentBench with a success rate of 7.8%, doubling the performance of its base model.
  • It also improves the hypothesis matching score by 17.4% relatively on DiscoveryBench, narrowing the gap between open-weight models and proprietary ones.

Installation

Clone this repository and install the required packages:

git clone https://github.com/OSU-NLP-Group/AutoSDT
cd AutoSDT
pip install -r requirements.txt

AutoSDT Pipeline

Configure Azure endpoint and API key

vim ~/.bashrc
export AZURE_OPENAI_KEY=YOUR_AZURE_API_KEY
export AZURE_ENDPOINT=YOUR_AZURE_ENDPOINT
export AZURE_API_VERSION=YOUR_AZURE_API_VERSION
source ~/.bashrc

AutoSDT-Search: Search for research related repositories

cd autosdt/scripts
bash run_search.sh

And specify discipline keywords in base_keywords argument.

AutoSDT-Select: Crawl python files, verify that they represent data-driven scientific tasks, and prepare their workspaces

bash run_crawl_files.sh
bash run_scientific_task_verify.sh
bash run_locate_dependencies.sh
bash run_prepare_env.sh

AutoSDT-Adapt: Adapt program for standalone executability and generate task instruction

bash run_adapt_code.sh
bash run_generate_instruction.sh

After the above steps, you should obtain a final_combined_training_data.jsonl containing the generated instructions and code. After that, run

python convert_data_to_alpaca_format.py

to convert the data format into alpaca training format.

Training and Inference

Supervised Fine-tuning

We use the LLaMA-Factory library to conduct SFT experiments. We provide the .yaml files in the models/ folder in this repo:

-- qwen2.5-coder-7b-instruct_full_sft.yaml
-- qwen2.5-coder-7b-instruct_full_sft.yaml
-- qwen2.5-coder-7b-instruct_full_sft.yaml

Please refer to LLaMA-Factory for more details.

Inference and Evaluation

For ScienceAgentBench, we directly follow the original repo for running inference and evaluation. Please refer to ScienceAgentBench/README.md for more information.

For DiscoveryBench, first start an LLM engine at localhost using vllm, then run

python evaluate_with_llm_engine.py

to generate all the evaluation results, and run

python cal_eval_avg.py

to compute the final results.

Contact

Yifei Li, Hanane Nour Moussa, Huan Sun, The Ohio State University

Disclaimer

AutoSDT creates tasks based on open-source code and data, and we respect the creators' ownership and intellectual property. We have made our best effort to ensure that the repositories included in AutoSDT-5K have permissive licenses allowing for academic use. We provide more details in Appendix G in the paper. We welcome requests from the original authors to modify or remove relevant tasks related to their repositories if needed.

We ensure that all 1325 repositories composing the final tasks in AutoSDT-5K allow for academic use. We list the licenses and the number of corresponding repositories in the following table:

License Repositories
MIT 449
GNU 247
Apache 145
BSD 84
CC 57
Boost 4
Public Domain 3
ISC 1
Eclipse 1
PolyForm 1
Mulan 1
Other (Custom) 15

We manually checked the remaining 15 repositories with custom licenses and ensured that they all allow academic and non-commercial use:

Repositories with Custom Licenses
GabrieleLozupone/AXIAL
fhalab/MLDE
snacktavish/TreeToReads
usnistgov/SDNist
ruppinlab/CSI-Microbes-identification
fenchri/edge-oriented-graph
SNU-LIST/QSMnet
Ramprasad-Group/polygnn
gdalessi/OpenMORe
svalkiers/clusTCR
AI-sandbox/SALAI-Net
pixelite1201/agora_evaluation
jsunn-y/PolymerGasMembraneML
spectrochempy/spectrochempy
usnistgov/atomgpt

There are also 317 repositories without any license information. We assume that these repositories are permissive for academic purposes.

License

Code under this repo is licensed under MIT License.

Citation

Please cite our paper (and star our repo!) if you use our data, models, or code.

@misc{li2025autosdtscalingdatadrivendiscovery,
      title={AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists}, 
      author={Yifei Li and Hanane Nour Moussa and Ziru Chen and Shijie Chen and Botao Yu and Mingyi Xue and Benjamin Burns and Tzu-Yao Chiu and Vishal Dey and Zitong Lu and Chen Wei and Qianheng Zhang and Tianyu Zhang and Song Gao and Xuhui Huang and Xia Ning and Nesreen K. Ahmed and Ali Payani and Huan Sun},
      year={2025},
      eprint={2506.08140},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.08140}, 
}

About

AutoSDT is a fully automatic pipeline to collect data-driven scientific coding tasks to train co-scientist models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •