Chasing Random: Instruction Selection Strategies Fail To Generalize

Official Code for Chasing Random: Instruction Selection Strategies Fail To Generalize.

A large body of work has trained competitive models cost-effectively using only a fraction of high-quality instructions from existing datasets. In this work, we analyze popular selection strategies across different source datasets, selection budgets and evaluation benchmarks to demonstrate that gains through data selection generalize poorly -- often failing to consistently outperform even random baselines. Through an analysis of the cost expended in selection, we also conclude that data selection strategies can often exceed the cost of fine-tuning on the full dataset, yielding only marginal—and sometimes no gains compared to tuning on the full dataset or a random subset.

Figure 1: Random baselines are competitive with minimal cost. Best strategies vary significantly by setup (*indicates best strategy).

You can find all the data (including specific random,strictrandom run data for reproducibility and overlap assessment) in the data.zip hosted here.
For efficiency, we do not include the full dataset for EVOL, ALPACA and DOLLY as we source them from huggingface.
We include our uniformly subsampled FLAN (our version of the full dataset) for reproducibility in this folder. To setup the environment for FLAN, use instructions provided here. Then run the seqio_creator.py (for example, with a different per task weightage). We recommend exporting the following variables to avoid exceeding Disk Quota and cloud authentication errors.

export CURL_CA_BUNDLE=/etc/ssl/certs/ca-bundle.crt
export TFDS_DATA_DIR=<REASONABLY LARGE PERSISTENT STORAGE>

Data Format

The main finetuning script train.py excepts the data to be formatted in the following jsonl format. You can find a sample data file in src/sample_data/.

{"input":<instruction (including any input context)>, "target":<target>}

Fine-Tuning

Run accelerate config setting with the available number of available GPUs. Edit the src/scripts/train.sh --num_processes parameter to the same value. The hyperparameters in train.sh correspond to a 8-GPU setup so edit the hyperparameters accordingly if --num_processess != 8.

accelerate config 
bash train.sh

Selection Strategies

Alpacasus: Run sampler.py to create initial pool of samples to be scored. Then run scorer.py to score the pool of samples; Finally run the scorer.py with pruning_budget to prune according to the subsampled budget.

python sampler.py --root ../sample_data/temp/ --budget 10 --identifier temp 
python scorer.py --budget 5 
python scorer.py --prune --pruning_budget 10

Files of interest: sampler.py, scorer.py

Longest: Run sampler.py and pass --length_sorted.

python sampler.py --root ../sample_data/temp/ --dolly --length_sorted --budget 10 --identifier temp

Files of interest: sampler.py

Cherry: We use the code opensourced by the authors listed here. You can use cherry_data_converter.py to convert our data into the format required by the CherryLM training scripts (You will have to supply the path to your preprocessed FLAN dataset as we load that locally. Set this to flan_full.jsonl from data.zip if you're not using a custom FLAN dataset). Use cherry.sh to run all the steps (training the pre-experienced model, clustering, scoring the samples for IFD and finally selecting samples.)

python cherry_data_converter.py --write_root ../sample_data/temp/
cd ..
git clone https://github.com/tianyi-lab/Cherry_LLM.git
cd scripts
bash cherry.sh

Files of interest: cherry_data_converter.py, cherry.sh Please refer to the description of each argument here to overwrite temporary paths in the script.

DEITA: We directly use the code opensourced by the authors listed here; To convert data into the sharegpt format you can use preprocessor/data_data_converter.py. This file has 2 flags:
--preprocessing: This is used for converting the original datasets into shareGPT format
--training: This is used for filtering the deita-scored data in accordance with whatever your sampling budget may be (default is 10000).

This file will also give you a warning if your deita-scored samples are not enough to subsample for the budget of your choice. If you encounter an underflow, we recommend trying a lower value of --threshold in deita_selection.py (default for us is 0.9). The deita_selection.py files shows sample usage of the entire scoring pipeline (embedding generation and scoring). Please refer to the description of each argument here to overwrite temporary paths in the script.

cd preprocessor
python deita_data_converter.py --preprocessing \
 --max_budget 100 --root ../sample_data/  \
 --write_root ../sample_data/temp \

git clone https://github.com/hkust-nlp/deita.git
cd deita
pip install -e .

python deita_selection.py --threshold 0.9

Files of interest: deita_selection.py, deita_data_converter.py

Evaluation

inference.py loads all the evaluation benchmarks used in the paper. By default - we use VLLM for faster inference and you can choose whichever variant based on the description below:

With VLLM

We provide a separate vllm_requirements.txt should you choose to run inference using VLLM. The current requirements.txt includes its installation and you can see inference.sh for example usage.

Without VLLM

If you prefer not using VLLM - you can refer to the batch_process() function in the utils to run inference while only leveraging accelerate. For Eval Harness evaluations, you can refer to openllm_simple_evaluate.py which leverages 4-bit quantized model inferencing with simple_evaluate for Eval Harness evaluations.

Benchmark Specific Details

AlpacaEval: You can use /evaluation/alpacaeval.py to create the random-baselines for AlpacaEval computation. This will also write the exact command to run for generating the alpacaeval leaderboards for each comparison.
IFEval: You can use /evaluation/ifeval.py to compute IFEval scores using the inferences generated for ifeval.
Eval Harness: You can use eval_harness_summary_compiler() in ./utils.py to parse all the results from Eval Harness inference into csv's for analysis.

bash inference.sh 

accelerate launch -m lm_eval --model hf \
--model_args pretrained=<path to llama checkpoint>,max_length=2048,dtype=auto \
--tasks arc_challenge,hellaswag,arc_easy,truthfulqa_mc1,truthfulqa_mc2,winogrande,mmlu  \
--batch_size auto \
--output_path ../results

Citations

If you found our code and/or paper useful, please consider citing our work

@misc{diddee2024chasingrandominstructionselection,
      title={Chasing Random: Instruction Selection Strategies Fail to Generalize}, 
      author={Harshita Diddee and Daphne Ippolito},
      year={2024},
      eprint={2410.15225},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2410.15225}, 
}

Code for cherry data selection was sourced from the official open source repository.

@inproceedings{li-etal-2024-quantity,
    title = "From Quantity to Quality: Boosting {LLM} Performance with Self-Guided Data Selection for Instruction Tuning",
    author = "Li, Ming  and
      Zhang, Yong  and
      Li, Zhitao  and
      Chen, Jiuhai  and
      Chen, Lichang  and
      Cheng, Ning  and
      Wang, Jianzong  and
      Zhou, Tianyi  and
      Xiao, Jing",
    editor = "Duh, Kevin  and
      Gomez, Helena  and
      Bethard, Steven",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-long.421",
    pages = "7595--7628",
}

Code for deita data selection was sourced from the official open source repository.

@inproceedings{
liu2024what,
title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=BTKAeLqLMw}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
vllm_requirements.txt		vllm_requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chasing Random: Instruction Selection Strategies Fail To Generalize

Table of Contents

Environment

Data

Data Format

Fine-Tuning

Selection Strategies

Evaluation

With VLLM

Without VLLM

Benchmark Specific Details

Citations

About

Uh oh!

Releases

Packages

Languages

ippolito-cmu/ChasingRandom

Folders and files

Latest commit

History

Repository files navigation

Chasing Random: Instruction Selection Strategies Fail To Generalize

Table of Contents

Environment

Data

Data Format

Fine-Tuning

Selection Strategies

Evaluation

With VLLM

Without VLLM

Benchmark Specific Details

Citations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages