Skip to content

A pre-trained representation from a transformers model for inference on insect DNA barcoding data.

License

Notifications You must be signed in to change notification settings

bioscan-ml/BarcodeBERT

Repository files navigation

BarcodeBERT

A pre-trained transformer model for inference on insect DNA barcoding data.

drawing

Check out our paper

Using the model

from transformers import AutoTokenizer, AutoModel

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "bioscan-ml/BarcodeBERT", trust_remote_code=True
)

# Load the model
model = AutoModel.from_pretrained("bioscan-ml/BarcodeBERT", trust_remote_code=True)

# Sample sequence
dna_seq = "ACGCGCTGACGCATCAGCATACGA"

# Tokenize
input_seq = tokenizer(dna_seq, return_tensors="pt")["input_ids"]

# Pass through the model
output = model(input_seq)["hidden_states"][-1]

# Compute Global Average Pooling
features = output.mean(1)

Reproducing the results from the paper

  1. Clone this repository and install the required libraries by running
pip install -e .
  1. Download the data
wget https://vault.cs.uwaterloo.ca/s/x7gXQKnmRX3GAZm/download -O data.zip
unzip data.zip
mv new_data/* data/
rm -r new_data
rm data.zip
  1. Pretrain BarcodeBERT
python barcodebert/pretraining.py --dataset=CANADA-1.5M --k_mer=4 --n_layers=4 --n_heads=4 --data_dir=data/ --checkpoint=model_checkpoints/CANADA-1.5M/4_4_4/checkpoint_pretraining.pt
  1. Baseline model pipelines: The desired backbone can be selected using one of the following keywords:
    BarcodeBERT, NT, Hyena_DNA, DNABERT, DNABERT-2, DNABERT-S
python baselines/knn_probing.py --backbone=<DESIRED-BACKBONE>  --data-dir=data/
python baselines/linear_probing.py --backbone=<DESIRED-BACKBONE>  --data-dir=data/
python baselines/finetuning.py --backbone=<DESIRED-BACKBONE> --data-dir=data/ --batch_size=32

Note: HyenaDNA has to be downloaded using git-lfs. If that is not available to you, you may download the /hyenadna-tiny-1k-seqlen/ checkpoint directly from Hugging face. The keyword BarcodeBERT is also available as a baseline but this will download the publicly available model as presented in our workshop paper.

  1. BLAST
cd data/
python to_fasta.py --input_file=supervised_train.csv &&
python to_fasta.py --input_file=supervised_test.csv &&
python to_fasta.py --input_file=unseen.csv

makeblastdb -in supervised_train.fas -title train -dbtype nucl -out train.fas
blastn -query supervised_test.fas -db train.fas -out results_supervised_test.tsv -outfmt 6 -num_threads 16
blastn -query unseen.fas -db train.fas -out results_unseen.tsv -outfmt 6 -num_threads 16

Citation

If you find BarcodeBERT useful in your research please consider citing:

@article{arias2023barcodebert,
  title={{BarcodeBERT}: Transformers for Biodiversity Analysis},
  author={Pablo Millan Arias
    and Niousha Sadjadi
    and Monireh Safari
    and ZeMing Gong
    and Austin T. Wang
    and Joakim Bruslund Haurum
    and Iuliia Zarubiieva
    and Dirk Steinke
    and Lila Kari
    and Angel X. Chang
    and Scott C. Lowe
    and Graham W. Taylor
  },
  journal={arXiv preprint arXiv:2311.02401}
  year={2023},
  eprint={2311.02401},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  doi={10.48550/arxiv.2311.02401},
}

About

A pre-trained representation from a transformers model for inference on insect DNA barcoding data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published