Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
gonzalobenegas committed Sep 2, 2023
1 parent 1a979b2 commit 473a412
Show file tree
Hide file tree
Showing 2 changed files with 64 additions and 10 deletions.
57 changes: 52 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,58 @@
pip install git+https://github.com/songlab-cal/gpn.git
```

## Usage
## Application to *Arabidopsis thaliana*
* Quick example to play with the model: `basic_example.ipynb` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/songlab-cal/gpn/blob/main/basic_example.ipynb)
* [Application to *Arabidopsis thaliana*, including training, inference and analysis](analysis/arabidopsis)
* [General workflow to create a training dataset given a list of NCBI accessions](workflow/make_dataset_from_ncbi)
* [Training, inference and analysis](analysis/arabidopsis)

## Training on your own data
1. [Snakemake workflow to create a dataset](workflow/make_dataset)
Can automatically download data from NCBI given a list of accessions, or use your own fasta files.
2. Training
- Will automatically detect all available GPUs.
- Track metrics on [Weights & Biases](https://wandb.ai/)
- Implemented models: `ConvNet`, `GPNRoFormer` (Transformer)
- Specify config overrides: e.g. `--config_overrides n_layers=30`
- Example:
```bash
WANDB_PROJECT=your_project python -m gpn.run_mlm --do_train --do_eval \
--fp16 --report_to wandb --prediction_loss_only True --remove_unused_columns False \
--dataset_name results/dataset --tokenizer_name gonzalobenegas/tokenizer-dna-mlm \
--soft_masked_loss_weight_train 0.1 --soft_masked_loss_weight_evaluation 0.0 \
--weight_decay 0.01 --optim adamw_torch \
--dataloader_num_workers 16 --seed 42 \
--save_strategy steps --save_steps 10000 --evaluation_strategy steps \
--eval_steps 10000 --logging_steps 10000 --max_steps 120000 --warmup_steps 1000 \
--learning_rate 1e-3 --lr_scheduler_type constant_with_warmup \
--run_name your_run --output_dir your_output_dir --model_type ConvNet \
--per_device_train_batch_size 512 --per_device_eval_batch_size 512 --gradient_accumulation_steps 1
```
3. Extract embeddings
- Input file requires `chrom`, `start`, `end`
- Example:
```bash
python -m gpn.get_embeddings windows.parquet genome.fa.gz 100 your_output_dir \
results.parquet --per-device-batch-size 4000 --is-file --dataloader-num-workers 16
```
4. Variant effect prediction
- Input file requires `chrom`, `pos`, `ref`, `alt`
- Example:
```bash
python -m gpn.run_vep variants.parquet genome.fa.gz 512 your_output_dir results.parquet \
--per-device-batch-size 4000 --is-file --dataloader-num-workers 16
```

## Citation
Gonzalo Benegas, Sanjit Singh Batra and Yun S. Song "DNA language models are powerful zero-shot predictors of non-coding variant effects" bioRxiv (2022)
DOI: [10.1101/2022.08.22.504706](https://doi.org/10.1101/2022.08.22.504706)
```
@article {benegas2023dna,
author = {Gonzalo Benegas and Sanjit Singh Batra and Yun S. Song},
title = {DNA language models are powerful predictors of genome-wide variant effects},
elocation-id = {2022.08.22.504706},
year = {2023},
doi = {10.1101/2022.08.22.504706},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2023/08/04/2022.08.22.504706},
eprint = {https://www.biorxiv.org/content/early/2023/08/04/2022.08.22.504706.full.pdf},
journal = {bioRxiv}
}
```
17 changes: 12 additions & 5 deletions workflow/make_dataset/README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,28 @@
# Workflow to create a training dataset
[Example dataset](https://huggingface.co/datasets/gonzalobenegas/example_dataset) (with default config, should take 5 minutes)
1. Download data from ncbi given a list of accessions, or alternatively, use your own fasta files.
2. Define a set of training intervals, e.g. full chromosomes, only exons, etc.
1. Download data from NCBI given a list of accessions, or alternatively, use your own fasta files
2. Define a set of training intervals, e.g. full chromosomes, only exons (requires annotation), etc
3. Shard the dataset for efficient loading with Hugging Face libraries
4. Optional: upload to Hugging Face Hub

## Requirements:
- [GPN](https://github.com/songlab-cal/gpn)
- [Snakemake](https://snakemake.github.io/)
- If you want to automatically download data from NCBI, install [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) (e.g. `conda install -c conda-forge ncbi-datasets-cli`).
- If you want to automatically download data from NCBI, install [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) (e.g. `conda install -c conda-forge ncbi-datasets-cli`)

## Choosing species/assemblies (ignore if using your own set of fasta files):
- Manually download assembly metadata from [NCBI Genome](https://www.ncbi.nlm.nih.gov/data-hub/genome)
- You can choose a set of taxa (e.g. mammals, plants) and apply filters such as annotation level, assembly level.
- Checkout the script `gpn/filter_assemblies.py` for more details, such as how to
subsample, or how to keep only one assembly per genus.

## Configuration:
- See `config\config.yaml` and `config\assemblies.tsv`
- Check notes in `workflow/Snakefile` for running with your own set of fasta files.
- See `config/config.yaml` and `config/assemblies.tsv`
- Check notes in `workflow/Snakefile` for running with your own set of fasta files

## Running:
- `snakemake --cores all`
- The dataset will be created at `results/dataset`

## Uploading to Hugging Face Hub:
For easy distribution and deployment, the dataset can be uploaded to HF Hub (optionally, as a private dataset).
Expand Down

0 comments on commit 473a412

Please sign in to comment.