diff --git a/README.md b/README.md index 5ad6b56..a96de53 100644 --- a/README.md +++ b/README.md @@ -7,11 +7,58 @@ pip install git+https://github.com/songlab-cal/gpn.git ``` -## Usage +## Application to *Arabidopsis thaliana* * Quick example to play with the model: `basic_example.ipynb` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/songlab-cal/gpn/blob/main/basic_example.ipynb) -* [Application to *Arabidopsis thaliana*, including training, inference and analysis](analysis/arabidopsis) -* [General workflow to create a training dataset given a list of NCBI accessions](workflow/make_dataset_from_ncbi) +* [Training, inference and analysis](analysis/arabidopsis) + +## Training on your own data +1. [Snakemake workflow to create a dataset](workflow/make_dataset) + Can automatically download data from NCBI given a list of accessions, or use your own fasta files. +2. Training +- Will automatically detect all available GPUs. +- Track metrics on [Weights & Biases](https://wandb.ai/) +- Implemented models: `ConvNet`, `GPNRoFormer` (Transformer) +- Specify config overrides: e.g. `--config_overrides n_layers=30` +- Example: +```bash +WANDB_PROJECT=your_project python -m gpn.run_mlm --do_train --do_eval \ + --fp16 --report_to wandb --prediction_loss_only True --remove_unused_columns False \ + --dataset_name results/dataset --tokenizer_name gonzalobenegas/tokenizer-dna-mlm \ + --soft_masked_loss_weight_train 0.1 --soft_masked_loss_weight_evaluation 0.0 \ + --weight_decay 0.01 --optim adamw_torch \ + --dataloader_num_workers 16 --seed 42 \ + --save_strategy steps --save_steps 10000 --evaluation_strategy steps \ + --eval_steps 10000 --logging_steps 10000 --max_steps 120000 --warmup_steps 1000 \ + --learning_rate 1e-3 --lr_scheduler_type constant_with_warmup \ + --run_name your_run --output_dir your_output_dir --model_type ConvNet \ + --per_device_train_batch_size 512 --per_device_eval_batch_size 512 --gradient_accumulation_steps 1 +``` +3. Extract embeddings +- Input file requires `chrom`, `start`, `end` +- Example: +```bash +python -m gpn.get_embeddings windows.parquet genome.fa.gz 100 your_output_dir \ + results.parquet --per-device-batch-size 4000 --is-file --dataloader-num-workers 16 +``` +4. Variant effect prediction +- Input file requires `chrom`, `pos`, `ref`, `alt` +- Example: +```bash +python -m gpn.run_vep variants.parquet genome.fa.gz 512 your_output_dir results.parquet \ + --per-device-batch-size 4000 --is-file --dataloader-num-workers 16 +``` ## Citation -Gonzalo Benegas, Sanjit Singh Batra and Yun S. Song "DNA language models are powerful zero-shot predictors of non-coding variant effects" bioRxiv (2022) -DOI: [10.1101/2022.08.22.504706](https://doi.org/10.1101/2022.08.22.504706) +``` +@article {benegas2023dna, + author = {Gonzalo Benegas and Sanjit Singh Batra and Yun S. Song}, + title = {DNA language models are powerful predictors of genome-wide variant effects}, + elocation-id = {2022.08.22.504706}, + year = {2023}, + doi = {10.1101/2022.08.22.504706}, + publisher = {Cold Spring Harbor Laboratory}, + URL = {https://www.biorxiv.org/content/early/2023/08/04/2022.08.22.504706}, + eprint = {https://www.biorxiv.org/content/early/2023/08/04/2022.08.22.504706.full.pdf}, + journal = {bioRxiv} +} +``` diff --git a/workflow/make_dataset/README.md b/workflow/make_dataset/README.md index 8bd0e63..7a5fb60 100644 --- a/workflow/make_dataset/README.md +++ b/workflow/make_dataset/README.md @@ -1,21 +1,28 @@ # Workflow to create a training dataset [Example dataset](https://huggingface.co/datasets/gonzalobenegas/example_dataset) (with default config, should take 5 minutes) -1. Download data from ncbi given a list of accessions, or alternatively, use your own fasta files. -2. Define a set of training intervals, e.g. full chromosomes, only exons, etc. +1. Download data from NCBI given a list of accessions, or alternatively, use your own fasta files +2. Define a set of training intervals, e.g. full chromosomes, only exons (requires annotation), etc 3. Shard the dataset for efficient loading with Hugging Face libraries 4. Optional: upload to Hugging Face Hub ## Requirements: - [GPN](https://github.com/songlab-cal/gpn) - [Snakemake](https://snakemake.github.io/) -- If you want to automatically download data from NCBI, install [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) (e.g. `conda install -c conda-forge ncbi-datasets-cli`). +- If you want to automatically download data from NCBI, install [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) (e.g. `conda install -c conda-forge ncbi-datasets-cli`) + +## Choosing species/assemblies (ignore if using your own set of fasta files): +- Manually download assembly metadata from [NCBI Genome](https://www.ncbi.nlm.nih.gov/data-hub/genome) +- You can choose a set of taxa (e.g. mammals, plants) and apply filters such as annotation level, assembly level. +- Checkout the script `gpn/filter_assemblies.py` for more details, such as how to +subsample, or how to keep only one assembly per genus. ## Configuration: -- See `config\config.yaml` and `config\assemblies.tsv` -- Check notes in `workflow/Snakefile` for running with your own set of fasta files. +- See `config/config.yaml` and `config/assemblies.tsv` +- Check notes in `workflow/Snakefile` for running with your own set of fasta files ## Running: - `snakemake --cores all` +- The dataset will be created at `results/dataset` ## Uploading to Hugging Face Hub: For easy distribution and deployment, the dataset can be uploaded to HF Hub (optionally, as a private dataset).