Update readme

songlab-cal · Sep 2, 2023 · 473a412 · 473a412
1 parent 1a979b2
commit 473a412
Show file tree

Hide file tree

Showing 2 changed files with 64 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -7,11 +7,58 @@
 pip install git+https://github.com/songlab-cal/gpn.git
 ```
 
-## Usage
+## Application to *Arabidopsis thaliana*
 * Quick example to play with the model: `basic_example.ipynb` [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/songlab-cal/gpn/blob/main/basic_example.ipynb)
-* [Application to *Arabidopsis thaliana*, including training, inference and analysis](analysis/arabidopsis)
-* [General workflow to create a training dataset given a list of NCBI accessions](workflow/make_dataset_from_ncbi)
+* [Training, inference and analysis](analysis/arabidopsis)
+
+## Training on your own data
+1. [Snakemake workflow to create a dataset](workflow/make_dataset)
+   Can automatically download data from NCBI given a list of accessions, or use your own fasta files.
+2. Training
+- Will automatically detect all available GPUs.
+- Track metrics on [Weights & Biases](https://wandb.ai/)
+- Implemented models: `ConvNet`, `GPNRoFormer` (Transformer)
+- Specify config overrides: e.g. `--config_overrides n_layers=30`
+- Example:
+```bash
+WANDB_PROJECT=your_project python -m gpn.run_mlm --do_train --do_eval \
+    --fp16 --report_to wandb --prediction_loss_only True --remove_unused_columns False \
+    --dataset_name results/dataset --tokenizer_name gonzalobenegas/tokenizer-dna-mlm \
+    --soft_masked_loss_weight_train 0.1 --soft_masked_loss_weight_evaluation 0.0 \
+    --weight_decay 0.01 --optim adamw_torch \
+    --dataloader_num_workers 16 --seed 42 \
+    --save_strategy steps --save_steps 10000 --evaluation_strategy steps \
+    --eval_steps 10000 --logging_steps 10000 --max_steps 120000 --warmup_steps 1000 \
+    --learning_rate 1e-3 --lr_scheduler_type constant_with_warmup \
+    --run_name your_run --output_dir your_output_dir --model_type ConvNet \
+    --per_device_train_batch_size 512 --per_device_eval_batch_size 512 --gradient_accumulation_steps 1
+```
+3. Extract embeddings
+- Input file requires `chrom`, `start`, `end`
+- Example:
+```bash
+python -m gpn.get_embeddings windows.parquet genome.fa.gz 100 your_output_dir \
+    results.parquet --per-device-batch-size 4000 --is-file --dataloader-num-workers 16
+```
+4. Variant effect prediction
+- Input file requires `chrom`, `pos`, `ref`, `alt`
+- Example:
+```bash
+python -m gpn.run_vep variants.parquet genome.fa.gz 512 your_output_dir results.parquet \
+    --per-device-batch-size 4000 --is-file --dataloader-num-workers 16
+```
 
 ## Citation
-Gonzalo Benegas, Sanjit Singh Batra and Yun S. Song "DNA language models are powerful zero-shot predictors of non-coding variant effects" bioRxiv (2022)  
-DOI: [10.1101/2022.08.22.504706](https://doi.org/10.1101/2022.08.22.504706)
+```
+@article {benegas2023dna,
+	author = {Gonzalo Benegas and Sanjit Singh Batra and Yun S. Song},
+	title = {DNA language models are powerful predictors of genome-wide variant effects},
+	elocation-id = {2022.08.22.504706},
+	year = {2023},
+	doi = {10.1101/2022.08.22.504706},
+	publisher = {Cold Spring Harbor Laboratory},
+	URL = {https://www.biorxiv.org/content/early/2023/08/04/2022.08.22.504706},
+	eprint = {https://www.biorxiv.org/content/early/2023/08/04/2022.08.22.504706.full.pdf},
+	journal = {bioRxiv}
+}
+```
diff --git a/workflow/make_dataset/README.md b/workflow/make_dataset/README.md
@@ -1,21 +1,28 @@
 # Workflow to create a training dataset
 [Example dataset](https://huggingface.co/datasets/gonzalobenegas/example_dataset) (with default config, should take 5 minutes)
-1. Download data from ncbi given a list of accessions, or alternatively, use your own fasta files.
-2. Define a set of training intervals, e.g. full chromosomes, only exons, etc.
+1. Download data from NCBI given a list of accessions, or alternatively, use your own fasta files
+2. Define a set of training intervals, e.g. full chromosomes, only exons (requires annotation), etc
 3. Shard the dataset for efficient loading with Hugging Face libraries
 4. Optional: upload to Hugging Face Hub
 
 ## Requirements:
 - [GPN](https://github.com/songlab-cal/gpn)
 - [Snakemake](https://snakemake.github.io/)
-- If you want to automatically download data from NCBI, install [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) (e.g. `conda install -c conda-forge ncbi-datasets-cli`).
+- If you want to automatically download data from NCBI, install [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) (e.g. `conda install -c conda-forge ncbi-datasets-cli`)
+
+## Choosing species/assemblies (ignore if using your own set of fasta files):
+- Manually download assembly metadata from [NCBI Genome](https://www.ncbi.nlm.nih.gov/data-hub/genome)
+- You can choose a set of taxa (e.g. mammals, plants) and apply filters such as annotation level, assembly level.
+- Checkout the script `gpn/filter_assemblies.py` for more details, such as how to
+subsample, or how to keep only one assembly per genus.
 
 ## Configuration:
-- See `config\config.yaml` and `config\assemblies.tsv`
-- Check notes in `workflow/Snakefile` for running with your own set of fasta files.
+- See `config/config.yaml` and `config/assemblies.tsv`
+- Check notes in `workflow/Snakefile` for running with your own set of fasta files
 
 ## Running:
 - `snakemake --cores all`
+- The dataset will be created at `results/dataset`
 
 ## Uploading to Hugging Face Hub:
 For easy distribution and deployment, the dataset can be uploaded to HF Hub (optionally, as a private dataset).