Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
gonzalobenegas committed Sep 2, 2023
1 parent 473a412 commit 8815280
Showing 1 changed file with 11 additions and 11 deletions.
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@ pip install git+https://github.com/songlab-cal/gpn.git

## Training on your own data
1. [Snakemake workflow to create a dataset](workflow/make_dataset)
Can automatically download data from NCBI given a list of accessions, or use your own fasta files.
- Can automatically download data from NCBI given a list of accessions, or use your own fasta files.
2. Training
- Will automatically detect all available GPUs.
- Track metrics on [Weights & Biases](https://wandb.ai/)
- Implemented models: `ConvNet`, `GPNRoFormer` (Transformer)
- Specify config overrides: e.g. `--config_overrides n_layers=30`
- Example:
- Will automatically detect all available GPUs.
- Track metrics on [Weights & Biases](https://wandb.ai/)
- Implemented models: `ConvNet`, `GPNRoFormer` (Transformer)
- Specify config overrides: e.g. `--config_overrides n_layers=30`
- Example:
```bash
WANDB_PROJECT=your_project python -m gpn.run_mlm --do_train --do_eval \
--fp16 --report_to wandb --prediction_loss_only True --remove_unused_columns False \
Expand All @@ -34,23 +34,23 @@ WANDB_PROJECT=your_project python -m gpn.run_mlm --do_train --do_eval \
--per_device_train_batch_size 512 --per_device_eval_batch_size 512 --gradient_accumulation_steps 1
```
3. Extract embeddings
- Input file requires `chrom`, `start`, `end`
- Example:
- Input file requires `chrom`, `start`, `end`
- Example:
```bash
python -m gpn.get_embeddings windows.parquet genome.fa.gz 100 your_output_dir \
results.parquet --per-device-batch-size 4000 --is-file --dataloader-num-workers 16
```
4. Variant effect prediction
- Input file requires `chrom`, `pos`, `ref`, `alt`
- Example:
- Input file requires `chrom`, `pos`, `ref`, `alt`
- Example:
```bash
python -m gpn.run_vep variants.parquet genome.fa.gz 512 your_output_dir results.parquet \
--per-device-batch-size 4000 --is-file --dataloader-num-workers 16
```

## Citation
```
@article {benegas2023dna,
@article{benegas2023dna,
author = {Gonzalo Benegas and Sanjit Singh Batra and Yun S. Song},
title = {DNA language models are powerful predictors of genome-wide variant effects},
elocation-id = {2022.08.22.504706},
Expand Down

0 comments on commit 8815280

Please sign in to comment.