Skip to content

Commit

Permalink
Merge pull request #220 from Australian-Structural-Biology-Computing/…
Browse files Browse the repository at this point in the history
…add-rosettafold-all-atom

Add RoseTTAFold-All-Atom
  • Loading branch information
JoseEspinosa authored Feb 19, 2025
2 parents 1af71b4 + ee87982 commit dd7a880
Show file tree
Hide file tree
Showing 20 changed files with 626 additions and 47 deletions.
6 changes: 3 additions & 3 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ If you're not used to this workflow with git, you can start with some [docs from
You have the option to test your changes locally by running the pipeline. For receiving warnings about process selectors and other `debug` information, it is recommended to use the debug profile. Execute all the tests with the following command:

```bash
nextflow run . --profile debug,test,docker --outdir <OUTDIR>
nextflow run . -profile debug,test,docker --outdir <OUTDIR>
```

When you create a pull request with changes, [GitHub Actions](https://github.com/features/actions) will run automatic tests.
Expand Down Expand Up @@ -78,8 +78,8 @@ If you wish to contribute a new step, please use the following coding standards:
5. Add any new parameters to `nextflow_schema.json` with help text (via the `nf-core pipelines schema build` tool).
6. Add sanity checks and validation for all relevant parameters.
7. Perform local tests to validate that the new code works as expected.
8. If applicable, add a new test command in `.github/workflow/ci.yml`.
9. Update MultiQC config `assets/multiqc_config.yml` so relevant suffixes, file name clean up and module plots are in the appropriate order. If applicable, add a [MultiQC](https://https://multiqc.info/) module.
8. If applicable, add a new test command in `.github/workflows/ci.yml`.
9. Update MultiQC config `assets/multiqc_config.yml` so relevant suffixes, file name clean up and module plots are in the appropriate order. If applicable, add a [MultiQC](https://multiqc.info/) module.
10. Add a description of the output files and if relevant any appropriate images from the MultiQC report to `docs/output.md`.

### Default values
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ jobs:
- "test_colabfold_download"
- "test_esmfold"
- "test_split_fasta"
- "test_rosettafold_all_atom"
isMaster:
- ${{ github.base_ref == 'master' }}
# Exclude conda and singularity on dev
Expand Down
6 changes: 4 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [[#180](https://github.com/nf-core/proteinfold/issues/180)] - Implement Fooldseek.
- [[#188](https://github.com/nf-core/proteinfold/issues/188)] - Fix colabfold image to run in gpus.
- [[PR ##205](https://github.com/nf-core/proteinfold/pull/205)] - Change input schema from `sequence,fasta` to `id,fasta`.
- [[PR #210](https://github.com/nf-core/proteinfold/pull/210)] - Moving post-processing logic to a subworkflow, change wave images pointing to oras to point to https and refactor module to match nf-core folder structure.
- [[#214](https://github.com/nf-core/proteinfold/issues/214)] - Fix colabfold image to run in cpus after [#188](https://github.com/nf-core/proteinfold/issues/188) fix.
- [[PR #210](https://github.com/nf-core/proteinfold/pull/210)]- Moving post-processing logic to a subworkflow, change wave images pointing to oras to point to https and refactor module to match nf-core folder structure.
- [[#214](https://github.com/nf-core/proteinfold/issues/214)]- Fix colabfold image to run in cpus after [#188](https://github.com/nf-core/proteinfold/issues/188) fix.
- [[PR ##220](https://github.com/nf-core/proteinfold/pull/220)] - Add RoseTTAFold-All-Atom module.
- [[#235](https://github.com/nf-core/proteinfold/issues/235)] - Update samplesheet to new version (switch from `sequence` column to `id`).
- [[#240](https://github.com/nf-core/proteinfold/issues/240)] - Separate download and input of pdb `mmcif` files and `obsolete` database.

Expand Down Expand Up @@ -117,6 +118,7 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
| | `--esm2_t36_3B_UR50D_contact_regression` |
| | `--esmfold_params_path` |
| | `--skip_multiqc` |
| | `--rosettafold_all_atom_db` |

> **NB:** Parameter has been **updated** if both old and new parameter information is present.
> **NB:** Parameter has been **added** if just the new parameter information is present.
Expand Down
16 changes: 15 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ On release, automated continuous integration tests run the pipeline on a full-si

v. [ESMFold](https://github.com/facebookresearch/esm) - Regular ESM

vi. [RoseTTAFold-All-Atom](https://github.com/baker-laboratory/RoseTTAFold-All-Atom/) - Regular RFAA

## Usage

> [!NOTE]
Expand All @@ -53,7 +55,7 @@ nextflow run nf-core/proteinfold \
--outdir <OUTDIR>
```

The pipeline takes care of downloading the databases and parameters required by AlphaFold2, Colabfold or ESMFold. In case you have already downloaded the required files, you can skip this step by providing the path to the databases using the corresponding parameter [`--alphafold2_db`], [`--colabfold_db`] or [`--esmfold_db`]. Please refer to the [usage documentation](https://nf-co.re/proteinfold/usage) to check the directory structure you need to provide for each of the databases.
The pipeline takes care of downloading the databases and parameters required by AlphaFold2, Colabfold, ESMFold or RoseTTAFold-All-Atom. In case you have already downloaded the required files, you can skip this step by providing the path to the databases using the corresponding parameter [`--alphafold2_db`], [`--colabfold_db`], [`--esmfold_db`] or ['--rosettafold_all_atom_db']. Please refer to the [usage documentation](https://nf-co.re/proteinfold/usage) to check the directory structure you must provide for each database.

- The typical command to run AlphaFold2 mode is shown below:

Expand Down Expand Up @@ -136,6 +138,18 @@ The pipeline takes care of downloading the databases and parameters required by
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
```

- The rosettafold_all_atom mode can be run using the command below:

```console
nextflow run nf-core/proteinfold \
--input samplesheet.csv \
--outdir <OUTDIR> \
--mode rosettafold_all_atom \
--rosettafold_all_atom_db <null (default) | PATH> \
--use_gpu <true/false> \
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
```

> [!WARNING]
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/docs/usage/getting_started/configuration#custom-configuration-files).
Expand Down
13 changes: 10 additions & 3 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@
"items": {
"type": "object",
"properties": {
"sequence": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Sequence name must be provided and cannot contain spaces",
"meta": ["sequence"]
},
"id": {
"type": "string",
"pattern": "^\\S+$",
Expand All @@ -17,10 +23,11 @@
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.fa(sta)?$",
"errorMessage": "Fasta file must be provided, cannot contain spaces and must have extension '.fa' or '.fasta'"
"pattern": "^\\S+\\.(fa(sta)?|yaml|yml|json)$",
"errorMessage": "Fasta, yaml or json file must be provided, cannot contain spaces and must have extension '.fa', '.fasta', '.yaml', '.yml', or '.json'"
}
},
"required": ["id", "fasta"]
"required": ["fasta"],
"anyOf": [{ "required": ["sequence"] }, { "required": ["id"] }]
}
}
1 change: 1 addition & 0 deletions bin/generate_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,7 @@ def pdb_to_lddt(pdb_files, generate_tsv):
"esmfold": "ESMFold",
"alphafold2": "AlphaFold2",
"colabfold": "ColabFold",
"rosettafold_all_atom": "Rosettafold_All_Atom",
}

parser = argparse.ArgumentParser()
Expand Down
12 changes: 12 additions & 0 deletions conf/dbs.config
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,18 @@ params {
"alphafold2_ptm" : "alphafold_params_2021-07-14"
]

// RoseTTAFold_All_Atom links
uniref30_rosettafold_all_atom_link = 'http://wwwuser.gwdg.de/~compbiol/uniclust/2020_06/UniRef30_2020_06_hhsuite.tar.gz'
pdb100_rosettafold_all_atom_link = 'https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2021Mar03.tar.gz'
bfd_rosettafold_all_atom_link = 'https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz'
rfaa_paper_weights_link = 'http://files.ipd.uw.edu/pub/RF-All-Atom/weights/RFAA_paper_weights.pt'

// RoseTTAFold_All_Atom paths
uniref30_rosettafold_all_atom_path = "${params.rosettafold_all_atom_db}/uniref30/UniRef30_2020_06/*"
pdb100_rosettafold_all_atom_path = "${params.rosettafold_all_atom_db}/pdb100_2021Mar03/*"
bfd_rosettafold_all_atom_path = "${params.rosettafold_all_atom_db}/bfd/*"
rfaa_paper_weights_path = "${params.rosettafold_all_atom_db}/RFAA_paper_weights.pt"

// Esmfold links
esmfold_3B_v1 = 'https://dl.fbaipublicfiles.com/fair-esm/models/esmfold_3B_v1.pt'
esm2_t36_3B_UR50D = 'https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t36_3B_UR50D.pt'
Expand Down
39 changes: 39 additions & 0 deletions conf/modules_rosettafold_all_atom.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Config file for defining DSL2 per module options and publishing paths
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Available keys to override module options:
ext.args = Additional arguments appended to command in module.
ext.args2 = Second set of arguments appended to command in module (multi-tool modules).
ext.args3 = Third set of arguments appended to command in module (multi-tool modules).
ext.prefix = File name prefix for output files.
----------------------------------------------------------------------------------------
*/

process {
withName: 'GUNZIP|ARIA2_PDB_SEQRES' {
publishDir = [
path: {"${params.outdir}/DBs/rosettafold_all_atom/"},
mode: 'symlink',
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
]
}

withName: 'RUN_ROSETTAFOLD_ALL_ATOM' {
if(params.use_gpu) { accelerator = 1 }
publishDir = [
path: { "${params.outdir}/rosettafold_all_atom/" },
mode: 'copy',
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
pattern: '*.*'
]
}

withName: 'NFCORE_PROTEINFOLD:ROSETTAFOLD_ALL_ATOM:MULTIQC' {
publishDir = [
path: { "${params.outdir}/multiqc" },
mode: 'copy',
saveAs: { filename -> filename.equals('versions.yml') ? null : "rosettafold_all_atom_$filename" }
]
}
}
36 changes: 36 additions & 0 deletions conf/test_rosettafold_all_atom.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nextflow config file for running minimal tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Defines input files and everything required to run a fast and simple pipeline test.
Use as follows:
nextflow run nf-core/proteinfold -profile test_rosettafold_all_atom,<docker/singularity> --outdir <OUTDIR>
----------------------------------------------------------------------------------------
*/

stubRun = true

// Limit resources so that this can run on GitHub Actions
process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h'
]
}

params {
config_profile_name = 'Test profile'
config_profile_description = 'Minimal test dataset to check pipeline function'

// Input data to test rosettafold_all_atom
mode = 'rosettafold_all_atom'
rosettafold_all_atom_db = "${projectDir}/assets/dummy_db_dir"
input = params.pipelines_testdata_base_path + 'proteinfold/testdata/samplesheet/v1.2/samplesheet.csv'
}

process {
withName: 'RUN_ROSETTAFOLD_ALL_ATOM' {
container = 'biocontainers/gawk:5.1.0'
}
}
35 changes: 35 additions & 0 deletions dockerfiles/Dockerfile_nfcore-proteinfold_rosettafold_all_atom
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
FROM nvidia/cuda:12.6.0-cudnn-devel-ubuntu24.04

LABEL Author="[email protected]" \
title="nfcore/proteinfold_rosettafold_all_atom" \
Version="1.2.0dev" \
description="Docker image containing all software requirements to run the RUN_ROSETTAFOLD_ALL_ATOM module using the nf-core/proteinfold pipeline"

ENV PYTHONPATH="/app/RoseTTAFold-All-Atom" \
PATH="/conda/bin:/app/RoseTTAFold-All-Atom:$PATH" \
DGLBACKEND="pytorch" \
LD_LIBRARY_PATH="/conda/lib:/usr/local/cuda-12.6/lib64:$LD_LIBRARY_PATH"

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y wget git && \
wget -q -P /tmp "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" && \
bash /tmp/Miniforge3-$(uname)-$(uname -m).sh -b -p /conda && \
rm -rf /tmp/Miniforge3-$(uname)-$(uname -m).sh /var/lib/apt/lists/* && \
apt-get autoremove -y && apt-get clean -y

RUN git clone --single-branch --depth 1 https://github.com/Australian-Structural-Biology-Computing/RoseTTAFold-All-Atom.git /app/RoseTTAFold-All-Atom && \
cd /app/RoseTTAFold-All-Atom && \
/conda/bin/mamba env create --file=environment.yaml && \
/conda/bin/mamba run -n RFAA bash -c \
"python /app/RoseTTAFold-All-Atom/rf2aa/SE3Transformer/setup.py install && \
bash /app/RoseTTAFold-All-Atom/install_dependencies.sh" && \
/conda/bin/mamba clean --all --force-pkgs-dirs -y

RUN cd /app/RoseTTAFold-All-Atom && \
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/blast-2.2.26-x64-linux.tar.gz && \
mkdir -p blast-2.2.26 && \
tar -xf blast-2.2.26-x64-linux.tar.gz -C blast-2.2.26 && \
cp -r blast-2.2.26/blast-2.2.26/ blast-2.2.26_bk && \
rm -r blast-2.2.26 && \
mv blast-2.2.26_bk/ blast-2.2.26 && \
rm -rf /root/.cache *.tar.gz && \
apt-get autoremove -y && apt-get remove --purge -y wget git && apt-get clean -y
14 changes: 14 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and predicts pr
- [AlphaFold2](https://github.com/deepmind/alphafold)
- [ColabFold](https://github.com/sokrypton/ColabFold) - MMseqs2 (API server or local search) followed by ColabFold
- [ESMFold](https://github.com/facebookresearch/esm)
- [RoseTTAFold-All-Atom](https://github.com/baker-laboratory/RoseTTAFold-All-Atom/)

See main [README.md](https://github.com/nf-core/proteinfold/blob/master/README.md) for a condensed overview of the steps in the pipeline, and the bioinformatics tools used at each step.

Expand Down Expand Up @@ -176,6 +177,19 @@ Below you can find an indicative example of the TSV file with the pLDDT scores p
| 49 | CB | VAL | 7 | 52.74 |
| 50 | O | VAL | 7 | 56.46 |

### RoseTTAFold-All-Atom

<details markdown="1">
<summary>Output files</summary>

- `run/`
- `<SEQUENCE NAME>_rosettafold_all_atom.pdb` that is the structure with the highest pLDDT score (ranked first)
- `<SEQUENCE NAME>_plddt_mqc.tsv` that presents the pLDDT scores per residue for the predicted model
- `<SEQUENCE NAME>_aux.pt` pytorch file with confidence metrics stored (can load with torch.load(file, map_location="cpu"))
- `<SEQUENCE NAME>/` that contains the computed MSAs, prediction metadata

</details>

### MultiQC report

<details markdown="1">
Expand Down
14 changes: 13 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Each FASTA file should contain a single protein sequence unless using multimer m

## Running the pipeline

The typical commands for running the pipeline on AlphaFold2, Colabfold and ESMFold modes are shown below.
The typical commands for running the pipeline on AlphaFold2, Colabfold, ESMFold and RoseTTAFold-All-Atom modes are shown below.

> You can run any combination of the models by providing them to the `--mode` parameter separated by a comma. For example: `--mode alphafold2,esmfold,colabfold` will run the three models in parallel.
Expand Down Expand Up @@ -428,6 +428,18 @@ If you specify the `--esmfold_db <PATH>` parameter, the directory structure of y

This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.

RoseTTAFold All-Atom can be run using this command:

```bash
nextflow run nf-core/proteinfold \
--input samplesheet.csv \
--outdir <OUTDIR> \
--mode rosettafold_all_atom \
--rosettafold_all_atom_db <null (default) | DB_PATH> \
--use_gpu <true/false> \
-profile <docker/singularity/.../institute>
```

Note that the pipeline will create the following files in your working directory:

```bash
Expand Down
Loading

0 comments on commit dd7a880

Please sign in to comment.