Skip to content

Commit

Permalink
Merge pull request #223 from Australian-Structural-Biology-Computing/…
Browse files Browse the repository at this point in the history
…add-helixfold3

Add HelixFold3
  • Loading branch information
JoseEspinosa authored Feb 21, 2025
2 parents dd7a880 + e3e8fab commit 7b14116
Show file tree
Hide file tree
Showing 20 changed files with 910 additions and 4 deletions.
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ jobs:
- "test_esmfold"
- "test_split_fasta"
- "test_rosettafold_all_atom"
- "test_helixfold3"
isMaster:
- ${{ github.base_ref == 'master' }}
# Exclude conda and singularity on dev
Expand Down
6 changes: 4 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [[#180](https://github.com/nf-core/proteinfold/issues/180)] - Implement Fooldseek.
- [[#188](https://github.com/nf-core/proteinfold/issues/188)] - Fix colabfold image to run in gpus.
- [[PR ##205](https://github.com/nf-core/proteinfold/pull/205)] - Change input schema from `sequence,fasta` to `id,fasta`.
- [[PR #210](https://github.com/nf-core/proteinfold/pull/210)]- Moving post-processing logic to a subworkflow, change wave images pointing to oras to point to https and refactor module to match nf-core folder structure.
- [[#214](https://github.com/nf-core/proteinfold/issues/214)]- Fix colabfold image to run in cpus after [#188](https://github.com/nf-core/proteinfold/issues/188) fix.
- [[PR #210](https://github.com/nf-core/proteinfold/pull/210)] - Moving post-processing logic to a subworkflow, change wave images pointing to oras to point to https and refactor module to match nf-core folder structure.
- [[#214](https://github.com/nf-core/proteinfold/issues/214)] - Fix colabfold image to run in cpus after [#188](https://github.com/nf-core/proteinfold/issues/188) fix.
- [[PR ##220](https://github.com/nf-core/proteinfold/pull/220)] - Add RoseTTAFold-All-Atom module.
- [[PR ##223](https://github.com/nf-core/proteinfold/pull/223)] - Add HelixFold3 module.
- [[#235](https://github.com/nf-core/proteinfold/issues/235)] - Update samplesheet to new version (switch from `sequence` column to `id`).
- [[#240](https://github.com/nf-core/proteinfold/issues/240)] - Separate download and input of pdb `mmcif` files and `obsolete` database.

Expand Down Expand Up @@ -119,6 +120,7 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
| | `--esmfold_params_path` |
| | `--skip_multiqc` |
| | `--rosettafold_all_atom_db` |
| | `--helixfold3_db` |

> **NB:** Parameter has been **updated** if both old and new parameter information is present.
> **NB:** Parameter has been **added** if just the new parameter information is present.
Expand Down
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ On release, automated continuous integration tests run the pipeline on a full-si

vi. [RoseTTAFold-All-Atom](https://github.com/baker-laboratory/RoseTTAFold-All-Atom/) - Regular RFAA

vii. [HelixFold3](https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold3) - Regular HF3

## Usage

> [!NOTE]
Expand Down Expand Up @@ -150,6 +152,18 @@ The pipeline takes care of downloading the databases and parameters required by
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
```

- The helixfold3 mode can be run using the command below:

```console
nextflow run nf-core/proteinfold \
--input samplesheet.csv \
--outdir <OUTDIR> \
--mode helixfold3 \
--helixfold3_db <null (default) | PATH> \
--use_gpu <true/false> \
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
```

> [!WARNING]
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/docs/usage/getting_started/configuration#custom-configuration-files).
Expand Down
1 change: 1 addition & 0 deletions bin/generate_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -308,6 +308,7 @@ def pdb_to_lddt(pdb_files, generate_tsv):
"alphafold2": "AlphaFold2",
"colabfold": "ColabFold",
"rosettafold_all_atom": "Rosettafold_All_Atom",
"helixfold3": "HelixFold3"
}

parser = argparse.ArgumentParser()
Expand Down
29 changes: 29 additions & 0 deletions conf/dbs.config
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,35 @@ params {
bfd_rosettafold_all_atom_path = "${params.rosettafold_all_atom_db}/bfd/*"
rfaa_paper_weights_path = "${params.rosettafold_all_atom_db}/RFAA_paper_weights.pt"

// Helixfold3 links
helixfold3_uniclust30_link = 'https://storage.googleapis.com/alphafold-databases/casp14_versions/uniclust30_2018_08_hhsuite.tar.gz'
helixfold3_ccd_preprocessed_link = 'https://paddlehelix.bd.bcebos.com/HelixFold3/CCD/ccd_preprocessed_etkdg.pkl.gz'
helixfold3_rfam_link = 'https://paddlehelix.bd.bcebos.com/HelixFold3/MSA/Rfam-14.9_rep_seq.fasta'
helixfold3_init_models_link = 'https://paddlehelix.bd.bcebos.com/HelixFold3/params/HelixFold3-params-240814.zip'
helixfold3_bfd_link = 'https://storage.googleapis.com/alphafold-databases/casp14_versions/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz'
helixfold3_small_bfd_link = 'https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz'
helixfold3_uniprot_sprot_link = 'ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz'
helixfold3_uniprot_trembl_link = 'ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz'
helixfold3_pdb_seqres_link = "${params.pdb_seqres_link}"
helixfold3_uniref90_link = 'ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz'
helixfold3_mgnify_link = 'https://storage.googleapis.com/alphafold-databases/casp14_versions/mgy_clusters_2018_12.fa.gz'
helixfold3_pdb_mmcif_link = 'rsync.rcsb.org::ftp_data/structures/divided/mmCIF/'
helixfold3_pdb_obsolete_link = 'ftp://ftp.wwpdb.org/pub/pdb/data/status/obsolete.dat'

// Helixfold3 paths
helixfold3_uniclust30_path = "${params.helixfold3_db}/uniclust30/*"
helixfold3_ccd_preprocessed_path = "${params.helixfold3_db}/ccd_preprocessed_etkdg.pkl.gz"
helixfold3_rfam_path = "${params.helixfold3_db}/Rfam-14.9_rep_seq.fasta"
helixfold3_init_models_path = "${params.helixfold3_db}/HelixFold3-240814.pdparams"
helixfold3_bfd_path = "${params.helixfold3_db}/bfd/*"
helixfold3_small_bfd_path = "${params.helixfold3_db}/small_bfd/*"
helixfold3_uniprot_path = "${params.helixfold3_db}/uniprot/*"
helixfold3_pdb_seqres_path = "${params.helixfold3_db}/pdb_seqres/*"
helixfold3_uniref90_path = "${params.helixfold3_db}/uniref90/*"
helixfold3_mgnify_path = "${params.helixfold3_db}/mgnify/*"
helixfold3_pdb_mmcif_path = "${params.helixfold3_db}/pdb_mmcif/*"
helixfold3_maxit_src_path = "${params.helixfold3_db}/maxit-v11.200-prod-src"

// Esmfold links
esmfold_3B_v1 = 'https://dl.fbaipublicfiles.com/fair-esm/models/esmfold_3B_v1.pt'
esm2_t36_3B_UR50D = 'https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t36_3B_UR50D.pt'
Expand Down
39 changes: 39 additions & 0 deletions conf/modules_helixfold3.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Config file for defining DSL2 per module options and publishing paths
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Available keys to override module options:
ext.args = Additional arguments appended to command in module.
ext.args2 = Second set of arguments appended to command in module (multi-tool modules).
ext.args3 = Third set of arguments appended to command in module (multi-tool modules).
ext.prefix = File name prefix for output files.
----------------------------------------------------------------------------------------
*/

process {
withName: 'GUNZIP|COMBINE_UNIPROT|DOWNLOAD_PDBMMCIF|ARIA2_PDB_SEQRES' {
publishDir = [
path: {"${params.outdir}/DBs/helixfold3/"},
mode: 'symlink',
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
]
}

withName: 'RUN_HELIXFOLD3' {
if(params.use_gpu) { accelerator = 1 }
publishDir = [
path: { "${params.outdir}/helixfold3/" },
mode: 'copy',
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
pattern: '*.*'
]
}

withName: 'NFCORE_PROTEINFOLD:HELIXFOLD3:MULTIQC' {
publishDir = [
path: { "${params.outdir}/multiqc" },
mode: 'copy',
saveAs: { filename -> filename.equals('versions.yml') ? null : "helixfold3_$filename" }
]
}
}
37 changes: 37 additions & 0 deletions conf/test_helixfold3.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nextflow config file for running minimal tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Defines input files and everything required to run a fast and simple pipeline test.
Use as follows:
nextflow run nf-core/proteinfold -profile test_helixfold3,<docker/singularity> --outdir <OUTDIR>
----------------------------------------------------------------------------------------
*/

stubRun = true

// Limit resources so that this can run on GitHub Actions
process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h'
]
}

params {
config_profile_name = 'Test profile'
config_profile_description = 'Minimal test dataset to check pipeline function'

// Input data to test helixfold3
mode = 'helixfold3'
helixfold3_db = "${projectDir}/assets/dummy_db_dir"
input = params.pipelines_testdata_base_path + 'proteinfold/testdata/samplesheet/v1.2/samplesheet.csv'
}

process {
withName: 'RUN_HELIXFOLD3' {
container = 'biocontainers/gawk:5.1.0'
}
}

34 changes: 34 additions & 0 deletions dockerfiles/Dockerfile_nfcore-proteinfold_helixfold3
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04

LABEL Author="[email protected]" \
title="nfcore/proteinfold_helixfold3" \
Version="0.9.0" \
description="Docker image containing all software requirements to run the RUN_HELIXFOLD3 module using the nf-core/proteinfold pipeline"

ENV PYTHONPATH="/app/helixfold3:$PYTHONPATH" \
PATH="/conda/bin:/app/helixfold3:$PATH" \
PYTHON_BIN="/conda/envs/helixfold/bin/python3.9" \
ENV_BIN="/conda/envs/helixfold/bin" \
OBABEL_BIN="/conda/envs/helixfold/bin"

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y wget git && \
wget -q -P /tmp "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" && \
bash /tmp/Miniforge3-$(uname)-$(uname -m).sh -b -p /conda && \
rm -rf /tmp/Miniforge3-$(uname)-$(uname -m).sh /var/lib/apt/lists/* && \
apt-get autoremove -y && apt-get clean -y

RUN git clone --single-branch --branch dev --depth 1 --no-checkout https://github.com/PaddlePaddle/PaddleHelix.git /app/helixfold3 && \
cd /app/helixfold3 && \
git sparse-checkout init --cone && \
git sparse-checkout set apps/protein_folding/helixfold3 && \
git checkout dev && \
mv apps/protein_folding/helixfold3/* . && \
rm -rf apps

COPY environment_nfcore-proteinfold_helixfold3.yaml /app/helixfold3/
RUN /conda/bin/mamba env create --file=/app/helixfold3/environment_nfcore-proteinfold_helixfold3.yaml && \
/conda/bin/mamba install -y -c bioconda aria2 hmmer==3.3.2 kalign2==2.04 hhsuite==3.3.0 -n helixfold && \
/conda/bin/mamba install -y -c conda-forge openbabel -n helixfold && \
/conda/bin/mamba clean --all --force-pkgs-dirs -y && \
rm -rf /root/.cache && \
apt-get autoremove -y && apt-get remove --purge -y wget git && apt-get clean -y
35 changes: 35 additions & 0 deletions dockerfiles/environment_nfcore-proteinfold_helixfold3.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: helixfold
channels:
- conda-forge
- bioconda
- nvidia
- biocore

dependencies:
- python=3.9
- cuda-toolkit=12.0
- cudnn=8.4.0
- nccl=2.14
- libgcc
- libgomp
- pip
- aria2
- hmmer==3.4
- kalign2==2.04
- hhsuite==3.3.0
- openbabel
- pip:
- paddlepaddle-gpu==2.6.1 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
- absl-py==0.13.0
- biopython==1.79
- chex==0.0.7
- dm-haiku==0.0.4
- dm-tree==0.1.6
- docker==5.0.0
- immutabledict==2.0.0
- jax==0.2.14
- ml-collections==0.1.0
- pandas==1.3.4
- scipy==1.9.0
- rdkit-pypi==2022.9.5
- posebusters
13 changes: 13 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and predicts pr
- [ColabFold](https://github.com/sokrypton/ColabFold) - MMseqs2 (API server or local search) followed by ColabFold
- [ESMFold](https://github.com/facebookresearch/esm)
- [RoseTTAFold-All-Atom](https://github.com/baker-laboratory/RoseTTAFold-All-Atom/)
- [HelixFold3](https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold3)

See main [README.md](https://github.com/nf-core/proteinfold/blob/master/README.md) for a condensed overview of the steps in the pipeline, and the bioinformatics tools used at each step.

Expand Down Expand Up @@ -190,6 +191,18 @@ Below you can find an indicative example of the TSV file with the pLDDT scores p

</details>

### HelixFold3

<details markdown="1">
<summary>Output files</summary>

- `run/`
- `<SEQUENCE NAME>_helixfold3.pdb` that is the structure with the highest pLDDT score (ranked first)
- `<SEQUENCE NAME>_plddt_mqc.tsv` that presents the pLDDT scores per residue for the predicted model
- `<SEQUENCE NAME>/` that contains the computed MSAs, prediction metadata, ranked structures, raw model outputs etc.

</details>

### MultiQC report

<details markdown="1">
Expand Down
12 changes: 12 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -426,6 +426,18 @@ If you specify the `--esmfold_db <PATH>` parameter, the directory structure of y
└── esmfold_3B_v1.pt
```

HelixFold3 can be run using this command (note that HF3 requires `.json` files not `.fasta`):

```console
nextflow run nf-core/proteinfold \
--input samplesheet.csv \
--outdir <OUTDIR> \
--mode helixfold3 \
--helixfold3_db <null (default) | DB_PATH> \
--use_gpu <true/false> \
-profile <docker>
```

This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.

RoseTTAFold All-Atom can be run using this command:
Expand Down
70 changes: 69 additions & 1 deletion main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,10 @@ if (params.mode.toLowerCase().split(",").contains("rosettafold_all_atom")) {
include { PREPARE_ROSETTAFOLD_ALL_ATOM_DBS } from './subworkflows/local/prepare_rosettafold_all_atom_dbs'
include { ROSETTAFOLD_ALL_ATOM } from './workflows/rosettafold_all_atom'
}
if (params.mode.toLowerCase().split(",").contains("helixfold3")) {
include { PREPARE_HELIXFOLD3_DBS } from './subworkflows/local/prepare_helixfold3_dbs'
include { HELIXFOLD3 } from './workflows/helixfold3'
}

include { PIPELINE_INITIALISATION } from './subworkflows/local/utils_nfcore_proteinfold_pipeline'
include { PIPELINE_COMPLETION } from './subworkflows/local/utils_nfcore_proteinfold_pipeline'
Expand Down Expand Up @@ -70,6 +74,7 @@ workflow NFCORE_PROTEINFOLD {
ch_colabfold_top_ranked_pdb = Channel.empty()
ch_esmfold_top_ranked_pdb = Channel.empty()
ch_rosettafold_all_atom_top_ranked_pdb = Channel.empty()
ch_helixfold3_top_ranked_pdb = Channel.empty()
ch_multiqc = Channel.empty()
ch_versions = Channel.empty()
ch_report_input = Channel.empty()
Expand Down Expand Up @@ -250,6 +255,68 @@ workflow NFCORE_PROTEINFOLD {
ch_report_input = ch_report_input.mix(ROSETTAFOLD_ALL_ATOM.out.pdb_msa)
}

//
// WORKFLOW: Run helixfold3
//
if(requested_modes.contains("helixfold3")) {
//
// SUBWORKFLOW: Prepare helixfold3 DBs
//
PREPARE_HELIXFOLD3_DBS (
params.helixfold3_db,
params.helixfold3_uniclust30_link,
params.helixfold3_ccd_preprocessed_link,
params.helixfold3_rfam_link,
params.helixfold3_init_models_link,
params.helixfold3_bfd_link,
params.helixfold3_small_bfd_link,
params.helixfold3_uniprot_sprot_link,
params.helixfold3_uniprot_trembl_link,
params.helixfold3_pdb_seqres_link,
params.helixfold3_uniref90_link,
params.helixfold3_mgnify_link,
params.helixfold3_pdb_mmcif_link,
params.helixfold3_pdb_obsolete_link,
params.helixfold3_uniclust30_path,
params.helixfold3_ccd_preprocessed_path,
params.helixfold3_rfam_path,
params.helixfold3_init_models_path,
params.helixfold3_bfd_path,
params.helixfold3_small_bfd_path,
params.helixfold3_uniprot_path,
params.helixfold3_pdb_seqres_path,
params.helixfold3_uniref90_path,
params.helixfold3_mgnify_path,
params.helixfold3_pdb_mmcif_path,
params.helixfold3_maxit_src_path
)
ch_versions = ch_versions.mix(PREPARE_HELIXFOLD3_DBS.out.versions)

//
// WORKFLOW: Run nf-core/helixfold3 workflow
//
HELIXFOLD3 (
ch_samplesheet,
ch_versions,
PREPARE_HELIXFOLD3_DBS.out.helixfold3_uniclust30,
PREPARE_HELIXFOLD3_DBS.out.helixfold3_ccd_preprocessed,
PREPARE_HELIXFOLD3_DBS.out.helixfold3_rfam,
PREPARE_HELIXFOLD3_DBS.out.helixfold3_bfd,
PREPARE_HELIXFOLD3_DBS.out.helixfold3_small_bfd,
PREPARE_HELIXFOLD3_DBS.out.helixfold3_uniprot,
PREPARE_HELIXFOLD3_DBS.out.helixfold3_pdb_seqres,
PREPARE_HELIXFOLD3_DBS.out.helixfold3_uniref90,
PREPARE_HELIXFOLD3_DBS.out.helixfold3_mgnify,
PREPARE_HELIXFOLD3_DBS.out.helixfold3_pdb_mmcif,
PREPARE_HELIXFOLD3_DBS.out.helixfold3_init_models,
PREPARE_HELIXFOLD3_DBS.out.helixfold3_maxit_src
)
ch_helixfold3_top_ranked_pdb = HELIXFOLD3.out.top_ranked_pdb
ch_multiqc = ch_multiqc.mix(HELIXFOLD3.out.multiqc_report.collect())
ch_versions = ch_versions.mix(HELIXFOLD3.out.versions)
ch_report_input = ch_report_input.mix(HELIXFOLD3.out.pdb_msa)
}

//
// POST PROCESSING: generate visualisation reports
//
Expand Down Expand Up @@ -293,7 +360,8 @@ workflow NFCORE_PROTEINFOLD {
ch_alphafold_top_ranked_pdb,
ch_colabfold_top_ranked_pdb,
ch_esmfold_top_ranked_pdb,
ch_rosettafold_all_atom_top_ranked_pdb
ch_rosettafold_all_atom_top_ranked_pdb,
ch_helixfold3_top_ranked_pdb
)

emit:
Expand Down
Loading

0 comments on commit 7b14116

Please sign in to comment.