Skip to content

Commit 7b14116

Browse files
authored
Merge pull request #223 from Australian-Structural-Biology-Computing/add-helixfold3
Add HelixFold3
2 parents dd7a880 + e3e8fab commit 7b14116

20 files changed

+910
-4
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ jobs:
4545
- "test_esmfold"
4646
- "test_split_fasta"
4747
- "test_rosettafold_all_atom"
48+
- "test_helixfold3"
4849
isMaster:
4950
- ${{ github.base_ref == 'master' }}
5051
# Exclude conda and singularity on dev

CHANGELOG.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1313
- [[#180](https://github.com/nf-core/proteinfold/issues/180)] - Implement Fooldseek.
1414
- [[#188](https://github.com/nf-core/proteinfold/issues/188)] - Fix colabfold image to run in gpus.
1515
- [[PR ##205](https://github.com/nf-core/proteinfold/pull/205)] - Change input schema from `sequence,fasta` to `id,fasta`.
16-
- [[PR #210](https://github.com/nf-core/proteinfold/pull/210)]- Moving post-processing logic to a subworkflow, change wave images pointing to oras to point to https and refactor module to match nf-core folder structure.
17-
- [[#214](https://github.com/nf-core/proteinfold/issues/214)]- Fix colabfold image to run in cpus after [#188](https://github.com/nf-core/proteinfold/issues/188) fix.
16+
- [[PR #210](https://github.com/nf-core/proteinfold/pull/210)] - Moving post-processing logic to a subworkflow, change wave images pointing to oras to point to https and refactor module to match nf-core folder structure.
17+
- [[#214](https://github.com/nf-core/proteinfold/issues/214)] - Fix colabfold image to run in cpus after [#188](https://github.com/nf-core/proteinfold/issues/188) fix.
1818
- [[PR ##220](https://github.com/nf-core/proteinfold/pull/220)] - Add RoseTTAFold-All-Atom module.
19+
- [[PR ##223](https://github.com/nf-core/proteinfold/pull/223)] - Add HelixFold3 module.
1920
- [[#235](https://github.com/nf-core/proteinfold/issues/235)] - Update samplesheet to new version (switch from `sequence` column to `id`).
2021
- [[#240](https://github.com/nf-core/proteinfold/issues/240)] - Separate download and input of pdb `mmcif` files and `obsolete` database.
2122

@@ -119,6 +120,7 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
119120
| | `--esmfold_params_path` |
120121
| | `--skip_multiqc` |
121122
| | `--rosettafold_all_atom_db` |
123+
| | `--helixfold3_db` |
122124

123125
> **NB:** Parameter has been **updated** if both old and new parameter information is present.
124126
> **NB:** Parameter has been **added** if just the new parameter information is present.

README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ On release, automated continuous integration tests run the pipeline on a full-si
4141

4242
vi. [RoseTTAFold-All-Atom](https://github.com/baker-laboratory/RoseTTAFold-All-Atom/) - Regular RFAA
4343

44+
vii. [HelixFold3](https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold3) - Regular HF3
45+
4446
## Usage
4547

4648
> [!NOTE]
@@ -150,6 +152,18 @@ The pipeline takes care of downloading the databases and parameters required by
150152
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
151153
```
152154

155+
- The helixfold3 mode can be run using the command below:
156+
157+
```console
158+
nextflow run nf-core/proteinfold \
159+
--input samplesheet.csv \
160+
--outdir <OUTDIR> \
161+
--mode helixfold3 \
162+
--helixfold3_db <null (default) | PATH> \
163+
--use_gpu <true/false> \
164+
-profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
165+
```
166+
153167
> [!WARNING]
154168
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/docs/usage/getting_started/configuration#custom-configuration-files).
155169

bin/generate_report.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -308,6 +308,7 @@ def pdb_to_lddt(pdb_files, generate_tsv):
308308
"alphafold2": "AlphaFold2",
309309
"colabfold": "ColabFold",
310310
"rosettafold_all_atom": "Rosettafold_All_Atom",
311+
"helixfold3": "HelixFold3"
311312
}
312313

313314
parser = argparse.ArgumentParser()

conf/dbs.config

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,35 @@ params {
6161
bfd_rosettafold_all_atom_path = "${params.rosettafold_all_atom_db}/bfd/*"
6262
rfaa_paper_weights_path = "${params.rosettafold_all_atom_db}/RFAA_paper_weights.pt"
6363

64+
// Helixfold3 links
65+
helixfold3_uniclust30_link = 'https://storage.googleapis.com/alphafold-databases/casp14_versions/uniclust30_2018_08_hhsuite.tar.gz'
66+
helixfold3_ccd_preprocessed_link = 'https://paddlehelix.bd.bcebos.com/HelixFold3/CCD/ccd_preprocessed_etkdg.pkl.gz'
67+
helixfold3_rfam_link = 'https://paddlehelix.bd.bcebos.com/HelixFold3/MSA/Rfam-14.9_rep_seq.fasta'
68+
helixfold3_init_models_link = 'https://paddlehelix.bd.bcebos.com/HelixFold3/params/HelixFold3-params-240814.zip'
69+
helixfold3_bfd_link = 'https://storage.googleapis.com/alphafold-databases/casp14_versions/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz'
70+
helixfold3_small_bfd_link = 'https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz'
71+
helixfold3_uniprot_sprot_link = 'ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz'
72+
helixfold3_uniprot_trembl_link = 'ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz'
73+
helixfold3_pdb_seqres_link = "${params.pdb_seqres_link}"
74+
helixfold3_uniref90_link = 'ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz'
75+
helixfold3_mgnify_link = 'https://storage.googleapis.com/alphafold-databases/casp14_versions/mgy_clusters_2018_12.fa.gz'
76+
helixfold3_pdb_mmcif_link = 'rsync.rcsb.org::ftp_data/structures/divided/mmCIF/'
77+
helixfold3_pdb_obsolete_link = 'ftp://ftp.wwpdb.org/pub/pdb/data/status/obsolete.dat'
78+
79+
// Helixfold3 paths
80+
helixfold3_uniclust30_path = "${params.helixfold3_db}/uniclust30/*"
81+
helixfold3_ccd_preprocessed_path = "${params.helixfold3_db}/ccd_preprocessed_etkdg.pkl.gz"
82+
helixfold3_rfam_path = "${params.helixfold3_db}/Rfam-14.9_rep_seq.fasta"
83+
helixfold3_init_models_path = "${params.helixfold3_db}/HelixFold3-240814.pdparams"
84+
helixfold3_bfd_path = "${params.helixfold3_db}/bfd/*"
85+
helixfold3_small_bfd_path = "${params.helixfold3_db}/small_bfd/*"
86+
helixfold3_uniprot_path = "${params.helixfold3_db}/uniprot/*"
87+
helixfold3_pdb_seqres_path = "${params.helixfold3_db}/pdb_seqres/*"
88+
helixfold3_uniref90_path = "${params.helixfold3_db}/uniref90/*"
89+
helixfold3_mgnify_path = "${params.helixfold3_db}/mgnify/*"
90+
helixfold3_pdb_mmcif_path = "${params.helixfold3_db}/pdb_mmcif/*"
91+
helixfold3_maxit_src_path = "${params.helixfold3_db}/maxit-v11.200-prod-src"
92+
6493
// Esmfold links
6594
esmfold_3B_v1 = 'https://dl.fbaipublicfiles.com/fair-esm/models/esmfold_3B_v1.pt'
6695
esm2_t36_3B_UR50D = 'https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t36_3B_UR50D.pt'

conf/modules_helixfold3.config

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
/*
2+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3+
Config file for defining DSL2 per module options and publishing paths
4+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5+
Available keys to override module options:
6+
ext.args = Additional arguments appended to command in module.
7+
ext.args2 = Second set of arguments appended to command in module (multi-tool modules).
8+
ext.args3 = Third set of arguments appended to command in module (multi-tool modules).
9+
ext.prefix = File name prefix for output files.
10+
----------------------------------------------------------------------------------------
11+
*/
12+
13+
process {
14+
withName: 'GUNZIP|COMBINE_UNIPROT|DOWNLOAD_PDBMMCIF|ARIA2_PDB_SEQRES' {
15+
publishDir = [
16+
path: {"${params.outdir}/DBs/helixfold3/"},
17+
mode: 'symlink',
18+
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
19+
]
20+
}
21+
22+
withName: 'RUN_HELIXFOLD3' {
23+
if(params.use_gpu) { accelerator = 1 }
24+
publishDir = [
25+
path: { "${params.outdir}/helixfold3/" },
26+
mode: 'copy',
27+
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
28+
pattern: '*.*'
29+
]
30+
}
31+
32+
withName: 'NFCORE_PROTEINFOLD:HELIXFOLD3:MULTIQC' {
33+
publishDir = [
34+
path: { "${params.outdir}/multiqc" },
35+
mode: 'copy',
36+
saveAs: { filename -> filename.equals('versions.yml') ? null : "helixfold3_$filename" }
37+
]
38+
}
39+
}

conf/test_helixfold3.config

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
/*
2+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3+
Nextflow config file for running minimal tests
4+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5+
Defines input files and everything required to run a fast and simple pipeline test.
6+
Use as follows:
7+
nextflow run nf-core/proteinfold -profile test_helixfold3,<docker/singularity> --outdir <OUTDIR>
8+
----------------------------------------------------------------------------------------
9+
*/
10+
11+
stubRun = true
12+
13+
// Limit resources so that this can run on GitHub Actions
14+
process {
15+
resourceLimits = [
16+
cpus: 4,
17+
memory: '15.GB',
18+
time: '1.h'
19+
]
20+
}
21+
22+
params {
23+
config_profile_name = 'Test profile'
24+
config_profile_description = 'Minimal test dataset to check pipeline function'
25+
26+
// Input data to test helixfold3
27+
mode = 'helixfold3'
28+
helixfold3_db = "${projectDir}/assets/dummy_db_dir"
29+
input = params.pipelines_testdata_base_path + 'proteinfold/testdata/samplesheet/v1.2/samplesheet.csv'
30+
}
31+
32+
process {
33+
withName: 'RUN_HELIXFOLD3' {
34+
container = 'biocontainers/gawk:5.1.0'
35+
}
36+
}
37+
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04
2+
3+
LABEL Author="[email protected]" \
4+
title="nfcore/proteinfold_helixfold3" \
5+
Version="0.9.0" \
6+
description="Docker image containing all software requirements to run the RUN_HELIXFOLD3 module using the nf-core/proteinfold pipeline"
7+
8+
ENV PYTHONPATH="/app/helixfold3:$PYTHONPATH" \
9+
PATH="/conda/bin:/app/helixfold3:$PATH" \
10+
PYTHON_BIN="/conda/envs/helixfold/bin/python3.9" \
11+
ENV_BIN="/conda/envs/helixfold/bin" \
12+
OBABEL_BIN="/conda/envs/helixfold/bin"
13+
14+
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y wget git && \
15+
wget -q -P /tmp "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" && \
16+
bash /tmp/Miniforge3-$(uname)-$(uname -m).sh -b -p /conda && \
17+
rm -rf /tmp/Miniforge3-$(uname)-$(uname -m).sh /var/lib/apt/lists/* && \
18+
apt-get autoremove -y && apt-get clean -y
19+
20+
RUN git clone --single-branch --branch dev --depth 1 --no-checkout https://github.com/PaddlePaddle/PaddleHelix.git /app/helixfold3 && \
21+
cd /app/helixfold3 && \
22+
git sparse-checkout init --cone && \
23+
git sparse-checkout set apps/protein_folding/helixfold3 && \
24+
git checkout dev && \
25+
mv apps/protein_folding/helixfold3/* . && \
26+
rm -rf apps
27+
28+
COPY environment_nfcore-proteinfold_helixfold3.yaml /app/helixfold3/
29+
RUN /conda/bin/mamba env create --file=/app/helixfold3/environment_nfcore-proteinfold_helixfold3.yaml && \
30+
/conda/bin/mamba install -y -c bioconda aria2 hmmer==3.3.2 kalign2==2.04 hhsuite==3.3.0 -n helixfold && \
31+
/conda/bin/mamba install -y -c conda-forge openbabel -n helixfold && \
32+
/conda/bin/mamba clean --all --force-pkgs-dirs -y && \
33+
rm -rf /root/.cache && \
34+
apt-get autoremove -y && apt-get remove --purge -y wget git && apt-get clean -y
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
name: helixfold
2+
channels:
3+
- conda-forge
4+
- bioconda
5+
- nvidia
6+
- biocore
7+
8+
dependencies:
9+
- python=3.9
10+
- cuda-toolkit=12.0
11+
- cudnn=8.4.0
12+
- nccl=2.14
13+
- libgcc
14+
- libgomp
15+
- pip
16+
- aria2
17+
- hmmer==3.4
18+
- kalign2==2.04
19+
- hhsuite==3.3.0
20+
- openbabel
21+
- pip:
22+
- paddlepaddle-gpu==2.6.1 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
23+
- absl-py==0.13.0
24+
- biopython==1.79
25+
- chex==0.0.7
26+
- dm-haiku==0.0.4
27+
- dm-tree==0.1.6
28+
- docker==5.0.0
29+
- immutabledict==2.0.0
30+
- jax==0.2.14
31+
- ml-collections==0.1.0
32+
- pandas==1.3.4
33+
- scipy==1.9.0
34+
- rdkit-pypi==2022.9.5
35+
- posebusters

docs/output.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and predicts pr
1414
- [ColabFold](https://github.com/sokrypton/ColabFold) - MMseqs2 (API server or local search) followed by ColabFold
1515
- [ESMFold](https://github.com/facebookresearch/esm)
1616
- [RoseTTAFold-All-Atom](https://github.com/baker-laboratory/RoseTTAFold-All-Atom/)
17+
- [HelixFold3](https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold3)
1718

1819
See main [README.md](https://github.com/nf-core/proteinfold/blob/master/README.md) for a condensed overview of the steps in the pipeline, and the bioinformatics tools used at each step.
1920

@@ -190,6 +191,18 @@ Below you can find an indicative example of the TSV file with the pLDDT scores p
190191

191192
</details>
192193

194+
### HelixFold3
195+
196+
<details markdown="1">
197+
<summary>Output files</summary>
198+
199+
- `run/`
200+
- `<SEQUENCE NAME>_helixfold3.pdb` that is the structure with the highest pLDDT score (ranked first)
201+
- `<SEQUENCE NAME>_plddt_mqc.tsv` that presents the pLDDT scores per residue for the predicted model
202+
- `<SEQUENCE NAME>/` that contains the computed MSAs, prediction metadata, ranked structures, raw model outputs etc.
203+
204+
</details>
205+
193206
### MultiQC report
194207

195208
<details markdown="1">

0 commit comments

Comments
 (0)