@@ -26,7 +26,7 @@ The Enhanced Database of Interacting Protein Structures for Interface Prediction
26
26
* Benchmark results included in our paper were run after this issue was resolved
27
27
* However, if you ran experiments using DB5-Plus' filename list for its test complexes, please re-run them using the latest list
28
28
29
- ## How to run creation tools
29
+ ## How to set up
30
30
31
31
First, download Mamba (if not already downloaded):
32
32
``` bash
@@ -51,66 +51,135 @@ conda activate DIPS-Plus # Note: One still needs to use `conda` to (de)activate
51
51
pip3 install -e .
52
52
```
53
53
54
- ## Default DIPS-Plus directory structure
54
+ To install PSAIA for feature generation, install GCC 10 for PSAIA:
55
+
56
+ ``` bash
57
+ # Install GCC 10 for Ubuntu 20.04:
58
+ sudo apt install software-properties-common
59
+ sudo add-apt-repository ppa:ubuntu-toolchain-r/ppa
60
+ sudo apt update
61
+ sudo apt install gcc-10 g++-10
62
+
63
+ # Or install GCC 10 for Arch Linux/Manjaro:
64
+ yay -S gcc10
65
+ ```
66
+
67
+ Then install QT4 for PSAIA:
68
+
69
+ ``` bash
70
+ # Install QT4 for Ubuntu 20.04:
71
+ sudo add-apt-repository ppa:rock-core/qt4
72
+ sudo apt update
73
+ sudo apt install libqt4* libqtcore4 libqtgui4 libqtwebkit4 qt4* libxext-dev
74
+
75
+ # Or install QT4 for Arch Linux/Manjaro:
76
+ yay -S qt4
77
+ ```
78
+
79
+ Conclude by compiling PSAIA from source:
80
+
81
+ ``` bash
82
+ # Select the location to install the software:
83
+ MY_LOCAL=~ /Programs
84
+
85
+ # Download and extract PSAIA's source code:
86
+ mkdir " $MY_LOCAL "
87
+ cd " $MY_LOCAL "
88
+ wget http://complex.zesoi.fer.hr/data/PSAIA-1.0-source.tar.gz
89
+ tar -xvzf PSAIA-1.0-source.tar.gz
90
+
91
+ # Compile PSAIA (i.e., a GUI for PSA):
92
+ cd PSAIA_1.0_source/make/linux/psaia/
93
+ qmake-qt4 psaia.pro
94
+ make
95
+
96
+ # Compile PSA (i.e., the protein structure analysis (PSA) program):
97
+ cd ../psa/
98
+ qmake-qt4 psa.pro
99
+ make
100
+
101
+ # Compile PIA (i.e., the protein interaction analysis (PIA) program):
102
+ cd ../pia/
103
+ qmake-qt4 pia.pro
104
+ make
105
+
106
+ # Test run any of the above-compiled programs:
107
+ cd " $MY_LOCAL " /PSAIA_1.0_source/bin/linux
108
+ # Test run PSA inside a GUI:
109
+ ./psaia/psaia
110
+ # Test run PIA through a terminal:
111
+ ./pia/pia
112
+ # Test run PSA through a terminal:
113
+ ./psa/psa
114
+ ```
115
+
116
+ Lastly, install Docker following the instructions from https://docs.docker.com/engine/install/
117
+
118
+ ## How to generate protein feature inputs
119
+ In our [ feature generation notebook] ( notebooks/feature_generation.ipynb ) , we provide examples of how users can generate the protein features described in our [ accompanying manuscript] ( https://arxiv.org/abs/2106.04362 ) for individual protein inputs.
120
+
121
+ ## How to use data
122
+ In our [ data usage notebook] ( notebooks/data_usage.ipynb ) , we provide examples of how users might use DIPS-Plus (or DB5-Plus) for downstream analysis or prediction tasks. For example, to train a new NeiA model with DB5-Plus as its cross-validation dataset, first download DB5-Plus' raw files and process them via the ` data_usage ` notebook:
123
+
124
+ ``` bash
125
+ mkdir -p project/datasets/DB5/final
126
+ wget https://zenodo.org/record/5134732/files/final_raw_db5.tar.gz -O project/datasets/DB5/final/final_raw_db5.tar.gz
127
+ tar -xzf project/datasets/DB5/final/final_raw_db5.tar.gz -C project/datasets/DB5/final/
128
+
129
+ # To process these raw files for training and subsequently train a model:
130
+ python3 notebooks/data_usage.py
131
+ ```
132
+
133
+ ## Standard DIPS-Plus directory structure
55
134
56
135
```
57
136
DIPS-Plus
58
137
│
59
138
└───project
60
- │ │
61
- │ └───datasets
62
- │ │ │
63
- │ │ └───builder
64
- │ │ │
65
- │ │ └───DB5
66
- │ │ │ │
67
- │ │ │ └───final
68
- │ │ │ │ │
69
- │ │ │ │ └───raw
70
- │ │ │ │
71
- │ │ │ └───interim
72
- │ │ │ │ │
73
- │ │ │ │ └───complexes
74
- │ │ │ │ │
75
- │ │ │ │ └───external_feats
76
- │ │ │ │ │
77
- │ │ │ │ └───pairs
78
- │ │ │ │
79
- │ │ │ └───raw
80
- │ │ │ │
81
- │ │ │ README
82
- │ │ │
83
- │ │ └───DIPS
84
- │ │ │
85
- │ │ └───filters
86
- │ │ │
87
- │ │ └───final
88
- │ │ │ │
89
- │ │ │ └───raw
90
- │ │ │
91
- │ │ └───interim
92
- │ │ │ │
93
- │ │ │ └───complexes
94
- │ │ │ │
95
- │ │ │ └───external_feats
96
- │ │ │ │
97
- │ │ │ └───pairs-pruned
98
- │ │ │
99
- │ │ └───raw
100
- │ │ │
101
- │ │ └───pdb
102
- │ │
103
- │ └───utils
104
- │ constants.py
105
- │ utils.py
106
- │
107
- .gitignore
108
- environment.yml
109
- LICENSE
110
- README.md
111
- requirements.txt
112
- setup.cfg
113
- setup.py
139
+ │
140
+ └───datasets
141
+ │
142
+ └───DB5
143
+ │ │
144
+ │ └───final
145
+ │ │ │
146
+ │ │ └───processed # task-ready features for each dataset example
147
+ │ │ │
148
+ │ │ └───raw # generic features for each dataset example
149
+ │ │
150
+ │ └───interim
151
+ │ │ │
152
+ │ │ └───complexes # metadata for each dataset example
153
+ │ │ │
154
+ │ │ └───external_feats # features curated for each dataset example using external tools
155
+ │ │ │
156
+ │ │ └───pairs # pair-wise features for each dataset example
157
+ │ │
158
+ │ └───raw # raw PDB data downloads for each dataset example
159
+ │
160
+ └───DIPS
161
+ │
162
+ └───filters # filters to apply to each (un-pruned) dataset example
163
+ │
164
+ └───final
165
+ │ │
166
+ │ └───processed # task-ready features for each dataset example
167
+ │ │
168
+ │ └───raw # generic features for each dataset example
169
+ │
170
+ └───interim
171
+ │ │
172
+ │ └───complexes # metadata for each dataset example
173
+ │ │
174
+ │ └───external_feats # features curated for each dataset example using external tools
175
+ │ │
176
+ │ └───pairs-pruned # filtered pair-wise features for each dataset example
177
+ │ │
178
+ │ └───parsed # pair-wise features for each dataset example after initial parsing
179
+ │
180
+ └───raw
181
+ │
182
+ └───pdb # raw PDB data downloads for each dataset example
114
183
```
115
184
116
185
## How to compile DIPS-Plus from scratch
@@ -122,7 +191,7 @@ Retrieve protein complexes from the RCSB PDB and build out directory structure:
122
191
rm project/datasets/DIPS/final/raw/pairs-postprocessed.txt project/datasets/DIPS/final/raw/pairs-postprocessed-train.txt project/datasets/DIPS/final/raw/pairs-postprocessed-val.txt project/datasets/DIPS/final/raw/pairs-postprocessed-test.txt
123
192
124
193
# Create data directories (if not already created):
125
- mkdir project/datasets/DIPS/raw project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim project/datasets/DIPS/interim/external_feats project/datasets/DIPS/final project/datasets/DIPS/final/raw project/datasets/DIPS/final/processed
194
+ mkdir project/datasets/DIPS/raw project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim project/datasets/DIPS/interim/pairs-pruned project/datasets/DIPS/interim/ external_feats project/datasets/DIPS/final project/datasets/DIPS/final/raw project/datasets/DIPS/final/processed
126
195
127
196
# Download the raw PDB files:
128
197
rsync -rlpt -v -z --delete --port=33444 --include=' *.gz' --include=' *.xz' --include=' */' --exclude ' *' \
@@ -139,7 +208,17 @@ python3 project/datasets/builder/prune_pairs.py project/datasets/DIPS/interim/pa
139
208
140
209
# Generate externally-sourced features:
141
210
python3 project/datasets/builder/generate_psaia_features.py " $PSAIADIR " " $PROJDIR " /project/datasets/builder/psaia_config_file_dips.txt " $PROJDIR " /project/datasets/DIPS/raw/pdb " $PROJDIR " /project/datasets/DIPS/interim/parsed " $PROJDIR " /project/datasets/DIPS/interim/pairs-pruned " $PROJDIR " /project/datasets/DIPS/interim/external_feats --source_type rcsb
142
- python3 project/datasets/builder/generate_hhsuite_features.py " $PROJDIR " /project/datasets/DIPS/interim/parsed " $PROJDIR " /project/datasets/DIPS/interim/pairs-pruned " $HHSUITE_DB " " $PROJDIR " /project/datasets/DIPS/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type rcsb --write_file
211
+ python3 project/datasets/builder/generate_hhsuite_features.py " $PROJDIR " /project/datasets/DIPS/interim/parsed " $PROJDIR " /project/datasets/DIPS/interim/pairs-pruned " $HHSUITE_DB " " $PROJDIR " /project/datasets/DIPS/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type rcsb --write_file # Note: After this, one needs to re-run this command with `--read_file` instead
212
+
213
+ # Generate multiple sequence alignments (MSAs) using a smaller sequence database (if not already created using the standard BFD):
214
+ DOWNLOAD_DIR=" $HHSUITE_DB_DIR " && ROOT_DIR=" ${DOWNLOAD_DIR} /small_bfd" && SOURCE_URL=" https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz" && BASENAME=$( basename " ${SOURCE_URL} " ) && mkdir --parents " ${ROOT_DIR} " && aria2c " ${SOURCE_URL} " --dir=" ${ROOT_DIR} " && pushd " ${ROOT_DIR} " && gunzip " ${ROOT_DIR} /${BASENAME} " && popd # e.g., Download the small BFD
215
+ python3 project/datasets/builder/generate_hhsuite_features.py " $PROJDIR " /project/datasets/DIPS/interim/parsed " $PROJDIR " /project/datasets/DIPS/interim/pairs-pruned " $HHSUITE_DB_DIR " /small_bfd " $PROJDIR " /project/datasets/DIPS/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type rcsb --generate_msa_only --write_file # Note: After this, one needs to re-run this command with `--read_file` instead
216
+
217
+ # Identify interfaces within intrinsically disordered regions (IDRs) #
218
+ # (1) Pull down the Docker image for `flDPnn`
219
+ docker pull docker.io/sinaghadermarzi/fldpnn
220
+ # (2) For all sequences in the dataset, predict which interface residues reside within IDRs
221
+ python3 project/datasets/builder/annotate_idr_interfaces.py " $PROJDIR " /project/datasets/DIPS/final/raw
143
222
144
223
# Add new features to the filtered pairs, ensuring that the pruned pairs' original PDB files are stored locally for DSSP:
145
224
python3 project/datasets/builder/download_missing_pruned_pair_pdbs.py " $PROJDIR " /project/datasets/DIPS/raw/pdb " $PROJDIR " /project/datasets/DIPS/interim/pairs-pruned --num_cpus 32 --rank " $1 " --size " $2 "
0 commit comments