Skip to content

Commit fb59ee1

Browse files
committed
Add initial files for refactor and upgrade
1 parent 696ffdf commit fb59ee1

18 files changed

+2850
-78
lines changed

.gitignore

+12-1
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,12 @@ venv.tar.gz
111111
.idea
112112
.vscode
113113

114+
# TensorBoard
115+
tb_logs/
116+
117+
# Feature Processing
118+
*work_filenames*.csv
119+
114120
# DIPS
115121
project/datasets/DIPS/complexes/**
116122
project/datasets/DIPS/interim/**
@@ -119,13 +125,15 @@ project/datasets/DIPS/parsed/**
119125
project/datasets/DIPS/raw/**
120126
project/datasets/DIPS/final/raw/**
121127
project/datasets/DIPS/final/final_raw_dips.tar.gz*
128+
project/datasets/DIPS/final/processed/**
122129

123130
# DB5
124131
project/datasets/DB5/processed/**
125132
project/datasets/DB5/raw/**
126133
project/datasets/DB5/interim/**
127134
project/datasets/DB5/final/raw/**
128135
project/datasets/DB5/final/final_raw_db5.tar.gz*
136+
project/datasets/DB5/final/processed/**
129137

130138
# EVCoupling
131139
project/datasets/EVCoupling/raw/**
@@ -137,4 +145,7 @@ project/datasets/EVCoupling/final/processed/**
137145
project/datasets/CASP-CAPRI/raw/**
138146
project/datasets/CASP-CAPRI/interim/**
139147
project/datasets/CASP-CAPRI/final/raw/**
140-
project/datasets/CASP-CAPRI/final/processed/**
148+
project/datasets/CASP-CAPRI/final/processed/**
149+
150+
# Input
151+
project/datasets/Input/**

README.md

+137-58
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ The Enhanced Database of Interacting Protein Structures for Interface Prediction
2626
* Benchmark results included in our paper were run after this issue was resolved
2727
* However, if you ran experiments using DB5-Plus' filename list for its test complexes, please re-run them using the latest list
2828

29-
## How to run creation tools
29+
## How to set up
3030

3131
First, download Mamba (if not already downloaded):
3232
```bash
@@ -51,66 +51,135 @@ conda activate DIPS-Plus # Note: One still needs to use `conda` to (de)activate
5151
pip3 install -e .
5252
```
5353

54-
## Default DIPS-Plus directory structure
54+
To install PSAIA for feature generation, install GCC 10 for PSAIA:
55+
56+
```bash
57+
# Install GCC 10 for Ubuntu 20.04:
58+
sudo apt install software-properties-common
59+
sudo add-apt-repository ppa:ubuntu-toolchain-r/ppa
60+
sudo apt update
61+
sudo apt install gcc-10 g++-10
62+
63+
# Or install GCC 10 for Arch Linux/Manjaro:
64+
yay -S gcc10
65+
```
66+
67+
Then install QT4 for PSAIA:
68+
69+
```bash
70+
# Install QT4 for Ubuntu 20.04:
71+
sudo add-apt-repository ppa:rock-core/qt4
72+
sudo apt update
73+
sudo apt install libqt4* libqtcore4 libqtgui4 libqtwebkit4 qt4* libxext-dev
74+
75+
# Or install QT4 for Arch Linux/Manjaro:
76+
yay -S qt4
77+
```
78+
79+
Conclude by compiling PSAIA from source:
80+
81+
```bash
82+
# Select the location to install the software:
83+
MY_LOCAL=~/Programs
84+
85+
# Download and extract PSAIA's source code:
86+
mkdir "$MY_LOCAL"
87+
cd "$MY_LOCAL"
88+
wget http://complex.zesoi.fer.hr/data/PSAIA-1.0-source.tar.gz
89+
tar -xvzf PSAIA-1.0-source.tar.gz
90+
91+
# Compile PSAIA (i.e., a GUI for PSA):
92+
cd PSAIA_1.0_source/make/linux/psaia/
93+
qmake-qt4 psaia.pro
94+
make
95+
96+
# Compile PSA (i.e., the protein structure analysis (PSA) program):
97+
cd ../psa/
98+
qmake-qt4 psa.pro
99+
make
100+
101+
# Compile PIA (i.e., the protein interaction analysis (PIA) program):
102+
cd ../pia/
103+
qmake-qt4 pia.pro
104+
make
105+
106+
# Test run any of the above-compiled programs:
107+
cd "$MY_LOCAL"/PSAIA_1.0_source/bin/linux
108+
# Test run PSA inside a GUI:
109+
./psaia/psaia
110+
# Test run PIA through a terminal:
111+
./pia/pia
112+
# Test run PSA through a terminal:
113+
./psa/psa
114+
```
115+
116+
Lastly, install Docker following the instructions from https://docs.docker.com/engine/install/
117+
118+
## How to generate protein feature inputs
119+
In our [feature generation notebook](notebooks/feature_generation.ipynb), we provide examples of how users can generate the protein features described in our [accompanying manuscript](https://arxiv.org/abs/2106.04362) for individual protein inputs.
120+
121+
## How to use data
122+
In our [data usage notebook](notebooks/data_usage.ipynb), we provide examples of how users might use DIPS-Plus (or DB5-Plus) for downstream analysis or prediction tasks. For example, to train a new NeiA model with DB5-Plus as its cross-validation dataset, first download DB5-Plus' raw files and process them via the `data_usage` notebook:
123+
124+
```bash
125+
mkdir -p project/datasets/DB5/final
126+
wget https://zenodo.org/record/5134732/files/final_raw_db5.tar.gz -O project/datasets/DB5/final/final_raw_db5.tar.gz
127+
tar -xzf project/datasets/DB5/final/final_raw_db5.tar.gz -C project/datasets/DB5/final/
128+
129+
# To process these raw files for training and subsequently train a model:
130+
python3 notebooks/data_usage.py
131+
```
132+
133+
## Standard DIPS-Plus directory structure
55134

56135
```
57136
DIPS-Plus
58137
59138
└───project
60-
│ │
61-
│ └───datasets
62-
│ │ │
63-
│ │ └───builder
64-
│ │ │
65-
│ │ └───DB5
66-
│ │ │ │
67-
│ │ │ └───final
68-
│ │ │ │ │
69-
│ │ │ │ └───raw
70-
│ │ │ │
71-
│ │ │ └───interim
72-
│ │ │ │ │
73-
│ │ │ │ └───complexes
74-
│ │ │ │ │
75-
│ │ │ │ └───external_feats
76-
│ │ │ │ │
77-
│ │ │ │ └───pairs
78-
│ │ │ │
79-
│ │ │ └───raw
80-
│ │ │ │
81-
│ │ │ README
82-
│ │ │
83-
│ │ └───DIPS
84-
│ │ │
85-
│ │ └───filters
86-
│ │ │
87-
│ │ └───final
88-
│ │ │ │
89-
│ │ │ └───raw
90-
│ │ │
91-
│ │ └───interim
92-
│ │ │ │
93-
│ │ │ └───complexes
94-
│ │ │ │
95-
│ │ │ └───external_feats
96-
│ │ │ │
97-
│ │ │ └───pairs-pruned
98-
│ │ │
99-
│ │ └───raw
100-
│ │ │
101-
│ │ └───pdb
102-
│ │
103-
│ └───utils
104-
│ constants.py
105-
│ utils.py
106-
107-
.gitignore
108-
environment.yml
109-
LICENSE
110-
README.md
111-
requirements.txt
112-
setup.cfg
113-
setup.py
139+
140+
└───datasets
141+
142+
└───DB5
143+
│ │
144+
│ └───final
145+
│ │ │
146+
│ │ └───processed # task-ready features for each dataset example
147+
│ │ │
148+
│ │ └───raw # generic features for each dataset example
149+
│ │
150+
│ └───interim
151+
│ │ │
152+
│ │ └───complexes # metadata for each dataset example
153+
│ │ │
154+
│ │ └───external_feats # features curated for each dataset example using external tools
155+
│ │ │
156+
│ │ └───pairs # pair-wise features for each dataset example
157+
│ │
158+
│ └───raw # raw PDB data downloads for each dataset example
159+
160+
└───DIPS
161+
162+
└───filters # filters to apply to each (un-pruned) dataset example
163+
164+
└───final
165+
│ │
166+
│ └───processed # task-ready features for each dataset example
167+
│ │
168+
│ └───raw # generic features for each dataset example
169+
170+
└───interim
171+
│ │
172+
│ └───complexes # metadata for each dataset example
173+
│ │
174+
│ └───external_feats # features curated for each dataset example using external tools
175+
│ │
176+
│ └───pairs-pruned # filtered pair-wise features for each dataset example
177+
│ │
178+
│ └───parsed # pair-wise features for each dataset example after initial parsing
179+
180+
└───raw
181+
182+
└───pdb # raw PDB data downloads for each dataset example
114183
```
115184

116185
## How to compile DIPS-Plus from scratch
@@ -122,7 +191,7 @@ Retrieve protein complexes from the RCSB PDB and build out directory structure:
122191
rm project/datasets/DIPS/final/raw/pairs-postprocessed.txt project/datasets/DIPS/final/raw/pairs-postprocessed-train.txt project/datasets/DIPS/final/raw/pairs-postprocessed-val.txt project/datasets/DIPS/final/raw/pairs-postprocessed-test.txt
123192

124193
# Create data directories (if not already created):
125-
mkdir project/datasets/DIPS/raw project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim project/datasets/DIPS/interim/external_feats project/datasets/DIPS/final project/datasets/DIPS/final/raw project/datasets/DIPS/final/processed
194+
mkdir project/datasets/DIPS/raw project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim project/datasets/DIPS/interim/pairs-pruned project/datasets/DIPS/interim/external_feats project/datasets/DIPS/final project/datasets/DIPS/final/raw project/datasets/DIPS/final/processed
126195

127196
# Download the raw PDB files:
128197
rsync -rlpt -v -z --delete --port=33444 --include='*.gz' --include='*.xz' --include='*/' --exclude '*' \
@@ -139,7 +208,17 @@ python3 project/datasets/builder/prune_pairs.py project/datasets/DIPS/interim/pa
139208

140209
# Generate externally-sourced features:
141210
python3 project/datasets/builder/generate_psaia_features.py "$PSAIADIR" "$PROJDIR"/project/datasets/builder/psaia_config_file_dips.txt "$PROJDIR"/project/datasets/DIPS/raw/pdb "$PROJDIR"/project/datasets/DIPS/interim/parsed "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned "$PROJDIR"/project/datasets/DIPS/interim/external_feats --source_type rcsb
142-
python3 project/datasets/builder/generate_hhsuite_features.py "$PROJDIR"/project/datasets/DIPS/interim/parsed "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned "$HHSUITE_DB" "$PROJDIR"/project/datasets/DIPS/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type rcsb --write_file
211+
python3 project/datasets/builder/generate_hhsuite_features.py "$PROJDIR"/project/datasets/DIPS/interim/parsed "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned "$HHSUITE_DB" "$PROJDIR"/project/datasets/DIPS/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type rcsb --write_file # Note: After this, one needs to re-run this command with `--read_file` instead
212+
213+
# Generate multiple sequence alignments (MSAs) using a smaller sequence database (if not already created using the standard BFD):
214+
DOWNLOAD_DIR="$HHSUITE_DB_DIR" && ROOT_DIR="${DOWNLOAD_DIR}/small_bfd" && SOURCE_URL="https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz" && BASENAME=$(basename "${SOURCE_URL}") && mkdir --parents "${ROOT_DIR}" && aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}" && pushd "${ROOT_DIR}" && gunzip "${ROOT_DIR}/${BASENAME}" && popd # e.g., Download the small BFD
215+
python3 project/datasets/builder/generate_hhsuite_features.py "$PROJDIR"/project/datasets/DIPS/interim/parsed "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned "$HHSUITE_DB_DIR"/small_bfd "$PROJDIR"/project/datasets/DIPS/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type rcsb --generate_msa_only --write_file # Note: After this, one needs to re-run this command with `--read_file` instead
216+
217+
# Identify interfaces within intrinsically disordered regions (IDRs) #
218+
# (1) Pull down the Docker image for `flDPnn`
219+
docker pull docker.io/sinaghadermarzi/fldpnn
220+
# (2) For all sequences in the dataset, predict which interface residues reside within IDRs
221+
python3 project/datasets/builder/annotate_idr_interfaces.py "$PROJDIR"/project/datasets/DIPS/final/raw
143222

144223
# Add new features to the filtered pairs, ensuring that the pruned pairs' original PDB files are stored locally for DSSP:
145224
python3 project/datasets/builder/download_missing_pruned_pair_pdbs.py "$PROJDIR"/project/datasets/DIPS/raw/pdb "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned --num_cpus 32 --rank "$1" --size "$2"

0 commit comments

Comments
 (0)