Predicting and explaining the impact of genetic disruptions and interactions on cell and organismal viability
This repositroy contains all the source code necessary to reproduce the results of our paper, "Predicting and explaining the impact of genetic disruptions and interactions on cell and organismal viability".
The following files are responsible for extracting features and tasks from raw bioinformatic data.
create_ppc.py
creates the protein-protein interaction networks for the budding yeast, fission yeast, human, and fruit fly.create_tasks.py
generates the single-, double-, and triple-mutant tasks studied in the paper. It assumes thatcreate_ppc.py
has already been executed.create_features.py
creates the single and pairwise gene features for all four organisms. This requires theowltool
application from geneontology.org to be present in../tools
, and requires an NCBI-Blast+ installation (on Ubuntu, this can be installed viasudo apt install ncbi-blast+
).create_datasets.py
combines features and tasks into one csv file, for each task. GI and Triple GI tasks only include the pairwise features as it would be too much to include the features of individual genes. For those tasks, the GI and Triple GI models require the single gene feature files as well.create_pseudo_triplets_task.py
creates randomly sampled pseudo triplets within- and across-complexes.explore_hybrid_costanzo.ipynb
examines the overlap between costanzo and Biogrid datasets.
The above-mentioned files require original third-party data files to be present in the ../data-sources
directory (e.g., datasets such as BioGRID, uniprot, etc.). Since there are many original data files required and due to the difficulty of downloading and placing them in the right organization, we provide, a zip file containing all the processed data necessary to replicate the analyses and modeling experiments. Thus, the user doesn't need to deal with original third-party data. The file can be downloaded here.
After download, unzip the contents of the file into ../generated-data
directory.
After downloading and extracting the processed data files, the following scripts should be executed.
create_mn_datasets.py
creates datasets for the MN models, based on those generated bycreate_datasets.py
.create_splits.py
creates all the cross-validation splits for all tasks studied in the paper. This includes the development/test splits for yeast.figures.ipynb
produces all the non-modeling figures in the paper.
The following files can reproduce the results of the modeling experiments of the paper. You can just run them, there are no arguments or parameters to pass.
exp_optimize_hyperparams.py
runs the hyper parameter optimization experiments on the single- and double-gene neural network models. Produces supplementary table 1.exp_feature_selection.py
runs feature selection experiments on the development portion of the budding yeast datasets. Produces supplementary tables 2, 4, 7.exp_yeast_smf.py
evaluates the S-Full, S-Refined, S-MN, and null models on the development portion (CV) and test portions of the yeast SMF dataset. Produces Figure 1A.exp_yeast_gi_hybrid.py
evaluates the D-Full, D-Refined, D-MN, and null models on the development (CV) and test portions of the yeast hybrid GI dataset. Produces Figure 2A.exp_yeast_tgi.py
evaluates the T-Full, T-Refined, T-MN, and null models on the development (CV) and test portions of the yeast triple mutant GI dataset. Produces Figure 3A.exp_smf_binary.py
evaluates the S-Refined, S-MN, and null models on the SMF datasets of all four organisms, as well as the multi organismal lethal (MO) vs. viable (V) dataset of humans and fruit flies. Training and evaluation is done using CV. Produces Figure 4.exp_gi_binary.py
evaluates the D-Refined, D-MN (with and without slim GO terms), and the null models on the GI datasets of all four organisms. Produces Figure 5.exp_gi_costanzo_pombe.py
evaluates the 4-way D-Full, D-Refined, D-MN, and null models on the yeast Costanzo GI dataset, and the D-Refiend, D-MN, and null models on the pombe GI dataset. Produces Supplementary Figure 5.exp_smf_other_orgs.py
evaluates the 3-way S-Refined, S-MN, and null models on the pombe, human, and fruit fly SMF datasets. Produces Supplementary Figure 7.exp_smf_ca_mo_v.py
evaluates S-Refined, S-MN, and null models on the task of predicting cellular autonomous lethality (CA) vs multi-organismal lethality (MO) vs viability (V) in humans and fruit flies. Produces Supplementary Figure 8.exp_lit.py
compares the binary S-MN and D-MN models to other single-mutant fitness models from literature on the yeast SMF and hybrid GI datasets. Produces Supplementary Figure 9.exp_cross_prediction.py
trains the D-MN model on the GI prediction task on the yeast hybrid GI dataset, and evaluates it on the task of predicting GI, coprecipitation, phosphorylation, and transcription. Produces Supplementary Figure 10.exp_mn_feature_contribution.py
computes the drop in balanced accuracy when each feature of the S-MN, D-MN, and T-MN models is removed. Produces Supplementary Figure 11.exp_generalization.py
trains S-MN and D-MN models on the yeast SMF and hybrid GI datasets, and evaluates them on the other organisms' datasets. Produces Supplementary Figure 12.
Those reside under cfgs/
and specify the configuration of the NN and MN models used in the paper. Note that those configuration specify the models in their "full" form, the refined variants and created dynamically from those files in the experiment scripts above.
Tensorflow model classes reside under models/
. In addition to the neural network and MN models, a helper module, train_and_evaluate.py
is provided to carry out CV training and evaluation. The module takes advantage of multiple cores to run several splits at the same time.
The code was tested with the follow modules:
sklearn 1.0.2
dcor 0.5.3
Bio 1.79
igraph 0.9.10
matplotlib 3.5.1
networkx 2.8
numpy 1.22.3
obonet 0.3.0
pandas 1.4.2
scipy 1.8.0
seaborn 0.11.2
statsmodels 0.13.2
tensorflow 2.8.0