|
| 1 | +# AlphaFold |
| 2 | + |
| 3 | +This package provides an implementation of the contact prediction network, |
| 4 | +associated model weights and CASP13 dataset as published in Nature. |
| 5 | + |
| 6 | +Any publication that discloses findings arising from using this source code must |
| 7 | +cite *AlphaFold: Protein structure prediction using potentials from deep |
| 8 | +learning* by Andrew W. Senior, Richard Evans, John Jumper, James Kirkpatrick, |
| 9 | +Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, Alexander W. R. Nelson, |
| 10 | +Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, |
| 11 | +Pushmeet Kohli, David T. Jones, David Silver, Koray Kavukcuoglu, Demis Hassabis. |
| 12 | + |
| 13 | +## Setup |
| 14 | + |
| 15 | +### Dependencies |
| 16 | + |
| 17 | +* Python 3.6+. |
| 18 | +* [Abseil 0.8.0+](https://github.com/abseil/abseil-py) |
| 19 | +* [Numpy 1.16+](https://numpy.org) |
| 20 | +* [Six 1.12+](https://pypi.org/project/six/) |
| 21 | +* [Sonnet 1.35+](https://github.com/deepmind/sonnet) |
| 22 | +* [TensorFlow 1.14](https://tensorflow.org). Not compatible with TensorFlow |
| 23 | + 2.0+. |
| 24 | +* [TensorFlow Probability 0.7.0](https://www.tensorflow.org/probability) |
| 25 | + |
| 26 | +You can set up Python virtual environment with these dependencies inside the |
| 27 | +forked `deepmind_research` repository using: |
| 28 | + |
| 29 | +```shell |
| 30 | +python3 -m venv alphafold_venv |
| 31 | +source alphafold_venv/bin/activate |
| 32 | +pip install -r alphafold_casp13/requirements.txt |
| 33 | +``` |
| 34 | + |
| 35 | +### Input data |
| 36 | + |
| 37 | +The dataset can be downloaded from |
| 38 | +[Google Cloud Storage](https://console.cloud.google.com/storage/browser/alphafold_casp13_data). |
| 39 | + |
| 40 | +Download it e.g. using `wget`: |
| 41 | + |
| 42 | +```shell |
| 43 | +wget https://storage.googleapis.com/alphafold_casp13_data/casp13_data.zip |
| 44 | +``` |
| 45 | + |
| 46 | +The zip file contains 1 directory for each CASP13 target and a `LICENSE.md` |
| 47 | +file. Each target directory contains the following files: |
| 48 | + |
| 49 | +1. `TARGET.tfrec` file. This is a |
| 50 | + [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord) file |
| 51 | + with serialized tf.train.Example protocol buffers that contain the features |
| 52 | + needed to run the model. |
| 53 | +1. `contacts/TARGET.pickle` file(s) with the predicted distogram. |
| 54 | +1. `contacts/TARGET.rr` file(s) with the contact map derived from the predicted |
| 55 | + distogram. The RR format is described on the |
| 56 | + [CASP website](http://predictioncenter.org/casp13/index.cgi?page=format#RR). |
| 57 | + |
| 58 | +Note that for **T0999** the target was manually split based on hits in HHSearch |
| 59 | +into 5 sub-targets, hence there are 5 distograms |
| 60 | +(`contacts/T0999s{1,2,3,4,5}.pickle`) and 5 RR files |
| 61 | +(`contacts/T0999s{1,2,3,4,5}.rr`). |
| 62 | + |
| 63 | +The `contacts/` folder is not needed to run the model, these files are included |
| 64 | +only for convenience so that you don't need to run the inference for CASP13 |
| 65 | +targets to get the contact map. |
| 66 | + |
| 67 | +### Model checkpoints |
| 68 | + |
| 69 | +The model checkpoints can be downloaded from |
| 70 | +[Google Cloud Storage](https://console.cloud.google.com/storage/browser/alphafold_casp13_data). |
| 71 | + |
| 72 | +Download them e.g. using `wget`: |
| 73 | + |
| 74 | +```shell |
| 75 | +wget https://storage.googleapis.com/alphafold_casp13_data/alphafold_casp13_weights.zip |
| 76 | +``` |
| 77 | + |
| 78 | +The zip file contains: |
| 79 | + |
| 80 | +1. A directory `873731`. This contains the weights for the distogram model. |
| 81 | +1. A directory `916425`. This contains the weights for the background distogram |
| 82 | + model. |
| 83 | +1. A directory `941521`. This contains the weights for the torsion model. |
| 84 | +1. `LICENSE.md`. The model checkpoints have a non-commercial license which is |
| 85 | + defined in this file. |
| 86 | + |
| 87 | +Each directory with model weights contains a number of different model |
| 88 | +configurations. Each model has a config file and associated weights. There is |
| 89 | +only one torsion model. Each model directory also contains a stats file that is |
| 90 | +used for feature normalization specific to that model. |
| 91 | + |
| 92 | +## Distogram prediction |
| 93 | + |
| 94 | +### Running the system |
| 95 | + |
| 96 | +You can use the `run_eval.sh` script to run the entire Distogram prediction |
| 97 | +system. There are a few steps you need to start with: |
| 98 | + |
| 99 | +1. Download the input data as described above. Unpack the data in the |
| 100 | + directory with the code. |
| 101 | +1. Download the model checkpoints as described above. Unpack the data. |
| 102 | +1. In `run_eval.sh` set the following: |
| 103 | + * `DISTOGRAM_MODEL` to the path to the directory with the distogram model. |
| 104 | + * `BACKGROUND_MODEL` to the path to the directory with the background |
| 105 | + model. |
| 106 | + * `TORSION_MODEL` to the path to the directory with the torsion model. |
| 107 | + * `TARGET` to the path to the directory with the target input data. |
| 108 | + |
| 109 | +Then run `alphafold_casp13/run_eval.sh` from the `deepmind_research` parent |
| 110 | +directory (you will get errors if you try running `run_eval.sh` directly from |
| 111 | +the `alphafold_casp13` directory). |
| 112 | + |
| 113 | +The contact prediction works in the following way: |
| 114 | + |
| 115 | +1. 4 replicas (by *replica* we mean a configuration file describing the network |
| 116 | + architecture and a snapshot with the network weights), each with slightly |
| 117 | + different model configuration, are launched to predict the distogram. |
| 118 | +1. 4 replicas, each with slightly different model configuration are launched to |
| 119 | + predict the background distogram. |
| 120 | +1. 1 replica is launched to predict the torsions. |
| 121 | +1. The predictions from the different replicas are averaged together using |
| 122 | + `ensemble_contact_maps.py`. |
| 123 | +1. The predictions for the 64 × 64 distogram crops are pasted together using |
| 124 | + `paste_contact_maps.py`. |
| 125 | + |
| 126 | +When running `run_eval.sh` the output has the following directory structure: |
| 127 | + |
| 128 | +* **distogram/**: Contains 4 subfolders, one for each replica. Each of these |
| 129 | + contain the predicted ASA, secondary structure and a pickle file with the |
| 130 | + distogram for each crop. It also contains an `ensemble` directory with the |
| 131 | + ensembled distograms. |
| 132 | +* **background_distogram/**: Contains 4 subfolders, one for each replica. Each |
| 133 | + of these contain a pickle file with the background distogram for each crop. |
| 134 | + It also contains an `ensemble` directory with the ensembled background |
| 135 | + distograms. |
| 136 | +* **torsion/**: Contains 1 subfolder as there was only a single replica. This |
| 137 | + folder contains contains the predicted ASA, secondary structure, backbone |
| 138 | + torsions and a pickle file with the distogram for each crop. It also |
| 139 | + contains an `ensemble` directory with the ensembled torsions. |
| 140 | +* **pasted/**: Contains distograms obtained from the ensembled distograms by |
| 141 | + pasting. An RR contact map file is computed from this pasted distogram. |
| 142 | + **This is the final distogram that was used in the subsequent AlphaFold |
| 143 | + folding pipeline in CASP13.** |
| 144 | + |
| 145 | +## Data splits |
| 146 | + |
| 147 | +We used a version of [PDB](https://www.rcsb.org/) downloaded on 2018-03-15. The |
| 148 | +train/test split can be found in the `train_domains.txt` and `test_domains.txt` |
| 149 | +files. |
| 150 | + |
| 151 | +Disclaimer: This is not an official Google product. |
0 commit comments