DeepGraphLearning
diff --git a/‎10mh_A.pdb
Lines changed: 2608 additions & 0 deletions b/‎10mh_A.pdb
Lines changed: 2608 additions & 0 deletions
diff --git a/‎LICENSE
Lines changed: 21 additions & 0 deletions b/‎LICENSE
Lines changed: 21 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 72 additions & 1 deletion b/‎README.md
Lines changed: 72 additions & 1 deletion
diff --git a/‎asset/diffpack.png
1.38 MB b/‎asset/diffpack.png
1.38 MB
diff --git a/‎asset/result.png
9.31 MB b/‎asset/result.png
9.31 MB
diff --git a/‎config/inference.yaml
Lines changed: 71 additions & 0 deletions b/‎config/inference.yaml
Lines changed: 71 additions & 0 deletions
diff --git a/‎config/inference_confidence.yaml
Lines changed: 84 additions & 0 deletions b/‎config/inference_confidence.yaml
Lines changed: 84 additions & 0 deletions
diff --git a/‎diffpack/__init__.py b/‎diffpack/__init__.py
diff --git a/‎diffpack/dataset.py
Lines changed: 143 additions & 0 deletions b/‎diffpack/dataset.py
Lines changed: 143 additions & 0 deletions
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 Yangtian Zhang, Zuobai Zhang
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -1 +1,72 @@
-# DiffPack
+# DiffPack: A Torsional Diffusion Model for Autoregressive Protein Side-Chain Packing
+**DiffPack** is a novel torsional diffusion model designed for predicting the conformation of protein side-chains based on their backbones, as introduced in [arxiv link](https://arxiv.org/abs/2306.01794). By learning the joint distribution of side-chain torsional angles through a process of diffusing and denoising on the torsional space, DiffPack significantly improves angle accuracy across various benchmarks for protein side-chain packing. 
+
+
+## Installation
+You can install DiffPack with the following commands, which will install all the dependencies.
+```shell
+conda create -n diffpack python=3.8
+conda activate diffpack
+```
+
+```shell
+conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
+conda install pyg -c pyg
+conda install torchdrug -c milagraph -c conda-forge -c pytorch -c pyg
+```
+
+```shell
+pip install biopython==1.77
+pip install pyyaml
+pip install easydict
+```
+![framwork](asset/diffpack.png)
+
+## Model Checkpoints
+We provide several versions of DiffPack, each with its own configuration and checkpoint:
+
+| Model                                 | Config                                     | Checkpoint            |
+|---------------------------------------|--------------------------------------------|-----------------------|
+| DiffPack (Vanila)                     | [Config](config/inference.yaml)            | [Google Drive Link](https://drive.google.com/file/d/1tZ9ZOjIxq9SxrkdvbLJyLUBbt2P-mksO/view?usp=sharing) |
+ | DiffPack (with Confidence Prediction) | [Config](config/inference_confidence.yaml) | [Google Drive Link](https://drive.google.com/file/d/1tZ9ZOjIxq9SxrkdvbLJyLUBbt2P-mksO/view?usp=sharing) |
+
+The Vanilla version of DiffPack is the base model, 
+while the version with Confidence Prediction includes an additional feature that estimates the confidence score of the predicted side-chain conformation.
+
+Most of the configuration is specified in the configuration file. We list some important configuration hyperparameters here:
+- `mode`: Backward mode in diffusion process. We use `ode` or `sde` for DiffPack.
+- `annealed_temp`: Annealing temperature in diffusion process. We use `3` for DiffPack. Ideally, higher value corresponds to lower temperature.
+- `num_sample`: Number of samples in diffusion process. Confidence model will decide which sample to use.
+
+## Running DiffPack
+To use DiffPack for new proteins on your local machine, we provide the necessary configuration files in the config/ folder. 
+For instance, if you have two pdb files 1a3a.pdb and 1a3b.pdb, 
+you can run the following command to infer new proteins and save the results in your chosen output folder:
+```shell
+python script/inference.py -c config/inference_confidence.yaml \
+                           --seed 2023 \
+                           --output_dir path/to/output \
+                           --pdb_files 1a3a.pdb 1a3b.pdb ...
+```
+This command will generate and save the predicted side-chain conformations for the given proteins. 
+
+## Retraining DiffPack
+For those interested in training DiffPack on their own datasets, we will soon release the code and instructions for this process. 
+Stay tuned for updates!
+
+## Visualization of Results
+![Visualization](asset/result.png)
+
+## License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+## Citation
+If you find DiffPack useful in your research or project, please cite our paper:
+```
+@article{zhang2023diffpack,
+  title={DiffPack: A Torsional Diffusion Model for Autoregressive Protein Side-Chain Packing},
+  author={Zhang, Yangtian and Zhang, Zuobai and Zhong, Bozitao and Misra, Sanchit and Tang, Jian},
+  journal={arXiv preprint arXiv:2306.01794},
+  year={2023}
+}
+```
+
@@ -0,0 +1,71 @@
+test_set:
+  class: SideChainDataset
+  path: null
+  atom_feature: residue_symbol
+  bond_feature: null
+  residue_feature: null
+  mol_feature: null
+  sanitize: true
+  removeHs: true
+  transform:
+    class: Compose
+    transforms: []
+
+
+task:
+  class: TorsionalDiffusion
+  train_chi_id: null
+  schedule_1pi_periodic:
+    class: SO2VESchedule
+    pi_periodic: true
+    annealed_temp: 3
+    cache_folder:  ~/scratch/output/diffpack
+    mode: ode
+  schedule_2pi_periodic:
+    class: SO2VESchedule
+    pi_periodic: false
+    annealed_temp: 3
+    cache_folder:  ~/scratch/output/diffpack
+    mode: ode
+  sigma_embedding:
+    class: SigmaEmbeddingLayer
+    input_dim: 39
+    hidden_dims: [ 64, 128 ]
+    sigma_dim: 64
+  model:
+    class: GearNet
+    input_dim: 128
+    hidden_dims: [128, 128, 128, 128, 128, 128]
+    batch_norm: True
+    concat_hidden: True
+    short_cut: True
+    readout: 'sum'
+    num_relation: 6
+    edge_input_dim: 58
+    num_angle_bin: 8
+  torsion_mlp_hidden_dims: [ 64, 128 ]
+  graph_construction_model:
+    class: GraphConstruction
+    edge_layers:
+      - class: BondEdge
+      - class: SpatialEdge
+        radius: 4.5
+        min_distance: 2
+      - class: KNNEdge
+        k: 10
+        min_distance: 0
+    edge_feature: gearnet
+
+optimizer:
+  class: Adam
+  lr: 1.0e-4
+
+engine:
+  gpus: [0] #, 1, 2, 3]
+  batch_size: 32
+  log_interval: 1000
+
+model_checkpoint: ~/scratch/trained_model/diffpack/gearnet_edge_confidence_converted.pth
+
+train:
+  num_epoch: 0
@@ -0,0 +1,84 @@
+test_set:
+  class: SideChainDataset
+  path: null
+  atom_feature: residue_symbol
+  bond_feature: null
+  residue_feature: null
+  mol_feature: null
+  sanitize: true
+  removeHs: true
+  transform:
+    class: Compose
+    transforms: []
+
+
+task:
+  class: ConfidencePrediction
+  num_sample: 4
+  num_mlp_layer: 3
+  train_chi_id: null
+  schedule_1pi_periodic:
+    class: SO2VESchedule
+    pi_periodic: true
+    annealed_temp: 3
+    cache_folder:  ~/scratch/output/diffpack
+    mode: ode
+  schedule_2pi_periodic:
+    class: SO2VESchedule
+    pi_periodic: false
+    annealed_temp: 3
+    cache_folder:  ~/scratch/output/diffpack
+    mode: ode
+  confidence_model:
+    class: GearNet
+    input_dim: 39
+    hidden_dims: [ 128, 128, 128, 128, 128, 128 ]
+    batch_norm: True
+    concat_hidden: True
+    short_cut: True
+    readout: 'sum'
+    num_relation: 6
+    edge_input_dim: 58
+    num_angle_bin: 8
+  sigma_embedding:
+    class: SigmaEmbeddingLayer
+    input_dim: 39
+    hidden_dims: [ 64, 128 ]
+    sigma_dim: 64
+  model:
+    class: GearNet
+    input_dim: 128
+    hidden_dims: [128, 128, 128, 128, 128, 128]
+    batch_norm: True
+    concat_hidden: True
+    short_cut: True
+    readout: 'sum'
+    num_relation: 6
+    edge_input_dim: 58
+    num_angle_bin: 8
+  torsion_mlp_hidden_dims: [ 64, 128 ]
+  graph_construction_model:
+    class: GraphConstruction
+    edge_layers:
+      - class: BondEdge
+      - class: SpatialEdge
+        radius: 4.5
+        min_distance: 2
+      - class: KNNEdge
+        k: 10
+        min_distance: 0
+    edge_feature: gearnet
+
+optimizer:
+  class: Adam
+  lr: 1.0e-4
+
+engine:
+  gpus: [0] #, 1, 2, 3]
+  batch_size: 32
+  log_interval: 1000
+
+model_checkpoint: ~/scratch/trained_model/diffpack/gearnet_edge_confidence_converted.pth
+
+train:
+  num_epoch: 0
@@ -0,0 +1,143 @@
+import glob
+import logging
+import os
+
+import torch
+from rdkit import Chem
+from torchdrug import data
+from torchdrug.core import Registry as R
+from torchdrug.layers import functional
+from tqdm import tqdm
+
+from diffpack import rotamer
+from diffpack.rotamer import get_chi_mask, atom_name_vocab, bb_atom_name
+
+logging.basicConfig(level=logging.DEBUG)
+
+logger = logging.getLogger(__name__)
+
+
+@R.register("datasets.SideChainDataset")
+class SideChainDataset(data.ProteinDataset):
+    processed_file = None
+    exclude_pdb_files = []
+
+    def __init__(self, path=None, pdb_files=None, verbose=1, **kwargs):
+        if path is not None:
+            logger.info("Loading dataset from folder %s" % path)
+            path = os.path.expanduser(path)
+            if not os.path.exists(path):
+                os.makedirs(path)
+            self.path = path
+            pkl_file = os.path.join(path, self.processed_file)
+
+            if os.path.exists(pkl_file):
+                logger.info("Found existing pickle file %s" % pkl_file
+                            + ". Loading from pickle file (this may take a while)")
+                self.load_pickle(pkl_file, verbose=verbose, **kwargs)
+            else:
+                logger.info("No pickle file found. Loading from pdb files (this may take a while)"
+                            + " and save to pickle file %s" % pkl_file)
+                pdb_files = sorted(glob.glob(os.path.join(path, "*.pdb")))
+                self.load_pdbs(pdb_files, verbose=verbose, **kwargs)
+                self.save_pickle(pkl_file, verbose=verbose)
+        elif pdb_files is not None:
+            logger.info("Loading dataset from pdb files")
+            pdb_files = [os.path.expanduser(pdb_file) for pdb_file in pdb_files]
+            pdb_files = [pdb_file for pdb_file in pdb_files if pdb_file.endswith(".pdb")]
+            self.load_pdbs(pdb_files, verbose=verbose, **kwargs)
+
+        # Filter out proteins with no residues
+        indexes = [i for i, (protein, pdb_file) in enumerate(zip(self.data, self.pdb_files))
+                   if (protein.num_residue > 0).all() and os.path.basename(pdb_file) not in self.exclude_pdb_files]
+        self.data = [self.data[i] for i in indexes]
+        self.sequences = [self.sequences[i] for i in indexes]
+        self.pdb_files = [self.pdb_files[i] for i in indexes]
+
+    def load_pdbs(self, pdb_files, transform=None, lazy=False, verbose=0, sanitize=True, removeHs=True, **kwargs):
+        """
+        Load the dataset from pdb files.
+
+        Parameters:
+            pdb_files (list of str): pdb file names
+            transform (Callable, optional): protein sequence transformation function
+            lazy (bool, optional): if lazy mode is used, the proteins are processed in the dataloader.
+                This may slow down the data loading process, but save a lot of CPU memory and dataset loading time.
+            verbose (int, optional): output verbose level
+            **kwargs
+        """
+        num_sample = len(pdb_files)
+
+        self.transform = transform
+        self.lazy = lazy
+        self.kwargs = kwargs
+        self.data = []
+        self.pdb_files = []
+        self.sequences = []
+
+        if verbose:
+            pdb_files = tqdm(pdb_files, "Constructing proteins from pdbs")
+        for i, pdb_file in enumerate(pdb_files):
+            if not lazy or i == 0:
+                mol = Chem.MolFromPDBFile(pdb_file, sanitize=sanitize, removeHs=removeHs)
+                if not mol:
+                    logger.debug("Can't construct molecule from pdb file `%s`. Ignore this sample." % pdb_file)
+                    continue
+                protein = data.Protein.from_molecule(mol, **kwargs)
+                if not protein:
+                    logger.debug("Can't construct protein from pdb file `%s`. Ignore this sample." % pdb_file)
+                    continue
+            else:
+                protein = None
+            if hasattr(protein, "residue_feature"):
+                with protein.residue():
+                    protein.residue_feature = protein.residue_feature.to_sparse()
+            self.data.append(protein)
+            self.pdb_files.append(pdb_file)
+            self.sequences.append(protein.to_sequence() if protein else None)
+
+    def get_item(self, index):
+        if getattr(self, "lazy", False):
+            protein = data.Protein.from_pdb(self.pdb_files[index], **self.kwargs)
+        else:
+            protein = self.data[index].clone()
+        protein = protein.subgraph(protein.atom_name < 37)
+
+        with protein.atom():
+            # Init atom14 index map
+            protein.atom14index = rotamer.restype_atom14_index_map[
+                protein.residue_type[protein.atom2residue], protein.atom_name
+            ]  # [num_atom, 14]
+
+        with protein.residue():
+            # Init residue features
+            protein.residue_feature = functional.one_hot(protein.residue_type, 21)  # [num_residue, 21]
+
+            # Init residue masks
+            chi_mask = get_chi_mask(protein)
+            chi_1pi_periodic_mask = torch.tensor(rotamer.chi_pi_periodic)[protein.residue_type]
+            chi_2pi_periodic_mask = ~chi_1pi_periodic_mask
+            protein.chi_mask = chi_mask
+            protein.chi_1pi_periodic_mask = torch.logical_and(chi_mask, chi_1pi_periodic_mask)  # [num_residue, 4]
+            protein.chi_2pi_periodic_mask = torch.logical_and(chi_mask, chi_2pi_periodic_mask)  # [num_residue, 4]
+
+            # Init atom37 features
+            protein.atom37_mask = torch.zeros(protein.num_residue, len(atom_name_vocab), device=protein.device,
+                                              dtype=torch.bool)  # [num_residue, 37]
+            protein.atom37_mask[protein.atom2residue, protein.atom_name] = True
+            protein.sidechain37_mask = protein.atom37_mask.clone()  # [num_residue, 37]
+            protein.sidechain37_mask[:, bb_atom_name] = False
+        item = {"graph": protein}
+
+        if self.transform:
+            item = self.transform(item)
+        return item
+
+    @staticmethod
+    def from_pdb_files(pdb_files, verbose=1, **kwargs):
+        return SideChainDataset(pdb_files, verbose=verbose, **kwargs)
+
+    def __repr__(self):
+        lines = ["#sample: %d" % len(self)]
+        return "%s(  %s)" % (self.__class__.__name__, "\n  ".join(lines))
+