DeepGraphLearning
diff --git a/‎LICENSE
+359 b/‎LICENSE
+359
diff --git a/‎README.md
+242-2 b/‎README.md
+242-2
diff --git a/‎asset/inductive_qe.gif
3.27 MB b/‎asset/inductive_qe.gif
3.27 MB
diff --git a/‎config/complex_query/gnnqe_eval_train.yaml
+42 b/‎config/complex_query/gnnqe_eval_train.yaml
+42
diff --git a/‎config/complex_query/gnnqe_main.yaml
+42 b/‎config/complex_query/gnnqe_main.yaml
+42
diff --git a/‎config/complex_query/heuristic_eval_train.yaml
+31 b/‎config/complex_query/heuristic_eval_train.yaml
+31
diff --git a/‎config/complex_query/heuristic_main.yaml
+31 b/‎config/complex_query/heuristic_main.yaml
+31
@@ -1,2 +1,242 @@
-# InductiveQE
-Official implementation of Inductive Logical Query Answering in Knowledge Graphs (NeurIPS 2022)
+# Inductive Logical Query Answering in Knowledge Graphs (NeurIPS 2022) #
+
+<p align="center">
+<a href="https://arxiv.org/pdf/2210.08008.pdf"><img src="http://img.shields.io/badge/Paper-PDF-red.svg" alt="NeurIPS paper"></a>
+<a href="https://doi.org/10.5281/zenodo.7306046"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.7306046.svg" alt="InductiveQE dataset"></a>
+</p>
+
+![InductiveQE animation](asset/inductive_qe.gif)
+
+This is the official code base of the paper
+
+[Inductive Logical Query Answering in Knowledge Graphs][paper]
+
+[Mikhail Galkin](https://migalkin.github.io),
+[Zhaocheng Zhu](https://kiddozhu.github.io),
+[Hongyu Ren](http://hyren.me/),
+[Jian Tang](https://jian-tang.com)
+
+[paper]: https://arxiv.org/abs/2210.08008
+
+## Overview ##
+
+**Important: the camera-ready NeurIPS'22 version was identified to have datasets with possible test set leakages.
+The new version including this repository and updated Arxiv submission have new datasets and experiments where this
+issue has been fixed. We recommend to use the latest version of datasets ([2.0 on Zenodo][dataset]) and experiments ([v2 on arXiv][paper]) for further
+comparisons.**
+
+[dataset]: https://zenodo.org/record/7306046
+
+Inductive query answering is the setup where at inference time an underlying graph can have new, unseen entities.
+In this paper, we study a practical inductive setup when a training graph **is extended** with more nodes and edges
+at inference time. That is, an inference graph is always a superset of the training graph.
+Note that the inference graph always shares the same set of relation types with the training graph.
+
+The two big implications of the inductive setup:
+* test queries involve new, unseen nodes where answers can be both seen and unseen nodes;
+* training queries now might have more answers among new nodes.
+
+The two inductive approaches implemented in this repo:
+1. **NodePiece-QE** (Inductive node representations): based on [NodePiece](https://github.com/migalkin/NodePiece) and [CQD](https://github.com/pminervini/KGReasoning/).
+Train on 1p link prediction, **inference-only** zero-shot logical query answering over unseen entities.
+The NodePiece encoder can be extended with the additional GNN encoder (CompGCN) that is denoted as **NodePiece-QE w/ GNN** in the paper.
+2. **Inductive GNN-QE** (Inductive relational structure representations): based on [GNN-QE](https://github.com/DeepGraphLearning/GNN-QE).
+Trainable on complex queries, achieves higher performance than NodePiece-QE but is more expensive to train.
+
+We additionally provide a dummy Edge-type Heuristic (`model.HeuristicBaseline`) that only considers possible tails of the last relation projection step of any query pattern.
+
+## Data ##
+
+We created 10 new inductive query answering datasets where validation/test graphs extend the training graph and contain new entities:
+* Small-scale: 9 datasets based on FB15k-237 with the ratio of *inference-to-train nodes* varies from 106% to 550%, total of 15k nodes for various splits.
+* Large-scale: 1 dataset based on OGB WikiKG2 with the fixed ratio of 133% and 1.5M training nodes but with 500K new nodes and 5M new edges at inference.
+
+<details>
+<summary>Datasets Description</summary>
+
+Each dataset is a zip archive containing 17 files:
+
+* `train_graph.txt` (pt for wikikg) - original training graph
+* `val_inference.txt` (pt) - inference graph (validation split), new nodes in validation are disjoint with the test inference graph
+* `val_predict.txt` (pt) - missing edges in the validation inference graph to be predicted.
+* `test_inference.txt` (pt) - inference graph (test splits), new nodes in test are disjoint with the validation inference graph
+* `test_predict.txt` (pt) - missing edges in the test inference graph to be predicted.
+* `train/valid/test_queries.pkl` - queries of the respective split, 14 query types for fb-derived datasets, 9 types for WikiKG (EPFO-only)
+* `*_answers_easy.pkl` - easy answers to respective queries that do not require predicting missing links but only edge traversal
+* `*_answers_hard.pkl` - hard answers to respective queries that DO require predicting missing links and against which the final metrics will be computed
+* `train_answers_valid.pkl` - the extended set of answers for training queries on the bigger validation graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models
+* `train_answers_test.pkl` - the extended set of answers for training queries on the bigger test graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models
+* `og_mappings.pkl` - contains entity2id / relation2id dictionaries mapping local node/relation IDs from a respective dataset to the original fb15k237 / wikikg2
+* `stats.txt` - a small file with dataset stats
+</details>
+
+All datasets are available on [Zenodo](https://zenodo.org/record/7306046), please refer to v2.0 of the datasets.
+The datasets will be downloaded automatically upon the first run.
+
+Additionally, we provide lightweight dumps ([Zenodo](https://zenodo.org/record/7306061)) just of those graphs (without queries and answers) for training simple link prediction and KG completion models.
+Please refer to v2.0 of the datasets.
+
+## Installation ##
+
+The dependencies can be installed via either conda or pip. NodePiece-QE and GNN-QE are compatible
+with Python 3.7/3.8/3.9 and PyTorch >= 1.8.0.
+
+### From Conda ###
+
+```bash
+conda install torchdrug pytorch cudatoolkit -c milagraph -c pytorch -c pyg
+conda install pytorch-sparse pytorch-scatter -c pyg
+conda install easydict pyyaml -c conda-forge
+```
+
+### From Pip ###
+
+```bash
+pip install torchdrug torch
+pip install easydict pyyaml
+pip install wandb tensorboardx
+```
+
+Then install `torch-scatter` and `torch-sparse` following the instructions in the [Github repo](https://github.com/rusty1s/pytorch_sparse).
+For example, for PyTorch 1.10 and CUDA 10.2:
+
+```bash
+pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu102.html
+```
+
+
+## Usage ##
+
+### NodePiece-QE ###
+
+Conceptually, running NodePiece-QE consists of two parts:
+1. Training a neural link predictor using NodePiece (+ optional GNN), saving materialized embeddings of the test graph.
+2. Running CQD over the saved embeddings.
+
+**Step 1: Training a Link Predictor**
+
+Use the `NodePiece` model with the task `InductiveKnowledgeGraphCompletion` applied to the dataset of choice.
+
+We prepared 5 configs for FB15k-237-derived datasets in the `config/lp_pretraining` directory, 2 for NodePiece w/o GNN
+and 3 for NodePiece w/ GNN, following the reported hyperparameters in the paper.
+`_550` configs have a higher `input_dim` so we decided to have a dedicated file for them to send less params to the
+training script.
+
+We also provide 2 configs for the WikiKG graph and recommend running pre-training in the multi-gpu mode due to the size
+of the graph.
+
+Example of training a vanilla NodePiece on the 175% dataset:
+
+```bash
+python script/run.py -c config/lp_pretraining/nodepiece_nognn.yaml --ratio 175 --temp 0.5 --epochs 2000 --gpus [0] --logger console
+```
+
+NodePiece + GNN on the 175% dataset:
+```bash
+python script/run.py -c config/lp_pretraining/nodepiece_gnn.yaml --ratio 175 --temp 1.0 --epochs 1000 --gpus [0] --logger console
+```
+
+For datasets of ratios 106-150 use the 5-layer GNN config `config/lp_pretraining/nodepiece_gnn.yaml`.
+
+* Use `--gpus null` to run the scripts on a CPU.
+* Use `--logger wandb` to send training logs to wandb, don't forget to prepend env variable `WANDB_ENTITY=(your_entity)`
+before executing the python script.
+
+After training, materialized entity and relation embeddings of the test graph will be stored in the `output_dir` folder.
+
+WikiKG training requires a vocabulary of mined NodePiece anchors, we ship a precomputed
+vocab `20000_anchors_d0.4_p0.4_r0.2_25sp_bfs.pkl` together with the `wikikg.zip` archive.
+You can mine your own vocab playing around with the `NodePieceTokenizer` -- mining is implemented on a GPU and
+should be much faster than the original NodePiece implementation.
+
+An example WikiKG link prediction pre-training config should contain `--vocab` param to the mined vocab, e.g.,
+```bash
+python script/run.py -c config/lp_pretraining/wikikg_nodepiece_nognn.yaml --gpus [0] --vocab /path/to/pickle/vocab.pkl
+```
+
+We highly recommend training both no-GNN and GNN versions of NodePiece on WikiKG using several GPUs, for example
+```bash
+python -m torch.distributed.launch --nproc_per_node=2 script/run.py -c config/lp_pretraining/wikikg_nodepiece_nognn.yaml --gpus [0,1] --vocab /path/to/pickle/vocab.pkl
+```
+
+**Step 2: CQD Inference**
+
+Use the pre-trained link predictor to run CQD inference over EPFO queries (negation is not supported in this version of CQD).
+
+Example of running CQD on the pre-trained 200d NodePiece w/ GNN model over the 175% dataset
+* Note that we need to specify a 2x smaller embedding dimension of the training model as by default we train a
+ComplEx model with two parts - real and complex;
+* Use the full path to the **embeddings** of the pre-trained models, they are named smth like `/path/epoch_1000_ents`
+and `/path/epoch_1000_rels`, so just use the common prefix `/path/epoch_1000`.
+
+```bash
+python cqd/main.py --cuda --do_test --data_path ./data/175 -d 100 -cpu 6 --log_steps 10000 --test_log_steps 10000 --geo cqd --print_on_screen --cqd-k 32 --cqd-sigmoid --tasks "1p.2p.3p.2i.3i.ip.pi.2u.up" --inductive --checkpoint_path /path/epoch_1000 --skip_tr
+```
+
+To evaluate training queries on the bigger test graphs, use the argument `--eval_train`
+
+```bash
+python cqd/main.py --cuda --do_test --data_path ./data/175 -d 100 -cpu 6 --log_steps 10000 --test_log_steps 10000 --geo cqd --print_on_screen --cqd-k 32 --cqd-sigmoid --tasks "1p.2p.3p.2i.3i.ip.pi.2u.up" --inductive --checkpoint_path /path/epoch_1000 --eval_train
+```
+
+### GNN-QE ###
+
+To train GNN-QE and evaluate on the valid/test queries and desired dataset ratio, use the `gnnqe_main.yaml` config.
+Example on the 175% dataset:
+
+```bash
+python script/run.py -c config/complex_query/gnnqe_main.yaml --ratio 175 --gpus [0]
+```
+
+Alternatively, you may specify `--gpus null` to run GNN-QE on a CPU.
+
+The hyperparameters are designed for 32GB GPUs, but you may adjust the batch size in the config file
+to fit a smaller GPU memory.
+
+To run GNN-QE with multiple GPUs or multiple machines, use the following commands
+
+```bash
+python -m torch.distributed.launch --nproc_per_node=2 script/run.py -c config/complex_query/gnnqe_main.yaml --gpus [0,1]
+```
+
+```bash
+python -m torch.distributed.launch --nnodes=4 --nproc_per_node=4 script/run.py -c config/complex_query/gnnqe_main.yaml --gpus [0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3]
+```
+
+To evaluate training queries on the bigger test graphs, use the config `gnnqe_eval_train.yaml` and specify the
+checkpoint `--checkpoint` of the trained model. The best performing checkpoint is written in the log files after training the `main` config.
+For example, if the best performing 175% model is `model_epoch_1.pth` then the path will be:
+```bash
+python script/run.py -c config/complex_query/gnnqe_eval_train.yaml --ratio 175 --gpus [0] --checkpoint /path/to/model/model_epoch_1.pth
+```
+
+### Heuristic Baseline ###
+
+Finally, we provide configs for the inference-only rule-based heuristic baseline
+that only considers possible tails of the last relation projection step of any query pattern.
+The two configs are `config/complex_query/heuristic_main.yaml` and `config/complex_query/heuristic_eval_train.yaml`.
+
+To run the baseline on test queries (for example, on the 175% dataset):
+
+```bash
+python script/run.py -c config/complex_query/heuristic_main.yaml --ratio 175 --gpus [0]
+```
+
+To run the baseline on train queries over bigger test graphs:
+```bash
+python script/run.py -c config/complex_query/heuristic_eval_train.yaml --ratio 175 --gpus [0]
+```
+
+## Citation ##
+
+If you find this project useful in your research, please cite the following paper
+
+```bibtex
+@inproceedings{galkin2022inductive,
+  title={Inductive Logical Query Answering in Knowledge Graphs},
+  author={Mikhail Galkin and Zhaocheng Zhu and Hongyu Ren and Jian Tang},
+  booktitle={Advances in Neural Information Processing Systems},
+  year={2022},
+  url={https://openreview.net/forum?id=-vXEN5rIABY}
+}
+```
@@ -0,0 +1,42 @@
+output_dir: .~/git/InductiveQE/experiments
+
+dataset:
+  class: InductiveFB15k237CompExtendedEval  # special task to evaluate train queries on the bigger test graph
+  path: ~/git/InductiveQE/data
+  ratio: {{ ratio }}                # specify dataset ratio here or when running the script
+
+task:
+  class: InductiveLogicalQuery
+  model:
+    class: GNN-QE
+    model:
+      class: NBFNet
+      input_dim: 32                 # Hyperparams from Table 14 (Appendix D of the paper)
+      hidden_dims: [32, 32, 32, 32]
+      message_func: distmult
+      aggregate_func: pna
+      short_cut: yes
+      layer_norm: yes
+      dependent: yes
+    logic: product
+    dropout_ratio: 0.5
+  criterion: bce
+  sample_weight: no
+  adversarial_temperature: 0.1
+
+optimizer:
+  class: Adam
+  lr: 5.0e-3
+
+engine:
+  gpus: {{ gpus }}
+  batch_size: 64                    # reduce if doesn't fit on a GPU
+
+train:
+  num_epoch: 0                      # no training, just run the inference
+  batch_per_epoch: 1000             # number of batches to be considered as "one epoch"
+
+checkpoint: {{ checkpoint }}        # path to the checkpoint of the trained model
+metric: mrr
+skip_eval_on_train: False           # here, we do run evaluation of train queries on training and test graphs
+save_embs: False                    # doesn't work for GNN-QE as it saves its checkpoints automatically
@@ -0,0 +1,42 @@
+output_dir: ~/git/InductiveQE/experiments
+
+dataset:
+  class: InductiveFB15k237Comp
+  path: ~/git/InductiveQE/data
+  ratio: {{ ratio }}                # specify dataset ratio here or when running the script
+
+task:
+  class: InductiveLogicalQuery
+  model:
+    class: GNN-QE
+    model:
+      class: NBFNet
+      input_dim: 32                 # Hyperparams from Table 14 (Appendix D of the paper)
+      hidden_dims: [32, 32, 32, 32]
+      message_func: distmult
+      aggregate_func: pna
+      short_cut: yes
+      layer_norm: yes
+      dependent: yes
+    logic: product
+    dropout_ratio: 0.5
+  criterion: bce
+  sample_weight: no
+  adversarial_temperature: 0.1
+
+optimizer:
+  class: Adam
+  lr: 5.0e-3
+
+engine:
+  gpus: {{ gpus }}
+  batch_size: 64                    # reduce if doesn't fit on a GPU
+
+train:
+  num_epoch: 10                     # total number of optimization steps will be num_epochs * batch_per_epoch
+  batch_per_epoch: 1000             # number of batches to be considered as "one epoch"
+
+metric: mrr
+fast_test: 1000                     # GNN-QE is slow in inference, use this option for a random subsample of valid data
+skip_eval_on_train: True            # do not run evaluation on the training data, use the gnnqe_eval_train config instead
+save_embs: False                    # doesn't work for GNN-QE as it saves its checkpoints automatically
@@ -0,0 +1,31 @@
+output_dir: ~/git/InductiveQE/experiments
+
+dataset:
+  class: InductiveFB15k237CompExtendedEval
+  path: ~/git/InductiveQE/data
+  ratio: {{ ratio }}
+
+task:
+  class: InductiveLogicalQuery
+  model:
+    class: HeuristicBaseline
+    inverse_mode: "2x"
+  criterion: bce
+  sample_weight: no
+  adversarial_temperature: 0.1
+
+optimizer:
+  class: Adam
+  lr: 5.0e-3
+
+engine:
+  gpus: {{ gpus }}
+  batch_size: 512
+
+train:
+  num_epoch: 0                  # Heuristic baseline doesn't need any training
+  batch_per_epoch: 0
+
+metric: mrr
+skip_eval_on_train: True        # We run inference directly on the bigger test graph
+save_embs: False
@@ -0,0 +1,31 @@
+output_dir: ~/git/InductiveQE/experiments
+
+dataset:
+  class: InductiveFB15k237Comp
+  path: ~/git/InductiveQE/data
+  ratio: {{ ratio }}
+
+task:
+  class: InductiveLogicalQuery
+  model:
+    class: HeuristicBaseline
+    inverse_mode: "2x"
+  criterion: bce
+  sample_weight: no
+  adversarial_temperature: 0.1
+
+optimizer:
+  class: Adam
+  lr: 5.0e-3
+
+engine:
+  gpus: {{ gpus }}
+  batch_size: 512
+
+train:
+  num_epoch: 0                  # Heuristic baseline doesn't need any training
+  batch_per_epoch: 0
+
+metric: mrr
+skip_eval_on_train: True        # No eval on train queries needed here
+save_embs: False