Node2Vec is a vertex embedding algorithm proposed in node2vec: Scalable Feature Learning for Networks. TigerGraph splits the computation into two parts: the random walk process and the embedding training process. Assuming that you are using version 3.6 or greater of the TigerGraph database, ignore the UDF install instructions.
There are two different random walk processes to choose from. The first is regular random walks, implemented in tg_random_walk.gsql
. This is equivalent to setting p
and q
parameters of Node2Vec both to 1, which is also equivalent to the DeepWalk paper. This version is more performant than tg_weighted_random_walk.gsql
, due to the less computation that is needed. If the graph is large, you may want to batch the random walk process to reduce memory consumption. Use tg_random_walk_batch.gsql
if this is desired.
The second option is weighted random walk, as described in the Node2Vec paper. This is implemented in the tg_weighted_random_walk_sub.gsql
and tg_weighted_random_walk.gsql
. If your TigerGraph database version is below 3.6, see the UDF installation instructions below. If the graph is large, you may want to batch the random walk process to reduce memory consumption. Use tg_weighted_random_walk_batch.gsql
with tg_weighted_random_walk_sub.gsql
if desired.
To install the un-weighted random walk: copy the algorithm from tg_random_walk.gsql
and install it on the database using the standard query install process.
To install the weighted random walk: copy tg_weighted_random_walk_sub.gsql
and install. Then copy and install tg_weighted_random_walk.gsql
.
Once the random walks have been generated, we can use the output to train the Node2Vec model. To install, make sure the proper UDFs are installed. If you are using a TigerGraph database of version 3.6 or greater, the UDFs are pre-installed.
To install Node2Vec query: copy the query from tg_node2vec.gsql
and install on the database.
Vim is the text editor of choice in this README, any other text editors such as Emacs or Nano will suffice in the commands listed below
<TGversion>
should be replaced with your current Tigergraph version number
If you are using tg_weighted_random_walk_sub.gsql
, then you will need to install the tg_random_udf.cpp
. **The code defined in tg_random_udf.cpp
should be pasted inside the UDIMPL
f namespace inside of ExprFunctions.hpp
.
# open file and paste code
$ vim ~/tigergraph/app/<TGversion>/dev/gdk/gsql/src/QueryUdf/ExprFunctions.hpp
tg_node2vec_sub()
is a UDF that is called in tg_node2vec.gsql
.
**The code defined in tg_node2vec_sub.cpp
should be pasted inside the UDIMPL
namespace inside of ExprFunctions.hpp
# open file and paste code
$ vim ~/tigergraph/app/<TGversion>/dev/gdk/gsql/src/QueryUdf/ExprFunctions.hpp
There are multiple options to get word2vec.h
- Download/Copy
word2vec.h
file into~/tigergraph/app/<TGversion>/dev/gdk/gsdk/include
directory - Create the file and copy the code from
word2vec.h
and paste it into the newly created file (steps shown below)
# Go to correct directory
$ cd ~/tigergraph/app/<TGversion>/dev/gdk/gsdk/include/
# create file and paste code
$ vim word2vec.h
The newly created word2vec.h
needs to be included in the ExprUtil.hpp
file
$ vim ~/tigergraph/app/<TGversion>/dev/gdk/gsql/src/QueryUdf/ExprUtil.hpp
Once inside the text editor, paste the following line of C++
code into under the other include statements
#include "/home/tigergraph/tigergraph/app/<TGversion>/dev/gdk/gsdk/include/word2vec.h"
If you are working on a single machine, remove the Distributed
GSQL keyword from the header of the random_walk
query
# Change first header to the second header
CREATE DISTRIBUTED QUERY random_walk(...) {...}
CREATE QUERY random_walk(...) {...}
After doing this, proceed to the next section(Running Queries)
If you are on multiple machines or a cluster
# For every machine in cluster
$ gssh <machine>
$ PUT ExprFunctions from "/home/tigergraph/tigergraph/app/<TGversion>/dev/gdk/gsql/src/QueryUdf/ExprFunctions.hpp"
** The following instructions can be done with GraphStudio or GSQL terminal
- Install the
random_walk
query - Run query
random_walk
with desired parameters. Visit https://docs.tigergraph.com/graph-ml/current/node-embeddings/node2vec for a description of the random walk query parameters. Make sure that TigerGraph has the correct permissions to write to the output directory you specify. - (optional) Inspect output of random_walk query
# For the default filepath parameter $ cat ~/path.csv
- Run
node2vec_query
with desired parameters.Dimension
denotes the embedding dimension size - (optional) Inspect Embeddings
# For the default filepath parameter $ cat ~/embedding.csv