Skip to content

Titan Format

okram edited this page Jan 6, 2013 · 62 revisions

  • InputFormat: com.thinkaurelius.faunus.formats.titan.cassandra.TitanCassandraInputFormat
  • InputFormat: com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseInputFormat
  • OutputFormat: com.thinkaurelius.faunus.formats.titan.cassandra.TitanCassandraOutputFormat
  • OutputFormat: com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseOutputFormat

Titan is a distributed graph database developed by Aurelius and provided under the liberal Apache 2 license. Titan is backend agnostic and is currently deployed with support for Apache Cassandra and Apache HBase (see The Benefits of Titan and Storage Backend Overview).

Titan InputFormat Support

An InputFormat specifies how to turn a data source into a stream of Hadoop <KEY,VALUE> pairs (see blog post). For Faunus, this means turning the source data into a stream of <NullWritable, FaunusVertex> pairs. The following TitanXXXInputFormat classes stream Titan encoded data contained within Cassandra and HBase into Faunus/Hadoop.

TitanCassandraInputFormat

In order to read graph data from Titan/Cassandra, a graph needs to exist. For the sake of an example, The Graph of the Gods dataset deployed with Titan can be loaded using Gremlin (see diagram at Getting Started).

gremlin> g = TitanFactory.open('bin/cassandra.local')
==>titangraph[cassandra:127.0.0.1]
gremlin> g.loadGraphML('data/graph-of-the-gods.xml')
==>null
gremlin> g.stopTransaction(SUCCESS)

In Faunus, a bin/titan-cassandra-input.properties file is provided with the following properties which tell Faunus the location and features of the Titan/Cassandra cluster.

faunus.graph.input.format=com.thinkaurelius.faunus.formats.titan.cassandra.TitanCassandraInputFormat
titan.graph.input.storage.backend=cassandra
titan.graph.input.storage.hostname=localhost
titan.graph.input.storage.port=9160
titan.graph.input.storage.keyspace=titan
cassandra.input.partitioner.class=org.apache.cassandra.dht.RandomPartitioner
gremlin> g = FaunusFactory.open('bin/titan-cassandra-input.properties')      
==>faunusgraph[titancassandrainputformat]
gremlin> g.V.count()
13/01/04 12:53:24 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
13/01/04 12:53:24 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map, com.thinkaurelius.faunus.mapreduce.util.CountMapReduce.Map, com.thinkaurelius.faunus.mapreduce.util.CountMapReduce.Reduce]
...
==>12

TitanHBaseInputFormat

The Graph of the Gods dataset deployed with Titan can be loaded into Titan/HBase using Gremlin (see diagram at Getting Started).

gremlin> g = TitanFactory.open('bin/hbase.local')
==>titangraph[hbase:127.0.0.1]
gremlin> g.loadGraphML('data/graph-of-the-gods.xml')
==>null
gremlin> g.stopTransaction(SUCCESS)

In Faunus, a bin/titan-hbase-input.properties file is provided with the following properties. This creates a FaunusGraph that is fed from Titan/HBase. Note, for multi-machines environments, the titan.graph.input.storage.hostname should use the cluster-internal IP address of the machine with Zookeeper even if that machine is in fact localhost.

faunus.graph.input.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseInputFormat
titan.graph.input.storage.backend=hbase
titan.graph.input.storage.hostname=localhost
titan.graph.input.storage.port=2181
titan.graph.input.storage.tablename=titan
gremlin> g = FaunusFactory.open('bin/titan-hbase-input.properties') 
==>faunusgraph[titanhbaseinputformat]
gremlin> g.V.count()
13/01/04 15:40:56 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
13/01/04 15:40:56 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map, com.thinkaurelius.faunus.mapreduce.util.CountMapReduce.Map, com.thinkaurelius.faunus.mapreduce.util.CountMapReduce.Reduce]
...
==>12

Please follow the links below for more information on streaming data out of HBase.

Titan OutputFormat Support

Faunus can be used to bulk load data into Titan. Thus, given a stream of <NullWritable, FaunusVertex> pairs, with a TitanXXXOutputFormat the stream is faithfully written to Titan. For all the examples to follow, it is assumed that data/graph-of-the-gods.json is in HDFS. Finally, note that is is typically a good idea to have the graph (keyspace/table) already initialized (e.g. g = TitanFactory.open(...)) before doing bulk writing as the creation process takes time and a heavy write load during the graph creation process can yield exceptions.

TitanCassandraOutputFormat

faunus.graph.output.format=com.thinkaurelius.faunus.formats.titan.cassandra.TitanCassandraOutputFormat
titan.graph.output.storage.backend=cassandra
titan.graph.output.storage.hostname=localhost
titan.graph.output.storage.port=9160
titan.graph.output.storage.keyspace=titan
titan.graph.output.storage.batch-loading=true
titan.graph.output.ids.block-size=100000
titan.graph.output.infer-schema=true
blueprints.graph.output.tx-commit=5000

Here are some notes for the above properties.

  • storage.batch-loading: By setting this to true, certain checks in Titan are circumvented which speeds up the writing process.
  • ids.block-size: When this value is small and the clients are writing lots of data, the clients communicates with Titan repeatedly to get new ids and this can cause exceptions to happen as the id system stalls trying to serve all the clients.
  • infer-schema: When a new edge label or property key is provided to Titan, Titan updates its schema metadata. By inferring the schema prior to writing, exceptions can be circumvented.
  • tx-commit: It is possible to determine how many vertices/edges should be written before committing a transaction. It is important to batch so that every write it not a commit.
gremlin> g = FaunusFactory.open('bin/titan-cassandra-output.properties') 
==>faunusgraph[graphsoninputformat]
gremlin> g.V.sideEffect('{it.blah = 42}') 
13/01/04 15:44:42 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
13/01/04 15:44:42 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map, com.thinkaurelius.faunus.mapreduce.sideeffect.SideEffectMap.Map, com.thinkaurelius.faunus.formats.BlueprintsGraphOutputMapReduce.Map, com.thinkaurelius.faunus.formats.BlueprintsGraphOutputMapReduce.Reduce]
...

In the above job, the Graph of the Gods GraphSON file is streamed from HDFS and each vertex has a new property added (blah=42). The output graph is pushed into Titan/Cassandra. Via the Titan/Gremlin console, the graph is viewable.

titan$ bin/gremlin.sh 

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> g = TitanFactory.open('bin/cassandra.local')
==>titangraph[cassandrathrift:127.0.0.1]
gremlin> g.v(4).map
==>{name=saturn, type=titan, blah=42}
gremlin>

TitanHBaseOutputFormat

faunus.graph.output.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseOutputFormat
titan.graph.output.storage.backend=hbase
titan.graph.output.storage.hostname=localhost
titan.graph.output.storage.port=2181
titan.graph.output.storage.tablename=titan
titan.graph.output.storage.batch-loading=true
titan.graph.output.ids.block-size=100000
titan.graph.output.infer-schema=true
blueprints.graph.output.tx-commit=5000

NOTE: Please see the TitanCassandraOutputFormat section for information the meaning of these properties.

The properties above are used to construct a FaunusGraph. The Gremlin traversal’s resultant graph is then written to Titan/HBase and the output process is complete.

gremlin> g = FaunusFactory.open('bin/titan-hbase-output.properties')    
==>faunusgraph[graphsoninputformat]
gremlin> g.V.sideEffect('{it.blah = 42}')                           
13/01/04 15:48:32 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
13/01/04 15:48:32 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.VerticesMap.Map, com.thinkaurelius.faunus.mapreduce.sideeffect.SideEffectMap.Map, com.thinkaurelius.faunus.formats.BlueprintsGraphOutputMapReduce.Map, com.thinkaurelius.faunus.formats.BlueprintsGraphOutputMapReduce.Reduce]
...
Clone this wiki locally