README.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <title></title>
  <style type="text/css">code{white-space: pre;}</style>
</head>
<body>
<h1 id="xg">xg</h1>
<h2 id="succinct-labeled-graphs-with-collections-and-paths"><em>succinct labeled graphs with collections and paths</em></h2>
<p><a href="https://travis-ci.org/vgteam/xg"><img src="https://travis-ci.org/vgteam/xg.svg" alt="Build Status" /></a></p>
<h2 id="what">what?</h2>
<p>This library provides an interface to the construction and query of succinct/compressed labeled graphs, labeled collections of nodes and edges, and paths through the graph. The primary motivation is to index <a href="https://github.com/ekg/vg#vg">variation graphs</a> of the type produced by <a href="https://github.com/ekg/vg"><code>vg</code></a>, but in principle the library could be used for any labeled, directed graph.</p>
<p>Graphs indexed with <code>xg</code> can be loaded and queried using variety of high-performance interfaces backed by efficient, succinct data structures from <a href="https://github.com/simongog/sdsl-lite">sdsl-lite</a>.</p>
<h2 id="usage">usage</h2>
<p>First build:</p>
<pre class="shell"><code>make
make test</code></pre>
<p>Now you can index a graph constructed with vg, then obtain a subgraph starting at a particular node and extending up to 5 steps away:</p>
<pre class="shell"><code>./xg -v test/data/z.vg -o z.xg
./xg -i z.xg -n 235 -c 5 | vg view -</code></pre>
<p><code>vg view</code> allows us to see the graph in GFA format.</p>
<p>See <a href="https://github.com/ekg/xg/blob/master/main.cpp"><code>main.cpp</code></a> for example usage.</p>
<h2 id="challenge">challenge</h2>
<p><code>vg</code>'s current graph memory model is weak and extremely bloated. It relies on fixed-width 64-bit integer ids and large hash tables mapping these to other entities. This makes it difficult to store in memory, and a general-purpose key-value store (rocksdb) is used to allow low-memory access to the entire graph. Although this design has some advantages, querying the graph requires costly IO operations, and thus use must be managed carefully when developing high-performance applications.</p>
<p>Fully-indexed graphs should be cheap to store and hold in memory, but it doesn't seem there is a standard approach that can be used just for high-performance access to the sequence and identifier space of the graph. Most work has gone into improving performance for querying the text of such a graph (<a href="https://github.com/jltsiren/gcsa2">GCSA</a>) or generating one out of sequencing reads (assemblers such as <a href="https://github.com/jts/sga">SGA</a> or <a href="https://github.com/lh3/fermi2">fermi2</a>).</p>
<p>The basic requirement is a system that a minimal amount of memory to store the sequence of the graph, its edges, and paths in the graph, but still allows constant-time access to the essential features of the graph. The system should support accessing:</p>
<ul>
<li>the node's label (a DNA sequence, for instance, or URL)</li>
<li>the node's neighbors (inbound and outbound edges)</li>
<li>the node's region in the graph (ranges of node id space that are within some distance of the node)</li>
<li>node locations relative to stored paths in the graph</li>
<li>node and edge path membership</li>
</ul>
<h2 id="sketch">sketch</h2>
</body>
</html>