index.html

<html lang="en">
<head>
  <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  <meta charset="utf-8">
  <meta name="generator" content="pandoc" />

    <meta name="author" content="Ryan Hafen" />


  <title>datadr</title>

    <script src="assets/jquery-1.11.3/jquery.min.js"></script>
  <link href="assets/bootstrap-3.3.2/css/bootstrap.min.css" rel="stylesheet" />
  <script src="assets/bootstrap-3.3.2/js/bootstrap.min.js"></script>
  <script src="assets/bootstrap-3.3.2/shim/html5shiv.min.js"></script>
  <script src="assets/bootstrap-3.3.2/shim/respond.min.js"></script>
  <link href="assets/highlight-8.4/tomorrow.css" rel="stylesheet" />
  <script src="assets/highlight-8.4/highlight.pack.js"></script>
  <link href="assets/fontawesome-4.3.0/css/font-awesome.min.css" rel="stylesheet" />
  <script src="assets/stickykit-1.1.1/sticky-kit.min.js"></script>
  <script src="assets/jqueryeasing-1.3/jquery.easing.min.js"></script>
  <link href="assets/packagedocs-0.0.1/pd.css" rel="stylesheet" />
  <script src="assets/packagedocs-0.0.1/pd.js"></script>
  <script src="assets/packagedocs-0.0.1/pd-collapse-toc.js"></script>


  <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
</head>

<body>


  <header class="navbar navbar-white navbar-fixed-top" role="banner" id="header">
    <div class="container">
      <div class="navbar-header">
        <button class="navbar-toggle" type="button" data-toggle="collapse" data-target=".navbar-collapse">
          <span class="sr-only">Toggle navigation</span>
          <span class="icon-bar"></span>
          <span class="icon-bar"></span>
          <span class="icon-bar"></span>
        </button>
                <span class="navbar-brand">
<a href="http://deltarho.org"> <img src='figures/icon.png' alt='deltarho icon' width='30px' height='30px' style='margin-top: -3px;'> </a>
        </span>
                <a href="index.html" class="navbar-brand page-scroll">
        datadr - Divide and Recombine in R
        </a>
      </div>
            <nav class="collapse navbar-collapse" role="navigation">
        <ul class="nav nav-pills pull-right">
<li class="active">
<a href='index.html'>Docs</a>
</li>
<li>
<a href='rd.html'>Package Ref</a>
</li>
<li>
<a href='https://github.com/delta-rho/datadr'>Github <i class='fa fa-github'></i></a>
</li>
        </ul>
      </nav>
          </div>
  </header>

  <!-- Begin Body -->
  <div class="container">
    <div class="row">
            <div class="col-md-3" id="sidebar-col">
        <div id="toc">
          <ul>
          <li><a href="#introduction">Introduction</a><ul>
          <li><a href="#background">Background</a></li>
          <li><a href="#package-overview">Package Overview</a></li>
          <li><a href="#quickstart">Quickstart</a></li>
          <li><a href="#for-plyr-dplyr-users">For plyr / dplyr Users</a></li>
          <li><a href="#outline">Outline</a></li>
          </ul></li>
          <li><a href="#dealing-with-data-in-dr">Dealing with Data in D&amp;R</a><ul>
          <li><a href="#key-value-pairs">Key-Value Pairs</a></li>
          <li><a href="#distributed-data-objects">Distributed Data Objects</a></li>
          <li><a href="#distributed-data-frames">Distributed Data Frames</a></li>
          <li><a href="#ddoddf-transformations">ddo/ddf Transformations</a></li>
          <li><a href="#common-data-operations">Common Data Operations</a></li>
          </ul></li>
          <li><a href="#division-and-recombination">Division and Recombination</a><ul>
          <li><a href="#high-level-interface">High-Level Interface</a></li>
          <li><a href="#division">Division</a></li>
          <li><a href="#recombination">Recombination</a></li>
          <li><a href="#dr-examples">D&amp;R Examples</a></li>
          </ul></li>
          <li><a href="#mapreduce">MapReduce</a><ul>
          <li><a href="#introduction-to-mapreduce">Introduction to MapReduce</a></li>
          <li><a href="#mapreduce-with-datadr">MapReduce with datadr</a></li>
          <li><a href="#mapreduce-examples">MapReduce Examples</a></li>
          <li><a href="#other-options">Other Options</a></li>
          </ul></li>
          <li><a href="#division-independent-methods">Division-Independent Methods</a><ul>
          <li><a href="#all-data-computation">All-Data Computation</a></li>
          <li><a href="#quantiles">Quantiles</a></li>
          <li><a href="#aggregation">Aggregation</a></li>
          <li><a href="#hexagonal-binning">Hexagonal Binning</a></li>
          </ul></li>
          <li><a href="#storecompute-backends">Store/Compute Backends</a><ul>
          <li><a href="#backend-choices">Backend Choices</a></li>
          <li><a href="#small-memory-cpu">Small: Memory / CPU</a></li>
          <li><a href="#medium-disk-multicore">Medium: Disk / Multicore</a></li>
          <li><a href="#large-hdfs-rhipe">Large: HDFS / RHIPE</a></li>
          <li><a href="#conversion">Conversion</a></li>
          <li><a href="#reading-in-data">Reading in Data</a></li>
          </ul></li>
          <li><a href="#misc">Misc</a><ul>
          <li><a href="#debugging">Debugging</a></li>
          <li><a href="#faq">FAQ</a></li>
          <li><a href="#r-code">R Code</a></li>
          </ul></li>
          </ul>
        </div>
      </div>
      <div class="col-md-9" id="content-col">

<div id="content-top"></div>
<div id="introduction" class="section level1">
<h1>Introduction</h1>
<div id="background" class="section level2">
<h2>Background</h2>
<p>This tutorial covers an implementation of Divide and Recombine (D&amp;R) in the R statistical programming environment, via an R package called <code>datadr</code>. This is one component of the <a href="http://deltarho.org">DeltaRho</a> environment for the analysis of large complex data.</p>
<p>The goal of D&amp;R is to provide an environment for data analysts to carry out deep statistical analysis of large, complex data with as much ease and flexibility as is possible with small datasets.</p>
<p>D&amp;R is accomplished by dividing data into meaningful subsets, applying analytical methods to those subsets, and recombining the results. Recombinations can be numerical or visual. For visualization in the D&amp;R framework, see <a href="http://github.com/delta-rho/trelliscope">Trelliscope</a>.</p>
<p>The diagram below is a visual representation of the D&amp;R process.</p>
<p><img src="image/drdiagram.svg" width="650px" alt="drdiagram" style="display:block; margin:auto"/> <!-- ![drdiagram](image/drdiagram.png) --></p>
<p>For a given data set, which may be a collection of large csv files, an R data frame, etc., we apply a division method that partitions the data in some way that is meaningful for the analysis we plan to perform. Often the partitioning is a logical choice based on the subject matter. After dividing the data, we attack the resulting partitioning with several visual and numerical methods, where we apply the method independently to each subset and combine the results. There are many forms of divisions and recombinations, many of which will be covered in this tutorial.</p>
<div id="reference" class="section level3">
<h3>Reference</h3>
<p>References:</p>
<ul>
<li><a href="http://deltarho.org">deltarho.org</a></li>
<li><a href="http://onlinelibrary.wiley.com/doi/10.1002/sta4.7/full">Large complex data: divide and recombine (D&amp;R) with RHIPE. <em>Stat</em>, 1(1), 53-67</a></li>
</ul>
<p>Related projects:</p>
<ul>
<li><a href="http://github.com/delta-rho/RHIPE">RHIPE</a>: the engine that makes D&amp;R work for large datasets</li>
<li><a href="http://github.com/delta-rho/trelliscope">Trelliscope</a>: the visualization companion to datadr</li>
</ul>
</div>
</div>
<div id="package-overview" class="section level2">
<h2>Package Overview</h2>
<p>We’ll first lay out some of the major data types and functions in <code>datadr</code> to provide a feel for what is available in the package.</p>
<div id="data-types" class="section level3">
<h3>Data types</h3>
<p>The two major data types in <code>datadr</code> are distributed data frames and distributed data objects. A <em>distributed data frame (ddf)</em> can be thought of as a data frame that is split into chunks – each chunk is a subset of rows of the data frame – which may reside across nodes of a cluster (hence “distributed”). A <em>distributed data object (ddo)</em> is a similar notion except that each subset can be an object with arbitrary structure. Every distributed data frame is also a distributed data object.</p>
<p>The data structure we use to store ddo/ddf objects are <em>key-value pairs</em>. For our purposes, the key is typically a label that uniqueley identifies a subset, and the value is the subset of the data corresponding to the key. Thus, a ddo/ddf is essentially a list, where each element of the list contains a key-value pair.</p>
</div>
<div id="functions" class="section level3">
<h3>Functions</h3>
<p>Functions in <code>datadr</code> can be categorized according to the mechanisms they provide for working with data: distributed data types and backend connections, data operations, division-independent operations, and data ingest operations.</p>
<div id="distributed-data-types-backend-connections" class="section level4">
<h4>Distributed data types / backend connections</h4>
<p>Currently, there are three ways to store data using <code>datadr</code>: in memory, on a standard file system (e.g. a hard drive), and on the Hadoop Distributed File System (HDFS). Distributed data objects stored in memory do not require a connection to a backend. However, datasets that exceed the memory capabilities must be stored to disk or to HDFS via a backend connection:</p>
<ul>
<li><code><a target='_blank' href='rd.html#localdiskconn'>localDiskConn()</a></code>: instantiate backend connections to ddo / ddf objects that are persisted (i.e. ‘permanently’ stored) to a local disk connection</li>
<li><code><a target='_blank' href='rd.html#hdfsconn'>hdfsConn()</a></code>: instantiate backend connections to ddo / ddf objects that are persisted to HDFS</li>
<li><code><a target='_blank' href='rd.html#ddf'>ddf()</a></code>: instantiate a ddo from a backend connection</li>
<li><code><a target='_blank' href='rd.html#ddo'>ddo()</a></code>: instantiate a ddf from a backend connection</li>
</ul>
</div>
<div id="data-operations" class="section level4">
<h4>Data operations</h4>
<ul>
<li><code><a target='_blank' href='rd.html#divide'>divide()</a></code>: divide a ddf, either by conditioning variables or by randomly chosen subsets</li>
<li><code><a target='_blank' href='rd.html#recombine'>recombine()</a></code>: take the results of a computation applied to a ddo/ddf and combine them in a number of ways</li>
<li><code><a target='_blank' href='rd.html#drlapply'>drLapply()</a></code>: apply a function to each subset of a ddo/ddf and obtain a new ddo/ddf</li>
<li><code><a target='_blank' href='rd.html#drjoin'>drJoin()</a></code>: join multiple ddo/ddf objects by key</li>
<li><code><a target='_blank' href='rd.html#drsample'>drSample()</a></code>: take a random sample of subsets of a ddo/ddf</li>
<li><code><a target='_blank' href='rd.html#drfilter'>drFilter()</a></code>: filter out subsets of a ddo/ddf that do not meet a specified criteria</li>
<li><code><a target='_blank' href='rd.html#drsubset'>drSubset()</a></code>: return a subset data frame of a ddf</li>
<li><code><a target='_blank' href='rd.html#mrexec'>mrExec()</a></code>: run a traditional MapReduce job on a ddo/ddf</li>
</ul>
<p>All of these operations kick off MapReduce jobs to perform the desired computation. In <code>datadr</code>, we almost always want a new data set result right away, so there is not a prevailing notion of <em>deferred evaluation</em> as in other distributed computing frameworks. The only exception is a function that can be applied prior to or after any of these data operations that adds a transformation to be applied to each subset at the time of the next data operation. This function is <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code> and will be discussed in greater detail later in the tutorial.</p>
</div>
<div id="division-independent-operations" class="section level4">
<h4>Division-independent operations</h4>
<ul>
<li><code><a target='_blank' href='rd.html#drquantile'>drQuantile()</a></code>: estimate all-data quantiles, optionally by a grouping variable</li>
<li><code><a target='_blank' href='rd.html#draggregate'>drAggregate()</a></code>: all-data tabulation, similar to R’s <code><a target='_blank' href='rd.html#aggregate'>aggregate()</a></code> command</li>
<li><code><a target='_blank' href='rd.html#drhexbin'>drHexbin()</a></code>: all-data hexagonal binning aggregation</li>
</ul>
<p>Note that every data operation works in a backend-agnostic manner, meaning that whether you have data in memory, on your hard drive, or HDFS, you can run the same commands virtually unchanged.</p>
</div>
<div id="data-ingest" class="section level4">
<h4>Data ingest</h4>
<p>One of the most difficult aspects of dealing with very large data is getting the data into R. In <code>datadr</code>, we have extended the <code>read.table</code> family of functions. They are available as <code><a target='_blank' href='rd.html#drread_csv'>drRead.csv()</a></code>, <code><a target='_blank' href='rd.html#drread_delim'>drRead.delim()</a></code>, etc. See <code><a target='_blank' href='rd.html#drread_table'>drRead.table</a></code> for additional methods. These are particularly useful for backends like local disk and HDFS. Usage of these methods is discussed in the <a href="#reading-in-data">Reading in Data</a> section.</p>
</div>
</div>
</div>
<div id="quickstart" class="section level2">
<h2>Quickstart</h2>
<p>Before going into some of the details of <code>datadr</code>, let’s first run through some quick examples to get acquainted with some of the functionality of the package.</p>
<div id="package-installation" class="section level3">
<h3>Package installation</h3>
<p>First, we need to install the necessary components, <code>datadr</code> and <code>trelliscope</code>. These are R packages that we install from CRAN.</p>
<pre class="r"><code>install.packages(c(&quot;datadr&quot;, &quot;trelliscope&quot;))</code></pre>
<p>The example we go through will be a small dataset that we can handle in a local R session, and therefore we only need to have these two packages installed. For other installation options when dealing with larger data sets, see the <a href="http://deltarho.org/#quickstart">quickstart</a> on our website.</p>
<p>We will use as an example a data set consisting of the median list and sold price of homes in the United States, aggregated by county and month from 2008 to early 2014. These data are available in a package called <code>housingData</code>. To install this package:</p>
<pre class="r"><code>install.packages(&quot;housingData&quot;)</code></pre>
</div>
<div id="environment-setup" class="section level3">
<h3>Environment setup</h3>
<p>Now we load the packages and look at the housing data:</p>
<pre class="r"><code>library(housingData)
library(datadr)
library(trelliscope)

head(housing)</code></pre>
<pre><code>   fips             county state       time  nSold medListPriceSqft
1 06037 Los Angeles County    CA 2008-01-31 505900               NA
2 06037 Los Angeles County    CA 2008-02-29 497100               NA
3 06037 Los Angeles County    CA 2008-03-31 487300               NA
4 06037 Los Angeles County    CA 2008-04-30 476400               NA
5 06037 Los Angeles County    CA 2008-05-31 465900               NA
6 06037 Los Angeles County    CA 2008-06-30 456000               NA
  medSoldPriceSqft
1         360.1645
2         353.9788
3         349.7633
4         348.5246
5         343.8849
6         342.1065</code></pre>
<p>We see that we have a data frame with the information we discussed, in addition to the number of units sold.</p>
</div>
<div id="division-by-county-and-state" class="section level3">
<h3>Division by county and state</h3>
<p>One way we want to divide the data is by county name and state to be able to study how home prices have evolved over time within county. We can do this with a call to <code><a target='_blank' href='rd.html#divide'>divide()</a></code>:</p>
<pre class="r"><code>byCounty &lt;- divide(housing,
  by = c(&quot;county&quot;, &quot;state&quot;), update = TRUE)</code></pre>
<p>Our <code>byCounty</code> object is now a distributed data frame (ddf) that is stored in memory. We can see some of its attributes by printing the object:</p>
<pre class="r"><code>byCounty</code></pre>
<pre><code>
Distributed data frame backed by &#39;kvMemory&#39; connection

 attribute      | value
----------------+-----------------------------------------------------------
 names          | fips(cha), time(Dat), nSold(num), and 2 more
 nrow           | 224369
 size (stored)  | 15.73 MB
 size (object)  | 15.73 MB
 # subsets      | 2883

* Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary()
* Conditioning variables: county, state</code></pre>
<p>We see there are 2883 counties, and we can access various attributes by calling methods such as <code>summary()</code>. The <code>update = TRUE</code> that we added to <code><a target='_blank' href='rd.html#divide'>divide()</a></code> provided some of these attributes. Let’s look at the summary:</p>
<pre class="r"><code>summary(byCounty)</code></pre>
<pre><code>        fips                 time                nSold
 -------------------  ------------------  -------------------
       levels : 2883  missing :        0   missing :   164370
      missing : 0         min : 08-10-01       min :       11
 &gt; freqTable head &lt;       max : 14-03-01       max :    35619
 26077 : 140                                  mean : 274.6582
 51069 : 140                               std dev : 732.2429
 08019 : 139                              skewness :   10.338
 13311 : 139                              kurtosis : 222.8995
 -------------------  ------------------  -------------------
   medListPriceSqft     medSoldPriceSqft
 --------------------  -------------------
  missing :     48399   missing :   162770
      min : 0.5482456       min : 17.40891
      max :  1544.944       max : 1249.494
     mean :  96.72912      mean : 105.5659
  std dev :  56.12035   std dev : 69.40658
 skewness :  6.816523  skewness : 5.610013
 kurtosis :  94.06555  kurtosis : 60.48337
 --------------------  ------------------- </code></pre>
<p>Since <code>datadr</code> knows that <code>byCounty</code> is a ddf, and because we set <code>update = TRUE</code>, after the division operation global summary statistics were computed for each of the variables.</p>
<p>Suppose we want a more meaningful global summary, such as computing quantiles. <code>datadr</code> can do this in a division-independent way with <code><a target='_blank' href='rd.html#drquantile'>drQuantile()</a></code>. For example, let’s look at quantiles for the median list price and plot them using <code>xyplot()</code> from the <code>lattice</code> package:</p>
<pre class="r"><code>library(lattice)
priceQ &lt;- drQuantile(byCounty, var = &quot;medListPriceSqft&quot;)
xyplot(q ~ fval, data = priceQ, scales = list(y = list(log = 10)))</code></pre>
<p><img src="index_files/figure-html/quickstart_quantile-1.png" title="" alt="" width="624" /></p>
<p>By the way, what does a subset of <code>byCounty</code> look like? <code>byCounty</code> is a list of <em>key-value pairs</em>, which we will learn more about later. Essentially, the collection of subsets can be thought of as a large list, where each list element has a key and a value. To look at the first key-value pair:</p>
<pre class="r"><code>byCounty[[1]]</code></pre>
<pre><code>$key
[1] &quot;county=Abbeville County|state=SC&quot;

$value
   fips       time nSold medListPriceSqft medSoldPriceSqft
1 45001 2008-10-01    NA         73.06226               NA
2 45001 2008-11-01    NA         70.71429               NA
3 45001 2008-12-01    NA         70.71429               NA
4 45001 2009-01-01    NA         73.43750               NA
5 45001 2009-02-01    NA         78.69565               NA
...</code></pre>
</div>
<div id="applying-an-analytic-method-and-recombination" class="section level3">
<h3>Applying an analytic method and recombination</h3>
<p>Now, suppose we wish to apply an analytic method to each subset of our data and recombine the result. A simple thing we may want to look at is the slope coefficient of a linear model applied to list prices vs. time for each county.</p>
<p>We can create a function that operates on an input data frame <code>x</code> that does this:</p>
<pre class="r"><code>lmCoef &lt;- function(x)
  coef(lm(medListPriceSqft ~ time, data = x))[2]</code></pre>
<p>We can apply this transformation to each subset in our data with <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code>:</p>
<pre class="r"><code>byCountySlope &lt;- addTransform(byCounty, lmCoef)</code></pre>
<p>This applies <code>lmCoef()</code> to each subset in a deferred fashion, meaning that for all intents and purposes we can think of <code>byCountySlope</code> as a distributed data object that contains the result of <code>lmCoef()</code> being applied to each subset. But computation is deffered until another data operation is applied to <code>byCountySlope</code>, such as a recombination, which we will do next.</p>
<p>When we look at a subset of <code>byCountySlope</code>, we see what the result will look like:</p>
<pre class="r"><code>byCountySlope[[1]]</code></pre>
<pre><code>$key
[1] &quot;county=Abbeville County|state=SC&quot;

$value
         time
-0.0002323686 </code></pre>
<p>Now let’s recombine the slopes into a single data frame. This can be done with the <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code> function, using the <code><a target='_blank' href='rd.html#combrbind'>combRbind</a></code> combiner, which is analagous to <code><a target='_blank' href='rd.html#rbind'>rbind()</a></code>:</p>
<pre class="r"><code>countySlopes &lt;- recombine(byCountySlope, combRbind)</code></pre>
<pre class="r"><code>head(countySlopes)</code></pre>
<pre><code>                county state           val
time  Abbeville County    SC -0.0002323686
time1    Acadia Parish    LA  0.0019518441
time2  Accomack County    VA -0.0092717711
time3       Ada County    ID -0.0030197554
time4     Adair County    IA -0.0308381951
time5     Adair County    KY  0.0034399585</code></pre>
</div>
<div id="joining-other-data-sets" class="section level3">
<h3>Joining other data sets</h3>
<p>There are several data operations beyond <code><a target='_blank' href='rd.html#divide'>divide()</a></code> and <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code>. Let’s look at a quick example of one of these, <code><a target='_blank' href='rd.html#drjoin'>drJoin()</a></code>. Suppose we have multiple related data sources. For example, we have geolocation data for the county centroids. <code><a target='_blank' href='rd.html#drjoin'>drJoin()</a></code> will allow us to join multiple data sets by key.</p>
<p>We have a data set, <code>geoCounty</code>, also part of the <code>housingData</code> package, that we want to divide in the same way as we divided the <code>housing</code> data:</p>
<pre class="r"><code>head(geoCounty)</code></pre>
<pre><code>   fips         county state       lon      lat rMapState rMapCounty
1 01001 Autauga County    AL -86.64565 32.54009   alabama    autauga
2 01003 Baldwin County    AL -87.72627 30.73831   alabama    baldwin
3 01005 Barbour County    AL -85.39733 31.87403   alabama    barbour
4 01007    Bibb County    AL -87.12526 32.99902   alabama       bibb
5 01009  Blount County    AL -86.56271 33.99044   alabama     blount
6 01011 Bullock County    AL -85.71680 32.10634   alabama    bullock</code></pre>
<pre class="r"><code>geo &lt;- divide(geoCounty, by = c(&quot;county&quot;, &quot;state&quot;))</code></pre>
<pre class="r"><code>geo[[1]]</code></pre>
<pre><code>$key
[1] &quot;county=Abbeville County|state=SC&quot;

$value
   fips       lon      lat      rMapState rMapCounty
1 45001 -82.45851 34.23021 south carolina  abbeville</code></pre>
<p>We see that this division gives us a divided data set with the same keys as <code>byCounty</code>. So we can join it with <code>byCounty</code>:</p>
<pre class="r"><code>byCountyGeo &lt;- drJoin(housing = byCounty, geo = geo)</code></pre>
<p>What this does is provide us with a new ddo (not a distributed data frame anymore) where for each key, the value is a list with a data frame <code>housing</code> holding the time series data and a data frame <code>geo</code> holding the geographic data. We can see the structure of this for a subset with:</p>
<pre class="r"><code>str(byCountyGeo[[1]])</code></pre>
<pre><code>List of 2
 $ key  : chr &quot;county=Abbeville County|state=SC&quot;
 $ value:List of 2
  ..$ housing:&#39;data.frame&#39;: 66 obs. of  5 variables:
  .. ..$ fips            : chr [1:66] &quot;45001&quot; &quot;45001&quot; &quot;45001&quot; &quot;45001&quot; ...
  .. ..$ time            : Date[1:66], format: &quot;2008-10-01&quot; ...
  .. ..$ nSold           : num [1:66] NA NA NA NA NA NA NA NA NA NA ...
  .. ..$ medListPriceSqft: num [1:66] 73.1 70.7 70.7 73.4 78.7 ...
  .. ..$ medSoldPriceSqft: num [1:66] NA NA NA NA NA NA NA NA NA NA ...
  ..$ geo    :&#39;data.frame&#39;: 1 obs. of  5 variables:
  .. ..$ fips      : chr &quot;45001&quot;
  .. ..$ lon       : num -82.5
  .. ..$ lat       : num 34.2
  .. ..$ rMapState : chr &quot;south carolina&quot;
  .. ..$ rMapCounty: chr &quot;abbeville&quot;
  ..- attr(*, &quot;split&quot;)=&#39;data.frame&#39;:    1 obs. of  2 variables:
  .. ..$ county: chr &quot;Abbeville County&quot;
  .. ..$ state : chr &quot;SC&quot;
 - attr(*, &quot;class&quot;)= chr [1:2] &quot;kvPair&quot; &quot;list&quot;</code></pre>
</div>
<div id="trelliscope-display" class="section level3">
<h3>Trelliscope display</h3>
<p>We have a more comprehensive tutorial for using <a href="http://deltarho.org/docs-trelliscope/">Trelliscope</a>, but for completeness here and for some motivation to get through this tutorial and move on to the Trelliscope tutorial, we provide a simple example of taking a ddf and creating a Trelliscope display from it.</p>
<p>In short, a Trelliscope display is like a Trellis display, or ggplot with faceting, or small multiple plot, or whatever you are used to calling the action of breaking a set of data into pieces and applying a plot to each piece and then arranging those plots in a grid and looking at them. With Trelliscope, we are able to create such displays on data with a very large number of subsets and view them in an interactive and meaningful way.</p>
</div>
<div id="setting-up-a-visualization-database" class="section level3">
<h3>Setting up a visualization database</h3>
<p>For a Trelliscope display, we must connect to a “visualization database” (VDB), which is a directory on our computer where we are going to organize all of the information about our displays (we create many over the course of an analysis). Typically, we will set up a single VDB for each project we are working on. We can do this with the <code>vdbConn()</code> function:</p>
<pre class="r"><code>vdbConn(&quot;vdb&quot;, name = &quot;deltarhoTutorial&quot;)</code></pre>
<p>This connects to a directory called <code>&quot;vdb&quot;</code> relative to our current working directory. R holds this connection in its global options so that subsequent calls will know where to put things without explicitly specifying the connection each time.</p>
</div>
<div id="creating-a-panel-function" class="section level3">
<h3>Creating a panel function</h3>
<p>To create a Trelliscope display, we need to first specify a <em>panel</em> function, which specifies what to plot for each subset. It takes as input either a key-value pair or just a value, depending on whether the function has two arguments or one.</p>
<p>For example, here is a panel function that takes a value and creates a lattice <code>xyplot</code> of list and sold price over time:</p>
<pre class="r"><code>timePanel &lt;- function(x)
  xyplot(medListPriceSqft + medSoldPriceSqft ~ time,
    data = x, auto.key = TRUE, ylab = &quot;Price / Sq. Ft.&quot;)</code></pre>
<p>Let’s test it on a subset:</p>
<pre class="r"><code>timePanel(byCounty[[20]]$value)</code></pre>
<p><img src="index_files/figure-html/quickstart_panel_test-1.png" title="" alt="" width="624" /></p>
<p>Great!</p>
</div>
<div id="creating-a-cognostics-function" class="section level3">
<h3>Creating a cognostics function</h3>
<p>Another optional thing we can do is specify a <em>cognostics</em> function that is applied to each subset. A cognostic is a metric that tells us an interesting attribute about a subset of data, and we can use cognostics to have more worthwhile interactions with all of the panels in the display. A cognostic function needs to return a list of metrics:</p>
<pre class="r"><code>priceCog &lt;- function(x) { list(
  slope     = cog(lmCoef(x), desc = &quot;list price slope&quot;),
  meanList  = cogMean(x$medListPriceSqft),
  listRange = cogRange(x$medListPriceSqft),
  nObs      = cog(length(which(!is.na(x$medListPriceSqft))),
    desc = &quot;number of non-NA list prices&quot;)
)}</code></pre>
<p>We use the <code>cog()</code> function to wrap our metrics so that we can provide a description for the cognostic. We may also employ special cognostics functions like <code>cogMean()</code> and <code>cogRange()</code> to compute mean and range with a default description.</p>
<p>We should test the cognostics function on a subset:</p>
<pre class="r"><code>priceCog(byCounty[[1]]$value)</code></pre>
<pre><code>$slope
         time
-0.0002323686

$meanList
[1] 72.76927

$listRange
[1] 23.08482

$nObs
[1] 66</code></pre>
</div>
<div id="making-the-display" class="section level3">
<h3>Making the display</h3>
<p>Now we can create a Trelliscope display by sending our data, our panel function, and our cognostics function to <code>makeDisplay()</code>:</p>
<pre class="r"><code>makeDisplay(byCounty,
  name = &quot;list_sold_vs_time_datadr_tut&quot;,
  desc = &quot;List and sold price over time&quot;,
  panelFn = timePanel,
  cogFn = priceCog,
  width = 400, height = 400,
  lims = list(x = &quot;same&quot;))</code></pre>
<p>If you have been dutifully following along with this example in your own R console, you can now view the display with the following:</p>
<pre class="r"><code>view()</code></pre>
<p>If you have not been following along but are wondering what that <code>view()</code> command did, you can visit <a href="http://hafen.shinyapps.io/deltarhoTutorial/" target="_blank">here</a> for an online version. You will find a list of displays to choose from, of which the one with the name <code>list_sold_vs_time_datadr_tut</code> is the one we just created. This brings up the point that you can share your Trelliscope displays online – more about that as well as how to use the viewer will be covered in the Trelliscope tutorial – but feel free to play around with the viewer.</p>
<p>This covers the basics of <code>datadr</code> and a bit of <code>trelliscope</code>. Hopefully you now feel comfortable enough to dive in and try some things out. The remainder of this tutorial and the <a href="http://deltarho.org/docs-trelliscope/">Trelliscope</a> tutorial will provide greater detail.</p>
</div>
</div>
<div id="for-plyr-dplyr-users" class="section level2">
<h2>For plyr / dplyr Users</h2>
<p>Now that we have seen some examples and have a good feel for what <code>datadr</code> can do, if you have used <code>plyr</code> or <code>dplyr</code> packages, you may be noticing a few similarities.</p>
<p>If you have not used these packages before, you can skip this section, but if you have, we will go over a quick simple example of how to do the same thing in the three packages to help the <code>plyr</code> user have a better understanding of how to map their knowledge of those packages to <code>datadr</code>.</p>
<p>It is also worth discussing some of the similarites and differences to help understand when <code>datadr</code> is useful. We expand on this in the <a href="#faq">FAQ</a>. In a nutshell, <code>datadr</code> and <code>dplyr</code> are very different and are actually complementary. We often use the amazing features of <code>dplyr</code> for within-subset computations, but we need <code>datadr</code> to deal with complex data structures and potentially very large data.</p>
<div id="code-comparison" class="section level3">
<h3>Code Comparison</h3>
<p>For a simple example, we turn to the famous iris data. Suppose we want to compute the mean sepal length by species:</p>
<div id="with-plyr" class="section level4">
<h4>With <code>plyr</code>:</h4>
<pre class="r"><code>library(plyr)

ddply(iris, .(Species), function(x)
  data.frame(val = mean(x$Sepal.Length)))</code></pre>
<p>With <code>plyr</code>, we are performing the split, apply, and combine all in the same step.</p>
</div>
<div id="with-dplyr" class="section level4">
<h4>With <code>dplyr</code>:</h4>
<pre class="r"><code>library(dplyr)

iris %&gt;%
  group_by(Species) %&gt;%
  summarise(val = mean(Sepal.Length))</code></pre>
<p>Here, we call <code>group_by()</code> to create a <code>bySpecies</code> object, which is the same object as <code>iris</code> but with additional information about the indices of where the rows for each species are. Then we call <code>summarise()</code> which computes the mean sepal length for each group and returns the result as a data frame.</p>
</div>
<div id="with-datadr" class="section level4">
<h4>With <code>datadr</code>:</h4>
<pre class="r"><code>library(datadr)

divide(iris, by = &quot;Species&quot;) %&gt;%
  addTransform(function(x) mean(x$Sepal.Length)) %&gt;%
  recombine(combRbind)</code></pre>
<p>Here, we call <code><a target='_blank' href='rd.html#divide'>divide()</a></code> to partition the iris data by species, resulting in a “distributed data frame”, called <code>bySpecies</code>. Note that this result is a new data object - an important and deliberate distinction. Then we call <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code> to apply a function that computes the mean sepal length to each partition. Then we call <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code> to bind all the results into a single data frame.</p>
</div>
</div>
</div>
<div id="outline" class="section level2">
<h2>Outline</h2>
<p>The outline for the remainder of this tutorial is as follows:</p>
<ul>
<li>First, we cover the foundational D&amp;R data structure, key-value pairs, and how they are used to build distributed data objects and distributed data frames.</li>
<li>Next, we provide an introduction to the high-level division and recombination methods in <code>datadr</code>.</li>
<li>Then we discuss MapReduce - the lower-level language for accomplishing D&amp;R tasks - which is the engine for the higher-level D&amp;R methods. It is anticipated that the high-level language will be sufficient for most analysis tasks, but the lower-level approach is also exposed for special cases.</li>
<li>We then cover some division-independent methods that do various computations across the entire data set, regardless of how it is divided, such as all-data quantiles.</li>
<li>For all of these discussions, we use small data sets that fit in memory for illustrative purposes. This way everyone can follow along without having a large-scale backend like Hadoop running and configured. However, the true power of D&amp;R is with large data sets, and after introducing all of this material, we cover different backends for computation and storage that are currently supported for D&amp;R. The interface always remains the same regardless of the backend, but there are various things to discuss for each case. The backends discussed are:
<ul>
<li><strong>in-memory / single core R:</strong> ideal for small data</li>
<li><strong>local disk / multicore R:</strong> ideal for medium-sized data (too big for memory, small enough for local disk)</li>
<li><strong>Hadoop Distributed File System (HDFS) / RHIPE / Hadoop MapReduce:</strong> ideal for very large data sets</li>
</ul></li>
<li>We also provide R source files for all of the examples throughout the documentation.</li>
</ul>
<div class="alert alert-warning">
<strong>Note:</strong> Throughout the tutorial, the examples cover very small, simple datasets. This is by design, as the focus is on getting familiar with the available commands. Keep in mind that the same interface works for very large datasets, and that design choices have been made with scalability in mind.
</div>
</div>
</div>
<div id="dealing-with-data-in-dr" class="section level1">
<h1>Dealing with Data in D&amp;R</h1>
<div id="key-value-pairs" class="section level2">
<h2>Key-Value Pairs</h2>
<p>In D&amp;R, data is partitioned into subsets. Each subset is represented as a <em>key-value pair</em>. Collections of key-value pairs are <em>distributed data objects (ddo)</em>, or in the case of the value being a data frame, <em>distributed data frames (ddf)</em>, and form the basic input and output types for all D&amp;R operations. This section introduces these concepts and illustrates how they are used in datadr.</p>
<div id="key-value-pairs-in-datadr" class="section level3">
<h3>Key-value pairs in datadr</h3>
<p>In datadr, key-value pairs are R lists with two elements, one for the key and one for the value. For example,</p>
<pre class="r"><code># simple key-value pair example
list(1:5, rnorm(10))</code></pre>
<pre><code>[[1]]
[1] 1 2 3 4 5

[[2]]
 [1] -1.2070657  0.2774292  1.0844412 -2.3456977  0.4291247  0.5060559
 [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378</code></pre>
<p>is a key-value pair with integers 1-5 as the key and 10 random normals as the value. Typically, a key is used as a unique identifier for the value. For datadr it is recommended to make the key a simple string when possible.</p>
<p>There is a convenience function <code><a target='_blank' href='rd.html#kvpair'>kvPair()</a></code> for specifying a key-value pair:</p>
<pre class="r"><code># using kvPair
kvPair(1:5, rnorm(10))</code></pre>
<pre><code>$key
[1] 1 2 3 4 5

$value
 [1] -0.47719270 -0.99838644 -0.77625389  0.06445882  0.95949406
 [6] -0.11028549 -0.51100951 -0.91119542 -0.83717168  2.41583518</code></pre>
<p>This provides names for the list elements and is a useful function when an operation must explicitly know that it is dealing with a key-value pair and not just a list.</p>
</div>
<div id="key-value-pair-collections" class="section level3">
<h3>Key-value pair collections</h3>
<p>D&amp;R data objects are made up of collections of key-value pairs. In datadr, these are represented as lists of key-value pair lists. As an example, consider the iris data set, which consists of measurements of 4 aspects for 50 flowers from each of 3 species of iris. Suppose we would like to split the data into key-value pairs by species. We can do this by passing key-value pairs to a function <code><a target='_blank' href='rd.html#kvpairs'>kvPairs()</a></code>:</p>
<pre class="r"><code># create by-species key-value pairs
irisKV &lt;- kvPairs(
  kvPair(&quot;setosa&quot;, subset(iris, Species == &quot;setosa&quot;)),
  kvPair(&quot;versicolor&quot;, subset(iris, Species == &quot;versicolor&quot;)),
  kvPair(&quot;virginica&quot;, subset(iris, Species == &quot;virginica&quot;))
)
irisKV</code></pre>
<pre><code>[[1]]
$key
[1] &quot;setosa&quot;

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
...


[[2]]
$key
[1] &quot;versicolor&quot;

$value
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
51          7.0         3.2          4.7         1.4 versicolor
52          6.4         3.2          4.5         1.5 versicolor
53          6.9         3.1          4.9         1.5 versicolor
54          5.5         2.3          4.0         1.3 versicolor
55          6.5         2.8          4.6         1.5 versicolor
...


[[3]]
$key
[1] &quot;virginica&quot;

$value
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
101          6.3         3.3          6.0         2.5 virginica
102          5.8         2.7          5.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica
104          6.3         2.9          5.6         1.8 virginica
105          6.5         3.0          5.8         2.2 virginica
...</code></pre>
<p>The result is a list of 3 key-value pairs. We chose the species to be the key and the corresponding data frame to be the value for each pair.</p>
<p><code><a target='_blank' href='rd.html#kvpairs'>kvPairs()</a></code> is basically a wrapper for <code>list()</code>. It checks to make sure key-value pairs are valid and makes sure they are printed nicely. In pratice we actually very rarely need specify key-value pairs like this, but this is useful for illustration.</p>
<p>This example shows how we can partition our data into key-value pairs that have meaning – each subset represents measurements for one species. The ability to divide the data up into pieces allows us to distribute datasets that might be too large for a single disk across multiple machines, and also allows us to distribute computation, because in D&amp;R we apply methods independently to each subset.</p>
<p>Here, we manually created the partition by species, but datadr provides simple mechanisms for specifying divisions, which we will cover <a href="#division">later in the tutorial</a>. Prior to doing that, however, we need to discuss how collections of key-value pairs are represented in datadr as distributed data objects.</p>
</div>
</div>
<div id="distributed-data-objects" class="section level2">
<h2>Distributed Data Objects</h2>
<p>In datadr, a collection of key-value pairs along with attributes about the collection constitute a distributed data object (ddo). Most datadr operations require a ddo, and hence it is important to represent key-value pair collections as such.</p>
<p>We will continue to use our collection of key-value pairs we created previously <code>irisKV</code>:</p>
<pre class="r"><code>irisKV &lt;- kvPairs(
  kvPair(&quot;setosa&quot;, subset(iris, Species == &quot;setosa&quot;)),
  kvPair(&quot;versicolor&quot;, subset(iris, Species == &quot;versicolor&quot;)),
  kvPair(&quot;virginica&quot;, subset(iris, Species == &quot;virginica&quot;))
)</code></pre>
<div id="initializing-a-ddo" class="section level3">
<h3>Initializing a ddo</h3>
<p>To initialize a collection of key-value pairs as a distributed data object, we use the <code><a target='_blank' href='rd.html#ddo'>ddo()</a></code> function:</p>
<pre class="r"><code># create ddo object from irisKV
irisDdo &lt;- ddo(irisKV)</code></pre>
<p><code><a target='_blank' href='rd.html#ddo'>ddo()</a></code> simply takes the collection of key-value pairs and attaches additional attributes to the resulting ddo object. Note that in this example, since the data is in memory, we are supplying the data directly as the argument to <code><a target='_blank' href='rd.html#ddo'>ddo()</a></code>. For larger datasets stored in more scalable backends, instead of passing the data directly, a connection that points to where the key-value pairs are stored is provided. This is discussed in more detail in the <a href="#backend-choices">Store/Compute Backends</a> sections.</p>
<p>Objects of class “ddo” have several methods that can be invoked on them. The most simple of these is a print method:</p>
<pre class="r"><code>irisDdo</code></pre>
<pre><code>
Distributed data object backed by &#39;kvMemory&#39; connection

 attribute      | value
----------------+-----------------------------------------------------------
 size (stored)  | 12.67 KB
 size (object)  | 12.67 KB
 # subsets      | 3

* Other attributes: getKeys()
* Missing attributes: splitSizeDistn</code></pre>
<p>The print method shows several attributes that have been computed for the data.</p>
</div>
<div id="ddo-attributes" class="section level3">
<h3>ddo attributes</h3>
<p>From the printout of <code>irisDdo</code>, we see that a ddo has several attributes. The most basic ones:</p>
<ul>
<li><code>size (object)</code>: The total size of the all of the data as represented in memory in R is 12.67 KB (that’s some big data!)</li>
<li><code>size (stored)</code>: With backends other than in-memory, the size of data serialized and possibly compressed to disk can be very different from object size, which is useful to know. In this case, it’s the same since the object is in memory.</li>
<li><code># subsets</code>: There are 3 subsets (one for each species)</li>
</ul>
<p>We can look at the keys with:</p>
<pre class="r"><code># look at irisDdo keys
getKeys(irisDdo)</code></pre>
<pre><code>[[1]]
[1] &quot;setosa&quot;

[[2]]
[1] &quot;versicolor&quot;

[[3]]
[1] &quot;virginica&quot;</code></pre>
<p>We can also get an example key-value pair:</p>
<pre class="r"><code># look at an example key-value pair of irisDdo
kvExample(irisDdo)</code></pre>
<pre><code>$key
[1] &quot;setosa&quot;

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
...</code></pre>
<p><code>kvExample</code> is useful for obtaining a subset key-value pair against which we can test out different analytical methods before applying them across the entire data set.</p>
<p>Another attribute, <code>splitSizeDistn</code> is empty. This attribute provides information about the quantiles of the distribution of the size of each division. With very large data sets with a large number of subsets, this can be useful for getting a feel for how uniform the subset sizes are.</p>
<p>The <code>splitSizeDistn</code> attribute and more that we will see in the future are not computed by default when <code><a target='_blank' href='rd.html#ddo'>ddo()</a></code> is called. This is because it requires a computation over the data set, which can take some time with very large datasets, and may not always be desired or necessary.</p>
</div>
<div id="updating-attributes" class="section level3">
<h3>Updating attributes</h3>
<p>If you decide at any point that you would like to update the attributes of your ddo, you can call:</p>
<pre class="r"><code># update irisDdo attributes
irisDdo &lt;- updateAttributes(irisDdo)
irisDdo</code></pre>
<pre><code>
Distributed data object backed by &#39;kvMemory&#39; connection

 attribute      | value
----------------+-----------------------------------------------------------
 size (stored)  | 12.67 KB
 size (object)  | 12.67 KB
 # subsets      | 3

* Other attributes: getKeys(), splitSizeDistn()</code></pre>
<p>The <code>splitSizeDistn</code> attribute is now available. We can look at it with the accessor <code>splitSizeDistn()</code>:</p>
<pre class="r"><code>par(mar = c(4.1, 4.1, 1, 0.2))
# plot distribution of the size of the key-value pairs</code></pre>
<p><img src="index_files/figure-html/plot_iris_split_size-1.png" title="" alt="" width="624" /></p>
<p>Another way to get updated attributes is at the time the ddo is created, by setting <code>update = TRUE</code>:</p>
<pre class="r"><code># update at the time ddo() is called
irisDdo &lt;- ddo(irisKV, update = TRUE)</code></pre>
</div>
<div id="note-about-storage-and-computation" class="section level3">
<h3>Note about storage and computation</h3>
<p>Notice the first line of output from the <code>irisDdo</code> object printout. It states that the object is backed by a “kvMemory” (key-value pairs in memory) connection.</p>
<p>We will talk about other backends for storing and processing larger data sets that don’t fit in memory or even on your workstation’s disk. The key here is that the interface always stays the same, regardless of whether we are working with terabytes of kilobytes of data.</p>
</div>
<div id="accessing-subsets" class="section level3">
<h3>Accessing subsets</h3>
<p>We can access subsets of the data by key or by index:</p>
<pre class="r"><code>irisDdo[[&quot;setosa&quot;]]</code></pre>
<pre><code>$key
[1] &quot;setosa&quot;

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
...</code></pre>
<pre class="r"><code>irisDdo[[1]]</code></pre>
<pre><code>$key
[1] &quot;setosa&quot;

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
...</code></pre>
<pre class="r"><code>irisDdo[c(&quot;setosa&quot;, &quot;virginica&quot;)]</code></pre>
<pre><code>[[1]]
$key
[1] &quot;setosa&quot;

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
...


[[2]]
$key
[1] &quot;virginica&quot;

$value
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
101          6.3         3.3          6.0         2.5 virginica
102          5.8         2.7          5.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica
104          6.3         2.9          5.6         1.8 virginica
105          6.5         3.0          5.8         2.2 virginica
...</code></pre>
<pre class="r"><code>irisDdo[1:2]</code></pre>
<pre><code>[[1]]
$key
[1] &quot;setosa&quot;

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
...


[[2]]
$key
[1] &quot;versicolor&quot;

$value
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
51          7.0         3.2          4.7         1.4 versicolor
52          6.4         3.2          4.5         1.5 versicolor
53          6.9         3.1          4.9         1.5 versicolor
54          5.5         2.3          4.0         1.3 versicolor
55          6.5         2.8          4.6         1.5 versicolor
...</code></pre>
<p>Accessing by key is much simpler when the key is a character string, but subsetting works even when passing a list of non-string keys, or even a <code>digest()</code> of the desired key object (if you don’t know what that means, don’t worry!).</p>
</div>
</div>
<div id="distributed-data-frames" class="section level2">
<h2>Distributed Data Frames</h2>
<p>Key-value pairs in distributed data objects can have any structure. If we constrain the values to be data frames or readily transformable into data frames, we can represent the object as a distributed data frame (ddf). A ddf is a ddo with additional attributes. Having a uniform data frame structure for the values provides several benefits and data frames are required for specifying division methods.</p>
<div id="initializing-a-ddf" class="section level3">
<h3>Initializing a ddf</h3>
<p>Our <code>irisKV</code> data we created earlier has values that are data frames, so we can cast it as a distributed data frame like this:</p>
<pre class="r"><code># create ddf object from irisKV
irisDdf &lt;- ddf(irisKV, update = TRUE)
irisDdf</code></pre>
<pre><code>
Distributed data frame backed by &#39;kvMemory&#39; connection

 attribute      | value
----------------+-----------------------------------------------------------
 names          | Sepal.Length(num), Sepal.Width(num), and 3 more
 nrow           | 150
 size (stored)  | 12.67 KB
 size (object)  | 12.67 KB
 # subsets      | 3

* Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary()</code></pre>
</div>
<div id="ddf-attributes" class="section level3">
<h3>ddf attributes</h3>
<p>The printout of <code>irisDdf</code> above shows the ddo attributes we saw previously (because every ddf is also a ddo), but we also see some new data-frame-related attributes (which were automatically updated because we specified <code>update = TRUE</code>). These include:</p>
<ul>
<li><code>names</code>: a list of the variables</li>
<li><code>nrow</code>: the total number of rows in the data set</li>
</ul>
<p>Also there are additional “other” attributes listed at the bottom. The <code>summary</code> attribute can be useful for getting an initial look at the variables in your ddf, and is sometimes required for later computations, such as quantile estimation with <code><a target='_blank' href='rd.html#drquantile'>drQuantile()</a></code>, where the range of a variable is required to get a good quantile approximation. Summary statistics are all computed simultaneously in one MapReduce job with a call to <code><a target='_blank' href='rd.html#updateattributes'>updateAttributes()</a></code>.</p>
<p>The numerical summary statistics are computed using a <a href="http://janinebennett.org/index_files/ParallelStatisticsAlgorithms.pdf">numerically stable algorithm</a>.</p>
<p>Summary statistics include:</p>
<p>For each numeric variable:</p>
<ul>
<li><code>nna</code>: number of missing values</li>
<li><code>stats</code>: list of mean, variance, skewness, kurtosis</li>
<li><code>range</code>: min, max</li>
</ul>
<p>For each categorical variable:</p>
<ul>
<li><code>nobs</code>: number of observations</li>
<li><code>nna</code>: number of missing values</li>
<li><code>freqTable</code>: a data frame containing a frequency table</li>
</ul>
<p>Summaries can be accessed by:</p>
<pre class="r"><code># look at irisDdf summary stats
summary(irisDdf)</code></pre>
<pre><code>     Sepal.Length          Sepal.Width           Petal.Length
 --------------------  --------------------  ---------------------
  missing :         0   missing :         0   missing :          0
      min :       4.3       min :         2       min :          1
      max :       7.9       max :       4.4       max :        6.9
     mean :  5.843333      mean :  3.057333      mean :      3.758
  std dev : 0.8280661   std dev : 0.4358663   std dev :   1.765298
 skewness : 0.3117531  skewness : 0.3157671  skewness : -0.2721277
 kurtosis :  2.426432  kurtosis :  3.180976  kurtosis :   1.604464
 --------------------  --------------------  ---------------------
      Petal.Width            Species
 ---------------------  ------------------
  missing :          0        levels : 3
      min :        0.1       missing : 0
      max :        2.5  &gt; freqTable head &lt;
     mean :   1.199333      setosa : 50
  std dev :  0.7622377  versicolor : 50
 skewness : -0.1019342   virginica : 50
 kurtosis :   1.663933
 ---------------------  ------------------ </code></pre>
<p>For categorical variables, the top four values and their frequency is printed. To access the values themselves, we can do, for example:</p>
<pre class="r"><code>summary(irisDdf)$Sepal.Length$stats</code></pre>
<pre><code>$mean
[1] 5.843333

$var
[1] 0.6856935

$skewness
[1] 0.3117531

$kurtosis
[1] 2.426432</code></pre>
<p>or:</p>
<pre class="r"><code>summary(irisDdf)$Species$freqTable</code></pre>
<pre><code>       value Freq
1     setosa   50
2 versicolor   50
3  virginica   50</code></pre>
</div>
<div id="data-frame-like-ddf-methods" class="section level3">
<h3>Data frame-like “ddf” methods</h3>
<p>Note that with an object of class “ddf”, you can use some of the methods that apply to regular data frames:</p>
<pre class="r"><code>nrow(irisDdf)</code></pre>
<pre><code>150</code></pre>
<pre class="r"><code>ncol(irisDdf)</code></pre>
<pre><code>5</code></pre>
<pre class="r"><code>names(irisDdf)</code></pre>
<pre><code>[1] &quot;Sepal.Length&quot; &quot;Sepal.Width&quot;  &quot;Petal.Length&quot; &quot;Petal.Width&quot;
[5] &quot;Species&quot;     </code></pre>
<p>However, datadr does not go too far beyond this in terms of making a ddf feel or behave exactly like a regular R data frame.</p>
</div>
<div id="passing-a-data-frame-to-ddo-and-ddf" class="section level3">
<h3>Passing a data frame to <code>ddo()</code> and <code>ddf()</code></h3>
<p>It is worth noting that it is possible to pass a single data frame to <code><a target='_blank' href='rd.html#ddo'>ddo()</a></code> or <code><a target='_blank' href='rd.html#ddf'>ddf()</a></code>. The result is a single key-value pair with the data frame as the value, and <code>&quot;&quot;</code> as the key. This is an option strictly for convenience and with the idea that further down the line operations will be applied that split the data up into a more useful set of key-value pairs. Here is an example:</p>
<pre class="r"><code># initialize ddf from a data frame
irisDf &lt;- ddf(iris, update = TRUE)</code></pre>
<p>This of course only makes sense for data small enough to fit in memory in the first place. In the <a href="#small-memory--cpu">backends</a> section, we will discuss other backends for larger data and how data can be added to objects or read in from a raw source in these cases.</p>
</div>
</div>
<div id="ddoddf-transformations" class="section level2">
<h2>ddo/ddf Transformations</h2>
<p>A very common thing to want to do to a ddo or ddf is apply a transformation to each of the subsets. For example we may want to apply a transformation that :</p>
<ul>
<li>adds a new derived variable to a subset of a ddf</li>
<li>applies a statistical method or summarization to each subset</li>
<li>coerces each subset into a data frame</li>
<li>etc.</li>
</ul>
<p>This will be a routine thing to do when we start talking about D&amp;R operations.</p>
<p>We can add transformations to a ddo/ddf using <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code>. Let’s look at an example. Recall the iris data split by species:</p>
<pre class="r"><code># iris ddf by Species
irisKV &lt;- kvPairs(
  kvPair(&quot;setosa&quot;, subset(iris, Species == &quot;setosa&quot;)),
  kvPair(&quot;versicolor&quot;, subset(iris, Species == &quot;versicolor&quot;)),
  kvPair(&quot;virginica&quot;, subset(iris, Species == &quot;virginica&quot;))
)
irisDdf &lt;- ddf(irisKV)</code></pre>
<p>Suppose we want to add a simple transformation that computes the mean sepal width for each subset. I can do this with the following:</p>
<pre class="r"><code>irisSL &lt;- addTransform(irisDdf, function(x) mean(x$Sepal.Width))</code></pre>
<p>I simply provide my input ddo/ddf <code>irisDdf</code> and specify the function I want to apply to each subset.</p>
<p>If the function I provide has two arguments, it will pass both the key and value of the current subset as arguments to the function. If it has one argument, it will pass just the value. In this case, it has one argument, so I can expect <code>x</code> inside my function to hold the data frame value for a subset of <code>irisDdf</code>. Note that I can pre-define this function:</p>
<p>The output of a transformation function specified in <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code> will always be treated as a value unless the function returns a key-value pair via <code><a target='_blank' href='rd.html#kvpair'>kvPair()</a></code>.</p>
<pre class="r"><code>meanSL &lt;- function(x) mean(x$Sepal.Width)
irisSL &lt;- addTransform(irisDdf, meanSL)</code></pre>
<p>Let’s now look at the result:</p>
<pre class="r"><code>irisSL</code></pre>
<pre><code>
Transformed distributed data object backed by &#39;kvMemory&#39; connection

 attribute      | value
----------------+-----------------------------------------------------------
 size (stored)  | 12.67 KB (before transformation)
 size (object)  | 12.67 KB (before transformation)
 # subsets      | 3

* Other attributes: getKeys()</code></pre>
<p>Our input data was a ddf, but the output is a ddo! What is in the output?</p>
<pre class="r"><code>irisSL[[1]]</code></pre>
<pre><code>$key
[1] &quot;setosa&quot;

$value
[1] 3.428</code></pre>
<p>We see that <code>irisSL</code> now holds the data that we would expect – the result of our transformation – the mean sepal length. This value is not a data frame, so <code>irisSL</code> is a ddo.</p>
<p>But notice in the printout of <code>irisSL</code> above that it says that the object size is still the same as our input data, <code>irisDdf</code>. This is because when you add a transformation to a ddo/ddf, the transformation is not applied immediately, but is deferred until a data operation is applied. Data operations include <code><a target='_blank' href='rd.html#divide'>divide()</a></code>, <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code>, <code><a target='_blank' href='rd.html#drjoin'>drJoin()</a></code>, <code><a target='_blank' href='rd.html#drlapply'>drLapply()</a></code>, <code><a target='_blank' href='rd.html#drfilter'>drFilter()</a></code>, <code><a target='_blank' href='rd.html#drsample'>drSample()</a></code>, and <code><a target='_blank' href='rd.html#drsubset'>drSubset()</a></code>. When any of these are invoked on an object with a transformation attached to it, the transformation will be applied prior to any other computation. The transformation will also be applied any time a subset of the data is requested. Thus although the data has not been physically transformed after a call of <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code>, we can think of it conceptually as already being transformed.</p>
<p>When <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code> is called, it is tested on a subset of the data to make sure we have all of the necessary global variables and packages loaded necessary to portably perform the transformation. If there are any package dependencies, it makes a note and stores this information with the object. Also if there are any global object dependencies, these are also stored with the object. So whatever objects exist at the time of applying the transformation, any subsequent changes to the object or removal of the object will not effect the transformation.</p>
<p>For example, consider the following:</p>
<pre class="r"><code># set a global variable
globalVar &lt;- 7
# define a function that depends on this global variable
meanSLplus7 &lt;- function(x) mean(x$Sepal.Width) + globalVar
# add this transformation to irisDdf
irisSLplus7 &lt;- addTransform(irisDdf, meanSLplus7)
# look at the first key-value pair (invokes transformation)
irisSLplus7[[1]]</code></pre>
<pre><code>$key
[1] &quot;setosa&quot;

$value
[1] 10.428</code></pre>
<pre class="r"><code># remove globalVar
rm(globalVar)
# look at the first key-value pair (invokes transformation)
irisSLplus7[[1]]</code></pre>
<pre><code>$key
[1] &quot;setosa&quot;

$value
[1] 10.428</code></pre>
<p>We still get a result even though the global dependency of <code>meanSLplus7()</code> has been removed.</p>
<p>A final note about <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code>: it is possible to add multiple transformations to a distributed data object, in which case they are applied in the order supplied, but only one transform should ever be necessary.</p>
<p>For example, suppose we want to further modify <code>irisSL</code> to append some text to the keys:</p>
<pre class="r"><code>irisSL2 &lt;- addTransform(irisSL, function(k, v) kvPair(paste0(k, &quot;_mod&quot;), v))</code></pre>
<pre><code>*** finding global variables used in &#39;fn&#39;...</code></pre>
<pre><code> [none]</code></pre>
<pre><code>  package dependencies: datadr</code></pre>
<pre><code>*** testing &#39;fn&#39; on a subset...</code></pre>
<pre><code> ok</code></pre>
<pre class="r"><code>irisSL2[[1]]</code></pre>
<pre><code>$key
[1] &quot;setosa_mod&quot;

$value
[1] 3.428</code></pre>
<p>This is also an example of using a transformation function to modify the key.</p>
</div>
<div id="common-data-operations" class="section level2">
<h2>Common Data Operations</h2>
<p>The majority of this documentation will cover division and recombination, but here, we present some methods that are available for common data operations that come in handy for manipulating data in various ways.</p>
<div id="drlapply" class="section level3">
<h3>drLapply</h3>
<p>It is convenient to be able use the familiar <code>lapply()</code> approach to apply a function to each key-value pair. An <code>lapply()</code> method, called <code><a target='_blank' href='rd.html#drlapply'>drLapply()</a></code> is available for ddo/ddf objects. The function you specify follows the same convention as described earlier (if it has one argument, it is applied to the value only, if it has two arguments, it is applied to the key and value). A ddo is returned.</p>
<p>Here is an example of using <code><a target='_blank' href='rd.html#drlapply'>drLapply()</a></code> to the <code>irisDdf</code> data:</p>
<pre class="r"><code># get the mean Sepal.Width for each key-value pair in irisDdf
means &lt;- drLapply(irisDdf, function(x) mean(x$Sepal.Width))
# turn the resulting ddo into a list
as.list(means)</code></pre>
<pre><code>[[1]]
[[1]][[1]]
[1] &quot;setosa&quot;

[[1]][[2]]
[1] 3.428


[[2]]
[[2]][[1]]
[1] &quot;versicolor&quot;

[[2]][[2]]
[1] 2.77


[[3]]
[[3]][[1]]
[1] &quot;virginica&quot;

[[3]][[2]]
[1] 2.974</code></pre>
</div>
<div id="drfilter" class="section level3">
<h3>drFilter</h3>
<p>A <code><a target='_blank' href='rd.html#drfilter'>drFilter()</a></code> function is available which takes a function that is applied to each key-value pair. If the function returns <code>TRUE</code>, that key-value pair will be included in the resulting ddo/ddf, if <code>FALSE</code>, it will not.</p>
<p>Here is an example that keeps all subsets with mean sepal width less than 3:</p>
<pre class="r"><code># keep subsets with mean sepal width less than 3
drFilter(irisDdf, function(v) mean(v$Sepal.Width) &lt; 3)</code></pre>
<pre><code>
Distributed data frame backed by &#39;kvMemory&#39; connection

 attribute      | value
----------------+-----------------------------------------------------------
 names          | Sepal.Length(num), Sepal.Width(num), and 3 more
 nrow           | 100
 size (stored)  | 7.55 KB
 size (object)  | 7.55 KB
 # subsets      | 2

* Other attributes: getKeys()
* Missing attributes: splitSizeDistn, splitRowDistn, summary</code></pre>
</div>
<div id="drjoin" class="section level3">
<h3>drJoin</h3>
<p>The <code><a target='_blank' href='rd.html#drjoin'>drJoin()</a></code> operation takes multiple input ddo/ddf objects and merges their values by key. This is a very useful function when there are multiple input sources that you would like to group together.</p>
<p>Suppose with the iris data that we have two separate input sources, one that reports the sepal width and another that reports the sepal length for each species:</p>
<pre class="r"><code># create two new ddo objects that contain sepal width and sepal length
sw &lt;- drLapply(irisDdf, function(x) x$Sepal.Width)
sl &lt;- drLapply(irisDdf, function(x) x$Sepal.Length)</code></pre>
<p>An example subset of <code>sw</code> looks like this:</p>
<pre class="r"><code>sw[[1]]</code></pre>
<pre><code>$key
[1] &quot;setosa&quot;

$value
 [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9
[18] 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2
[35] 3.1 3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3</code></pre>
<p>Both <code>sw</code> and <code>sl</code> have the same set of keys, and the value is a vector of either the sepal width or length. To join them together, we can call <code><a target='_blank' href='rd.html#drjoin'>drJoin()</a></code>. This function takes any number of ddo/ddf inputs, and they must be named. It also optionally takes a <code>postTransFn</code> argument, which allows a transformation function to be applied the joined result.</p>
<p>By default, <code><a target='_blank' href='rd.html#drjoin'>drJoin()</a></code> groups the various data sources by key, and the resulting value is a named list, where each element of the list is the value from each data source. For example, to join the <code>sw</code> and <code>sl</code> data, we get the following:</p>
<pre class="r"><code># join sw and sl by key
joinRes &lt;- drJoin(Sepal.Width = sw, Sepal.Length = sl)
# look at first key-value pair
joinRes[[1]]</code></pre>
<pre><code>$key
[1] &quot;setosa&quot;

$value
$Sepal.Width
 [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9
[18] 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2
[35] 3.1 3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3

$Sepal.Length
 [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
[18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
[35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0</code></pre>
<p>The resulting object, <code>joinRes</code>, has subsets with the same keys, but the values are now named lists that consist of the values from both data sets.</p>
</div>
<div id="drsample" class="section level3">
<h3>drSample</h3>
<p>It can be useful to create a new data set of randomly sampled subsets of a large data set. The <code><a target='_blank' href='rd.html#drsample'>drSample()</a></code> function provides for this. Currently, it is as simple as specifying the fraction of subsets you would like the resulting data set to have:</p>
<pre class="r"><code>set.seed(1234)
drSample(irisDdf, fraction = 0.25)</code></pre>
<pre><code>
Distributed data object backed by &#39;kvMemory&#39; connection

 attribute      | value
----------------+-----------------------------------------------------------
 size (stored)  | 3.6 KB
 size (object)  | 3.6 KB
 # subsets      | 1

* Other attributes: getKeys()
* Missing attributes: splitSizeDistn</code></pre>
<!--
In the future, we will add the capability to sample the data with respect to [between-subset-variables](#between-subset-variables).
-->
</div>
</div>
</div>
<div id="division-and-recombination" class="section level1">
<h1>Division and Recombination</h1>
<div id="high-level-interface" class="section level2">
<h2>High-Level Interface</h2>
<p>datadr provides a high-level language for D&amp;R that simply consists of functions <code><a target='_blank' href='rd.html#divide'>divide()</a></code> for performing division, and <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code> for performing recombinations. The goal is for these methods to be sufficient for most operations a user might want to carry out. There are several ways these methods can be invoked to perform different tasks, which is outlined in this section.</p>
<p><code><a target='_blank' href='rd.html#divide'>divide()</a></code> and <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code> provide a way to create a persistent partitioning of the data in various ways and then provide several mechanisms combining results of analytical methods applied to the divisions. Being able to easily perform these operations alone provides a lot of power for ad-hoc analysis of very large data sets. However, we plan to inject results from D&amp;R theory and methods to provide an even more rich environment for analysis.</p>
<p><img src="image/drdiagram.svg" width="650px" alt="drdiagram" style="display:block; margin:auto"/> <!-- ![drdiagram](image/drdiagram.png) --></p>
</div>
<div id="division" class="section level2">
<h2>Division</h2>
<p>Division is achieved through the <code><a target='_blank' href='rd.html#divide'>divide()</a></code> method. The function documentation is available <a href="http://deltarho.org/docs-datadr/functionref.html#recombine">here</a>.</p>
<p>Currently there are two types of divisions supported: <em>conditioning variable</em>, and <em>random replicate</em>. In this section we discuss the major arguments to <code><a target='_blank' href='rd.html#divide'>divide()</a></code>, the most important of which is <code>by</code>.</p>
<div id="conditioning-variable-division" class="section level3">
<h3>Conditioning variable division</h3>
<p>In the previous section, we were looking at a division of the iris data by species. We manually split the data into key-value pairs. We can achieve the same result by doing conditioning variable division:</p>
<pre class="r"><code>irisDdf &lt;- ddf(iris)
# divide irisDdf by species
bySpecies &lt;- divide(irisDdf, by = &quot;Species&quot;, update = TRUE)</code></pre>
<p><code><a target='_blank' href='rd.html#divide'>divide()</a></code> must take a ddf object.</p>
<p>Since the result of splitting the iris data by species is a data frame, <code>bySpecies</code> is now a ddf. We can inspect it with the following:</p>
<pre class="r"><code>bySpecies</code></pre>
<pre><code>
Distributed data frame backed by &#39;kvMemory&#39; connection

 attribute      | value
----------------+-----------------------------------------------------------
 names          | Sepal.Length(num), Sepal.Width(num), and 2 more
 nrow           | 150
 size (stored)  | 10.71 KB
 size (object)  | 10.71 KB
 # subsets      | 3

* Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary()
* Conditioning variables: Species</code></pre>
<p>We see the same printout as we had with our manually-created division, with the addition of information about how the data was divided.</p>
<p>In the above example, conditioning variable division was specified with the <code>by</code> argument. Here, simply specifying a character string or vector of character strings (for multiple conditioning variables) will invoke conditioning variable division. A more formal way to achieve this is by using <code><a target='_blank' href='rd.html#conddiv'>condDiv()</a></code> to build the division specification:</p>
<pre class="r"><code># divide irisDdf by species using condDiv()
bySpecies &lt;- divide(irisDdf, by = condDiv(&quot;Species&quot;), update = TRUE)</code></pre>
<p>Using <code><a target='_blank' href='rd.html#conddiv'>condDiv()</a></code> is not necessary but follows the general idea of using a function to build a division specification that is and will be followed for other division methods.</p>
<p>Here’s what a subset of the divide data looks like:</p>
<pre class="r"><code># look at a subset of bySpecies
bySpecies[[1]]</code></pre>
<pre><code>$key
[1] &quot;Species=setosa&quot;

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
...</code></pre>
<p>Note that the “Species” column is missing in the value data frame. This is because it is the variable we split on, and therefore has the same value for the entire subset. All conditioning variables for a given subset are stored in a “splitVars” attribute, and can be retrieved by <code><a target='_blank' href='rd.html#getsplitvars'>getSplitVars()</a></code>:</p>
<pre class="r"><code># get the split variable (Species) for some subsets
getSplitVars(bySpecies[[1]])</code></pre>
<pre><code>$Species
[1] &quot;setosa&quot;</code></pre>
<pre class="r"><code>getSplitVars(bySpecies[[2]])</code></pre>
<pre><code>$Species
[1] &quot;versicolor&quot;</code></pre>
<p>The keys for the division result are strings that specify how the data was divided:</p>
<pre class="r"><code># look at bySpecies keys
getKeys(bySpecies)</code></pre>
<pre><code>[[1]]
[1] &quot;Species=setosa&quot;

[[2]]
[1] &quot;Species=versicolor&quot;

[[3]]
[1] &quot;Species=virginica&quot;</code></pre>
</div>
<div id="random-replicate-division" class="section level3">
<h3>Random replicate division</h3>
<p>Another way to divide data that is currently implemented is <em>random replicate</em> division. For this, we use the division specification function <code><a target='_blank' href='rd.html#rrdiv'>rrDiv()</a></code>. This function allows you to specify the number of rows you would like each random subset to have, and optionally a random seed to use for the random assignment of rows to subsets.</p>
<p>Suppose we want to split the iris data into random subsets with roughly 10 rows per subset:</p>
<pre class="r"><code># divide iris data into random subsets of 10 rows per subset
set.seed(123)
byRandom &lt;- divide(bySpecies, by = rrDiv(10), update = TRUE)</code></pre>
<p>Note that we passed <code>bySpecies</code> as the input data. We could just as well have specified <code>irisDdf</code> or any other division of the iris data. The input partitioning doesn’t matter.</p>
<pre class="r"><code>byRandom</code></pre>
<pre><code>
Distributed data frame backed by &#39;kvMemory&#39; connection

 attribute      | value
----------------+-----------------------------------------------------------
 names          | Sepal.Length(num), Sepal.Width(num), and 3 more
 nrow           | 150
 size (stored)  | 29.19 KB
 size (object)  | 29.19 KB
 # subsets      | 15

* Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary()
* Approx. number of rows in each division: 10</code></pre>
<p>We see there are still 150 rows (as there should be), but now there are 15 subsets.</p>
<p>We can look at the distribution of the of the number of rows in each subset:</p>
<pre class="r"><code># plot distribution of the number of rows in each subset
qplot(y = splitRowDistn(byRandom),
  xlab = &quot;percentile&quot;, ylab = &quot;number of rows in subset&quot;)</code></pre>
<p><img src="index_files/figure-html/byrandom_row_distn-1.png" title="" alt="" width="624" /></p>
<p>We see that there are not exactly 10 rows in each subset, but 10 rows on average. The random replicate algorithm simply randomly assigns each row of the input data into the number of bins <span class="math"><em>K</em></span> determined by the total number of rows <span class="math"><em>n</em></span> in the data divided by the desired number of rows per subset. Thus the distribution of the number of rows in each subset is like a draw from a multinomial with number of trials <span class="math"><em>n</em></span> and event probabilities of being put into one of <span class="math"><em>K</em></span> bins as <span class="math"><em>p</em><sub><em>i</em></sub> = 1/<em>K</em>, <em>i</em> = 1, …, <em>K</em></span>. We are working on a scalable approach to randomly assign exactly <span class="math"><em>n</em>/<em>K</em></span> rows to each subset.</p>
<p>The keys for random replicate divided data are simply labels indicating the bin:</p>
<pre class="r"><code>head(getKeys(byRandom))</code></pre>
<pre><code>[[1]]
[1] &quot;rr_1&quot;

[[2]]
[1] &quot;rr_10&quot;

[[3]]
[1] &quot;rr_11&quot;

[[4]]
[1] &quot;rr_12&quot;

[[5]]
[1] &quot;rr_13&quot;

[[6]]
[1] &quot;rr_14&quot;</code></pre>
<p>We will show an example of random replicate division in use later in this section.</p>
</div>
<div id="using-addtransform-with-divide" class="section level3">
<h3>Using <code>addTransform()</code> with <code>divide()</code></h3>
<p><code><a target='_blank' href='rd.html#divide'>divide()</a></code> does not know how to break data into pieces unless it is dealing with data frames. But sometimes we have input data that is not a ddf, or sometimes we would like to transform a ddf to add new columns before performing the division. We can use <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code> to get inputs suitable for the division result we would like to achieve.</p>
</div>
<div id="using-addtransform-to-create-a-derived-conditioning-variable" class="section level3">
<h3>Using <code>addTransform()</code> to create a derived conditioning variable</h3>
<p>A common use of <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code> when the input data is a ddf is to create a derived variable upon which we will perform division. For example, suppose we would like to divide the iris data by both <code>Species</code> and a discretized version of <code>Sepal.Length</code>.</p>
<p>First, let’s get a feel for the range of the <code>Sepal.Length</code> variable:</p>
<pre class="r"><code>summary(bySpecies)$Sepal.Length$range</code></pre>
<pre><code>[1] 4.3 7.9</code></pre>
<p>We see that its range is from 4.3 to 7.9. Suppose we want to bin <code>Sepal.Length</code> by the integer. We can create a new variable <code>slCut</code> by adding a transformation to the data that adds this column to the data frame in each subset.</p>
<pre class="r"><code>irisDdfSlCut &lt;- addTransform(irisDdf, function(v) {
  v$slCut &lt;- cut(v$Sepal.Length, seq(0, 8, by = 1))
  v
})
irisDdfSlCut[[1]]</code></pre>
<pre><code>$key
[1] &quot;&quot;

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species slCut
1          5.1         3.5          1.4         0.2  setosa (5,6]
2          4.9         3.0          1.4         0.2  setosa (4,5]
3          4.7         3.2          1.3         0.2  setosa (4,5]
4          4.6         3.1          1.5         0.2  setosa (4,5]
5          5.0         3.6          1.4         0.2  setosa (4,5]
...</code></pre>
<p>We see that <code>irisDdfSlCut</code> has the new variable <code>slCut</code>, as we expect. Now we can pass this to divide and split by both <code>Species</code> and <code>slCut</code>:</p>
<pre class="r"><code># divide on Species and slCut
bySpeciesSL &lt;- divide(irisDdfSlCut, by = c(&quot;Species&quot;, &quot;slCut&quot;))</code></pre>
<p>Let’s look at one subset:</p>
<pre class="r"><code>bySpeciesSL[[3]]</code></pre>
<pre><code>$key
[1] &quot;Species=versicolor|slCut=(4,5]&quot;

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          4.9         2.4          3.3           1
2          5.0         2.0          3.5           1
3          5.0         2.3          3.3           1</code></pre>
<p>As the key indicates, the species for this subset is <code>&quot;versicolor&quot;</code> and the sepal length is in the range <code>(4,5]</code>. Recall that we can access the split variables for this subset with:</p>
<pre class="r"><code>getSplitVars(bySpeciesSL[[3]])</code></pre>
<pre><code>$Species
[1] &quot;versicolor&quot;

$slCut
[1] &quot;(4,5]&quot;</code></pre>
</div>
<div id="the-posttransfn-argument" class="section level3">
<h3>The <code>postTransFn</code> argument</h3>
<p><code>postTransFn</code> provides a way for you to change the structure of the data after division, but prior to it being written to disk. This can be used to get the data out of data frame mode or to subset or remove columns, etc. It is specified in a way similar to <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code>, where if it has two arguments it will be passed the key and the value and if it has one argument it will be passed the key.</p>
<p>Since the input to <code><a target='_blank' href='rd.html#divide'>divide()</a></code> is a ddf, the <code>postTransFn</code> function will be receiving values which are some subset of that data frame, so you know what type of data to anticipate in the function, and you can test it on input key-value pairs to your call to <code><a target='_blank' href='rd.html#divide'>divide()</a></code>.</p>
</div>
<div id="the-spill-argument" class="section level3">
<h3>The <code>spill</code> argument</h3>
<p>Many times a conditioning variable division of interest will result in a long-tailed distribution of the data belonging to each subset, such that the data going into some subsets will get too large (remember that each subset must be small enough to be processed efficiently in memory). The <code>spill</code> argument in <code><a target='_blank' href='rd.html#divide'>divide()</a></code> allows you to specify a limit to the number of rows that can belong in a subset, after which additional records will get “spilled” into a new subset.</p>
<p>For example, suppose we want no more than 12 rows per subset in our by-species division:</p>
<pre class="r"><code># divide iris data by species, spilling to new key-value after 12 rows
bySpeciesSpill &lt;- divide(irisDdf, by = &quot;Species&quot;, spill = 12, update = TRUE)</code></pre>
<p>Let’s see what our subsets look like now:</p>
<pre class="r"><code># look at some subsets
bySpeciesSpill[[1]]</code></pre>
<pre><code>$key
[1] &quot;Species=setosa_1&quot;

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
...</code></pre>
<pre class="r"><code>bySpeciesSpill[[5]]</code></pre>
<pre><code>$key
[1] &quot;Species=setosa_5&quot;

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.3         3.7          1.5         0.2
2          5.0         3.3          1.4         0.2</code></pre>
<p>There are 5 different subsets for each species. For example, “Species=setosa” has subset with keys: “Species=setosa_1”, …, “Species=setosa_5”. The first four subsets have 12 rows in each (each spilling into a new subset after it was filled with 12 rows), and the fifth subset has 2 rows, a total of 50 rows for “Species=setosa”.</p>
</div>
<div id="the-filter-argument" class="section level3">
<h3>The <code>filter</code> argument</h3>
<p>The <code>filter</code> argument to <code><a target='_blank' href='rd.html#divide'>divide()</a></code> is an optional function that is applied to each candidate post-division key-value pair to determine whether it should be part of the resulting division. A common case of when the <code>filter</code> argument is useful is when a division may result in a very large number of very small subsets and we are only interested in studying subsets with adequate size.</p>
<p>As an example, consider the iris splitting with <code>spill = 12</code> from before. Suppose that in addition to spilling records, we also only want to keep subsets that have more than 5 records in them.</p>
<pre class="r"><code># divide iris data by species, spill, and filter out subsets with &lt;=5 rows
bySpeciesFilter &lt;- divide(irisDdf, by = &quot;Species&quot;, spill = 12,
  filter = function(v) nrow(v) &gt; 5, update = TRUE)
bySpeciesFilter</code></pre>
<pre><code>
Distributed data frame backed by &#39;kvMemory&#39; connection

 attribute      | value
----------------+-----------------------------------------------------------
 names          | Sepal.Length(num), Sepal.Width(num), and 2 more
 nrow           | 144
 size (stored)  | 30.04 KB
 size (object)  | 30.04 KB
 # subsets      | 12

* Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary()
* Conditioning variables: Species</code></pre>
<p>The <code>filter</code> function simply returns <code>TRUE</code> if we want to keep the subset and <code>FALSE</code> if not.</p>
<p>Now we have 144 rows and 12 divisions - the 3 subsets with 2 rows were omitted from the result.</p>
<p>Note that the filter is applied to the data prior to the application of <code>postTransFn</code>. Thus your filter function can expect the same structure of data frame as is in the values of your input ddf.</p>
</div>
</div>
<div id="recombination" class="section level2">
<h2>Recombination</h2>
<p>In this section we cover basic usage of the <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code> method. The function documentation is available <a href="http://deltarho.org/docs-datadr/functionref.html#recombine">here</a>.</p>
<p>We will show some examples on the iris data divided by species.</p>
<pre class="r"><code>irisDdf &lt;- ddf(iris)
bySpecies &lt;- divide(irisDdf, by = &quot;Species&quot;, update = TRUE)</code></pre>
<p>Recall that in D&amp;R we specify a data division, apply a number of numeric or visual methods to each subset of the division, and then recombine the results of those computations. Typically the application of the analytic method and the recombination go hand-in-hand – a ddo/ddf is typically transformed with <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code> prior to applying <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code>.</p>
<div id="combine-argument" class="section level3">
<h3><code>combine</code> argument</h3>
<p>Aside from specifying the input <code>data</code> ddo/ddf object, the main argument in <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code> is <code>combine</code>, which specifies the recombination strategy. There are several options for <code>combine</code> built in to datadr, and new ones can be specified. They come in a few categories:</p>
<ul>
<li>combiners that pull results into local R session</li>
<li>combiners that return a new ddo/ddf</li>
<li>combiners that compute statistics</li>
</ul>
<div id="combiners-that-pull-results-into-local-r-session" class="section level4">
<h4>Combiners that pull results into local R session</h4>
<p>Often the analytical method we apply to each subset results in a small enough result that we can pull all of the results together into our local R session. This is one of the more frequently-used recombination strategies. For this, there are currently two <code>combine</code> options:</p>
<ul>
<li><code><a target='_blank' href='rd.html#combcollect'>combCollect</a></code>: (the default) - returns a list of key-value pairs</li>
<li><code><a target='_blank' href='rd.html#combrbind'>combRbind</a></code>: rbinds all of the values into a single data frame</li>
</ul>
<p>Suppose we would like to compute the mean petal width for each species in our <code>bySpecies</code> division and pull the result back into our R session as a list of key-value pairs:</p>
<pre class="r"><code># apply mean petal width transformation
mpw &lt;- addTransform(bySpecies, function(v) mean(v$Petal.Width))
# recombine using the default combine=combCollect
recombine(mpw)</code></pre>
<pre><code>[[1]]
$key
[1] &quot;Species=setosa&quot;

$value
[1] 0.246


[[2]]
$key
[1] &quot;Species=versicolor&quot;

$value
[1] 1.326


[[3]]
$key
[1] &quot;Species=virginica&quot;

$value
[1] 2.026</code></pre>
<p>Here, the default <code><a target='_blank' href='rd.html#combcollect'>combCollect</a></code> was used to combine the results, giving us a list of key-value pairs with the value being the mean petal width.</p>
<p>If we would like the result to be a data frame we can use <code>combine=combRbind</code>:</p>
<pre class="r"><code>recombine(mpw, combRbind)</code></pre>
<pre><code>     Species   val
1     setosa 0.246
2 versicolor 1.326
3  virginica 2.026</code></pre>
<p>The scalar mean is coerced into a data frame. Note that by default if the input data keys are characters, they will be added to the data frame.</p>
</div>
<div id="combiners-that-return-a-new-ddoddf" class="section level4">
<h4>Combiners that return a new ddo/ddf</h4>
<p>Sometimes we have applied a transformation to a ddo/ddf and want the result to be a new ddo/ddf object with the transformation permanently applied. We might want to do this to have a smaller data set to work with for further D&amp;R operations. Or when a transformation is computationally expensive, we may want to make the result a new persistent data object to avoid future recomputations of the transformation.</p>
<p>For this type of recombination, we have two options for the <code>combine</code> argument:</p>
<ul>
<li><code><a target='_blank' href='rd.html#combddo'>combDdo</a></code>: persist the data as a ddo</li>
<li><code><a target='_blank' href='rd.html#combddf'>combDdf</a></code>: persist the data as a ddf</li>
</ul>
<p>For example, if I want the mean petal width transformation to persist as a ddo:</p>
<pre class="r"><code>recombine(mpw, combDdo)</code></pre>
<pre><code>
Distributed data object backed by &#39;kvMemory&#39; connection

 attribute      | value
----------------+-----------------------------------------------------------
 size (stored)  | 3.19 KB
 size (object)  | 3.19 KB
 # subsets      | 3

* Other attributes: getKeys()
* Missing attributes: splitSizeDistn</code></pre>
</div>
<div id="combiners-that-compute-statistics" class="section level4">
<h4>Combiners that compute statistics</h4>
<p>There are also some experimental recombination strategies that not only pull computational results together, but also merge the results in some statistical way.</p>
<p>The current methods for doing this in datadr are very experimental and mostly useful for illustrative purposes of what can be done. We will cover two examples:</p>
<ul>
<li><code><a target='_blank' href='rd.html#combmean'>combMean</a></code>: for transformations that return a vector, return the element-wise means</li>
<li><code><a target='_blank' href='rd.html#combmeancoef'>combMeanCoef</a></code>: for transformations that return model coefficients, average the coefficients</li>
</ul>
<p>To illustrate these, we will provide two examples of fitting a generalized linear model in the <a href="#dr-examples">D&amp;R Examples</a> section.</p>
<p>Much of the anticipated future work for datadr is the construction of several <code>apply</code>-<code>combine</code> pairs that are useful for different analysis tasks. The apply/combine pairs <code><a target='_blank' href='rd.html#drglm'>drGLM()</a></code>-<code><a target='_blank' href='rd.html#combmeancoef'>combMeanCoef()</a></code> and <code><a target='_blank' href='rd.html#drblb'>drBLB()</a></code>-<code><a target='_blank' href='rd.html#combmeancoef'>combMeanCoef()</a></code> that we will show later are two initial examples.</p>
</div>
</div>
</div>
<div id="dr-examples" class="section level2">
<h2>D&amp;R Examples</h2>
<p>Here are some examples with a new (but still small) data set that illustrate some general use of division and recombination including the use of random replicate division and some different recombination methods to fit a GLM to a dataset.</p>
<p>Although there are different approaches for in-memory data like this one, we will use datadr tools to deal with the data throughout, again remembering that these tools scale.</p>
<div id="the-data" class="section level3">
<h3>The data</h3>
<p>The data is adult income from the 1994 census database, pulled from the <a href="http://archive.ics.uci.edu/ml/datasets/Adult">UCI machine learning repository</a>. See <code>?adult</code> for more details.</p>
<p>First, we load the data (available as part of the datadr package) and turn it into a ddf:</p>
<pre class="r"><code>data(adult)
# turn adult into a ddf
adultDdf &lt;- ddf(adult, update = TRUE)
adultDdf</code></pre>
<pre><code>
Distributed data frame backed by &#39;kvMemory&#39; connection

 attribute      | value
----------------+-----------------------------------------------------------
 names          | age(int), workclass(fac), fnlwgt(int), and 13 more
 nrow           | 32561
 size (stored)  | 2.12 MB
 size (object)  | 2.12 MB
 # subsets      | 1

* Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary()</code></pre>
<pre class="r"><code>#look at the names
names(adultDdf)</code></pre>
<pre><code> [1] &quot;age&quot;           &quot;workclass&quot;     &quot;fnlwgt&quot;        &quot;education&quot;
 [5] &quot;educationnum&quot;  &quot;marital&quot;       &quot;occupation&quot;    &quot;relationship&quot;
 [9] &quot;race&quot;          &quot;sex&quot;           &quot;capgain&quot;       &quot;caploss&quot;
[13] &quot;hoursperweek&quot;  &quot;nativecountry&quot; &quot;income&quot;        &quot;incomebin&quot;    </code></pre>
<p>We see that there are about 32K observations, and we see the various variables available.</p>
<p>We’ll start with some simple exploratory analysis. One variable of interest in the data is education. We can look at the summary statistics to see the frequency distribution of <code>education</code> (which were computed since we specified <code>update = TRUE</code> when we created <code>adultDdf</code>):</p>
<pre class="r"><code>library(lattice)
edTable &lt;- summary(adultDdf)$education$freqTable
edTable$value &lt;- with(edTable, reorder(value, Freq, mean))
dotplot(value ~ Freq, data = edTable)</code></pre>
<p><img src="index_files/figure-html/ed_table-1.png" title="" alt="" width="624" /></p>
</div>
<div id="division-by-education-group" class="section level3">
<h3>Division by education group</h3>
<p>Perhaps we would like to divide our data by <code>education</code> and investigate how some of the other variables behave within education.</p>
<p>Suppose we want to make some changes to the <code>education</code> variable: we want to leave out “Preschool” and create groups “Some-elementary”, “Some-middle”, and “Some-HS”. Of course in a real analysis you would probably want to first make sure you aren’t washing any interesting effects out by making these groupings.</p>
<p>We can handle these changes to the <code>education</code> variable using <code>preTransFn</code> in our call to <code><a target='_blank' href='rd.html#divide'>divide()</a></code>. You might be wondering why not make the changes to the variable in the original data frame prior to doing all of this. For this example, of course we can do that, but suppose this data were, say, 1TB in size. You would probably much rather apply the transformation during the division than create a new set of data.</p>
<p>The following transformation function will achieve the desired result:</p>
<pre class="r"><code># make a transformation to group some education levels
edGroups &lt;- function(v) {
  v$edGroup &lt;- as.character(v$education)
  v$edGroup[v$edGroup %in% c(&quot;1st-4th&quot;, &quot;5th-6th&quot;)] &lt;- &quot;Some-elementary&quot;
  v$edGroup[v$edGroup %in% c(&quot;7th-8th&quot;, &quot;9th&quot;)] &lt;- &quot;Some-middle&quot;
  v$edGroup[v$edGroup %in% c(&quot;10th&quot;, &quot;11th&quot;, &quot;12th&quot;)] &lt;- &quot;Some-HS&quot;
  v
}
# test it
adultDdfGroup &lt;- addTransform(adultDdf, edGroups)
adultDdfGroup[[1]]</code></pre>
<pre><code>$key
[1] &quot;&quot;

$value
  age        workclass fnlwgt education educationnum            marital
1  39        State-gov  77516 Bachelors           13      Never-married
2  50 Self-emp-not-inc  83311 Bachelors           13 Married-civ-spouse
3  38          Private 215646   HS-grad            9           Divorced
4  53          Private 234721      11th            7 Married-civ-spouse
5  28          Private 338409 Bachelors           13 Married-civ-spouse
         occupation  relationship  race    sex capgain caploss
1      Adm-clerical Not-in-family White   Male    2174       0
2   Exec-managerial       Husband White   Male       0       0
3 Handlers-cleaners Not-in-family White   Male       0       0
4 Handlers-cleaners       Husband Black   Male       0       0
5    Prof-specialty          Wife Black Female       0       0
  hoursperweek nativecountry income incomebin   edGroup
1           40 United-States  &lt;=50K         0 Bachelors
2           13 United-States  &lt;=50K         0 Bachelors
3           40 United-States  &lt;=50K         0   HS-grad
4           40 United-States  &lt;=50K         0   Some-HS
5           40          Cuba  &lt;=50K         0 Bachelors
...</code></pre>
<p>This adds a variable <code>edGroup</code> with the desired grouping of education levels. We can now divide the data by <code>edGroup</code>. We specify a <code>filterFn</code> to only allow data to be output that does not correspond to “Preschool”.</p>
<pre class="r"><code># divide by edGroup and filter out &quot;Preschool&quot;
byEdGroup &lt;- divide(adultDdfGroup, by = &quot;edGroup&quot;,
  filterFn = function(x) x$edGroup[1] != &quot;Preschool&quot;,
  update = TRUE)
byEdGroup</code></pre>
<pre><code>
Distributed data frame backed by &#39;kvMemory&#39; connection

 attribute      | value
----------------+-----------------------------------------------------------
 names          | age(int), workclass(cha), fnlwgt(int), and 13 more
 nrow           | 32510
 size (stored)  | 3.3 MB
 size (object)  | 3.3 MB
 # subsets      | 11

* Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary()
* Conditioning variables: edGroup</code></pre>
<p>We can look at the distribution of number of people in each education group with the following simple recombination:</p>
<p>We need to add a transformation to <code>byEdGroup</code> that simply compute the number of rows, and then use a <code><a target='_blank' href='rd.html#combrbind'>combRbind</a></code> recombine to collect all of the results in a single data frame:</p>
<pre class="r"><code># add transformation to count number of people in each education group
byEdGroupNrow &lt;- addTransform(byEdGroup, function(x) nrow(x))
# recombine into a data frame
edGroupTable &lt;- recombine(byEdGroupNrow, combRbind)
edGroupTable</code></pre>
<pre><code>           edGroup   val
1       Assoc-acdm  1067
2        Assoc-voc  1382
3        Bachelors  5355
4        Doctorate   413
5          HS-grad 10501
6          Masters  1723
7      Prof-school   576
8     Some-college  7291
9  Some-elementary   501
10         Some-HS  2541
11     Some-middle  1160</code></pre>
<p>A similar dotplot as before can be made with this data.</p>
</div>
<div id="investigating-data-by-education-group" class="section level3">
<h3>Investigating data by education group</h3>
<p>There are many things we might be interested in doing with our <code>byEdGroup</code> division. We’ll just show one quick example.</p>
<p>One thing we might be interested in is how different the distribution of gender is within each of the education groups. One way to do this is to look at the ratio of men to women. We can compute this ratio by applying a simple transformation and a <code><a target='_blank' href='rd.html#combrbind'>combRbind</a></code> recombination:</p>
<pre class="r"><code># compute male/female ratio by education group
byEdGroupSR &lt;- addTransform(byEdGroup, function(x) {
  tab &lt;- table(x$sex)
  data.frame(maleFemaleRatio = tab[&quot;Male&quot;] / tab[&quot;Female&quot;])
})
# recombine into a data frame
sexRatio &lt;- recombine(byEdGroupSR, combRbind)
sexRatio</code></pre>
<pre><code>           edGroup maleFemaleRatio
1       Assoc-acdm        1.534442
2        Assoc-voc        1.764000
3        Bachelors        2.307597
4        Doctorate        3.802326
5          HS-grad        2.097640
6          Masters        2.214552
7      Prof-school        5.260870
8     Some-college        1.598361
9  Some-elementary        2.853846
10         Some-HS        1.917336
11     Some-middle        2.815789</code></pre>
<p>We can visualize it with the following:</p>
<pre class="r"><code># make dotplot of male/female ratio by education group
sexRatio$edGroup &lt;- with(sexRatio, reorder(edGroup, maleFemaleRatio, mean))
dotplot(edGroup ~ maleFemaleRatio, data = sexRatio)</code></pre>
<p><img src="index_files/figure-html/vis_sexratio-1.png" title="" alt="" width="624" /></p>
<p>We know the marginal distribution of gender is lopsided to begin with (see <code>summary(byEdGroup)$sex</code>), but we don’t know if the sample we are dealing with is biased or not… There are obviously many many directions to go with the exploratory analysis and hopefully these few examples provide a start and a feel for how to go about</p>
<p>One more thing to note about what we have done so far: We have shown a couple of examples of using datadr to summarize the data in different ways and visualize the summaries. This is a good thing to do. But we also want to be able to visualize the subsets in detail. For example, we might want to look at a <code>scatterplot</code> of <code>age</code> vs. <code>hoursperweek</code>. With this small data set, we obviously can pull all subsets in and make a lattice plot or faceted ggplot. However, what if there are thousands or hundreds of thousands of subsets? This is where the <a href="http://deltarho.org/docs-trelliscope/">trelliscope</a> package – a visualization companion to datadr – comes in.</p>
</div>
<div id="fitting-a-glm-to-the-data" class="section level3">
<h3>Fitting a GLM to the data</h3>
<p>Although the majority of the work we do is quite effective through clever use of generic division and recombination approaches and making heavy use of visualization, it is worthwhile to show some of the approaches of approximating all-data estimates with datadr.</p>
<p>Therefore, we now turn to some examples of ways to apply analytical methods across the entire dataset from within the D&amp;R paradigm. For example, suppose we would like to model the dependence of making more or less than 50K per year on <code>educationnum</code>, <code>hoursperweek</code>, and <code>sex</code> using logistic regression.</p>
<p>Before doing it with datadr, let’s first apply the method to the original data frame, so that we can compare the results. Recall again that since this is a small data set, we can do things the “usual” way:</p>
<pre class="r"><code># fit a glm to the original adult data frame
rglm &lt;- glm(incomebin ~ educationnum + hoursperweek + sex, data = adult, family = binomial())
summary(rglm)$coefficients</code></pre>
<pre><code>                Estimate  Std. Error   z value      Pr(&gt;|z|)
(Intercept)  -7.23765437 0.095201538 -76.02455  0.000000e+00
educationnum  0.35785458 0.006541441  54.70577  0.000000e+00
hoursperweek  0.03299077 0.001257241  26.24062 9.146190e-152
sexMale       1.21167356 0.036791195  32.93379 7.219072e-238</code></pre>
<!-- tmp <- predict(rglm)
tmp[tmp < 0.5] <- 0
tmp[tmp > 0.5] <- 1

length(which(tmp == adult$income)) / nrow(adult) -->
<p>Now let’s compare this to a few datadr approaches. Note that these approaches are currently proof-of-concept only and are meant to illustrate ideas. We will illustrate <code><a target='_blank' href='rd.html#drglm'>drGLM()</a></code> and <code><a target='_blank' href='rd.html#drblb'>drBLB()</a></code>.</p>
</div>
<div id="fitting-a-glm-with-drglm" class="section level3">
<h3>Fitting a GLM with <code>drGLM()</code></h3>
<p>For the results of <code><a target='_blank' href='rd.html#drglm'>drGLM()</a></code> and <code><a target='_blank' href='rd.html#drblb'>drBLB()</a></code> to be valid, we need a random-replicate division of the data. We will choose a division that provides about 1000 rows in each subset and that only has the variables that we care about:</p>
<pre class="r"><code>rrAdult &lt;- divide(adultDdf, by = rrDiv(1000), update = TRUE,
  postTransFn = function(x)
    x[,c(&quot;incomebin&quot;, &quot;educationnum&quot;, &quot;hoursperweek&quot;, &quot;sex&quot;)])</code></pre>
<!-- plot(splitRowDistn(rrAdult)) -->
<p>Now, we can apply a <code><a target='_blank' href='rd.html#drglm'>drGLM()</a></code> transformation to <code>rrAdult</code> and then call <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code> on the result. <code><a target='_blank' href='rd.html#drglm'>drGLM()</a></code> has been designed to take any arguments you might pass to <code><a target='_blank' href='rd.html#glm'>glm()</a></code> and apply it to each subset, doing some special manipulation to the results to work with the desired recombination, <code><a target='_blank' href='rd.html#combmeancoef'>combMeanCoef</a></code>, which is a function that has been designed specifically to take coefficient results from model fits applied to each subset and average them:</p>
<pre class="r"><code>adultGlm &lt;- addTransform(rrAdult, function(x)
  drGLM(incomebin ~ educationnum + hoursperweek + sex,
    data = x, family = binomial()))
recombine(adultGlm, combMeanCoef)</code></pre>
<pre><code> (Intercept) educationnum hoursperweek      sexMale
  -7.2843697    0.3596912    0.0333105    1.2218422 </code></pre>
<p>If we compare the result to the all-data estimate, the values are close. However, with this approach, we do not get any inference about the estimates.</p>
</div>
<div id="fitting-a-glm-with-drblb" class="section level3">
<h3>Fitting a GLM with <code>drBLB()</code></h3>
<p>We can use the bag of little bootstraps (BLB) approach to fit a GLM to the data. The idea of bag of little bootstraps is to split the data into random subsets and apply a bootstrap method to each subset, compute a bootstrap metric to the result, and then average the metric across all subsets.</p>
<p>One important thing to keep in mind is that BLB requires each subset be resampled with with <span class="math"><em>N</em></span> replications, <span class="math"><em>N</em></span> being the total number of rows in the entire data set. Since each subset has much fewer than <span class="math"><em>N</em></span> rows, say <span class="math"><em>n</em></span>, we can imitate taking <span class="math"><em>N</em></span> draws by sampling from a multinomial with <span class="math"><em>n</em></span> bins with uniform probability and assigning weights to each of the <span class="math"><em>n</em></span> observations in the subset and computing weights from these and passing that as the <code>weights</code> argument to <code><a target='_blank' href='rd.html#glm'>glm()</a></code>. Any R method that meets BLB requirements and accommodates this sampling scheme in one way or another can be used with <code><a target='_blank' href='rd.html#drblb'>drBLB()</a></code>.</p>
<p>We apply <code><a target='_blank' href='rd.html#drblb'>drBLB()</a></code> to each subset, specifying the <code>statistic</code> to be computed for each bootstrap sample, the <code>metric</code> to compute on the statistics, and the number of bootstrap replications <code>R</code>. We also need to tell it the total number of rows in the data set. Right now, <code><a target='_blank' href='rd.html#drblb'>drBLB()</a></code> simply returns a numeric vector, which is combined using <code><a target='_blank' href='rd.html#combmean'>combMean()</a></code>.</p>
<pre class="r"><code># add bag of little bootstraps transformation
adultBlb &lt;- addTransform(rrAdult, function(x) {
  drBLB(x,
    statistic = function(x, weights)
      coef(glm(incomebin ~ educationnum + hoursperweek + sex,
        data = x, weights = weights, family = binomial())),
    metric = function(x)
      quantile(x, c(0.05, 0.95)),
    R = 100,
    n = nrow(rrAdult)
  )
})
# compute the mean of the resulting CI limits
coefs &lt;- recombine(adultBlb, combMean)
matrix(coefs, ncol = 2, byrow = TRUE)</code></pre>
<pre><code>           [,1]        [,2]
[1,] -7.4618687 -7.14966585
[2,]  0.3493574  0.37076706
[3,]  0.0314755  0.03564671
[4,]  1.1670064  1.28536870</code></pre>
<p>The result here is simply a vector, where each successive pair of elements represents the lower and upper 95% confidence limit for <code>intercept</code>, <code>educationnum</code>, <code>hoursperweek</code>, and <code>sexMale</code>. We recast the result to print it as a matrix. Close inspection shows that the confidence limits are similar to what is returned from the all-data <code><a target='_blank' href='rd.html#glm'>glm()</a></code> estimate and that confidence interval widths are about the same.</p>
<!-- # BLB result: (500 per subset)
-7.51194450 -7.20129785
0.35235341  0.37403775
0.03162063  0.03572909
1.17855898 1.29983445

-7.53768498 x -7.22450246  (0.3132) (compare to 0.3732)
0.35235099 x 0.37423944    (0.0219) (compare to 0.0256)
0.03217393 x 0.03638794    (0.0042) (compare to 0.0049)
1.17440023 x 1.29530486    (0.1209) (compare to 0.1442)

# BLB result: (200 per subset)
-7.74911838 -7.42470919  (0.3244) (compare to 0.3732)
0.36184991  0.38391722  (0.0221) (compare to 0.0256)
0.03323966 0.03743492  (0.0042) (compare to 0.0049)
1.21932545  1.34476423  (0.1254) (compare to 0.1442)

# glm result:
-7.237654   0.095202
 0.357855   0.006541
 0.032991   0.001257
 1.211674   0.036791 -->
<!-- # drGLM result (500 per subset)
(Intercept) educationnum hoursperweek     sex Male
-7.35496959   0.36326367   0.03365385   1.23792439 -->
</div>
</div>
</div>
<div id="mapreduce" class="section level1">
<h1>MapReduce</h1>
<div id="introduction-to-mapreduce" class="section level2">
<h2>Introduction to MapReduce</h2>
<p>MapReduce is a simple but powerful programming model for breaking a task into pieces and operating on those pieces in an embarrassingly parallel manner across a cluster. The approach was popularized by Google (Dean &amp; Ghemawat, 2008).</p>
<p>MapReduce forms the basis of all datadr operations. While the goal of datadr is for the higher-level <code><a target='_blank' href='rd.html#divide'>divide()</a></code> and <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code> methods to take care of all analysis needs, there may be times that the user would like to write MapReduce code directly. datadr exposes general MapReduce interface that runs over any implemented backend. The most popular of these, of course, is RHIPE.</p>
<div id="mapreduce-overview" class="section level3">
<h3>MapReduce overview</h3>
<p>MapReduce operates on key-value pairs. The input, output, and intermediate data are all key-value pairs. A MapReduce job consists of three phases that operate on these key-value pairs: the <em>map</em>, the <em>shuffle/sort</em>, and the <em>reduce</em>:</p>
<ul>
<li><strong>Map</strong>: A map function is applied to each input key-value pair, which does some user-defined processing and emits new key-value pairs to intermediate storage to be processed by the reduce.</li>
<li><strong>Shuffle/Sort</strong>: The map output values are collected for each unique map output key and passed to a reduce function.</li>
<li><strong>Reduce</strong>: A reduce function is applied in parallel to all values corresponding to each unique map output key and emits output key-value pairs.</li>
</ul>
<p>A simple schematic of this is shown below.</p>
<p><img src="image/mroverview.svg" width="450px" alt="mroverview" style="display:block; margin:auto"/> <!-- ![mroverview](image/mroverview.png) --></p>
<p>The map function and reduce function are user-defined. The MapReduce engine takes care of everything else. We will get a better feel for how things work by looking at some examples in this section.</p>
</div>
<div id="iris-data-again" class="section level3">
<h3>Iris data (again)</h3>
<p>We will illustrate MapReduce by continuing to look at the iris data. This time, we’ll split it randomly into 4 key-value pairs:</p>
<pre class="r"><code># split iris data randomly into 4 key-value pairs
set.seed(1234)
irisRR &lt;- divide(iris, by = rrDiv(nrows = 40))</code></pre>
<p>All inputs and outputs to MapReduce jobs in datadr are ddo or ddf objects.</p>
</div>
</div>
<div id="mapreduce-with-datadr" class="section level2">
<h2>MapReduce with datadr</h2>
<p>MapReduce jobs are executed in datadr with a call to <code><a target='_blank' href='rd.html#mrexec'>mrExec()</a></code>. The main inputs a user should be concerned with are:</p>
<ul>
<li><code>data</code>: a ddo/ddf</li>
<li><code>map</code>: an R expression that is evaluated during the map stage</li>
<li><code>reduce</code>: a vector of R expressions with names <code>pre</code>, <code>reduce</code>, and <code>post</code> that is evaluated during the reduce stage</li>
</ul>
<p>Other inputs of interest are the following:</p>
<ul>
<li><code>setup</code>: an expression of R code to be run before <code>map</code> and <code>reduce</code></li>
<li><code>output</code>: a “kvConnection” object indicating where the output data should reside – see <a href="#backend-choices">Store/Compute Backends</a></li>
<li><code>control</code>: parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) – see <a href="#backend-choices">Store/Compute Backends</a></li>
<li><code>params</code>: a named list of parameters external to the input data that are needed in the map or reduce phases</li>
</ul>
<p>In datadr, the <code>map</code> and <code>reduce</code> must be specified by the user as an R expression.</p>
<div id="the-map-expression" class="section level3">
<h3>The <code>map</code> expression</h3>
<p>The map expression is simply an R expression that operates on a chunk of input key-value pairs. Map expressions operate in parallel on disjoint chunks of the input data. For example, if there are 1000 input key-value pairs of roughly equal size and there are 5 map tasks running, then each map task will operate on around 200 key-value pairs over the course of the job. Depending on the size of each key-value pair, typically a map task will operate on batches of these key-value pairs, say 10 at a time, until all 200 have been processed.</p>
<p>A datadr map expression has the following essential objects or functions available:</p>
<ul>
<li><code>map.keys</code>: a list of the current block of input keys</li>
<li><code>map.values</code>: a list of the current block of input values</li>
<li><code>collect()</code>: a function that emits key-value pairs to the shuffle/sort process</li>
</ul>
<p>A map expression skeleton would look like this:</p>
<pre class="r"><code>map &lt;- expression({
  # do some operations on map.keys and map.values
  # emit result to shuffle/sort using collect(key, value)
})</code></pre>
<p>A key element of the map expression is the <code>collect()</code> function, which passes modified key-value pairs to the shuffle/sort phase prior to the reduce. The first argument of the function is a key, and the second is a value. When you have reached a point in your map expression that you are ready to pass the current processed key-value pair to the reducer, you call <code>collect()</code>.</p>
</div>
<div id="the-reduce-expression" class="section level3">
<h3>The <code>reduce</code> expression</h3>
<p>The reduce expression is processed for each set of unique keys emitted from the running the map expression over the data. It consists of <code>pre</code>, <code>reduce</code> and <code>post</code> expressions.</p>
<p>A datadr reduce expression has the following essential objects or functions available:</p>
<ul>
<li><code>reduce.key</code>: a unique map output key</li>
<li><code>reduce.values</code>: a collection of all of the map output keys the correspond to <code>reduce.key</code></li>
<li><code>collect()</code>: a function that emits key-value pairs to the output dataset</li>
</ul>
<p>For example, say we have a map that emitted key-value pairs: <code>(&quot;a&quot;, 1)</code>, <code>(&quot;a&quot;, 2)</code>, and <code>(&quot;a&quot;, 3)</code>. The shuffle/sort gathers all map outputs with key <code>&quot;a&quot;</code> and sets <code>reduce.key = &quot;a&quot;</code> and <code>reduce.values = list(1, 2, 3)</code>.</p>
<p>Note that in many cases, there are a very large number of <code>reduce.values</code> such that we must iterate through batches of them. This is the purpose of the <code>pre</code>, <code>reduce</code>, and <code>post</code> parts of the reduce expression. In the <code>pre</code>, we might initialize a result object. Then the <code>reduce</code> part might get called multiple times until all <code>reduce.values</code> have been passed through. Finally, we can post-process the result object and emit it to the output data in the <code>post</code> part of the expression. (Note that we can emit output at any place in the reduce expression, but this is typically how it is done.)</p>
<p>A reduce expression skeleton would look like this:</p>
<pre class="r"><code>reduce &lt;- expression(
  pre = {
    # initialize objects in which results will be stored
  },
  reduce = {
    # take current batch of reduce.values and update the result
  },
  post = {
    # emit output key-value pairs using collect(key, value)
  }
)</code></pre>
<p>We will now solidify how these are used with some examples.</p>
</div>
</div>
<div id="mapreduce-examples" class="section level2">
<h2>MapReduce Examples</h2>
<!--
k <- irisDdf[[1]][[1]]
v <- irisDdf[[1]][[2]]
-->
<!--
lapply(irisKV, function(x) max(x[[2]]$Petal.Length))
map.values <- lapply(irisKV[3:4], "[[", 2)
v <- do.call(rbind, map.values)
tmp <- by(v, v$Species, function(x) {
  curSpecies <- as.character(x$Species[1])
  data.frame(tot=sum(x$Petal.Length), n=nrow(x))
})
-->
<p>The easiest way to illustrate MapReduce is through example. Given the <code>irisRR</code> data we just created, let’s try a couple of computations:</p>
<ul>
<li>Compute the global maximum <code>Petal.Length</code></li>
<li>Compute the mean <code>Petal.Length</code> by species</li>
</ul>
<div id="global-maximum-petal.length" class="section level3">
<h3>Global maximum <code>Petal.Length</code></h3>
<p>Recall that <code>irisRR</code> is a random partitioning of the iris data, split into 4 key-value pairs. To compute the global maximum petal length, we simply need to compute the maximum petal length for each key-value pair in the map and then combine these maximums in the reduce and take the max of maxes. To ensure that all of our maximum values computed in the map go to the same reduce task, we need to emit the same key each time we <code>collect()</code>. We emit the key <code>&quot;max&quot;</code> each time. This will ensure that even across multiple map processes, all results with emitted key <code>&quot;max&quot;</code> will be shuffled into the same reduce task, which will have <code>reduce.key = &quot;max&quot;</code>. We write the map as follows:</p>
<pre class="r"><code># map expression to emit max petal length for each k/v pair
maxMap &lt;- expression({
  for(curMapVal in map.values)
    collect(&quot;max&quot;, max(curMapVal$Petal.Length))
})</code></pre>
<p>The <code>map.keys</code> and <code>map.values</code> lists for the current block of input data being processed are available inside the map. We don’t care about the input keys in this case. We step through <code>map.values</code> and emit the maximum petal length for each map value.</p>
<p>Then in the reduce, we set up the variable <code>globalMax</code> which we will update as new maximum values arrive. In the <code>reduce</code> part of the expression, we concatenate the current value of <code>globalMax</code> to the new batch of <code>reduce.values</code> and compute the maximum of that - thus computing the maximum of maximums. When all <code>reduce.values</code> have been processed, we call <code>collect()</code> to emit the <code>reduce.key</code> (<code>&quot;max&quot;</code>), and the computed global maximum.</p>
<pre class="r"><code># reduce expression to compute global max petal length
maxReduce &lt;- expression(
  pre = {
    globalMax &lt;- NULL
  },
  reduce = {
    globalMax &lt;- max(c(globalMax, unlist(reduce.values)))
  },
  post = {
    collect(reduce.key, globalMax)
  }
)</code></pre>
<p>We can execute the job with the following:</p>
<pre class="r"><code># execute the job
maxRes &lt;- mrExec(irisRR,
  map = maxMap,
  reduce = maxReduce
)</code></pre>
<p>The output of <code><a target='_blank' href='rd.html#mrexec'>mrExec</a></code> is a ddo. Since we only output one key-value pair, and the key is <code>&quot;globalMax&quot;</code>, we can get the result with:</p>
<pre class="r"><code># look at the result
maxRes[[&quot;max&quot;]]</code></pre>
<pre><code>$key
[1] &quot;max&quot;

$value
[1] 6.9</code></pre>
<p>To go through what happened in this job in more detail, here is a visual depiction of what happened:</p>
<p><img src="image/mr1.svg" width="650px" alt="mr1" style="display:block; margin:auto"/> <!-- ![mr1](image/mr1.png) --></p>
<p>In this diagram, we illustrate how the MapReduce would be carried out if there are two map tasks running. The key-value pairs with keys <code>&quot;1&quot;</code> and <code>&quot;2&quot;</code> get sent to one map task, and the other two key-value pairs get sent to the other map task. The first map has available to compute on the objects <code>map.keys = list(&quot;1&quot;, &quot;2&quot;)</code> and <code>map.values</code>, a list of the values corresponding to keys <code>&quot;1&quot;</code> and <code>&quot;2&quot;</code>. In our map expression, we iterate through each of the two <code>map.value</code>s and emit key-value pairs shown after the map in the diagram. This is done for both map tasks. Then the shuffle/sort groups the data by map output key. In this case, all map outputs have the same key, so they all get grouped together to be sent to one reduce. If there are several reduce tasks running, in this case there will only be one doing any work, since there is only one unique map output key. In the reduce, we have <code>reduce.key = &quot;max&quot;</code> and a list <code>reduce.values = list(6.9, 5.8, 6.7, 6.4)</code> (note that with different reduce buffer settings, it could be that we first operate on <code>reduce.values = list(6.9, 5.8)</code> and then update the result with <code>reduce.values = list(6.7, 6.4)</code>). The reduce expression is applied to the data, and the final output is emitted, the global maximum.</p>
<p>We will look at a slightly more involved example next.</p>
<p>First, note that there are several ways to get to the desired result. Another way we could have written the map would be to take advantage of having several <code>map.keys</code> and <code>map.values</code> in a given running map task. We can compute the max of the maximum of each individual subset, and then only emit one key-value pair per map task:</p>
<pre class="r"><code># another map expression to emit max petal length
maxMap2 &lt;- expression(
  collect(
    &quot;max&quot;,
    max(sapply(map.values, function(x) max(x$Petal.Length))))
)</code></pre>
<p>With this, we are emitting less data to the reduce. Typically intermediate data is written to disk and then read back by the reduce, so it is usually a good idea to send as little data to the reduce as possible.</p>
</div>
<div id="mean-petal.length-by-species" class="section level3">
<h3>Mean <code>Petal.Length</code> by species</h3>
<p>Now we look at an example that shows a little more of a shuffle/sort and also illustrates how a simple summary statistic, the mean, can be broken into independent operations.</p>
<p>Suppose we would like to compute the mean petal length by species. Computing a mean with independent operations for each subset can be done quite simply by keeping track of the sum and the length of the variable of interest in each subset, adding these up, and then dividing the final sum by the final length (note that this is not numerically stable if we are dealing with a lot of values – see <a href="http://www.janinebennett.org/index_files/ParallelStatisticsAlgorithms.pdf">here</a> for a good reference – these are used in the summary statistics computations for <code><a target='_blank' href='rd.html#updateattributes'>updateAttributes()</a></code>).</p>
<p>So computing the mean in MapReduce is easy. But we want to compute the mean individually for each species. We can take care of that in our map expression by breaking the data up by species, and then computing the sum and length for each and emitting them to the reduce using <code>collect()</code>. Remember that you can call <code>collect()</code> as many times as you would like, with whatever keys and values you would like. Here we will choose the map output keys to be the species name, to help get data to the right reduce task.</p>
<pre class="r"><code># map expression to emit sum and length of Petal.Length by species
meanMap &lt;- expression({
  v &lt;- do.call(rbind, map.values)
  tmp &lt;- by(v, v$Species, function(x) {
    collect(
      as.character(x$Species[1]),
      cbind(tot = sum(x$Petal.Length), n = nrow(x)))
  })
})</code></pre>
<p>In this map expression, we first bind the <code>map.values</code> data frames into one data frame. Then we call <code>by</code> to apply a function to the data frame by species, where for each subset we emit the species and the corresponding sum and length.</p>
<p>For the reduce for each unique map output key, we initialize a value <code>total = 0</code> and a length <code>nn = 0</code>. Then, the <code>reduce</code> part of the expression is run on all incoming <code>reduce.values</code> and <code>total</code> and <code>nn</code> are updated with the new data. When we have cycled through all <code>reduce.values</code>, we compute the mean as <code>total / nn</code> and emit the result:</p>
<pre class="r"><code># reduce to compute mean Petal.Length
meanReduce &lt;- expression(
  pre = {
    total &lt;- 0
    nn &lt;- 0
  },
  reduce = {
    tmp &lt;- do.call(rbind, reduce.values)
    total &lt;- total + sum(tmp[, &quot;tot&quot;])
    nn &lt;- nn + sum(tmp[, &quot;n&quot;])
  },
  post = {
    collect(reduce.key, total / nn)
  }
)</code></pre>
<p>The job is executed with:</p>
<pre class="r"><code># execute the job
meanRes &lt;- mrExec(irisRR,
  map = meanMap,
  reduce = meanReduce
)</code></pre>
<p>And we can look at the result:</p>
<pre class="r"><code># look at the result for virginica and versicolor
meanRes[c(&quot;virginica&quot;, &quot;versicolor&quot;)]</code></pre>
<pre><code>[[1]]
$key
[1] &quot;virginica&quot;

$value
[1] 5.552


[[2]]
$key
[1] &quot;versicolor&quot;

$value
[1] 4.26</code></pre>
<p>And now we illustrate what happened in this job:</p>
<p><img src="image/mr2.svg" width="650px" alt="mr2" style="display:block; margin:auto"/> <!-- ![mr2](image/mr2.png) --></p>
<p>We assume the same setup of key-value pairs being sent to two map tasks as before in the global max example. Each map task takes its input values and <code>rbind</code>s them into a single data frame. Then for each species subset, the species is output as the key and the sum and length are output as the value. We see that each map task outputs data for each species. Then the shuffle/sort takes all output with key “setosa” and sends it to one reduce task, etc. Each reduce task takes its input, sums the sums and lengths, and emits a resulting mean.</p>
<p>Hopefully these examples start give an impression of the types of things that can be done with MapReduce and how it can be done in datadr.</p>
<p>Remember that this MapReduce interface works on any backend, specifically RHIPE. Those familiar with RHIPE will notice that the interface is nearly identical to that of RHIPE, but we have made some changes to make it more general.</p>
<!-- ```{r mean_map2}
map <- expression({
  for(i in seq_along(map.keys)) {
    k <- map.keys[[i]]
    v <- map.values[[i]]

    tmp <- by(v, v$Species, function(x) {
      curSpecies <- as.character(x$Species[1])
      collect(
        curSpecies,
        data.frame(tot=sum(x$Petal.Length), n=nrow(x)))
    })
  }
}) -->
</div>
</div>
<div id="other-options" class="section level2">
<h2>Other Options</h2>
<p>The examples we have seen have illustrated basic functionality of MapReduce in datadr. There are additional options that provide fine-tuned control over some of the aspects of the MapReduce execution.</p>
<div id="the-setup-expression" class="section level3">
<h3>The <code>setup</code> expression</h3>
<p>In addition to <code>map</code> and <code>reduce</code>, another expression that can be provided to <code>mrExec()</code> is <code>setup</code>. This expression is executed prior to any map or reduce tasks, and is typically used to load a required library, etc. Depending on the backend, your <code>map</code> and <code>reduce</code> expression code may be executed on multiple nodes of a cluster, and these remote R sessions need to have all of the data and packages available to do the correct computation on your data.</p>
<p>For example, suppose in the mean by species example that we wanted to use the <code>plyr</code> package to compute the mean by species inside each map task. Then we could specify:</p>
<!-- In order to load plyr, some other packages need to be unloaded first because they import plyr -->
<pre><code>$trelliscope
NULL

$ggplot2
NULL

$scales
NULL

$reshape2
NULL</code></pre>
<pre class="r"><code># example of a setup expression
setup &lt;- expression({
  suppressMessages(library(plyr))
})</code></pre>
<p>It is a good practice to wrap calls to <code>library()</code> with <code>suppressMessages()</code> because some backends such as RHIPE interpret console output as an error. Now we could change our map expression to something like this:</p>
<pre class="r"><code># alternative to meanMap using plyr
meanMap2 &lt;- expression({
  v &lt;- do.call(rbind, map.values)
  dlply(v, .(Species), function(x) {
    collect(
      as.character(x$Species[1]),
      cbind(tot = sum(x$Petal.Length), n = nrow(x)))
  })
})</code></pre>
<p>We can execute it with:</p>
<pre class="r"><code>meanRes &lt;- mrExec(irisRR,
  setup = setup,
  map = meanMap2,
  reduce = meanReduce
)</code></pre>
</div>
<div id="the-params-argument" class="section level3">
<h3>The <code>params</code> argument</h3>
<p>If your <code>map</code> and/or <code>reduce</code> expressions rely on data in your local environment, you need to specify these in a named list as the <code>params</code> argument to <code>mrExec()</code>. The reason for this is that the <code>map</code> and <code>reduce</code> will be executed on remote machines and any data that they rely on has to be packaged up and shipped to the nodes. Note that when using <code><a target='_blank' href='rd.html#divide'>divide()</a></code> and <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code>, any functions you supply are searched to see if they reference local data objects and they are added to <code>params</code> automatically for the MapReduce calls done inside those functions, so you do not need to worry about it in those cases.</p>
<p>Suppose, for example, in our mean calculation, we want to convert the petal length measurement from centimeters to millimeters, using a conversion factor <code>cm2mm = 10</code> that is an object available in the global environment. Of course this is a silly example because we could simply multiply the result by 10 in the reduce without passing the object, and also because we could do the conversion after reading the result back in. More realistic cases will surely arise in your actual analyses, but for now, we use this example just to illustrate:</p>
<pre class="r"><code>cm2mm &lt;- 10

meanMap3 &lt;- expression({
  v &lt;- do.call(rbind, map.values)
  dlply(v, .(Species), function(x) {
    collect(
      as.character(x$Species[1]),
      cbind(tot = sum(x$Petal.Length) * cm2mm, n = nrow(x)))
  })
})

meanRes &lt;- mrExec(irisRR,
  setup = setup,
  map = meanMap3,
  reduce = meanReduce,
  params = list(cm2mm = cm2mm)
)</code></pre>
</div>
<div id="the-control-argument" class="section level3">
<h3>The <code>control</code> argument</h3>
<p>The <code>control</code> argument to <code><a target='_blank' href='rd.html#mrexec'>mrExec()</a></code> provides a way to specify backend-specific parameters that determine how various aspects of the backend will operate (such as number of map and reduce tasks, buffer sizes, number of cores to use, etc.). As these depend on the backend being used, we will discuss <code>control</code> individually for each backend in the <a href="#backend-choices">Store/Compute Backends</a> section.</p>
<p>Note that the <code>control</code> argument is available in <code><a target='_blank' href='rd.html#divide'>divide()</a></code> and <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code> as well.</p>
</div>
<div id="the-output-argument" class="section level3">
<h3>The <code>output</code> argument</h3>
<p>The output argument allows you to specify where and how the output will be stored. This is to be a “kvConnection” object, described in the <a href="#backend-choices">Store/Compute Backends</a> section for each implemented backend.</p>
<p>If <code>output=NULL</code> (the default), then an attempt will be made to read the output from whatever backend the input was in to memory. If <code>output</code> is a different storage mechanism than <code>input</code>, a conversion will be made.</p>
</div>
<div id="distributed-counters-counter" class="section level3">
<h3>Distributed counters: <code>counter()</code></h3>
<p>It is possible to increment a distributed counter inside a map or reduce expression. This can be useful for tracking things happening inside the map and reduce processes across the entire job. Counters can be used through the function <code>counter()</code>, which is made available to be called inside any map or reduce expression. The counter takes 3 arguments:</p>
<pre class="r"><code>counter(group, name, value)</code></pre>
<p>A call to <code>counter()</code> tells the MapReduce job to add an increment of <code>value</code> to a counter identified by its <code>group</code> and <code>name</code>.</p>
<p>For example, let’s add a counter to our example job:</p>
<pre class="r"><code>meanMap4 &lt;- expression({
  counter(&quot;counterTest&quot;, &quot;mapValuesProcessed&quot;, length(map.values))

  v &lt;- do.call(rbind, map.values)
  dlply(v, .(Species), function(x) {
    collect(
      as.character(x$Species[1]),
      cbind(tot = sum(x$Petal.Length) * cm2mm, n = nrow(x)))
  })
})

meanRes &lt;- mrExec(irisRR,
  setup = setup,
  map = meanMap4,
  reduce = meanReduce,
  params = list(cm2mm = cm2mm)
)</code></pre>
<p>We added a counter to the map expression that increments the distributed counter in group <code>&quot;counterTest&quot;</code> with the name <code>&quot;mapValuesProcessed&quot;</code>. As map tasks running in parallel are provided new data, the length of <code>map.values</code> is added to this distributed counter. Counters are stored as an attribute of the result, and we can look at the counters with the following:</p>
<pre class="r"><code>counters(meanRes)</code></pre>
<pre><code>$counterTest
$counterTest$mapValuesProcessed
[1] 4</code></pre>
<p>The result is what we expect – there were 4 input key-value pairs processed by the map.</p>
<!-- #### `status()`

Mainly used for RHIPE. -->
<!--
The output from the map function is processed before being sent to the reduce function, grouping the key-value pairs by key. There is one reducer for each unique key passed from the map. Each group is processed by the reducer by iterating through all the values in the group. The reduce expression is composed of three parts, "pre", "reduce", and "post". For each unique key, the "pre" and "post" expressions are executed before and after the iteration through the group elements. The "pre" expression is useful for initializing variables and the "post" expression is useful for collating results or preparing them for output.
-->
</div>
</div>
</div>
<div id="division-independent-methods" class="section level1">
<h1>Division-Independent Methods</h1>
<div id="all-data-computation" class="section level2">
<h2>All-Data Computation</h2>
<p>While division and recombination methods focus on per-subset computation, there are times where we would like to compute statistics over the entire data set, regardless of division. datadr provides a set of methods for division-independent computations, meaning no matter how the data is divided, these methods will provide the same global summary of the data. Currently it supports tabulating data through a <code><a target='_blank' href='rd.html#draggregate'>drAggregate()</a></code> function, computing quantile estimates through a <code><a target='_blank' href='rd.html#drquantile'>drQuantile()</a></code> function (from which we can obtain other related quantities of interest, such as histograms, boxplots, etc.), and performing hexagonal binning of two quantitiative variables through a <code><a target='_blank' href='rd.html#drhexbin'>drHexbin()</a></code> function.</p>
</div>
<div id="quantiles" class="section level2">
<h2>Quantiles</h2>
<p>By far the most common thing we tend to compute over the entire data other than summary statistics and tabulations is quantiles. With datadr, there is a very simple interface to computing quantiles over the entire data set regardless of division.</p>
<p>To be able to compute quantiles, a ddf must be supplied, and the <code>range</code> attribute of the variable of interest must have been computed using <code><a target='_blank' href='rd.html#updateattributes'>updateAttributes()</a></code>. The range is required because the quantile estimation algorithm takes the range of the variable and slices it into a grid of <code>nBins</code> bins. Each observation of the variable is placed into the bin of the interval that it falls in and the bin counts are tabulated. Then the resulting table is turned into a quantile estimate.</p>
<p>The quantile estimation returns results similar to that of <code>type = 1</code> in R’s base <code>quantile()</code> function.</p>
<div id="example-adult-data" class="section level3">
<h3>Example: adult data</h3>
<p>Here we provide a quick example of how to compute quantiles. We have implemented function <code><a target='_blank' href='rd.html#drquantile'>drQuantile()</a></code> that at a minimum requires a ddf and a specification of <code>var</code>, the variable you would like to compute the quantiles of.</p>
<p>We will use the <code>adult</code> data from before. Let’s load it and create a by education division:</p>
<pre class="r"><code># load adult data for quantile example
data(adult)
adultDdf &lt;- ddf(adult)
# divide it by education
# must have update = TRUE to get range of variables
byEd &lt;- divide(adultDdf, by = &quot;education&quot;, update = TRUE)</code></pre>
<p>There’s no reason to divide by education other than to illustrate that this method operates on arbitrary divisions of the data.</p>
<p>We can compute the quantiles with:</p>
<pre class="r"><code># compute quantiles of hoursperweek
hpwQuant &lt;- drQuantile(byEd, var = &quot;hoursperweek&quot;)
head(hpwQuant)</code></pre>
<pre><code>           fval q
1 0.00000000000 1
2 0.00003071159 1
3 0.00006142317 1
4 0.00009213476 1
5 0.00012284635 1
6 0.00015355794 1</code></pre>
<p>The result is simply a data frame of “f-values” <code>fval</code> and quantiles <code>q</code>. We can plot the result with:</p>
<pre class="r"><code>plot(hpwQuant)</code></pre>
<p><img src="index_files/figure-html/plot_hpw_quant-1.png" title="" alt="" width="624" /></p>
<p>Recall the quantiles (y-axis) are hours worked in a week. Some people work too much.</p>
</div>
<div id="keeping-all-data-at-the-tails" class="section level3">
<h3>Keeping all data at the tails</h3>
<p>A common thing we want to do with all-data quantile estimates is retain more observations in the tails. With large data sets and heavy tails, it can be good to know about all of the observations located in the tails. With <code><a target='_blank' href='rd.html#drquantile'>drQuantile()</a></code>, it is possible to specify a parameter <code>tails</code>, which you can set to a positive integer. The <code>tails</code> argument tells the quantile method how many exact observations to keep at each side of the distribution. These exact values are appended to the quantile estimates to provide more detail at the tails of the distribution. The default is <code>tails = 100</code>.</p>
</div>
<div id="conditioning-on-variables" class="section level3">
<h3>Conditioning on variables</h3>
<p>It is possible to condition on a categorical variable when computing quantiles, so that you get a distribution per level of that categorical variable. This can be useful when the data is very large for each category (otherwise, you can do this using <code><a target='_blank' href='rd.html#divide'>divide()</a></code> and <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code>). Here is an example of the quantiles of hours worked per week by gender:</p>
<pre class="r"><code># compute quantiles of hoursperweek by sex
hpwBySexQuant &lt;- drQuantile(byEd, var = &quot;hoursperweek&quot;, by = &quot;sex&quot;)
xyplot(q ~ fval, groups = sex, data = hpwBySexQuant, auto.key = TRUE)</code></pre>
<p><img src="index_files/figure-html/hpw_quant_bysex-1.png" title="" alt="" width="624" /></p>
</div>
</div>
<div id="aggregation" class="section level2">
<h2>Aggregation</h2>
<p>Division-independent aggregation can be done with <code><a target='_blank' href='rd.html#draggregate'>drAggregate()</a></code>. This function is similar to R’s <code>aggregate()</code> and <code>xtabs()</code>. It provides a formula interface to specifying both the quantity to sum and the variables to group by. Unlike <code>xtabs()</code>, <code><a target='_blank' href='rd.html#draggregate'>drAggregate()</a></code> returns a data frame with rows only for combinations of factors for which there was a positive frequency, not a full contingency table.</p>
<pre class="r"><code># load adult data for aggregate example
data(adult)
adultDdf &lt;- ddf(adult)
# divide it by education, for fun
byEd &lt;- divide(adultDdf, by = &quot;education&quot;, update = TRUE)

# get counts by race and gender
raceGender &lt;- drAggregate(byEd, ~ race + sex)
raceGender</code></pre>
<pre><code>                 race    sex  Freq
10              White   Male 19174
5               White Female  8642
8               Black   Male  1569
3               Black Female  1555
7  Asian-Pac-Islander   Male   693
2  Asian-Pac-Islander Female   346
6  Amer-Indian-Eskimo   Male   192
9               Other   Male   162
1  Amer-Indian-Eskimo Female   119
4               Other Female   109</code></pre>
<pre class="r"><code># aggregate age by race and gender
totAge &lt;- drAggregate(byEd, age ~ race + sex)
totAge</code></pre>
<pre><code>                 race    sex   Freq
10              White   Male 760297
5               White Female 318126
8               Black   Male  59124
3               Black Female  58863
7  Asian-Pac-Islander   Male  27078
2  Asian-Pac-Islander Female  12141
6  Amer-Indian-Eskimo   Male   7144
9               Other   Male   5614
1  Amer-Indian-Eskimo Female   4417
4               Other Female   3453</code></pre>
</div>
<div id="hexagonal-binning" class="section level2">
<h2>Hexagonal Binning</h2>
<p>A useful binning approach for scatterplot summaries of large data is hexagonal binning. If you are not familiar with this approach, see <a href="http://cran.r-project.org/web/packages/hexbin/vignettes/hexagon_binning.pdf">here</a>. datadr provides a division-independent hexbin method that is pretty straightforward. A call to <code><a target='_blank' href='rd.html#drhexbin'>drHexbin()</a></code> must specify at a minimum the input ddf, the name of the x and y variable. The remaining parameters are passed on to the individual <code>hexbin()</code> function in the <code>hexbin</code> package, which is applied to each subset and the results are aggregated.</p>
<p>Here is a simple example. Suppose we want a binned scatterplot of age vs. education using the <code><a target='_blank' href='rd.html#adult'>adult</a></code> data:</p>
<pre class="r"><code>library(hexbin)
# do hexbin aggregation on age and education
res &lt;- drHexbin(byEd, &quot;educationnum&quot;, &quot;age&quot;, xbins = 15)
plot(res, xlab = &quot;education&quot;, ylab = &quot;age&quot;)</code></pre>
<p><img src="index_files/figure-html/hexbin_example-1.png" title="" alt="" width="768" /></p>
</div>
</div>
<div id="storecompute-backends" class="section level1">
<h1>Store/Compute Backends</h1>
<div id="backend-choices" class="section level2">
<h2>Backend Choices</h2>
<p>The examples we have seen so far have used very small datasets. What if we have more data than fits in memory? In this section we cover additional backends to datadr that allow us to scale the D&amp;R approach to very large datasets.</p>
<p>datadr has been designed to be extensible, providing the same interface to multiple backends. Thus all of the examples we have illustrated so far can be run with the code unchanged on data registered to a different backend.</p>
<p>The general requirements for a backend to the datadr interface are key-value storage and MapReduce computation.</p>
<p><img src="image/scalableenv.svg" width="450px" alt="scalableenv" style="display:block; margin:auto"/> <!-- ![scalableenv](image/scalableenv.png) --></p>
<p>Additionally, a backend must have bindings allow us to access data and interface with MapReduce from inside of R.</p>
<p>All of the examples we have seen so far have been for “small” data, using in-memory R lists as the key-value store and a simple R implementation of MapReduce to provide computation. Two other options have been implemented for “medium” and “large” data.</p>
<p><img src="image/backends4.svg" width="650px" alt="backends" style="display:block; margin:auto"/></p>
<p>We spend much of our time in RHIPE with very large datasets. This is the only implemented backend that requires substantial effort to get up and running, which entails installing and configuring Hadoop and RHIPE on a cluster. The other two options can be used on a single workstation. The “medium” option stores data on local disk and processes it using multicore R. This is a great intermediate backend and is particularly useful for processing results of Hadoop data that are still too large to fit into memory. In addition to operating on small data, the “small” option of in-memory data works well as a backend for reading in a small subset of a larger data set and testing methods before applying across the entire data set.</p>
<p>The “medium” and “large” out-of-memory key-value storage options require a connection to be established with the backend. Other than that, the only aspect of the interface that changes from one backend to another is a <code>control</code> method, from which the user can specify backend-specific settings and parameters. We will provide examples of how to use these different backends in this section.</p>
<p>For each backend, we will in general follow the process of the following:</p>
<ul>
<li>Initiating a connection to the backend</li>
<li>Adding data to the connection</li>
<li>Initiating a ddo/ddf on the connection</li>
<li>A D&amp;R example</li>
<li>A MapReduce example</li>
</ul>
</div>
<div id="small-memory-cpu" class="section level2">
<h2>Small: Memory / CPU</h2>
<p>The examples we have seen so far have all been based on in-memory key-value pairs. Thus there will be nothing new in this section. However, we will go through the process anyway to draw comparisons to the other backends and show how the interface stays the same.</p>
<p>We will stick with a very simple example using the iris data.</p>
<div id="initiating-an-in-memory-ddf" class="section level3">
<h3>Initiating an in-memory ddf</h3>
<p>With the in-memory backend, there is not a storage backend to “connect” to and add data to. We can jump straight to initializing a ddo/ddf from data we already have in our environment.</p>
<p>For example, suppose we have the following collection of key-value pairs:</p>
<pre class="r"><code>irisKV &lt;- list(
  list(&quot;key1&quot;, iris[1:40,]),
  list(&quot;key2&quot;, iris[41:110,]),
  list(&quot;key3&quot;, iris[111:150,]))</code></pre>
<p>As we have seen before, we can initialize this as a ddf with the following:</p>
<pre class="r"><code># initialize a ddf from irisKV
irisDdf &lt;- ddf(irisKV)</code></pre>
</div>
<div id="dr-example" class="section level3">
<h3>D&amp;R example</h3>
<p>For a quick example, let’s create a “by species” division of the data, and then do a recombination to compute the coefficients of a linear model of sepal length vs. sepal width:</p>
<pre class="r"><code># divide in-memory data by species
bySpecies &lt;- divide(irisDdf,
  by = &quot;Species&quot;)</code></pre>
<pre class="r"><code># transform bySpecies to a data frame of lm coefficients
bySpeciesLm &lt;- addTransform(bySpecies, function(x) {
  coefs &lt;- coef(lm(Sepal.Length ~ Petal.Length, data = x))
  data.frame(slope = coefs[2], intercept = coefs[1])
})
# compute lm coefficients for each division and rbind them
recombine(bySpeciesLm, combRbind)</code></pre>
<pre><code>     Species     slope intercept
1     setosa 0.5422926  4.213168
2 versicolor 0.8282810  2.407523
3  virginica 0.9957386  1.059659</code></pre>
</div>
<div id="mapreduce-example" class="section level3">
<h3>MapReduce example</h3>
<p>For a MapReduce example, let’s take the <code>bySpecies</code> data and find the 5 records with the highest sepal width:</p>
<pre class="r"><code># map returns top 5 rows according to sepal width
top5map &lt;- expression({
  v &lt;- do.call(rbind, map.values)
  collect(&quot;top5&quot;, v[order(v$Sepal.Width, decreasing = TRUE)[1:5],])
})

# reduce collects map results and then iteratively rbinds them and returns top 5
top5reduce &lt;- expression(
  pre = {
    top5 &lt;- NULL
  }, reduce = {
    top5 &lt;- rbind(top5, do.call(rbind, reduce.values))
    top5 &lt;- top5[order(top5$Sepal.Width, decreasing = TRUE)[1:5],]
  }, post = {
    collect(reduce.key, top5)
  }
)

# execute the job
top5 &lt;- mrExec(bySpecies, map = top5map, reduce = top5reduce)
# get the result
top5[[1]]</code></pre>
<pre><code>$key
[1] &quot;top5&quot;

$value
   Sepal.Length Sepal.Width Petal.Length Petal.Width
16          5.7         4.4          1.5         0.4
34          5.5         4.2          1.4         0.2
33          5.2         4.1          1.5         0.1
15          5.8         4.0          1.2         0.2
6           5.4         3.9          1.7         0.4</code></pre>
<p>Now we’ll go through these same steps for the other backends.</p>
</div>
</div>
<div id="medium-disk-multicore" class="section level2">
<h2>Medium: Disk / Multicore</h2>
<p>The “medium” key-value backend stores data on your machine’s local disk, and is good for datasets that are bigger than will fit in (or are manageable in) your workstation’s memory, but not so big that processing them with the available cores on your workstation becomes infeasible. Typically this is good for data in the hundreds of megabytes. It can be useful sometimes to store even very small datasets on local disk.</p>
<div id="initiating-a-disk-connection" class="section level3">
<h3>Initiating a disk connection</h3>
<p>To initiate a local disk connection, we use the function <code><a target='_blank' href='rd.html#localdiskconn'>localDiskConn()</a></code>, and simply point it to a directory on our local file system.</p>
<pre class="r"><code># initiate a disk connection to a new directory /__tempdir__/irisKV
irisDiskConn &lt;- localDiskConn(file.path(tempdir(), &quot;irisKV&quot;), autoYes = TRUE)</code></pre>
<p>Note that in this tutorial we are using a temporary directory as the root directory of our local disk objects through calling <code>tempdir()</code>. You wouldn’t do this in a real analysis but this makes the example run well in a non-intrusive platform-independent way.</p>
<p>By default, if the directory does not exist, <code><a target='_blank' href='rd.html#localdiskconn'>localDiskConn()</a></code> will ask you if you would like to create the directory. Since we specify <code>autoYes = TRUE</code>, the directory is automatically created.</p>
<pre class="r"><code># print the connection object
irisDiskConn</code></pre>
<pre><code>localDiskConn connection
  loc=/private/var/folders/n4/6ztyqms165s3r_4n001n1y6m0000gn/T/RtmpVe7Twt/irisKV; nBins=0</code></pre>
<p><code>irisDiskConn</code> is simply a “kvConnection” object that points to the directory. Meta data containing data attributes is also stored in this directory. If we lose the connection object <code>irisDiskConn</code>, the data still stays on the disk, and we can get our connection back by calling</p>
<pre class="r"><code>irisDiskConn &lt;- localDiskConn(file.path(tempdir(), &quot;irisKV&quot;))</code></pre>
<p>Any meta data that was there is also read in. If you would like to connect to a directory but reset all meta data, you can call <code><a target='_blank' href='rd.html#localdiskconn'>localDiskConn()</a></code> with <code>reset = TRUE</code>.</p>
<p>Data is stored in a local disk connection by creating a new <code>.Rdata</code> file for each key-value pair. For data with a very large number of key-value pairs, we can end up with too many files in a directory for the file system to handle efficiently. It is possible to specify a parameter <code>nBins</code> to <code><a target='_blank' href='rd.html#localdiskconn'>localDiskConn()</a></code>, which tells the connection that new data should be equally placed into <code>nbins</code> subdirectories. The default is <code>nBins = 0</code>.</p>
</div>
<div id="adding-data" class="section level3">
<h3>Adding data</h3>
<p>We have initiated a “localDiskConn” connection, but it is just an empty directory. We need to add data to it. With the same key-value pairs as before:</p>
<pre class="r"><code>irisKV &lt;- list(
  list(&quot;key1&quot;, iris[1:40,]),
  list(&quot;key2&quot;, iris[41:110,]),
  list(&quot;key3&quot;, iris[111:150,]))</code></pre>
<p>We can add key-value pairs to the connection with <code><a target='_blank' href='rd.html#adddata'>addData()</a></code>, which takes the connection object as its first argument and a list of key-value pairs as the second argument. For example:</p>
<pre class="r"><code>addData(irisDiskConn, irisKV[1:2])</code></pre>
<p>Here we added the first 2 key-value pairs to disk. We can verify this by looking in the directory:</p>
<pre class="r"><code>list.files(irisDiskConn$loc)</code></pre>
<pre><code>[1] &quot;_meta&quot;
[2] &quot;07abaecdababc84098369b43ae953523.Rdata&quot;
[3] &quot;b1f1dba2f126bc3208b6b84121503757.Rdata&quot;</code></pre>
<p><code>&quot;_meta&quot;</code> is a directory where the connection metadata is stored. The two <code>.Rdata</code> files are the two key-value pairs we just added. The file name is determined by the md5 hash of the data in the key (and we don’t have to worry about this).</p>
<p>We can call <code><a target='_blank' href='rd.html#adddata'>addData()</a></code> as many times as we would like to continue to add data to the directory. Let’s add the final key-value pair:</p>
<pre class="r"><code>addData(irisDiskConn, irisKV[3])</code></pre>
<p>Now we have a connection with all of the data in it.</p>
</div>
<div id="initializing-a-ddf-1" class="section level3">
<h3>Initializing a ddf</h3>
<p>We can initialize a ddo/ddf with our disk connection object:</p>
<pre class="r"><code># initialize a ddf from irisDiskConn
irisDdf &lt;- ddf(irisDiskConn)</code></pre>
<p>As noted before, with in-memory data, we initialize ddo/ddf objects with in-memory key-value pairs. For all other backends, we pass a connection object. <code>irisDdf</code> is now a distributed data frame that behaves in the same way as the one we created for the in-memory case. The data itself though is located on disk.</p>
<p>The connection object is saved as an attribute of the ddo/ddf.</p>
<pre class="r"><code># print irisDdf
irisDdf</code></pre>
<pre><code>
Distributed data frame backed by &#39;kvLocalDisk&#39; connection

 attribute      | value
----------------+-----------------------------------------------------------
 names          | Sepal.Length(num), Sepal.Width(num), and 3 more
 nrow           | [empty] call updateAttributes(dat) to get this value
 size (stored)  | 1.95 KB
 size (object)  | [empty] call updateAttributes(dat) to get this value
 # subsets      | 3

* Missing attributes: keys, splitSizeDistn, splitRowDistn, summary</code></pre>
<p>We see that the connection info for the object is added to the printout of <code>irisDdf</code>. Also, note that nearly all of the attributes have not been populated, including the keys. This is because the data is on disk and we need to pass over it to compute most of the attributes:</p>
<pre class="r"><code># update irisDdf attributes
irisDdf &lt;- updateAttributes(irisDdf)</code></pre>
</div>
<div id="dr-example-1" class="section level3">
<h3>D&amp;R Example</h3>
<p>Let’s see how the code looks for the D&amp;R example on the local disk data:</p>
<pre class="r"><code># divide local disk data by species
bySpecies &lt;- divide(irisDdf,
  by = &quot;Species&quot;,
  output = localDiskConn(file.path(tempdir(), &quot;bySpecies&quot;), autoYes = TRUE),
  update = TRUE)</code></pre>
<p>This code is the same as what we used for the in-memory data except that in <code><a target='_blank' href='rd.html#divide'>divide()</a></code>, we also need to specify an output connection. If <code>output</code> is not provided, an attempt is made to read the data in to an in-memory connection. Here we specify that we would like the output of the division to be stored on local disk in <code>&quot;bySpecies&quot;</code> in our R temporary directory.</p>
<p>As stated before, note that local disk objects persists on disk. I know where the data and metadata for the <code>bySpecies</code> object is located. If I lose my R session or remove my object, I can get it back. All attributes are stored as meta data at the connection, so that I don’t need to worry about recomputing anything:</p>
<pre class="r"><code># remove the R object &quot;bySpecies&quot;
rm(bySpecies)
# now reinitialize
bySpecies &lt;- ddf(localDiskConn(file.path(tempdir(), &quot;bySpecies&quot;)))</code></pre>
<p>The code for the recombination remains exactly the same:</p>
<pre class="r"><code># transform bySpecies to a data frame of lm coefficients
bySpeciesLm &lt;- addTransform(bySpecies, function(x) {
  coefs &lt;- coef(lm(Sepal.Length ~ Petal.Length, data = x))
  data.frame(slope = coefs[2], intercept = coefs[1])
})
# compute lm coefficients for each division and rbind them
recombine(bySpeciesLm, combRbind)</code></pre>
<pre><code>     Species     slope intercept
1     setosa 0.5422926  4.213168
2 versicolor 0.8282810  2.407523
3  virginica 0.9957386  1.059659</code></pre>
</div>
<div id="interacting-with-local-disk-ddoddf-objects" class="section level3">
<h3>Interacting with local disk ddo/ddf objects</h3>
<p>Note that all interactions with local disk ddo/ddf objects are the same as those we have seen so far.</p>
<p>For example, I can access data by index or by key:</p>
<pre class="r"><code>bySpecies[[1]]</code></pre>
<pre><code>$key
[1] &quot;Species=setosa&quot;

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
...</code></pre>
<pre class="r"><code>bySpecies[[&quot;Species=setosa&quot;]]</code></pre>
<pre><code>$key
[1] &quot;Species=setosa&quot;

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
...</code></pre>
<p>These extractors find the appropriate key-value pair files on disk, read them in, and return them.</p>
<p>Also, all the accessors like <code>getKeys()</code> work just the same:</p>
<pre class="r"><code>getKeys(bySpecies)</code></pre>
<pre><code>[[1]]
[1] &quot;Species=setosa&quot;

[[2]]
[1] &quot;Species=versicolor&quot;

[[3]]
[1] &quot;Species=virginica&quot;</code></pre>
</div>
<div id="mapreduce-example-1" class="section level3">
<h3>MapReduce example</h3>
<p>Here we again find the top 5 <code>iris</code> records according to sepal width.</p>
<pre class="r"><code># map returns top 5 rows according to sepal width
top5map &lt;- expression({
  counter(&quot;map&quot;, &quot;mapTasks&quot;, 1)
  v &lt;- do.call(rbind, map.values)
  collect(&quot;top5&quot;, v[order(v$Sepal.Width, decreasing = TRUE)[1:5],])
})

# reduce collects map results and then iteratively rbinds them and returns top 5
top5reduce &lt;- expression(
  pre = {
    top5 &lt;- NULL
  }, reduce = {
    top5 &lt;- rbind(top5, do.call(rbind, reduce.values))
    top5 &lt;- top5[order(top5$Sepal.Width, decreasing = TRUE)[1:5],]
  }, post = {
    collect(reduce.key, top5)
  }
)

# execute the job
top5 &lt;- mrExec(bySpecies, map = top5map, reduce = top5reduce)
# get the result
top5[[1]]</code></pre>
<pre><code>$key
[1] &quot;top5&quot;

$value
   Sepal.Length Sepal.Width Petal.Length Petal.Width
16          5.7         4.4          1.5         0.4
34          5.5         4.2          1.4         0.2
33          5.2         4.1          1.5         0.1
15          5.8         4.0          1.2         0.2
6           5.4         3.9          1.7         0.4</code></pre>
<p>I added the line with the call to <code>counter()</code> to the map expression to illustrate some of the control parameters described at the end of this section.</p>
</div>
<div id="control-options" class="section level3">
<h3><code>control</code> options</h3>
<p>There are various aspects of backends that we want to be able to have control oer. The <code>control</code> argument of a MapReduce job provides a general interface to do this. A <code>control</code> argument is simply a named list of settings for various control parameters.</p>
<p>All of the <a href="#">data operations</a> run MapReduce jobs and therefore have a <code>control</code> argument.</p>
<p>Currently, the available control parameters for MapReduce on a local disk connection are:</p>
<ul>
<li><code>cluster</code>: a cluster object from <code>makeCluster()</code> to use to do distributed computation – default is <code>NULL</code> (single core)</li>
<li><code>mapred_temp_dir</code>: where to put intermediate key-value pairs in between map and reduce – default is <code>tempdir()</code></li>
<li><code>map_buff_size_bytes</code>: the size of batches of key-value pairs to be passed to the map – default is 10485760 (10 Mb). The cores in the cluster are filled with key-value pairs to process, up to each collection exceeding this size.</li>
<li><code>map_temp_buff_size_bytes</code>: the size of the batches of key-value pairs to flush to intermediate storage from the map output – default is 10485760 (10 Mb)</li>
<li><code>reduce_buff_size_bytes</code>: the size of the batches of key-value pairs to send to the reduce – default is 10485760 (10 Mb)</li>
</ul>
<p>The function <code><a target='_blank' href='rd.html#localdiskcontrol'>localDiskControl()</a></code> is used to create the default list. Any parameter specified will override the default.</p>
<!-- TODO: talk about options(...) -->
<p>To illustrate the use of <code>control</code> for local disk connections, let’s rerun the “top 5” MapReduce job but this time with a 3-core cluster:</p>
<pre class="r"><code># create a 3 core cluster
library(parallel)
cl &lt;- makeCluster(3)

# run MapReduce job with custom control
top5a &lt;- mrExec(bySpecies,
  map = top5map, reduce = top5reduce,
  control = localDiskControl(cluster = cl))</code></pre>
<p>The map and reduce tasks for this job were run on a 3-core cluster.</p>
<!-- We set `map_buff_size_bytes` to 10.  This means that when sending key-value pairs to the map tasks, input key-value pairs will be bundled into batches of key-value pairs (minimum of 1 key-value pair per batch) that don't exceed 10 bytes.  Since our data is very small, this very low limit ensures that there is only one key-value pair per batch, and therefore that the 3 map tasks will all be doing work.  With the default buffer size of 10 Mb, all 3 input key-value pairs are sent to one map task, and in that case, it is pointless to run a multicore job because only one map task will be run. -->
<p>We can verify that our new computation did indeed run 3 separate map tasks (one on each core) by comparing the counters from the first and second jobs:</p>
<pre class="r"><code># how many map tasks were there before using a 3-core cluster
counters(top5)$map$mapTasks</code></pre>
<pre><code>[1] 1</code></pre>
<pre class="r"><code># how many map tasks were there after using a 3-core cluster
counters(top5a)$map$mapTasks</code></pre>
<pre><code>[1] 3</code></pre>
</div>
</div>
<div id="large-hdfs-rhipe" class="section level2">
<h2>Large: HDFS / RHIPE</h2>
<p>Very large data sets can be stored on the Hadoop Distributed File System (HDFS). For this to work, your workstation must be connected to a Hadoop cluster with RHIPE installed. If you want to try these examples but do not have a Hadoop cluster, we are still using the same small data set and you can download our single-node Vagrant virtual machine – instructions <a href="http://deltarho.org/#quickstart">here</a>.</p>
<div id="hdfs-operations-with-rhipe" class="section level3">
<h3>HDFS operations with RHIPE</h3>
<p>Getting ready for dealing with data in Hadoop can require some Hadoop file system operations. Here is a quick crash course on the available functions for interacting with HDFS from R using RHIPE.</p>
<p>First we need to load and initialize RHIPE:</p>
<pre class="r"><code>library(Rhipe)
rhinit()</code></pre>
<p>Now for some of the available commands:</p>
<pre class="r"><code># list files in the base directory of HDFS
rhls(&quot;/&quot;)</code></pre>
<pre><code>  permission   owner      group size          modtime  file
1 drwxr-xr-x hafe647 supergroup    0 2013-12-17 01:16 /test
2 drwxr-xr-x hafe647 supergroup    0 2014-03-17 13:41  /tmp
3 drwxr-xr-x hafe647 supergroup    0 2014-03-11 09:42 /user</code></pre>
<pre class="r"><code># make a directory /tmp/testfile
rhmkdir(&quot;/tmp/testfile&quot;)</code></pre>
<pre><code>[1] TRUE</code></pre>
<pre class="r"><code># write a couple of key-value pairs to /tmp/testfile/1
rhwrite(list(list(1, 1), list(2, 2)), file = &quot;/tmp/testfile/1&quot;)</code></pre>
<pre><code>Wrote 0.39 KB,2 chunks, and 2 elements (100% complete)</code></pre>
<pre class="r"><code># read those values back in
a &lt;- rhread(&quot;/tmp/testfile/1&quot;)</code></pre>
<pre><code>Read 2 objects(0.08 KB) in 0.03 seconds</code></pre>
<pre class="r"><code># create an R object and save a .Rdata file containing it to HDFS
d &lt;- rnorm(10)
rhsave(d, file = &quot;/tmp/testfile/d.Rdata&quot;)
# load that object back into the session
rhload(&quot;/tmp/testfile/d.Rdata&quot;)
# list the files in /tmp/testfile
rhls(&quot;/tmp/testfile&quot;)</code></pre>
<pre><code>  permission   owner      group        size          modtime
1 drwxr-xr-x hafe647 supergroup           0 2014-03-17 13:47
2 -rw-r--r-- hafe647 supergroup   142 bytes 2014-03-17 13:47
                   file
1       /tmp/testfile/1
2 /tmp/testfile/d.Rdata</code></pre>
<pre class="r"><code># set the HDFS working directory (like R&#39;s setwd())
hdfs.setwd(&quot;/tmp/testfile&quot;)
# now commands like rhls() go on paths relative to the HDFS working directory
rhls()</code></pre>
<pre><code>  permission   owner      group        size          modtime
1 drwxr-xr-x hafe647 supergroup           0 2014-03-17 13:47
2 -rw-r--r-- hafe647 supergroup   142 bytes 2014-03-17 13:47
                   file
1       /tmp/testfile/1
2 /tmp/testfile/d.Rdata</code></pre>
<pre class="r"><code># change permissions of /tmp/testfile/1
rhchmod(&quot;1&quot;, 777)
# see how permissions chagned
rhls()</code></pre>
<pre><code>  permission   owner      group        size          modtime
1 drwxrwxrwx hafe647 supergroup           0 2014-03-17 13:47
2 -rw-r--r-- hafe647 supergroup   142 bytes 2014-03-17 13:47
                   file
1       /tmp/testfile/1
2 /tmp/testfile/d.Rdata</code></pre>
<pre class="r"><code># delete everything we just did
rhdel(&quot;/tmp/testfile&quot;)</code></pre>
<p>Also see <code>rhcp()</code> and <code>rhmv()</code>.</p>
</div>
<div id="initiating-an-hdfs-connection" class="section level3">
<h3>Initiating an HDFS connection</h3>
<p>To initiate a connection to data on HDFS, we use the function <code><a target='_blank' href='rd.html#hdfsconn'>hdfsConn()</a></code>, and simply point it to a directory on HDFS.</p>
<pre class="r"><code># initiate an HDFS connection to a new HDFS directory /tmp/irisKV
irisHDFSconn &lt;- hdfsConn(&quot;/tmp/irisKV&quot;, autoYes = TRUE)</code></pre>
<p>Similar to local disk connections, by default, if the HDFS directory does not exist, <code><a target='_blank' href='rd.html#hdfsconn'>hdfsConn()</a></code> will ask you if you would like to create the directory. Since we specify <code>autoYes = TRUE</code>, the directory is automatically created. Also, as with local disk connections, <code>irisHDFSconn</code> is simply a “kvConnection” object that points to the HDFS directory which contains or will contain data, and where meta data is stored for the connection.</p>
<pre class="r"><code># print the connection object
irisHDFSconn</code></pre>
<p>This simply prints the location of the HDFS directory we are connected to and the type of data it will expect. <code>&quot;sequence&quot;</code> is the default, which is a Hadoop sequence file. Other options are <code>&quot;map&quot;</code> and <code>&quot;text&quot;</code>. These can be specified using the <code>type</code> argument to <code><a target='_blank' href='rd.html#hdfsconn'>hdfsConn()</a></code>. See <code>?hdfsConn</code> for more details.</p>
</div>
<div id="adding-data-1" class="section level3">
<h3>Adding data</h3>
<p>There is a method <code><a target='_blank' href='rd.html#adddata'>addData()</a></code> available for “hdfsConn” connections, but it is not recommended to use this. The reason is that for each call of <code><a target='_blank' href='rd.html#adddata'>addData()</a></code>, a new file is created on HDFS in the subdirectory that your connection points to. If you have a lot of data, chances are that you will be adding a lot of individual files. Hadoop does not like to handle large numbers of files. If the data is very large, it likes a very small number of very large files. Having a large number of files slows down job initialization and also requires more map tasks to run than would probably be desired. However, the method is still available if you would like to use it. Just note that the typical approach is to begin with data that is already on HDFS in some form (we will cover an example of beginning with text files on HDFS later).</p>
<p>To mimic what was done with the “localDiskConn” example:</p>
<pre class="r"><code>irisKV &lt;- list(
  list(&quot;key1&quot;, iris[1:40,]),
  list(&quot;key2&quot;, iris[41:110,]),
  list(&quot;key3&quot;, iris[111:150,]))

addData(irisHDFSconn, irisKV)</code></pre>
</div>
<div id="initializing-a-ddf-2" class="section level3">
<h3>Initializing a ddf</h3>
<p>We can initialize a ddo/ddf by passing the HDFS connection object to <code><a target='_blank' href='rd.html#ddo'>ddo()</a></code> or <code><a target='_blank' href='rd.html#ddf'>ddf()</a></code>.</p>
<pre class="r"><code># initialize a ddf from hdfsConn
irisDdf &lt;- ddf(irisHDFSconn)
irisDdf</code></pre>
<p>As with the disk connection <code>irisDdf</code> object, nearly all of the attributes have not been populated.</p>
<pre class="r"><code># update irisDdf attributes
irisDdf &lt;- updateAttributes(irisDdf)</code></pre>
</div>
<div id="dr-example-2" class="section level3">
<h3>D&amp;R Example</h3>
<p>Let’s see how the code looks for the D&amp;R example on the HDFS data:</p>
<pre class="r"><code># divide HDFS data by species
bySpecies &lt;- divide(irisDdf,
  by = &quot;Species&quot;,
  output = hdfsConn(&quot;/tmp/bySpecies&quot;, autoYes=TRUE),
  update = TRUE)</code></pre>
<p>As with the local disk data, we specify an HDFS output connection, indicating to store the results of the division to <code>&quot;/tmp/bySpecies&quot;</code> on HDFS. As with local disk data, this object and all meta data persists on disk.</p>
<p>If we were to leave our R session and want to reinstate our <code>bySpecies</code> object in a new session:</p>
<pre class="r"><code># reinitialize &quot;bySpecies&quot; by connecting to its path on HDFS
bySpecies &lt;- ddf(hdfsConn(&quot;/tmp/bySpecies&quot;))</code></pre>
<p>The code for the recombination remains exactly the same:</p>
<pre class="r"><code># transform bySpecies to a data frame of lm coefficients
bySpeciesLm &lt;- addTransform(bySpecies, function(x) {
  coefs &lt;- coef(lm(Sepal.Length ~ Petal.Length, data = x))
  data.frame(slope = coefs[2], intercept = coefs[1])
})
# compute lm coefficients for each division and rbind them
recombine(bySpeciesLm, combRbind)</code></pre>
</div>
<div id="interacting-with-hdfs-ddoddf-objects" class="section level3">
<h3>Interacting with HDFS ddo/ddf objects</h3>
<p>All interactions with HDFS ddo/ddf objects are still the same as those we have seen so far.</p>
<pre class="r"><code>bySpecies[[1]]
bySpecies[[&quot;Species=setosa&quot;]]</code></pre>
<p>However, there are a few caveats about extractors for these objects. If you specify a numeric index, <code>i</code>, the extractor method returns the key-value pair for the <code>i</code>th key, as available from <code>getKeys()</code>. Thus, if you don’t have your object keys read in, you can’t access data in this way. Another important thing to keep in mind is that retrieving data by key for data on HDFS requires that the data is in a Hadoop <em>mapfile</em>.</p>
</div>
<div id="hadoop-mapfiles" class="section level3">
<h3>Hadoop mapfiles</h3>
<p>Random access by key for datadr data objects stored on Hadoop requires that they are stored in a valid mapfile. By default, the result of any <code><a target='_blank' href='rd.html#divide'>divide()</a></code> operation returns a mapfile. The user need not worry about the details of this – if operations that require the data to be a valid mapfile are not given a mapfile, they will complain and tell you to convert your data to a mapfile.</p>
<p>For example, recall from our original data object, <code>irisDdf</code>, that the connection stated that the file type was a <em>sequence</em> file. Let’s try to retrieve the subset with key <code>&quot;key1&quot;</code>:</p>
<pre class="r"><code>irisDdf[[&quot;key1&quot;]]</code></pre>
<p>We have been told to call <code><a target='_blank' href='rd.html#makeextractable'>makeExtractable()</a></code> on this data to make subsets extractable by key.</p>
<pre class="r"><code># make data into a mapfile
irisDdf &lt;- makeExtractable(irisDdf)</code></pre>
<p>Note that this requires a complete read and write of your data. You should only worry about doing this if you absolutely need random access by key. The only major requirement for this outside of your own purposes is for use in Trelliscope.</p>
<p>Let’s try to get that subset by key again:</p>
<pre class="r"><code>irisDdf[[&quot;key1&quot;]]</code></pre>
</div>
<div id="mapreduce-example-2" class="section level3">
<h3>MapReduce Example</h3>
<p>Here we again find the top 5 <code>iris</code> records according to sepal width.</p>
<pre class="r"><code># map returns top 5 rows according to sepal width
top5map &lt;- expression({
  counter(&quot;map&quot;, &quot;mapTasks&quot;, 1)
  v &lt;- do.call(rbind, map.values)
  collect(&quot;top5&quot;, v[order(v$Sepal.Width, decreasing=TRUE)[1:5],])
})

# reduce collects map results and then iteratively rbinds them and returns top 5
top5reduce &lt;- expression(
  pre = {
    top5 &lt;- NULL
  }, reduce = {
    top5 &lt;- rbind(top5, do.call(rbind, reduce.values))
    top5 &lt;- top5[order(top5$Sepal.Width, decreasing=TRUE)[1:5],]
  }, post = {
    collect(reduce.key, top5)
  }
)

# execute the job
top5 &lt;- mrExec(bySpecies, map = top5map, reduce = top5reduce)
# get the result
top5[[1]]</code></pre>
</div>
<div id="control-options-1" class="section level3">
<h3><code>control</code> options</h3>
<p>For fine control over different parameters of a RHIPE / Hadoop job (and there are many parameters), we use the <code>control</code> argument to any of the datadr functions providing MapReduce functionality (<code><a target='_blank' href='rd.html#divide'>divide()</a></code>, <code><a target='_blank' href='rd.html#mrexec'>mrExec()</a></code>, etc.).</p>
<p>We can set RHIPE control parameters with the function <code><a target='_blank' href='rd.html#rhipecontrol'>rhipeControl()</a></code>, which creates a named list of parameters and their values. If a parameter isn’t explicitly specified, its default is used. The parameters available are:</p>
<ul>
<li><code>mapred</code></li>
<li><code>setup</code></li>
<li><code>combiner</code></li>
<li><code>cleanup</code></li>
<li><code>orderby</code></li>
<li><code>shared</code></li>
<li><code>jarfiles</code></li>
<li><code>zips</code></li>
<li><code>jobname</code></li>
</ul>
<p>See the documentation for the RHIPE function <code>rhwatch</code> for details about these:</p>
<pre class="r"><code>?rhwatch</code></pre>
<p>The first three parameters in the list are the most important and often-used, particularly <code>mapred</code>, which is a list specifying specific Hadoop parameters such as <code>mapred.reduce.tasks</code> which can help tune a job.</p>
<p>Defaults for these can be seen by calling <code><a target='_blank' href='rd.html#rhipecontrol'>rhipeControl()</a></code> with no arguments.:</p>
<pre class="r"><code>rhipeControl()</code></pre>
</div>
</div>
<div id="conversion" class="section level2">
<h2>Conversion</h2>
<p>In many cases, it is useful to be able to convert from one key-value backend to another. For example, we might have some smaller data out on HDFS that we would like to move to local disk. Or we might have in-memory data that is looking too large and we want to take advantage of parallel processing so we want to push it to local disk or HDFS.</p>
<p>We can convert data from one backend to another using the <code><a target='_blank' href='rd.html#convert'>convert()</a></code> method. The general syntax is <code>convert(from, to)</code> where <code>from</code> is a ddo/ddf, and <code>to</code> is a “kvConnection” object. When <code>to=NULL</code>, we are converting to in-memory.</p>
<pre class="r"><code># initialize irisDdf HDFS ddf object
irisDdf &lt;- ddo(hdfsConn(&quot;/tmp/irisKV&quot;))
# convert from HDFS to in-memory ddf
irisDdfMem &lt;- convert(from = irisDdf)
# convert from HDFS to local disk ddf
irisDdfDisk &lt;- convert(from = irisDdf,
  to = localDiskConn(file.path(tempdir(), &quot;irisKVdisk&quot;), autoYes=TRUE))</code></pre>
<p>All possible conversions (disk -&gt; HDFS, disk -&gt; memory, HDFS -&gt; disk, HDFS -&gt; memory, memory -&gt; disk, memory -&gt; HDFS) have <code><a target='_blank' href='rd.html#convert'>convert()</a></code> methods implemented.</p>
</div>
<div id="reading-in-data" class="section level2">
<h2>Reading in Data</h2>
<p>One of the most difficult parts of analyzing very large data sets is getting the original data into a format suitable for analysis. This package provides some convenience functions for reading data in from text files, either collections of very large text files on a local file system that are to be read in sequentially, or collections of very large text files on HDFS.</p>
<div id="reading-in-local-text-files" class="section level3">
<h3>Reading in local text files</h3>
<p>Delimited text files can be read in using the <code><a target='_blank' href='rd.html#drread_table'>drRead.table()</a></code> family of functions. This function reads blocks of lines of text files, converts them to a data frame, and stores the result as a value in a key-value pair. For more difficult, less-structured text inputs, it is possible to write custom MapReduce jobs to read in the data.</p>
<p>As an example, suppose the iris data was given to us as a csv file:</p>
<pre class="r"><code># create a csv file to treat as text input
csvFile &lt;- file.path(tempdir(), &quot;iris.csv&quot;)
write.csv(iris, file = csvFile, row.names = FALSE, quote = FALSE)
# see what the file looks like
system(paste(&quot;head&quot;, csvFile))</code></pre>
<p>We want to get this into a format suitable for analysis in R.</p>
<pre class="r"><code># connection for where to store output
irisConn &lt;- localDiskConn(file.path(tempdir(), &quot;iris&quot;), autoYes = TRUE)
# read in iris data
irisData &lt;- drRead.csv(csvFile, rowsPerBlock = 20, output = irisConn)</code></pre>
<pre><code>-- In file  /private/var/folders/n4/6ztyqms165s3r_4n001n1y6m0000gn/T/RtmpVe7Twt/iris.csv
  Processing chunk  1
  Processing chunk  2
  Processing chunk  3
  Processing chunk  4
  Processing chunk  5
  Processing chunk  6
  Processing chunk  7
  Processing chunk  8
  Processing chunk  9
  * End of file - no data for chunk  9 </code></pre>
<pre class="r"><code># look at resulting object
irisData</code></pre>
<pre><code>
Distributed data frame backed by &#39;kvLocalDisk&#39; connection

 attribute      | value
----------------+-----------------------------------------------------------
 names          | Sepal.Length(num), Sepal.Width(num), and 3 more
 nrow           | [empty] call updateAttributes(dat) to get this value
 size (stored)  | 2.86 KB
 size (object)  | [empty] call updateAttributes(dat) to get this value
 # subsets      | 8

* Missing attributes: keys, splitSizeDistn, splitRowDistn, summary</code></pre>
<pre class="r"><code># look at a subset
irisData[[1]]</code></pre>
<pre><code>$key
[1] 6

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          6.3         3.3          6.0         2.5 virginica
2          5.8         2.7          5.1         1.9 virginica
3          7.1         3.0          5.9         2.1 virginica
4          6.3         2.9          5.6         1.8 virginica
5          6.5         3.0          5.8         2.2 virginica
...</code></pre>
<p>We can pass a vector of file paths, and can tweak the subset size with the <code>rowsPerBlock</code> argument.</p>
<p>The same can be done with data on HDFS. In that case, the input is an HDFS connection with <code>type=&quot;text&quot;</code> instead of a path.</p>
</div>
</div>
</div>
<div id="misc" class="section level1">
<h1>Misc</h1>
<div id="debugging" class="section level2">
<h2>Debugging</h2>
<p>More to come here, but for now, general guidelines:</p>
<ul>
<li>Get it right first on a subset. When using the <code><a target='_blank' href='rd.html#divide'>divide()</a></code> and <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code> interface, pretty much the only place you can introduce errors is in your custom <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code> functions or other transformation functions.</li>
<li>With large datasets, read in a small collection of subsets and test those in-memory by calling the same methods on the in-memory object. <code>browser()</code> is your friend - you can stick this in any user-defined function or inside your map and reduce expressions, which allows you to step into the environment in which your code is operating.</li>
</ul>
<!-- Will be adding capability to set `debug = TRUE` in your `control()` method, in which case whenever there is an error, the key is returned or something along those lines so you can pull out the troublesome key-value pair and see why it was causing problems. -->
</div>
<div id="faq" class="section level2">
<h2>FAQ</h2>
<div id="i-want-to-keep-up-on-the-latest-developments---what-do-i-do" class="section level3">
<h3>I want to keep up on the latest developments - what do I do?</h3>
<p>Please subscribe to our google <a href="https://groups.google.com/forum/#!forum/tessera-users">mailing list</a>!</p>
<p>Simply send an email to <a href="tessera-users+subscribe@googlegroups.com"><a href="mailto:tessera-users+subscribe@googlegroups.com">mailto:tessera-users+subscribe@googlegroups.com</a></a> to join.</p>
</div>
<div id="what-should-i-do-if-i-have-an-issue-or-feature-request" class="section level3">
<h3>What should I do if I have an issue or feature request?</h3>
<p>Please post an issue on <a href="https://github.com/delta-rho/datadr/issues">github</a>.</p>
</div>
<div id="when-should-i-consider-using-datadr" class="section level3">
<h3>When should I consider using datadr?</h3>
<p>Whenever you have large and/or complex data to analyze.</p>
<p>Complexity is often more of an issue than size. Complex data requires great flexibility. We need to be able to do more than run numerical linear algebra routines against the data. We need to interrogate it from many different angles, both visually and numerically. datadr strives to provide a very flexible interface for this, while being able to scale.</p>
</div>
<div id="what-is-the-state-of-development-of-datadr" class="section level3">
<h3>What is the state of development of datadr?</h3>
<p>datadr started out as proof of concept, and after applying it to several real-world large complex datasets and getting a feel for the requirements, we completely rewrote the package around a more cohesive design, with extensibility in mind.</p>
<p>At this point, we do not anticipate major changes to the interface, but do anticipate many changes under the hood, and perhaps some small changes in how various attributes for data structures are stored and handled.</p>
</div>
<div id="what-are-the-plans-for-future-development-of-datadr" class="section level3">
<h3>What are the plans for future development of datadr?</h3>
<p>Currently the plan is to continue to use the package in applied situations and refine, tweak, and tune performance. We also plan to continue to add features, and particularly to investigate new backends, such as distributed memory architectures.</p>
</div>
<div id="can-you-support-backend-x-please" class="section level3">
<h3>Can you support backend “x” please?</h3>
<p>We are definitely interested in making datadr work with the latest technology. We’re particularly interested in efficient, scalable, fault-tolerant backends.</p>
<p>If the proposed backend meets these requirements, it is a candidate:</p>
<ul>
<li>data is stored in a key-value store</li>
<li>MapReduce is a feasible computation approach</li>
<li>data can be accessed by key</li>
</ul>
<p>If it has these additional characteristics, it is all the more interesting:</p>
<ul>
<li>it is scalable</li>
<li>work has already been done on generic R interfaces to this backend</li>
<li>other people use it – it is not an obscure technology</li>
</ul>
</div>
<div id="how-is-datadr-similar-to-different-from-plyr-dplyr" class="section level3">
<h3>How is datadr similar to / different from <code>plyr</code> / <code>dplyr</code>?</h3>
<p><code>dplyr</code> is an amazing package for quickly (both in coding and processing time) manipulating data frame like objects. Although <code>dplyr</code> was born of <code>plyr</code> and the split-apply-combine paradigm, we see <code>dplyr</code> and datadr to be quite complementary, using <code>dplyr</code> for within-subset datadr computations. The purpose of this discussion is to point out when one might be more appropriate than the other, which we will discuss in terms of scalability, flexibility, performance, and other features.</p>
<ol style="list-style-type: decimal">
<li><p><strong>Scalability:</strong> A backing technology for datadr must be a key-value store that is capable of running MapReduce operations. For <code>dplyr</code>, it is tabular backends that can run group-by operations. Backends that satisfy the requirements of datadr include Hadoop, Spark, etc., while backends that satisfy the requirements of <code>dplyr</code> include MySQL, SQLite, etc. We are aware of users using <code>dplyr</code> on the low tens of gigabytes of data. Backends like Hadoop are designed to scale to hundreds of terabytes, and we have used datadr with multi-terabyte data. However, we have also found datadr to be very useful with small data sets - it all depends on what you want to do with the data.</p></li>
<li><p><strong>Flexibility:</strong> Data structures in <code>dplyr</code> are tabular, such as data frames, SQL tables, etc. In datadr, data structures can be arbitrary. Often it is convenient or necessary to store data as arbitrary objects, and typically big data means messy / less-structured data. From an algorithmic point of view, the backing MapReduce framework of datadr provides a great deal more flexibility than operations that only work with tabular data. But with the flexibility comes a bit more complexity. Thinking of everything as a data frame in <code>dplyr</code> greatly simplifies things, and if that is what you are dealing with, it is a great choice.</p></li>
<li><p><strong>Performance:</strong> A lot of effort has been spent on optimization for <code>dplyr</code>. With datadr, we have been focusing on getting the design right and will move to tuning for performance soon. Even with datadr fully-optimized, a lot of the difference in speed of operations is highly dependent on the backend being used, and also on the design philosophy (see below). With datadr, we are concerned about speed, but we are more concerned with the tools accomodating the D&amp;R analysis approach.</p></li>
<li><p><strong>Philosophy:</strong> While datadr’s “divide and recombine” strategy is basically the same thing as “split-apply-combine”, there are some important emphases to the D&amp;R approach. A very important one is that in D&amp;R, when we split a dataset, we require the partioning to <em>always</em> be persistent. This means repartitioning the input data and saving a copy of the result. This can be expensive, but it is very important. Typically we divide once and then apply/recombine multiple times. While we pay an up-front cost but reap the benefits during the meat of our analysis. <code>dplyr</code>’s philosophy is to do things quickly and through very simple syntax with tabular data, and it does that very well.</p></li>
</ol>
</div>
<div id="how-is-datadr-similar-to-different-from-pig" class="section level3">
<h3>How is datadr similar to / different from Pig?</h3>
<p>Pig is similar to datadr in that it has a high-level language that abstracts MapReduce computations from the user. Although I’m sure it is possible to run R code somehow with Pig using UDFs, and that it is also probably possible to reproduce division and recombination functionality in pig, the power of datadr comes from the fact that you never leave the R analysis environment, and that you deal with native R data structures.</p>
<p>I see Pig as more of a data processing, transformation, and tabulation engine than a deep statistical analysis environment. If you are mostly interested in scalable, high-level data manipulation tool and want a mature product (datadr is new and currently has one developer), then Pig is a good choice. Another good thing about Pig is that it is tightly integrated into the Hadoop ecosystem. If you are interested in deep analysis with a whole bunch of statistical tools available, then datadr is probably a better choice.</p>
</div>
<div id="how-does-datadr-compare-to-other-r-based-big-data-solutions" class="section level3">
<h3>How does datadr compare to other R-based big data solutions?</h3>
<p>There are many solutions for dealing with big data in R, so many that I won’t even attempt to make direct comparisons out of fear of making an incorrect assessment of solutions I am not adequately familiar with.</p>
<p>A listing of many approaches can be found at the <a href="http://cran.r-project.org/web/views/HighPerformanceComputing.html">HPC CRAN Task View</a>. Note that “high-performance computing” does not always mean “big data”.</p>
<p>Instead of making direct comparisons, I will try to point out some things that I think make datadr unique:</p>
<ul>
<li>datadr leverages Hadoop, which is routinely used for extremely large data sets</li>
<li>datadr as an interface is extensible to other backends</li>
<li>datadr is not only a technology linking R to distributed computing backends, but an implementation of D&amp;R, an <em>approach</em> to data analysis that has been used successfully for many analyses of large, complex data</li>
<li>datadr provides a backend to scalable detailed visualization of data using <a href="http://deltarho.org/docs-trelliscope/">trelliscope</a></li>
</ul>
</div>
</div>
<div id="r-code" class="section level2">
<h2>R Code</h2>
<p>If you would like to run through all of the code examples in this documentation without having to pick out each line of code from the text, below are files with the R code for each section.</p>
<ul>
<li><a href="code/2data.R">Dealing with Data in D&amp;R</a></li>
<li><a href="code/3dnr.R">Division and Recombination</a></li>
<li><a href="code/4mr.R">MapReduce</a></li>
<li><a href="code/5divag.R">Division-Independent Methods</a></li>
<li><a href="code/6backend.R">Store/Compute Backends</a></li>
</ul>
</div>
</div>


      </div>
    </div>
  </div>

  <div id="footer">
    <div class="container">
      <div class="col-md-6">
                <p>&copy; Ryan Hafen, 2016</p>
              </div>
      <div class="col-md-6">
        <p class="pull-right">created with <a href="https://github.com/hafen/packagedocs">packagedocs</a></p>
      </div>
    </div>
  </div>


</body>
</html>