-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathindex.html
2727 lines (2574 loc) · 194 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<html lang="en">
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta name="generator" content="pandoc" />
<meta name="author" content="Ryan Hafen" />
<title>datadr</title>
<script src="assets/jquery-1.11.3/jquery.min.js"></script>
<link href="assets/bootstrap-3.3.2/css/bootstrap.min.css" rel="stylesheet" />
<script src="assets/bootstrap-3.3.2/js/bootstrap.min.js"></script>
<script src="assets/bootstrap-3.3.2/shim/html5shiv.min.js"></script>
<script src="assets/bootstrap-3.3.2/shim/respond.min.js"></script>
<link href="assets/highlight-8.4/tomorrow.css" rel="stylesheet" />
<script src="assets/highlight-8.4/highlight.pack.js"></script>
<link href="assets/fontawesome-4.3.0/css/font-awesome.min.css" rel="stylesheet" />
<script src="assets/stickykit-1.1.1/sticky-kit.min.js"></script>
<script src="assets/jqueryeasing-1.3/jquery.easing.min.js"></script>
<link href="assets/packagedocs-0.0.1/pd.css" rel="stylesheet" />
<script src="assets/packagedocs-0.0.1/pd.js"></script>
<script src="assets/packagedocs-0.0.1/pd-collapse-toc.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
</head>
<body>
<header class="navbar navbar-white navbar-fixed-top" role="banner" id="header">
<div class="container">
<div class="navbar-header">
<button class="navbar-toggle" type="button" data-toggle="collapse" data-target=".navbar-collapse">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<span class="navbar-brand">
<a href="http://deltarho.org"> <img src='figures/icon.png' alt='deltarho icon' width='30px' height='30px' style='margin-top: -3px;'> </a>
</span>
<a href="index.html" class="navbar-brand page-scroll">
datadr - Divide and Recombine in R
</a>
</div>
<nav class="collapse navbar-collapse" role="navigation">
<ul class="nav nav-pills pull-right">
<li class="active">
<a href='index.html'>Docs</a>
</li>
<li>
<a href='rd.html'>Package Ref</a>
</li>
<li>
<a href='https://github.com/delta-rho/datadr'>Github <i class='fa fa-github'></i></a>
</li>
</ul>
</nav>
</div>
</header>
<!-- Begin Body -->
<div class="container">
<div class="row">
<div class="col-md-3" id="sidebar-col">
<div id="toc">
<ul>
<li><a href="#introduction">Introduction</a><ul>
<li><a href="#background">Background</a></li>
<li><a href="#package-overview">Package Overview</a></li>
<li><a href="#quickstart">Quickstart</a></li>
<li><a href="#for-plyr-dplyr-users">For plyr / dplyr Users</a></li>
<li><a href="#outline">Outline</a></li>
</ul></li>
<li><a href="#dealing-with-data-in-dr">Dealing with Data in D&R</a><ul>
<li><a href="#key-value-pairs">Key-Value Pairs</a></li>
<li><a href="#distributed-data-objects">Distributed Data Objects</a></li>
<li><a href="#distributed-data-frames">Distributed Data Frames</a></li>
<li><a href="#ddoddf-transformations">ddo/ddf Transformations</a></li>
<li><a href="#common-data-operations">Common Data Operations</a></li>
</ul></li>
<li><a href="#division-and-recombination">Division and Recombination</a><ul>
<li><a href="#high-level-interface">High-Level Interface</a></li>
<li><a href="#division">Division</a></li>
<li><a href="#recombination">Recombination</a></li>
<li><a href="#dr-examples">D&R Examples</a></li>
</ul></li>
<li><a href="#mapreduce">MapReduce</a><ul>
<li><a href="#introduction-to-mapreduce">Introduction to MapReduce</a></li>
<li><a href="#mapreduce-with-datadr">MapReduce with datadr</a></li>
<li><a href="#mapreduce-examples">MapReduce Examples</a></li>
<li><a href="#other-options">Other Options</a></li>
</ul></li>
<li><a href="#division-independent-methods">Division-Independent Methods</a><ul>
<li><a href="#all-data-computation">All-Data Computation</a></li>
<li><a href="#quantiles">Quantiles</a></li>
<li><a href="#aggregation">Aggregation</a></li>
<li><a href="#hexagonal-binning">Hexagonal Binning</a></li>
</ul></li>
<li><a href="#storecompute-backends">Store/Compute Backends</a><ul>
<li><a href="#backend-choices">Backend Choices</a></li>
<li><a href="#small-memory-cpu">Small: Memory / CPU</a></li>
<li><a href="#medium-disk-multicore">Medium: Disk / Multicore</a></li>
<li><a href="#large-hdfs-rhipe">Large: HDFS / RHIPE</a></li>
<li><a href="#conversion">Conversion</a></li>
<li><a href="#reading-in-data">Reading in Data</a></li>
</ul></li>
<li><a href="#misc">Misc</a><ul>
<li><a href="#debugging">Debugging</a></li>
<li><a href="#faq">FAQ</a></li>
<li><a href="#r-code">R Code</a></li>
</ul></li>
</ul>
</div>
</div>
<div class="col-md-9" id="content-col">
<div id="content-top"></div>
<div id="introduction" class="section level1">
<h1>Introduction</h1>
<div id="background" class="section level2">
<h2>Background</h2>
<p>This tutorial covers an implementation of Divide and Recombine (D&R) in the R statistical programming environment, via an R package called <code>datadr</code>. This is one component of the <a href="http://deltarho.org">DeltaRho</a> environment for the analysis of large complex data.</p>
<p>The goal of D&R is to provide an environment for data analysts to carry out deep statistical analysis of large, complex data with as much ease and flexibility as is possible with small datasets.</p>
<p>D&R is accomplished by dividing data into meaningful subsets, applying analytical methods to those subsets, and recombining the results. Recombinations can be numerical or visual. For visualization in the D&R framework, see <a href="http://github.com/delta-rho/trelliscope">Trelliscope</a>.</p>
<p>The diagram below is a visual representation of the D&R process.</p>
<p><img src="image/drdiagram.svg" width="650px" alt="drdiagram" style="display:block; margin:auto"/> <!--  --></p>
<p>For a given data set, which may be a collection of large csv files, an R data frame, etc., we apply a division method that partitions the data in some way that is meaningful for the analysis we plan to perform. Often the partitioning is a logical choice based on the subject matter. After dividing the data, we attack the resulting partitioning with several visual and numerical methods, where we apply the method independently to each subset and combine the results. There are many forms of divisions and recombinations, many of which will be covered in this tutorial.</p>
<div id="reference" class="section level3">
<h3>Reference</h3>
<p>References:</p>
<ul>
<li><a href="http://deltarho.org">deltarho.org</a></li>
<li><a href="http://onlinelibrary.wiley.com/doi/10.1002/sta4.7/full">Large complex data: divide and recombine (D&R) with RHIPE. <em>Stat</em>, 1(1), 53-67</a></li>
</ul>
<p>Related projects:</p>
<ul>
<li><a href="http://github.com/delta-rho/RHIPE">RHIPE</a>: the engine that makes D&R work for large datasets</li>
<li><a href="http://github.com/delta-rho/trelliscope">Trelliscope</a>: the visualization companion to datadr</li>
</ul>
</div>
</div>
<div id="package-overview" class="section level2">
<h2>Package Overview</h2>
<p>We’ll first lay out some of the major data types and functions in <code>datadr</code> to provide a feel for what is available in the package.</p>
<div id="data-types" class="section level3">
<h3>Data types</h3>
<p>The two major data types in <code>datadr</code> are distributed data frames and distributed data objects. A <em>distributed data frame (ddf)</em> can be thought of as a data frame that is split into chunks – each chunk is a subset of rows of the data frame – which may reside across nodes of a cluster (hence “distributed”). A <em>distributed data object (ddo)</em> is a similar notion except that each subset can be an object with arbitrary structure. Every distributed data frame is also a distributed data object.</p>
<p>The data structure we use to store ddo/ddf objects are <em>key-value pairs</em>. For our purposes, the key is typically a label that uniqueley identifies a subset, and the value is the subset of the data corresponding to the key. Thus, a ddo/ddf is essentially a list, where each element of the list contains a key-value pair.</p>
</div>
<div id="functions" class="section level3">
<h3>Functions</h3>
<p>Functions in <code>datadr</code> can be categorized according to the mechanisms they provide for working with data: distributed data types and backend connections, data operations, division-independent operations, and data ingest operations.</p>
<div id="distributed-data-types-backend-connections" class="section level4">
<h4>Distributed data types / backend connections</h4>
<p>Currently, there are three ways to store data using <code>datadr</code>: in memory, on a standard file system (e.g. a hard drive), and on the Hadoop Distributed File System (HDFS). Distributed data objects stored in memory do not require a connection to a backend. However, datasets that exceed the memory capabilities must be stored to disk or to HDFS via a backend connection:</p>
<ul>
<li><code><a target='_blank' href='rd.html#localdiskconn'>localDiskConn()</a></code>: instantiate backend connections to ddo / ddf objects that are persisted (i.e. ‘permanently’ stored) to a local disk connection</li>
<li><code><a target='_blank' href='rd.html#hdfsconn'>hdfsConn()</a></code>: instantiate backend connections to ddo / ddf objects that are persisted to HDFS</li>
<li><code><a target='_blank' href='rd.html#ddf'>ddf()</a></code>: instantiate a ddo from a backend connection</li>
<li><code><a target='_blank' href='rd.html#ddo'>ddo()</a></code>: instantiate a ddf from a backend connection</li>
</ul>
</div>
<div id="data-operations" class="section level4">
<h4>Data operations</h4>
<ul>
<li><code><a target='_blank' href='rd.html#divide'>divide()</a></code>: divide a ddf, either by conditioning variables or by randomly chosen subsets</li>
<li><code><a target='_blank' href='rd.html#recombine'>recombine()</a></code>: take the results of a computation applied to a ddo/ddf and combine them in a number of ways</li>
<li><code><a target='_blank' href='rd.html#drlapply'>drLapply()</a></code>: apply a function to each subset of a ddo/ddf and obtain a new ddo/ddf</li>
<li><code><a target='_blank' href='rd.html#drjoin'>drJoin()</a></code>: join multiple ddo/ddf objects by key</li>
<li><code><a target='_blank' href='rd.html#drsample'>drSample()</a></code>: take a random sample of subsets of a ddo/ddf</li>
<li><code><a target='_blank' href='rd.html#drfilter'>drFilter()</a></code>: filter out subsets of a ddo/ddf that do not meet a specified criteria</li>
<li><code><a target='_blank' href='rd.html#drsubset'>drSubset()</a></code>: return a subset data frame of a ddf</li>
<li><code><a target='_blank' href='rd.html#mrexec'>mrExec()</a></code>: run a traditional MapReduce job on a ddo/ddf</li>
</ul>
<p>All of these operations kick off MapReduce jobs to perform the desired computation. In <code>datadr</code>, we almost always want a new data set result right away, so there is not a prevailing notion of <em>deferred evaluation</em> as in other distributed computing frameworks. The only exception is a function that can be applied prior to or after any of these data operations that adds a transformation to be applied to each subset at the time of the next data operation. This function is <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code> and will be discussed in greater detail later in the tutorial.</p>
</div>
<div id="division-independent-operations" class="section level4">
<h4>Division-independent operations</h4>
<ul>
<li><code><a target='_blank' href='rd.html#drquantile'>drQuantile()</a></code>: estimate all-data quantiles, optionally by a grouping variable</li>
<li><code><a target='_blank' href='rd.html#draggregate'>drAggregate()</a></code>: all-data tabulation, similar to R’s <code><a target='_blank' href='rd.html#aggregate'>aggregate()</a></code> command</li>
<li><code><a target='_blank' href='rd.html#drhexbin'>drHexbin()</a></code>: all-data hexagonal binning aggregation</li>
</ul>
<p>Note that every data operation works in a backend-agnostic manner, meaning that whether you have data in memory, on your hard drive, or HDFS, you can run the same commands virtually unchanged.</p>
</div>
<div id="data-ingest" class="section level4">
<h4>Data ingest</h4>
<p>One of the most difficult aspects of dealing with very large data is getting the data into R. In <code>datadr</code>, we have extended the <code>read.table</code> family of functions. They are available as <code><a target='_blank' href='rd.html#drread_csv'>drRead.csv()</a></code>, <code><a target='_blank' href='rd.html#drread_delim'>drRead.delim()</a></code>, etc. See <code><a target='_blank' href='rd.html#drread_table'>drRead.table</a></code> for additional methods. These are particularly useful for backends like local disk and HDFS. Usage of these methods is discussed in the <a href="#reading-in-data">Reading in Data</a> section.</p>
</div>
</div>
</div>
<div id="quickstart" class="section level2">
<h2>Quickstart</h2>
<p>Before going into some of the details of <code>datadr</code>, let’s first run through some quick examples to get acquainted with some of the functionality of the package.</p>
<div id="package-installation" class="section level3">
<h3>Package installation</h3>
<p>First, we need to install the necessary components, <code>datadr</code> and <code>trelliscope</code>. These are R packages that we install from CRAN.</p>
<pre class="r"><code>install.packages(c("datadr", "trelliscope"))</code></pre>
<p>The example we go through will be a small dataset that we can handle in a local R session, and therefore we only need to have these two packages installed. For other installation options when dealing with larger data sets, see the <a href="http://deltarho.org/#quickstart">quickstart</a> on our website.</p>
<p>We will use as an example a data set consisting of the median list and sold price of homes in the United States, aggregated by county and month from 2008 to early 2014. These data are available in a package called <code>housingData</code>. To install this package:</p>
<pre class="r"><code>install.packages("housingData")</code></pre>
</div>
<div id="environment-setup" class="section level3">
<h3>Environment setup</h3>
<p>Now we load the packages and look at the housing data:</p>
<pre class="r"><code>library(housingData)
library(datadr)
library(trelliscope)
head(housing)</code></pre>
<pre><code> fips county state time nSold medListPriceSqft
1 06037 Los Angeles County CA 2008-01-31 505900 NA
2 06037 Los Angeles County CA 2008-02-29 497100 NA
3 06037 Los Angeles County CA 2008-03-31 487300 NA
4 06037 Los Angeles County CA 2008-04-30 476400 NA
5 06037 Los Angeles County CA 2008-05-31 465900 NA
6 06037 Los Angeles County CA 2008-06-30 456000 NA
medSoldPriceSqft
1 360.1645
2 353.9788
3 349.7633
4 348.5246
5 343.8849
6 342.1065</code></pre>
<p>We see that we have a data frame with the information we discussed, in addition to the number of units sold.</p>
</div>
<div id="division-by-county-and-state" class="section level3">
<h3>Division by county and state</h3>
<p>One way we want to divide the data is by county name and state to be able to study how home prices have evolved over time within county. We can do this with a call to <code><a target='_blank' href='rd.html#divide'>divide()</a></code>:</p>
<pre class="r"><code>byCounty <- divide(housing,
by = c("county", "state"), update = TRUE)</code></pre>
<p>Our <code>byCounty</code> object is now a distributed data frame (ddf) that is stored in memory. We can see some of its attributes by printing the object:</p>
<pre class="r"><code>byCounty</code></pre>
<pre><code>
Distributed data frame backed by 'kvMemory' connection
attribute | value
----------------+-----------------------------------------------------------
names | fips(cha), time(Dat), nSold(num), and 2 more
nrow | 224369
size (stored) | 15.73 MB
size (object) | 15.73 MB
# subsets | 2883
* Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary()
* Conditioning variables: county, state</code></pre>
<p>We see there are 2883 counties, and we can access various attributes by calling methods such as <code>summary()</code>. The <code>update = TRUE</code> that we added to <code><a target='_blank' href='rd.html#divide'>divide()</a></code> provided some of these attributes. Let’s look at the summary:</p>
<pre class="r"><code>summary(byCounty)</code></pre>
<pre><code> fips time nSold
------------------- ------------------ -------------------
levels : 2883 missing : 0 missing : 164370
missing : 0 min : 08-10-01 min : 11
> freqTable head < max : 14-03-01 max : 35619
26077 : 140 mean : 274.6582
51069 : 140 std dev : 732.2429
08019 : 139 skewness : 10.338
13311 : 139 kurtosis : 222.8995
------------------- ------------------ -------------------
medListPriceSqft medSoldPriceSqft
-------------------- -------------------
missing : 48399 missing : 162770
min : 0.5482456 min : 17.40891
max : 1544.944 max : 1249.494
mean : 96.72912 mean : 105.5659
std dev : 56.12035 std dev : 69.40658
skewness : 6.816523 skewness : 5.610013
kurtosis : 94.06555 kurtosis : 60.48337
-------------------- ------------------- </code></pre>
<p>Since <code>datadr</code> knows that <code>byCounty</code> is a ddf, and because we set <code>update = TRUE</code>, after the division operation global summary statistics were computed for each of the variables.</p>
<p>Suppose we want a more meaningful global summary, such as computing quantiles. <code>datadr</code> can do this in a division-independent way with <code><a target='_blank' href='rd.html#drquantile'>drQuantile()</a></code>. For example, let’s look at quantiles for the median list price and plot them using <code>xyplot()</code> from the <code>lattice</code> package:</p>
<pre class="r"><code>library(lattice)
priceQ <- drQuantile(byCounty, var = "medListPriceSqft")
xyplot(q ~ fval, data = priceQ, scales = list(y = list(log = 10)))</code></pre>
<p><img src="index_files/figure-html/quickstart_quantile-1.png" title="" alt="" width="624" /></p>
<p>By the way, what does a subset of <code>byCounty</code> look like? <code>byCounty</code> is a list of <em>key-value pairs</em>, which we will learn more about later. Essentially, the collection of subsets can be thought of as a large list, where each list element has a key and a value. To look at the first key-value pair:</p>
<pre class="r"><code>byCounty[[1]]</code></pre>
<pre><code>$key
[1] "county=Abbeville County|state=SC"
$value
fips time nSold medListPriceSqft medSoldPriceSqft
1 45001 2008-10-01 NA 73.06226 NA
2 45001 2008-11-01 NA 70.71429 NA
3 45001 2008-12-01 NA 70.71429 NA
4 45001 2009-01-01 NA 73.43750 NA
5 45001 2009-02-01 NA 78.69565 NA
...</code></pre>
</div>
<div id="applying-an-analytic-method-and-recombination" class="section level3">
<h3>Applying an analytic method and recombination</h3>
<p>Now, suppose we wish to apply an analytic method to each subset of our data and recombine the result. A simple thing we may want to look at is the slope coefficient of a linear model applied to list prices vs. time for each county.</p>
<p>We can create a function that operates on an input data frame <code>x</code> that does this:</p>
<pre class="r"><code>lmCoef <- function(x)
coef(lm(medListPriceSqft ~ time, data = x))[2]</code></pre>
<p>We can apply this transformation to each subset in our data with <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code>:</p>
<pre class="r"><code>byCountySlope <- addTransform(byCounty, lmCoef)</code></pre>
<p>This applies <code>lmCoef()</code> to each subset in a deferred fashion, meaning that for all intents and purposes we can think of <code>byCountySlope</code> as a distributed data object that contains the result of <code>lmCoef()</code> being applied to each subset. But computation is deffered until another data operation is applied to <code>byCountySlope</code>, such as a recombination, which we will do next.</p>
<p>When we look at a subset of <code>byCountySlope</code>, we see what the result will look like:</p>
<pre class="r"><code>byCountySlope[[1]]</code></pre>
<pre><code>$key
[1] "county=Abbeville County|state=SC"
$value
time
-0.0002323686 </code></pre>
<p>Now let’s recombine the slopes into a single data frame. This can be done with the <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code> function, using the <code><a target='_blank' href='rd.html#combrbind'>combRbind</a></code> combiner, which is analagous to <code><a target='_blank' href='rd.html#rbind'>rbind()</a></code>:</p>
<pre class="r"><code>countySlopes <- recombine(byCountySlope, combRbind)</code></pre>
<pre class="r"><code>head(countySlopes)</code></pre>
<pre><code> county state val
time Abbeville County SC -0.0002323686
time1 Acadia Parish LA 0.0019518441
time2 Accomack County VA -0.0092717711
time3 Ada County ID -0.0030197554
time4 Adair County IA -0.0308381951
time5 Adair County KY 0.0034399585</code></pre>
</div>
<div id="joining-other-data-sets" class="section level3">
<h3>Joining other data sets</h3>
<p>There are several data operations beyond <code><a target='_blank' href='rd.html#divide'>divide()</a></code> and <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code>. Let’s look at a quick example of one of these, <code><a target='_blank' href='rd.html#drjoin'>drJoin()</a></code>. Suppose we have multiple related data sources. For example, we have geolocation data for the county centroids. <code><a target='_blank' href='rd.html#drjoin'>drJoin()</a></code> will allow us to join multiple data sets by key.</p>
<p>We have a data set, <code>geoCounty</code>, also part of the <code>housingData</code> package, that we want to divide in the same way as we divided the <code>housing</code> data:</p>
<pre class="r"><code>head(geoCounty)</code></pre>
<pre><code> fips county state lon lat rMapState rMapCounty
1 01001 Autauga County AL -86.64565 32.54009 alabama autauga
2 01003 Baldwin County AL -87.72627 30.73831 alabama baldwin
3 01005 Barbour County AL -85.39733 31.87403 alabama barbour
4 01007 Bibb County AL -87.12526 32.99902 alabama bibb
5 01009 Blount County AL -86.56271 33.99044 alabama blount
6 01011 Bullock County AL -85.71680 32.10634 alabama bullock</code></pre>
<pre class="r"><code>geo <- divide(geoCounty, by = c("county", "state"))</code></pre>
<pre class="r"><code>geo[[1]]</code></pre>
<pre><code>$key
[1] "county=Abbeville County|state=SC"
$value
fips lon lat rMapState rMapCounty
1 45001 -82.45851 34.23021 south carolina abbeville</code></pre>
<p>We see that this division gives us a divided data set with the same keys as <code>byCounty</code>. So we can join it with <code>byCounty</code>:</p>
<pre class="r"><code>byCountyGeo <- drJoin(housing = byCounty, geo = geo)</code></pre>
<p>What this does is provide us with a new ddo (not a distributed data frame anymore) where for each key, the value is a list with a data frame <code>housing</code> holding the time series data and a data frame <code>geo</code> holding the geographic data. We can see the structure of this for a subset with:</p>
<pre class="r"><code>str(byCountyGeo[[1]])</code></pre>
<pre><code>List of 2
$ key : chr "county=Abbeville County|state=SC"
$ value:List of 2
..$ housing:'data.frame': 66 obs. of 5 variables:
.. ..$ fips : chr [1:66] "45001" "45001" "45001" "45001" ...
.. ..$ time : Date[1:66], format: "2008-10-01" ...
.. ..$ nSold : num [1:66] NA NA NA NA NA NA NA NA NA NA ...
.. ..$ medListPriceSqft: num [1:66] 73.1 70.7 70.7 73.4 78.7 ...
.. ..$ medSoldPriceSqft: num [1:66] NA NA NA NA NA NA NA NA NA NA ...
..$ geo :'data.frame': 1 obs. of 5 variables:
.. ..$ fips : chr "45001"
.. ..$ lon : num -82.5
.. ..$ lat : num 34.2
.. ..$ rMapState : chr "south carolina"
.. ..$ rMapCounty: chr "abbeville"
..- attr(*, "split")='data.frame': 1 obs. of 2 variables:
.. ..$ county: chr "Abbeville County"
.. ..$ state : chr "SC"
- attr(*, "class")= chr [1:2] "kvPair" "list"</code></pre>
</div>
<div id="trelliscope-display" class="section level3">
<h3>Trelliscope display</h3>
<p>We have a more comprehensive tutorial for using <a href="http://deltarho.org/docs-trelliscope/">Trelliscope</a>, but for completeness here and for some motivation to get through this tutorial and move on to the Trelliscope tutorial, we provide a simple example of taking a ddf and creating a Trelliscope display from it.</p>
<p>In short, a Trelliscope display is like a Trellis display, or ggplot with faceting, or small multiple plot, or whatever you are used to calling the action of breaking a set of data into pieces and applying a plot to each piece and then arranging those plots in a grid and looking at them. With Trelliscope, we are able to create such displays on data with a very large number of subsets and view them in an interactive and meaningful way.</p>
</div>
<div id="setting-up-a-visualization-database" class="section level3">
<h3>Setting up a visualization database</h3>
<p>For a Trelliscope display, we must connect to a “visualization database” (VDB), which is a directory on our computer where we are going to organize all of the information about our displays (we create many over the course of an analysis). Typically, we will set up a single VDB for each project we are working on. We can do this with the <code>vdbConn()</code> function:</p>
<pre class="r"><code>vdbConn("vdb", name = "deltarhoTutorial")</code></pre>
<p>This connects to a directory called <code>"vdb"</code> relative to our current working directory. R holds this connection in its global options so that subsequent calls will know where to put things without explicitly specifying the connection each time.</p>
</div>
<div id="creating-a-panel-function" class="section level3">
<h3>Creating a panel function</h3>
<p>To create a Trelliscope display, we need to first specify a <em>panel</em> function, which specifies what to plot for each subset. It takes as input either a key-value pair or just a value, depending on whether the function has two arguments or one.</p>
<p>For example, here is a panel function that takes a value and creates a lattice <code>xyplot</code> of list and sold price over time:</p>
<pre class="r"><code>timePanel <- function(x)
xyplot(medListPriceSqft + medSoldPriceSqft ~ time,
data = x, auto.key = TRUE, ylab = "Price / Sq. Ft.")</code></pre>
<p>Let’s test it on a subset:</p>
<pre class="r"><code>timePanel(byCounty[[20]]$value)</code></pre>
<p><img src="index_files/figure-html/quickstart_panel_test-1.png" title="" alt="" width="624" /></p>
<p>Great!</p>
</div>
<div id="creating-a-cognostics-function" class="section level3">
<h3>Creating a cognostics function</h3>
<p>Another optional thing we can do is specify a <em>cognostics</em> function that is applied to each subset. A cognostic is a metric that tells us an interesting attribute about a subset of data, and we can use cognostics to have more worthwhile interactions with all of the panels in the display. A cognostic function needs to return a list of metrics:</p>
<pre class="r"><code>priceCog <- function(x) { list(
slope = cog(lmCoef(x), desc = "list price slope"),
meanList = cogMean(x$medListPriceSqft),
listRange = cogRange(x$medListPriceSqft),
nObs = cog(length(which(!is.na(x$medListPriceSqft))),
desc = "number of non-NA list prices")
)}</code></pre>
<p>We use the <code>cog()</code> function to wrap our metrics so that we can provide a description for the cognostic. We may also employ special cognostics functions like <code>cogMean()</code> and <code>cogRange()</code> to compute mean and range with a default description.</p>
<p>We should test the cognostics function on a subset:</p>
<pre class="r"><code>priceCog(byCounty[[1]]$value)</code></pre>
<pre><code>$slope
time
-0.0002323686
$meanList
[1] 72.76927
$listRange
[1] 23.08482
$nObs
[1] 66</code></pre>
</div>
<div id="making-the-display" class="section level3">
<h3>Making the display</h3>
<p>Now we can create a Trelliscope display by sending our data, our panel function, and our cognostics function to <code>makeDisplay()</code>:</p>
<pre class="r"><code>makeDisplay(byCounty,
name = "list_sold_vs_time_datadr_tut",
desc = "List and sold price over time",
panelFn = timePanel,
cogFn = priceCog,
width = 400, height = 400,
lims = list(x = "same"))</code></pre>
<p>If you have been dutifully following along with this example in your own R console, you can now view the display with the following:</p>
<pre class="r"><code>view()</code></pre>
<p>If you have not been following along but are wondering what that <code>view()</code> command did, you can visit <a href="http://hafen.shinyapps.io/deltarhoTutorial/" target="_blank">here</a> for an online version. You will find a list of displays to choose from, of which the one with the name <code>list_sold_vs_time_datadr_tut</code> is the one we just created. This brings up the point that you can share your Trelliscope displays online – more about that as well as how to use the viewer will be covered in the Trelliscope tutorial – but feel free to play around with the viewer.</p>
<p>This covers the basics of <code>datadr</code> and a bit of <code>trelliscope</code>. Hopefully you now feel comfortable enough to dive in and try some things out. The remainder of this tutorial and the <a href="http://deltarho.org/docs-trelliscope/">Trelliscope</a> tutorial will provide greater detail.</p>
</div>
</div>
<div id="for-plyr-dplyr-users" class="section level2">
<h2>For plyr / dplyr Users</h2>
<p>Now that we have seen some examples and have a good feel for what <code>datadr</code> can do, if you have used <code>plyr</code> or <code>dplyr</code> packages, you may be noticing a few similarities.</p>
<p>If you have not used these packages before, you can skip this section, but if you have, we will go over a quick simple example of how to do the same thing in the three packages to help the <code>plyr</code> user have a better understanding of how to map their knowledge of those packages to <code>datadr</code>.</p>
<p>It is also worth discussing some of the similarites and differences to help understand when <code>datadr</code> is useful. We expand on this in the <a href="#faq">FAQ</a>. In a nutshell, <code>datadr</code> and <code>dplyr</code> are very different and are actually complementary. We often use the amazing features of <code>dplyr</code> for within-subset computations, but we need <code>datadr</code> to deal with complex data structures and potentially very large data.</p>
<div id="code-comparison" class="section level3">
<h3>Code Comparison</h3>
<p>For a simple example, we turn to the famous iris data. Suppose we want to compute the mean sepal length by species:</p>
<div id="with-plyr" class="section level4">
<h4>With <code>plyr</code>:</h4>
<pre class="r"><code>library(plyr)
ddply(iris, .(Species), function(x)
data.frame(val = mean(x$Sepal.Length)))</code></pre>
<p>With <code>plyr</code>, we are performing the split, apply, and combine all in the same step.</p>
</div>
<div id="with-dplyr" class="section level4">
<h4>With <code>dplyr</code>:</h4>
<pre class="r"><code>library(dplyr)
iris %>%
group_by(Species) %>%
summarise(val = mean(Sepal.Length))</code></pre>
<p>Here, we call <code>group_by()</code> to create a <code>bySpecies</code> object, which is the same object as <code>iris</code> but with additional information about the indices of where the rows for each species are. Then we call <code>summarise()</code> which computes the mean sepal length for each group and returns the result as a data frame.</p>
</div>
<div id="with-datadr" class="section level4">
<h4>With <code>datadr</code>:</h4>
<pre class="r"><code>library(datadr)
divide(iris, by = "Species") %>%
addTransform(function(x) mean(x$Sepal.Length)) %>%
recombine(combRbind)</code></pre>
<p>Here, we call <code><a target='_blank' href='rd.html#divide'>divide()</a></code> to partition the iris data by species, resulting in a “distributed data frame”, called <code>bySpecies</code>. Note that this result is a new data object - an important and deliberate distinction. Then we call <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code> to apply a function that computes the mean sepal length to each partition. Then we call <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code> to bind all the results into a single data frame.</p>
</div>
</div>
</div>
<div id="outline" class="section level2">
<h2>Outline</h2>
<p>The outline for the remainder of this tutorial is as follows:</p>
<ul>
<li>First, we cover the foundational D&R data structure, key-value pairs, and how they are used to build distributed data objects and distributed data frames.</li>
<li>Next, we provide an introduction to the high-level division and recombination methods in <code>datadr</code>.</li>
<li>Then we discuss MapReduce - the lower-level language for accomplishing D&R tasks - which is the engine for the higher-level D&R methods. It is anticipated that the high-level language will be sufficient for most analysis tasks, but the lower-level approach is also exposed for special cases.</li>
<li>We then cover some division-independent methods that do various computations across the entire data set, regardless of how it is divided, such as all-data quantiles.</li>
<li>For all of these discussions, we use small data sets that fit in memory for illustrative purposes. This way everyone can follow along without having a large-scale backend like Hadoop running and configured. However, the true power of D&R is with large data sets, and after introducing all of this material, we cover different backends for computation and storage that are currently supported for D&R. The interface always remains the same regardless of the backend, but there are various things to discuss for each case. The backends discussed are:
<ul>
<li><strong>in-memory / single core R:</strong> ideal for small data</li>
<li><strong>local disk / multicore R:</strong> ideal for medium-sized data (too big for memory, small enough for local disk)</li>
<li><strong>Hadoop Distributed File System (HDFS) / RHIPE / Hadoop MapReduce:</strong> ideal for very large data sets</li>
</ul></li>
<li>We also provide R source files for all of the examples throughout the documentation.</li>
</ul>
<div class="alert alert-warning">
<strong>Note:</strong> Throughout the tutorial, the examples cover very small, simple datasets. This is by design, as the focus is on getting familiar with the available commands. Keep in mind that the same interface works for very large datasets, and that design choices have been made with scalability in mind.
</div>
</div>
</div>
<div id="dealing-with-data-in-dr" class="section level1">
<h1>Dealing with Data in D&R</h1>
<div id="key-value-pairs" class="section level2">
<h2>Key-Value Pairs</h2>
<p>In D&R, data is partitioned into subsets. Each subset is represented as a <em>key-value pair</em>. Collections of key-value pairs are <em>distributed data objects (ddo)</em>, or in the case of the value being a data frame, <em>distributed data frames (ddf)</em>, and form the basic input and output types for all D&R operations. This section introduces these concepts and illustrates how they are used in datadr.</p>
<div id="key-value-pairs-in-datadr" class="section level3">
<h3>Key-value pairs in datadr</h3>
<p>In datadr, key-value pairs are R lists with two elements, one for the key and one for the value. For example,</p>
<pre class="r"><code># simple key-value pair example
list(1:5, rnorm(10))</code></pre>
<pre><code>[[1]]
[1] 1 2 3 4 5
[[2]]
[1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559
[7] -0.5747400 -0.5466319 -0.5644520 -0.8900378</code></pre>
<p>is a key-value pair with integers 1-5 as the key and 10 random normals as the value. Typically, a key is used as a unique identifier for the value. For datadr it is recommended to make the key a simple string when possible.</p>
<p>There is a convenience function <code><a target='_blank' href='rd.html#kvpair'>kvPair()</a></code> for specifying a key-value pair:</p>
<pre class="r"><code># using kvPair
kvPair(1:5, rnorm(10))</code></pre>
<pre><code>$key
[1] 1 2 3 4 5
$value
[1] -0.47719270 -0.99838644 -0.77625389 0.06445882 0.95949406
[6] -0.11028549 -0.51100951 -0.91119542 -0.83717168 2.41583518</code></pre>
<p>This provides names for the list elements and is a useful function when an operation must explicitly know that it is dealing with a key-value pair and not just a list.</p>
</div>
<div id="key-value-pair-collections" class="section level3">
<h3>Key-value pair collections</h3>
<p>D&R data objects are made up of collections of key-value pairs. In datadr, these are represented as lists of key-value pair lists. As an example, consider the iris data set, which consists of measurements of 4 aspects for 50 flowers from each of 3 species of iris. Suppose we would like to split the data into key-value pairs by species. We can do this by passing key-value pairs to a function <code><a target='_blank' href='rd.html#kvpairs'>kvPairs()</a></code>:</p>
<pre class="r"><code># create by-species key-value pairs
irisKV <- kvPairs(
kvPair("setosa", subset(iris, Species == "setosa")),
kvPair("versicolor", subset(iris, Species == "versicolor")),
kvPair("virginica", subset(iris, Species == "virginica"))
)
irisKV</code></pre>
<pre><code>[[1]]
$key
[1] "setosa"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
...
[[2]]
$key
[1] "versicolor"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
...
[[3]]
$key
[1] "virginica"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica
104 6.3 2.9 5.6 1.8 virginica
105 6.5 3.0 5.8 2.2 virginica
...</code></pre>
<p>The result is a list of 3 key-value pairs. We chose the species to be the key and the corresponding data frame to be the value for each pair.</p>
<p><code><a target='_blank' href='rd.html#kvpairs'>kvPairs()</a></code> is basically a wrapper for <code>list()</code>. It checks to make sure key-value pairs are valid and makes sure they are printed nicely. In pratice we actually very rarely need specify key-value pairs like this, but this is useful for illustration.</p>
<p>This example shows how we can partition our data into key-value pairs that have meaning – each subset represents measurements for one species. The ability to divide the data up into pieces allows us to distribute datasets that might be too large for a single disk across multiple machines, and also allows us to distribute computation, because in D&R we apply methods independently to each subset.</p>
<p>Here, we manually created the partition by species, but datadr provides simple mechanisms for specifying divisions, which we will cover <a href="#division">later in the tutorial</a>. Prior to doing that, however, we need to discuss how collections of key-value pairs are represented in datadr as distributed data objects.</p>
</div>
</div>
<div id="distributed-data-objects" class="section level2">
<h2>Distributed Data Objects</h2>
<p>In datadr, a collection of key-value pairs along with attributes about the collection constitute a distributed data object (ddo). Most datadr operations require a ddo, and hence it is important to represent key-value pair collections as such.</p>
<p>We will continue to use our collection of key-value pairs we created previously <code>irisKV</code>:</p>
<pre class="r"><code>irisKV <- kvPairs(
kvPair("setosa", subset(iris, Species == "setosa")),
kvPair("versicolor", subset(iris, Species == "versicolor")),
kvPair("virginica", subset(iris, Species == "virginica"))
)</code></pre>
<div id="initializing-a-ddo" class="section level3">
<h3>Initializing a ddo</h3>
<p>To initialize a collection of key-value pairs as a distributed data object, we use the <code><a target='_blank' href='rd.html#ddo'>ddo()</a></code> function:</p>
<pre class="r"><code># create ddo object from irisKV
irisDdo <- ddo(irisKV)</code></pre>
<p><code><a target='_blank' href='rd.html#ddo'>ddo()</a></code> simply takes the collection of key-value pairs and attaches additional attributes to the resulting ddo object. Note that in this example, since the data is in memory, we are supplying the data directly as the argument to <code><a target='_blank' href='rd.html#ddo'>ddo()</a></code>. For larger datasets stored in more scalable backends, instead of passing the data directly, a connection that points to where the key-value pairs are stored is provided. This is discussed in more detail in the <a href="#backend-choices">Store/Compute Backends</a> sections.</p>
<p>Objects of class “ddo” have several methods that can be invoked on them. The most simple of these is a print method:</p>
<pre class="r"><code>irisDdo</code></pre>
<pre><code>
Distributed data object backed by 'kvMemory' connection
attribute | value
----------------+-----------------------------------------------------------
size (stored) | 12.67 KB
size (object) | 12.67 KB
# subsets | 3
* Other attributes: getKeys()
* Missing attributes: splitSizeDistn</code></pre>
<p>The print method shows several attributes that have been computed for the data.</p>
</div>
<div id="ddo-attributes" class="section level3">
<h3>ddo attributes</h3>
<p>From the printout of <code>irisDdo</code>, we see that a ddo has several attributes. The most basic ones:</p>
<ul>
<li><code>size (object)</code>: The total size of the all of the data as represented in memory in R is 12.67 KB (that’s some big data!)</li>
<li><code>size (stored)</code>: With backends other than in-memory, the size of data serialized and possibly compressed to disk can be very different from object size, which is useful to know. In this case, it’s the same since the object is in memory.</li>
<li><code># subsets</code>: There are 3 subsets (one for each species)</li>
</ul>
<p>We can look at the keys with:</p>
<pre class="r"><code># look at irisDdo keys
getKeys(irisDdo)</code></pre>
<pre><code>[[1]]
[1] "setosa"
[[2]]
[1] "versicolor"
[[3]]
[1] "virginica"</code></pre>
<p>We can also get an example key-value pair:</p>
<pre class="r"><code># look at an example key-value pair of irisDdo
kvExample(irisDdo)</code></pre>
<pre><code>$key
[1] "setosa"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
...</code></pre>
<p><code>kvExample</code> is useful for obtaining a subset key-value pair against which we can test out different analytical methods before applying them across the entire data set.</p>
<p>Another attribute, <code>splitSizeDistn</code> is empty. This attribute provides information about the quantiles of the distribution of the size of each division. With very large data sets with a large number of subsets, this can be useful for getting a feel for how uniform the subset sizes are.</p>
<p>The <code>splitSizeDistn</code> attribute and more that we will see in the future are not computed by default when <code><a target='_blank' href='rd.html#ddo'>ddo()</a></code> is called. This is because it requires a computation over the data set, which can take some time with very large datasets, and may not always be desired or necessary.</p>
</div>
<div id="updating-attributes" class="section level3">
<h3>Updating attributes</h3>
<p>If you decide at any point that you would like to update the attributes of your ddo, you can call:</p>
<pre class="r"><code># update irisDdo attributes
irisDdo <- updateAttributes(irisDdo)
irisDdo</code></pre>
<pre><code>
Distributed data object backed by 'kvMemory' connection
attribute | value
----------------+-----------------------------------------------------------
size (stored) | 12.67 KB
size (object) | 12.67 KB
# subsets | 3
* Other attributes: getKeys(), splitSizeDistn()</code></pre>
<p>The <code>splitSizeDistn</code> attribute is now available. We can look at it with the accessor <code>splitSizeDistn()</code>:</p>
<pre class="r"><code>par(mar = c(4.1, 4.1, 1, 0.2))
# plot distribution of the size of the key-value pairs</code></pre>
<p><img src="index_files/figure-html/plot_iris_split_size-1.png" title="" alt="" width="624" /></p>
<p>Another way to get updated attributes is at the time the ddo is created, by setting <code>update = TRUE</code>:</p>
<pre class="r"><code># update at the time ddo() is called
irisDdo <- ddo(irisKV, update = TRUE)</code></pre>
</div>
<div id="note-about-storage-and-computation" class="section level3">
<h3>Note about storage and computation</h3>
<p>Notice the first line of output from the <code>irisDdo</code> object printout. It states that the object is backed by a “kvMemory” (key-value pairs in memory) connection.</p>
<p>We will talk about other backends for storing and processing larger data sets that don’t fit in memory or even on your workstation’s disk. The key here is that the interface always stays the same, regardless of whether we are working with terabytes of kilobytes of data.</p>
</div>
<div id="accessing-subsets" class="section level3">
<h3>Accessing subsets</h3>
<p>We can access subsets of the data by key or by index:</p>
<pre class="r"><code>irisDdo[["setosa"]]</code></pre>
<pre><code>$key
[1] "setosa"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
...</code></pre>
<pre class="r"><code>irisDdo[[1]]</code></pre>
<pre><code>$key
[1] "setosa"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
...</code></pre>
<pre class="r"><code>irisDdo[c("setosa", "virginica")]</code></pre>
<pre><code>[[1]]
$key
[1] "setosa"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
...
[[2]]
$key
[1] "virginica"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica
104 6.3 2.9 5.6 1.8 virginica
105 6.5 3.0 5.8 2.2 virginica
...</code></pre>
<pre class="r"><code>irisDdo[1:2]</code></pre>
<pre><code>[[1]]
$key
[1] "setosa"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
...
[[2]]
$key
[1] "versicolor"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
...</code></pre>
<p>Accessing by key is much simpler when the key is a character string, but subsetting works even when passing a list of non-string keys, or even a <code>digest()</code> of the desired key object (if you don’t know what that means, don’t worry!).</p>
</div>
</div>
<div id="distributed-data-frames" class="section level2">
<h2>Distributed Data Frames</h2>
<p>Key-value pairs in distributed data objects can have any structure. If we constrain the values to be data frames or readily transformable into data frames, we can represent the object as a distributed data frame (ddf). A ddf is a ddo with additional attributes. Having a uniform data frame structure for the values provides several benefits and data frames are required for specifying division methods.</p>
<div id="initializing-a-ddf" class="section level3">
<h3>Initializing a ddf</h3>
<p>Our <code>irisKV</code> data we created earlier has values that are data frames, so we can cast it as a distributed data frame like this:</p>
<pre class="r"><code># create ddf object from irisKV
irisDdf <- ddf(irisKV, update = TRUE)
irisDdf</code></pre>
<pre><code>
Distributed data frame backed by 'kvMemory' connection
attribute | value
----------------+-----------------------------------------------------------
names | Sepal.Length(num), Sepal.Width(num), and 3 more
nrow | 150
size (stored) | 12.67 KB
size (object) | 12.67 KB
# subsets | 3
* Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary()</code></pre>
</div>
<div id="ddf-attributes" class="section level3">
<h3>ddf attributes</h3>
<p>The printout of <code>irisDdf</code> above shows the ddo attributes we saw previously (because every ddf is also a ddo), but we also see some new data-frame-related attributes (which were automatically updated because we specified <code>update = TRUE</code>). These include:</p>
<ul>
<li><code>names</code>: a list of the variables</li>
<li><code>nrow</code>: the total number of rows in the data set</li>
</ul>
<p>Also there are additional “other” attributes listed at the bottom. The <code>summary</code> attribute can be useful for getting an initial look at the variables in your ddf, and is sometimes required for later computations, such as quantile estimation with <code><a target='_blank' href='rd.html#drquantile'>drQuantile()</a></code>, where the range of a variable is required to get a good quantile approximation. Summary statistics are all computed simultaneously in one MapReduce job with a call to <code><a target='_blank' href='rd.html#updateattributes'>updateAttributes()</a></code>.</p>
<p>The numerical summary statistics are computed using a <a href="http://janinebennett.org/index_files/ParallelStatisticsAlgorithms.pdf">numerically stable algorithm</a>.</p>
<p>Summary statistics include:</p>
<p>For each numeric variable:</p>
<ul>
<li><code>nna</code>: number of missing values</li>
<li><code>stats</code>: list of mean, variance, skewness, kurtosis</li>
<li><code>range</code>: min, max</li>
</ul>
<p>For each categorical variable:</p>
<ul>
<li><code>nobs</code>: number of observations</li>
<li><code>nna</code>: number of missing values</li>
<li><code>freqTable</code>: a data frame containing a frequency table</li>
</ul>
<p>Summaries can be accessed by:</p>
<pre class="r"><code># look at irisDdf summary stats
summary(irisDdf)</code></pre>
<pre><code> Sepal.Length Sepal.Width Petal.Length
-------------------- -------------------- ---------------------
missing : 0 missing : 0 missing : 0
min : 4.3 min : 2 min : 1
max : 7.9 max : 4.4 max : 6.9
mean : 5.843333 mean : 3.057333 mean : 3.758
std dev : 0.8280661 std dev : 0.4358663 std dev : 1.765298
skewness : 0.3117531 skewness : 0.3157671 skewness : -0.2721277
kurtosis : 2.426432 kurtosis : 3.180976 kurtosis : 1.604464
-------------------- -------------------- ---------------------
Petal.Width Species
--------------------- ------------------
missing : 0 levels : 3
min : 0.1 missing : 0
max : 2.5 > freqTable head <
mean : 1.199333 setosa : 50
std dev : 0.7622377 versicolor : 50
skewness : -0.1019342 virginica : 50
kurtosis : 1.663933
--------------------- ------------------ </code></pre>
<p>For categorical variables, the top four values and their frequency is printed. To access the values themselves, we can do, for example:</p>
<pre class="r"><code>summary(irisDdf)$Sepal.Length$stats</code></pre>
<pre><code>$mean
[1] 5.843333
$var
[1] 0.6856935
$skewness
[1] 0.3117531
$kurtosis
[1] 2.426432</code></pre>
<p>or:</p>
<pre class="r"><code>summary(irisDdf)$Species$freqTable</code></pre>
<pre><code> value Freq
1 setosa 50
2 versicolor 50
3 virginica 50</code></pre>
</div>
<div id="data-frame-like-ddf-methods" class="section level3">
<h3>Data frame-like “ddf” methods</h3>
<p>Note that with an object of class “ddf”, you can use some of the methods that apply to regular data frames:</p>
<pre class="r"><code>nrow(irisDdf)</code></pre>
<pre><code>150</code></pre>
<pre class="r"><code>ncol(irisDdf)</code></pre>
<pre><code>5</code></pre>
<pre class="r"><code>names(irisDdf)</code></pre>
<pre><code>[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
[5] "Species" </code></pre>
<p>However, datadr does not go too far beyond this in terms of making a ddf feel or behave exactly like a regular R data frame.</p>
</div>
<div id="passing-a-data-frame-to-ddo-and-ddf" class="section level3">
<h3>Passing a data frame to <code>ddo()</code> and <code>ddf()</code></h3>
<p>It is worth noting that it is possible to pass a single data frame to <code><a target='_blank' href='rd.html#ddo'>ddo()</a></code> or <code><a target='_blank' href='rd.html#ddf'>ddf()</a></code>. The result is a single key-value pair with the data frame as the value, and <code>""</code> as the key. This is an option strictly for convenience and with the idea that further down the line operations will be applied that split the data up into a more useful set of key-value pairs. Here is an example:</p>
<pre class="r"><code># initialize ddf from a data frame
irisDf <- ddf(iris, update = TRUE)</code></pre>
<p>This of course only makes sense for data small enough to fit in memory in the first place. In the <a href="#small-memory--cpu">backends</a> section, we will discuss other backends for larger data and how data can be added to objects or read in from a raw source in these cases.</p>
</div>
</div>
<div id="ddoddf-transformations" class="section level2">
<h2>ddo/ddf Transformations</h2>
<p>A very common thing to want to do to a ddo or ddf is apply a transformation to each of the subsets. For example we may want to apply a transformation that :</p>
<ul>
<li>adds a new derived variable to a subset of a ddf</li>
<li>applies a statistical method or summarization to each subset</li>
<li>coerces each subset into a data frame</li>
<li>etc.</li>
</ul>
<p>This will be a routine thing to do when we start talking about D&R operations.</p>
<p>We can add transformations to a ddo/ddf using <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code>. Let’s look at an example. Recall the iris data split by species:</p>
<pre class="r"><code># iris ddf by Species
irisKV <- kvPairs(
kvPair("setosa", subset(iris, Species == "setosa")),
kvPair("versicolor", subset(iris, Species == "versicolor")),
kvPair("virginica", subset(iris, Species == "virginica"))
)
irisDdf <- ddf(irisKV)</code></pre>
<p>Suppose we want to add a simple transformation that computes the mean sepal width for each subset. I can do this with the following:</p>
<pre class="r"><code>irisSL <- addTransform(irisDdf, function(x) mean(x$Sepal.Width))</code></pre>
<p>I simply provide my input ddo/ddf <code>irisDdf</code> and specify the function I want to apply to each subset.</p>
<p>If the function I provide has two arguments, it will pass both the key and value of the current subset as arguments to the function. If it has one argument, it will pass just the value. In this case, it has one argument, so I can expect <code>x</code> inside my function to hold the data frame value for a subset of <code>irisDdf</code>. Note that I can pre-define this function:</p>
<p>The output of a transformation function specified in <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code> will always be treated as a value unless the function returns a key-value pair via <code><a target='_blank' href='rd.html#kvpair'>kvPair()</a></code>.</p>
<pre class="r"><code>meanSL <- function(x) mean(x$Sepal.Width)
irisSL <- addTransform(irisDdf, meanSL)</code></pre>
<p>Let’s now look at the result:</p>
<pre class="r"><code>irisSL</code></pre>
<pre><code>
Transformed distributed data object backed by 'kvMemory' connection
attribute | value
----------------+-----------------------------------------------------------
size (stored) | 12.67 KB (before transformation)
size (object) | 12.67 KB (before transformation)
# subsets | 3
* Other attributes: getKeys()</code></pre>
<p>Our input data was a ddf, but the output is a ddo! What is in the output?</p>
<pre class="r"><code>irisSL[[1]]</code></pre>
<pre><code>$key
[1] "setosa"
$value
[1] 3.428</code></pre>
<p>We see that <code>irisSL</code> now holds the data that we would expect – the result of our transformation – the mean sepal length. This value is not a data frame, so <code>irisSL</code> is a ddo.</p>
<p>But notice in the printout of <code>irisSL</code> above that it says that the object size is still the same as our input data, <code>irisDdf</code>. This is because when you add a transformation to a ddo/ddf, the transformation is not applied immediately, but is deferred until a data operation is applied. Data operations include <code><a target='_blank' href='rd.html#divide'>divide()</a></code>, <code><a target='_blank' href='rd.html#recombine'>recombine()</a></code>, <code><a target='_blank' href='rd.html#drjoin'>drJoin()</a></code>, <code><a target='_blank' href='rd.html#drlapply'>drLapply()</a></code>, <code><a target='_blank' href='rd.html#drfilter'>drFilter()</a></code>, <code><a target='_blank' href='rd.html#drsample'>drSample()</a></code>, and <code><a target='_blank' href='rd.html#drsubset'>drSubset()</a></code>. When any of these are invoked on an object with a transformation attached to it, the transformation will be applied prior to any other computation. The transformation will also be applied any time a subset of the data is requested. Thus although the data has not been physically transformed after a call of <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code>, we can think of it conceptually as already being transformed.</p>
<p>When <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code> is called, it is tested on a subset of the data to make sure we have all of the necessary global variables and packages loaded necessary to portably perform the transformation. If there are any package dependencies, it makes a note and stores this information with the object. Also if there are any global object dependencies, these are also stored with the object. So whatever objects exist at the time of applying the transformation, any subsequent changes to the object or removal of the object will not effect the transformation.</p>
<p>For example, consider the following:</p>
<pre class="r"><code># set a global variable
globalVar <- 7
# define a function that depends on this global variable
meanSLplus7 <- function(x) mean(x$Sepal.Width) + globalVar
# add this transformation to irisDdf
irisSLplus7 <- addTransform(irisDdf, meanSLplus7)
# look at the first key-value pair (invokes transformation)
irisSLplus7[[1]]</code></pre>
<pre><code>$key
[1] "setosa"
$value
[1] 10.428</code></pre>
<pre class="r"><code># remove globalVar
rm(globalVar)
# look at the first key-value pair (invokes transformation)
irisSLplus7[[1]]</code></pre>
<pre><code>$key
[1] "setosa"
$value
[1] 10.428</code></pre>
<p>We still get a result even though the global dependency of <code>meanSLplus7()</code> has been removed.</p>
<p>A final note about <code><a target='_blank' href='rd.html#addtransform'>addTransform()</a></code>: it is possible to add multiple transformations to a distributed data object, in which case they are applied in the order supplied, but only one transform should ever be necessary.</p>
<p>For example, suppose we want to further modify <code>irisSL</code> to append some text to the keys:</p>
<pre class="r"><code>irisSL2 <- addTransform(irisSL, function(k, v) kvPair(paste0(k, "_mod"), v))</code></pre>
<pre><code>*** finding global variables used in 'fn'...</code></pre>
<pre><code> [none]</code></pre>
<pre><code> package dependencies: datadr</code></pre>
<pre><code>*** testing 'fn' on a subset...</code></pre>
<pre><code> ok</code></pre>
<pre class="r"><code>irisSL2[[1]]</code></pre>
<pre><code>$key
[1] "setosa_mod"
$value
[1] 3.428</code></pre>
<p>This is also an example of using a transformation function to modify the key.</p>
</div>
<div id="common-data-operations" class="section level2">
<h2>Common Data Operations</h2>
<p>The majority of this documentation will cover division and recombination, but here, we present some methods that are available for common data operations that come in handy for manipulating data in various ways.</p>
<div id="drlapply" class="section level3">
<h3>drLapply</h3>
<p>It is convenient to be able use the familiar <code>lapply()</code> approach to apply a function to each key-value pair. An <code>lapply()</code> method, called <code><a target='_blank' href='rd.html#drlapply'>drLapply()</a></code> is available for ddo/ddf objects. The function you specify follows the same convention as described earlier (if it has one argument, it is applied to the value only, if it has two arguments, it is applied to the key and value). A ddo is returned.</p>
<p>Here is an example of using <code><a target='_blank' href='rd.html#drlapply'>drLapply()</a></code> to the <code>irisDdf</code> data:</p>
<pre class="r"><code># get the mean Sepal.Width for each key-value pair in irisDdf
means <- drLapply(irisDdf, function(x) mean(x$Sepal.Width))
# turn the resulting ddo into a list
as.list(means)</code></pre>
<pre><code>[[1]]
[[1]][[1]]
[1] "setosa"
[[1]][[2]]
[1] 3.428
[[2]]
[[2]][[1]]
[1] "versicolor"
[[2]][[2]]
[1] 2.77
[[3]]
[[3]][[1]]
[1] "virginica"
[[3]][[2]]
[1] 2.974</code></pre>
</div>
<div id="drfilter" class="section level3">
<h3>drFilter</h3>
<p>A <code><a target='_blank' href='rd.html#drfilter'>drFilter()</a></code> function is available which takes a function that is applied to each key-value pair. If the function returns <code>TRUE</code>, that key-value pair will be included in the resulting ddo/ddf, if <code>FALSE</code>, it will not.</p>
<p>Here is an example that keeps all subsets with mean sepal width less than 3:</p>
<pre class="r"><code># keep subsets with mean sepal width less than 3
drFilter(irisDdf, function(v) mean(v$Sepal.Width) < 3)</code></pre>
<pre><code>
Distributed data frame backed by 'kvMemory' connection
attribute | value
----------------+-----------------------------------------------------------
names | Sepal.Length(num), Sepal.Width(num), and 3 more
nrow | 100
size (stored) | 7.55 KB
size (object) | 7.55 KB
# subsets | 2
* Other attributes: getKeys()
* Missing attributes: splitSizeDistn, splitRowDistn, summary</code></pre>