Skip to content

Commit 3d33e97

Browse files
committed
Initial commit.
0 parents  commit 3d33e97

File tree

9 files changed

+5977
-0
lines changed

9 files changed

+5977
-0
lines changed

MapReduce-book-final.pdf

1.71 MB
Binary file not shown.

assets/css/bootstrap-responsive.css

+686
Large diffs are not rendered by default.

assets/css/bootstrap.css

+3,990
Large diffs are not rendered by default.

assets/css/docs.css

+843
Large diffs are not rendered by default.

assets/img/grid-18px-masked.png

405 Bytes
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
.com { color: #93a1a1; }
2+
.lit { color: #195f91; }
3+
.pun, .opn, .clo { color: #93a1a1; }
4+
.fun { color: #dc322f; }
5+
.str, .atv { color: #D14; }
6+
.kwd, .linenums .tag { color: #1e347b; }
7+
.typ, .atn, .dec, .var { color: teal; }
8+
.pln { color: #48484c; }
9+
10+
.prettyprint {
11+
padding: 8px;
12+
background-color: #f7f7f9;
13+
border: 1px solid #e1e1e8;
14+
}
15+
.prettyprint.linenums {
16+
-webkit-box-shadow: inset 40px 0 0 #fbfbfc, inset 41px 0 0 #ececf0;
17+
-moz-box-shadow: inset 40px 0 0 #fbfbfc, inset 41px 0 0 #ececf0;
18+
box-shadow: inset 40px 0 0 #fbfbfc, inset 41px 0 0 #ececf0;
19+
}
20+
21+
/* Specify class=linenums on a pre to get line numbering */
22+
ol.linenums {
23+
margin: 0 0 0 33px; /* IE indents via margin-left */
24+
}
25+
ol.linenums li {
26+
padding-left: 12px;
27+
color: #bebec5;
28+
line-height: 18px;
29+
text-shadow: 0 1px 0 #fff;
30+
}

ed1.html

+231
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
<!DOCTYPE html>
2+
<html lang="en">
3+
<head>
4+
<meta charset="utf-8">
5+
<title>Data-Intensive Text Processing with MapReduce</title>
6+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
7+
<meta name="description" content="">
8+
<meta name="author" content="">
9+
10+
<!-- Le styles -->
11+
<link href="assets/css/bootstrap.css" rel="stylesheet">
12+
<link href="assets/css/bootstrap-responsive.css" rel="stylesheet">
13+
<link href="assets/css/docs.css" rel="stylesheet">
14+
<link href="assets/js/google-code-prettify/prettify.css" rel="stylesheet">
15+
16+
<!-- Le HTML5 shim, for IE6-8 support of HTML5 elements -->
17+
<!--[if lt IE 9]>
18+
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
19+
<![endif]-->
20+
21+
</head>
22+
23+
<body>
24+
25+
<div class="navbar navbar-fixed-top">
26+
<div class="navbar-inner">
27+
<div class="container">
28+
<a class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
29+
<span class="icon-bar"></span>
30+
<span class="icon-bar"></span>
31+
<span class="icon-bar"></span>
32+
</a>
33+
<div class="nav-collapse">
34+
<ul class="nav">
35+
<li class="">
36+
<a href="index.html">Home</a>
37+
</li>
38+
<li class="active">
39+
<a href="ed1.html">1st Edition</a>
40+
</li>
41+
<li class="">
42+
<a href="ed2.html">2nd Edition</a>
43+
</li>
44+
</ul>
45+
</div>
46+
</div>
47+
</div>
48+
</div>
49+
50+
<div class="container">
51+
52+
<div class="page-header">
53+
<h1>Data-Intensive Text Processing with MapReduce <small>(First Edition)</small></h1>
54+
<p class="lead">Jimmy Lin and Chris Dyer.<br/>
55+
Morgan & Claypool Publishers, 2010.</p>
56+
</div>
57+
58+
<h2>Abstract</h2>
59+
60+
<p>Our world is being revolutionized by data-driven methods: access to
61+
large amounts of data has generated new insights and opened exciting
62+
new opportunities in commerce, science, and computing
63+
applications. Processing the enormous quantities of data necessary for
64+
these advances requires large clusters, making distributed computing
65+
paradigms more crucial than ever. MapReduce is a programming model for
66+
expressing distributed computations on massive datasets and an
67+
execution framework for large-scale data processing on clusters of
68+
commodity servers. The programming model provides an
69+
easy-to-understand abstraction for designing scalable algorithms,
70+
while the execution framework transparently handles many system-level
71+
details, ranging from scheduling to synchronization to fault
72+
tolerance. This book focuses on MapReduce algorithm design, with an
73+
emphasis on text processing algorithms common in natural language
74+
processing, information retrieval, and machine learning. We introduce
75+
the notion of MapReduce design patterns, which represent general
76+
reusable solutions to commonly occurring problems across a variety of
77+
problem domains. This book not only intends to help the reader "think
78+
in MapReduce", but also discusses limitations of the programming model
79+
as well.</p>
80+
81+
<p>Quite explicitly, this book focuses on MapReduce algorithm design, not <a href="http://hadoop.apache.org/">Hadoop</a> programming. Tom White's <a href="http://www.amazon.com/gp/product/0596521979?ie=UTF8&amp;tag=dataintetextp-20&amp;linkCode=as2&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0596521979">Hadoop: The Definitive Guide</a><img src="http://www.assoc-amazon.com/e/ir?t=dataintetextp-20&amp;l=as2&amp;o=1&amp;a=0596521979" width="1" height="1" alt="" style="border:none !important; margin:0px !important;" /> is a great resource for learning Hadoop.</p>
82+
83+
<h2 style="padding-top:15px">Getting the Book</h2>
84+
85+
<p>This book is part of the Morgan &amp; Claypool <a
86+
href="http://www.morganclaypool.com/toc/hlt/1/1">Synthesis Lectures on
87+
Human Language Technologies</a>. If you're at a university, your
88+
institution may already subscribe to the series, in which case you can
89+
access the <a
90+
href="http://dx.doi.org/10.2200/S00274ED1V01Y201006HLT007">electronic
91+
version</a> directly without cost (see <a
92+
href="http://www.morganclaypool.com/page/licensed">this page</a> for a
93+
list of institutional subscribers). Otherwise, to purchase:</p>
94+
95+
<ul>
96+
97+
<li>Electronic and print copies from <a href="http://dx.doi.org/10.2200/S00274ED1V01Y201006HLT007">Morgan &amp; Claypool</a> (publisher's site)</li>
98+
99+
<li>Print copies from <a href="http://www.amazon.com/gp/product/1608453421?ie=UTF8&amp;tag=dataintetextp-20&amp;linkCode=as2&amp;camp=1789&amp;creative=9325&amp;creativeASIN=1608453421">Amazon.com</a><img src="http://www.assoc-amazon.com/e/ir?t=dataintetextp-20&amp;l=as2&amp;o=1&amp;a=1608453421" width="1" height="1" alt="" style="border:none !important; margin:0px !important;" /></li>
100+
101+
</ul>
102+
103+
<p>We are pleased to provide
104+
the <a href="MapReduce-book-final.pdf">final pre-production
105+
manuscript</a> (April 11, 2010) as a preview. If you find this
106+
resource helpful, please consider purchasing an actual copy to support
107+
our work!</p>
108+
109+
<h2 style="padding-top:15px">Table of Contents</h2>
110+
111+
<ol>
112+
113+
<li>Introduction</li>
114+
115+
<li>MapReduce Basics</li>
116+
117+
<li>MapReduce algorithm design</li>
118+
119+
<li>Inverted Indexing for Text Retrieval</li>
120+
121+
<li>Graph Algorithms</li>
122+
123+
<li>EM Algorithms for Text Processing</li>
124+
125+
<li>Closing Remarks</li>
126+
127+
</ol>
128+
129+
<h2 style="padding-top:15px">Design Patterns &amp; Algorithms</h2>
130+
131+
<p><a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/index.html">Cloud<sup><small>9</small></sup></a>
132+
is a MapReduce library for Hadoop designed to serve as both a teaching
133+
tool and to support research in data-intensive text processing. It
134+
also serves as a repository of
135+
<a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/patterns.html">many
136+
examples</a> discussed in the book. Reference implementations of
137+
design patterns and other algorithms discussed in the book are being
138+
added gradually, so please come back periodically. Thus far, the
139+
repository contains:</p>
140+
141+
<ul>
142+
143+
<li><a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/order-inversion.html">Order inversion</a> from Chapter 3 for computing bigram
144+
relative frequencies.</li>
145+
146+
<li><a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/pairs-stripes.html">"Pairs"
147+
and "stripes"</a> from Chapter 3 for computing the word
148+
co-occurrence matrix of a large text collection.</li>
149+
150+
<li><a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/pagerank.html">PageRank</a> from Chapter 4
151+
as well as some more design patterns for graph algorithms not discussed in the book.</li>
152+
153+
</ul>
154+
155+
<h2 style="padding-top:15px">What People Are Saying</h2>
156+
157+
<ul>
158+
159+
<li>Book cited in a special report on managing information in <a href="http://www.economist.com/specialreports/displaystory.cfm?story_id=15557413">The Economist</a> (February 25, 2010)</li>
160+
161+
<li>Design patterns mentioned by <a href="http://mir-in-action.blogspot.com/2010/04/mapreduce-algorithm-design.html">Mark Levy</a> at Last.fm (April 6, 2010)</li>
162+
163+
<li>Google Research <a href="http://googleresearch.blogspot.com/2010/05/recent-accomplishments-by-research.html">plugs the book</a> (May 19, 2010)</li>
164+
165+
<li>Mentioned in a blog post by <a href="http://www.ctctlabs.com/index.php/blog/detail/applying_data_mining_techniques_to_mapreduce/">Constant Contact Labs</a> (May 27, 2010)</li>
166+
167+
<li>Deepak Singh from Amazon <a href="http://mndoci.com/2010/07/02/recommendation-data-intensive-text-processing-with-mapreduce/">recommends the book</a> (July 2, 2010)</li>
168+
169+
<li>Used in <a href="http://www.csee.ogi.edu/~zak/cs506-pslc/">CS 506/606: Special Topics: Problem Solving with Large Clusters</a> by Izhak Shafran and Richard Sproat at Oregon Health & Science University (Spring 2010)</li>
170+
171+
<li>Used in a Google-supported <a href="http://net.pku.edu.cn/~course/cs402/2010/index.html">Peking University course</a> on cloud computing by Hongfei Yan and Bo Peng (Summer, 2010)</li>
172+
173+
<li>Used in <a href="http://www.andyli.ece.ufl.edu/teaching/eel6935/">EEL 6935: Special Topics in Cloud Computing and Storage</a> by Andy Li at the University of Florida (Fall, 2010 and Fall, 2011)</li>
174+
175+
<li>Used in <a href="http://courses.cs.tamu.edu/caverlee/csce689/">CSCE 689: Internet-Scale Data Management</a> by James Caverlee at Texas A&amp;M (Fall, 2010)</li>
176+
177+
<li>Used in <a href="http://courses.cse.tamu.edu/caverlee/csce670/">CSCE 670: Information Storage and Retrieval</a> by James Caverlee at Texas A&amp;M (Spring, 2011 and Spring, 2012)</li>
178+
179+
<li>Used in <a href="http://cs.ua.edu/691Vrbsky/">CS 691-001: Cloud Computing</a> by Susan Vrbsky at University of Alabama (Spring, 2011)</li>
180+
181+
<li>Used in <a href="http://courses.washington.edu/css534/syllabi/s11.html">CSS 534: Parallel Programming in Grid and Cloud</a> by Munehiro Fukuda at University of Washington (Spring, 2011)</li>
182+
183+
<li>Used in <a href="http://snap.stanford.edu/class/cs341-2011/">CS341: Advanced Topics in Data Mining</a> by Jure Leskovec, Anand Rajaraman, and Jeff Ullman at Stanford (Spring, 2011)</li>
184+
185+
<li>Used in <a href="http://www.eurecom.fr/~michiard/CCSS.html">Summer School on Cloud Computing: Challenges and opportunities</a> by Pietro Michiard (Summer, 2011)</li>
186+
187+
<li>Used in a <a href="http://net.pku.edu.cn/~course/cs402/2011/index.html">Peking University course</a> on mass data processing/cloud computing by Hongfei Yan and Bo Peng (Summer, 2011)</li>
188+
189+
<li>Used in <a href="http://dicta-f11.utcompling.com/">CS395T / INF385T / LIN386M: Data-Intensive Computing for Text Analysis</a> by Jason Baldridge and Matt Lease at the University of Texas, Austin (Fall, 2011)</li>
190+
191+
<li>Used in <a href="http://www.ccs.neu.edu/home/mirek/classes/2011-F-CS6240/index.htm">CS 6240: Parallel Data Processing in MapReduce</a> by Mirek Riedewald at Northeastern University (Fall 2011)</li>
192+
193+
<li>Used in <a href="http://www.cs.gmu.edu/syllabus/syllabi-fall11/CS795BarbaraD.html">CS 795 Mining Massive Datasets</a> by Daniel Barbara at George Mason University (Fall, 2011)</li>
194+
195+
<li>Used in <a href="http://www.cs.kent.edu/~jin/Cloud12Spring/Cloud.html">CS 4/5/6/79995: Advanced Computing Platforms for Data Processing</a> by Ruoming Jin at Kent State University (Spring 2012)</li>
196+
197+
<li>Used in <a href="http://beowulf.lcs.mit.edu/18.337/">18.337/6.338: Parallel Computing</a> by Alan Edelman at MIT (Fall, 2011)</li>
198+
199+
<li>Used in <a href="http://www.cse.buffalo.edu/~bina/cse487/fall2011/">CSE487/587 Data-Intensive Computing</a> by Bina Ramamurthy at SUNY Buffalo (Fall, 2011)</li>
200+
201+
<li>Used in <a href="http://www.csc.lsu.edu/~wuyj/Teaching/7481/fa11/">CSC7481/LIS 7610 - Information Retrieval Systems</a> by Yejun Wu at LSU (Fall, 2011)</li>
202+
203+
<li>Used in <a href="http://www.cs.brown.edu/courses/csci2950-u/f11/index.html">CSCI-2950u: Data-Intensive Scalable Computing</a> by Rodrigo Fonseca at Brown (Fall, 2011)</li>
204+
205+
<li>Used in <a href="http://www.cs.sunysb.edu/~rezaul/CSE590-S12.html">CSE 590 (#50569): Topics in Computer Science (Supercomputing)</a> by Rezaul A. Chowdhury at Stony Brook University (Spring 2012)</li>
206+
207+
<li>Used in <a href="http://www-scf.usc.edu/~csci572/">Course 572: Information Retrieval and Web Search Engines</a> by Ellis Horowitz at USC (Spring 2012)</li>
208+
209+
</ul>
210+
211+
</div> <!-- /container -->
212+
213+
<script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script>
214+
<script src="http://twitter.github.com/bootstrap/assets/js/jquery.js"></script>
215+
<script src="http://twitter.github.com/bootstrap/assets/js/google-code-prettify/prettify.js"></script>
216+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-transition.js"></script>
217+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-alert.js"></script>
218+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-modal.js"></script>
219+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-dropdown.js"></script>
220+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-scrollspy.js"></script>
221+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-tab.js"></script>
222+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-tooltip.js"></script>
223+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-popover.js"></script>
224+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-button.js"></script>
225+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-collapse.js"></script>
226+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-carousel.js"></script>
227+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-typeahead.js"></script>
228+
<script src="http://twitter.github.com/bootstrap/assets/js/application.js"></script>
229+
230+
</body>
231+
</html>

ed2.html

+82
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
<!DOCTYPE html>
2+
<html lang="en">
3+
<head>
4+
<meta charset="utf-8">
5+
<title>MapReduce Algorithm Design</title>
6+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
7+
<meta name="description" content="">
8+
<meta name="author" content="">
9+
10+
<!-- Le styles -->
11+
<link href="assets/css/bootstrap.css" rel="stylesheet">
12+
<link href="assets/css/bootstrap-responsive.css" rel="stylesheet">
13+
<link href="assets/css/docs.css" rel="stylesheet">
14+
<link href="assets/js/google-code-prettify/prettify.css" rel="stylesheet">
15+
16+
<!-- Le HTML5 shim, for IE6-8 support of HTML5 elements -->
17+
<!--[if lt IE 9]>
18+
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
19+
<![endif]-->
20+
21+
</head>
22+
23+
<body>
24+
25+
<div class="navbar navbar-fixed-top">
26+
<div class="navbar-inner">
27+
<div class="container">
28+
<a class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
29+
<span class="icon-bar"></span>
30+
<span class="icon-bar"></span>
31+
<span class="icon-bar"></span>
32+
</a>
33+
<div class="nav-collapse">
34+
<ul class="nav">
35+
<li class="">
36+
<a href="index.html">Home</a>
37+
</li>
38+
<li class="">
39+
<a href="ed1.html">1st Edition</a>
40+
</li>
41+
<li class="active">
42+
<a href="ed2.html">2nd Edition</a>
43+
</li>
44+
</ul>
45+
</div>
46+
</div>
47+
</div>
48+
</div>
49+
50+
<div class="container">
51+
52+
<div class="page-header">
53+
<h1>MapReduce Algorithm Design</h1>
54+
<p class="lead">Jimmy Lin</p>
55+
</div>
56+
57+
<p>I'm currently planning a 2nd edition to "Data-Intensive Text
58+
Processing with MapReduce" with an expanded scope. The working title
59+
is simply "MapReduce Algorithm Design". Stay tuned
60+
for details!</p>
61+
62+
</div> <!-- /container -->
63+
64+
<script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script>
65+
<script src="http://twitter.github.com/bootstrap/assets/js/jquery.js"></script>
66+
<script src="http://twitter.github.com/bootstrap/assets/js/google-code-prettify/prettify.js"></script>
67+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-transition.js"></script>
68+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-alert.js"></script>
69+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-modal.js"></script>
70+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-dropdown.js"></script>
71+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-scrollspy.js"></script>
72+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-tab.js"></script>
73+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-tooltip.js"></script>
74+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-popover.js"></script>
75+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-button.js"></script>
76+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-collapse.js"></script>
77+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-carousel.js"></script>
78+
<script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-typeahead.js"></script>
79+
<script src="http://twitter.github.com/bootstrap/assets/js/application.js"></script>
80+
81+
</body>
82+
</html>

0 commit comments

Comments
 (0)