|
| 1 | +<!DOCTYPE html> |
| 2 | +<html lang="en"> |
| 3 | + <head> |
| 4 | + <meta charset="utf-8"> |
| 5 | + <title>Data-Intensive Text Processing with MapReduce</title> |
| 6 | + <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| 7 | + <meta name="description" content=""> |
| 8 | + <meta name="author" content=""> |
| 9 | + |
| 10 | + <!-- Le styles --> |
| 11 | + <link href="assets/css/bootstrap.css" rel="stylesheet"> |
| 12 | + <link href="assets/css/bootstrap-responsive.css" rel="stylesheet"> |
| 13 | + <link href="assets/css/docs.css" rel="stylesheet"> |
| 14 | + <link href="assets/js/google-code-prettify/prettify.css" rel="stylesheet"> |
| 15 | + |
| 16 | + <!-- Le HTML5 shim, for IE6-8 support of HTML5 elements --> |
| 17 | + <!--[if lt IE 9]> |
| 18 | + <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script> |
| 19 | + <![endif]--> |
| 20 | + |
| 21 | + </head> |
| 22 | + |
| 23 | + <body> |
| 24 | + |
| 25 | + <div class="navbar navbar-fixed-top"> |
| 26 | + <div class="navbar-inner"> |
| 27 | + <div class="container"> |
| 28 | + <a class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse"> |
| 29 | + <span class="icon-bar"></span> |
| 30 | + <span class="icon-bar"></span> |
| 31 | + <span class="icon-bar"></span> |
| 32 | + </a> |
| 33 | + <div class="nav-collapse"> |
| 34 | + <ul class="nav"> |
| 35 | + <li class=""> |
| 36 | + <a href="index.html">Home</a> |
| 37 | + </li> |
| 38 | + <li class="active"> |
| 39 | + <a href="ed1.html">1st Edition</a> |
| 40 | + </li> |
| 41 | + <li class=""> |
| 42 | + <a href="ed2.html">2nd Edition</a> |
| 43 | + </li> |
| 44 | + </ul> |
| 45 | + </div> |
| 46 | + </div> |
| 47 | + </div> |
| 48 | + </div> |
| 49 | + |
| 50 | + <div class="container"> |
| 51 | + |
| 52 | + <div class="page-header"> |
| 53 | + <h1>Data-Intensive Text Processing with MapReduce <small>(First Edition)</small></h1> |
| 54 | +<p class="lead">Jimmy Lin and Chris Dyer.<br/> |
| 55 | +Morgan & Claypool Publishers, 2010.</p> |
| 56 | + </div> |
| 57 | + |
| 58 | +<h2>Abstract</h2> |
| 59 | + |
| 60 | +<p>Our world is being revolutionized by data-driven methods: access to |
| 61 | +large amounts of data has generated new insights and opened exciting |
| 62 | +new opportunities in commerce, science, and computing |
| 63 | +applications. Processing the enormous quantities of data necessary for |
| 64 | +these advances requires large clusters, making distributed computing |
| 65 | +paradigms more crucial than ever. MapReduce is a programming model for |
| 66 | +expressing distributed computations on massive datasets and an |
| 67 | +execution framework for large-scale data processing on clusters of |
| 68 | +commodity servers. The programming model provides an |
| 69 | +easy-to-understand abstraction for designing scalable algorithms, |
| 70 | +while the execution framework transparently handles many system-level |
| 71 | +details, ranging from scheduling to synchronization to fault |
| 72 | +tolerance. This book focuses on MapReduce algorithm design, with an |
| 73 | +emphasis on text processing algorithms common in natural language |
| 74 | +processing, information retrieval, and machine learning. We introduce |
| 75 | +the notion of MapReduce design patterns, which represent general |
| 76 | +reusable solutions to commonly occurring problems across a variety of |
| 77 | +problem domains. This book not only intends to help the reader "think |
| 78 | +in MapReduce", but also discusses limitations of the programming model |
| 79 | +as well.</p> |
| 80 | + |
| 81 | +<p>Quite explicitly, this book focuses on MapReduce algorithm design, not <a href="http://hadoop.apache.org/">Hadoop</a> programming. Tom White's <a href="http://www.amazon.com/gp/product/0596521979?ie=UTF8&tag=dataintetextp-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0596521979">Hadoop: The Definitive Guide</a><img src="http://www.assoc-amazon.com/e/ir?t=dataintetextp-20&l=as2&o=1&a=0596521979" width="1" height="1" alt="" style="border:none !important; margin:0px !important;" /> is a great resource for learning Hadoop.</p> |
| 82 | + |
| 83 | +<h2 style="padding-top:15px">Getting the Book</h2> |
| 84 | + |
| 85 | +<p>This book is part of the Morgan & Claypool <a |
| 86 | +href="http://www.morganclaypool.com/toc/hlt/1/1">Synthesis Lectures on |
| 87 | +Human Language Technologies</a>. If you're at a university, your |
| 88 | +institution may already subscribe to the series, in which case you can |
| 89 | +access the <a |
| 90 | +href="http://dx.doi.org/10.2200/S00274ED1V01Y201006HLT007">electronic |
| 91 | +version</a> directly without cost (see <a |
| 92 | +href="http://www.morganclaypool.com/page/licensed">this page</a> for a |
| 93 | +list of institutional subscribers). Otherwise, to purchase:</p> |
| 94 | + |
| 95 | +<ul> |
| 96 | + |
| 97 | + <li>Electronic and print copies from <a href="http://dx.doi.org/10.2200/S00274ED1V01Y201006HLT007">Morgan & Claypool</a> (publisher's site)</li> |
| 98 | + |
| 99 | + <li>Print copies from <a href="http://www.amazon.com/gp/product/1608453421?ie=UTF8&tag=dataintetextp-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1608453421">Amazon.com</a><img src="http://www.assoc-amazon.com/e/ir?t=dataintetextp-20&l=as2&o=1&a=1608453421" width="1" height="1" alt="" style="border:none !important; margin:0px !important;" /></li> |
| 100 | + |
| 101 | +</ul> |
| 102 | + |
| 103 | +<p>We are pleased to provide |
| 104 | +the <a href="MapReduce-book-final.pdf">final pre-production |
| 105 | +manuscript</a> (April 11, 2010) as a preview. If you find this |
| 106 | +resource helpful, please consider purchasing an actual copy to support |
| 107 | +our work!</p> |
| 108 | + |
| 109 | +<h2 style="padding-top:15px">Table of Contents</h2> |
| 110 | + |
| 111 | +<ol> |
| 112 | + |
| 113 | + <li>Introduction</li> |
| 114 | + |
| 115 | + <li>MapReduce Basics</li> |
| 116 | + |
| 117 | + <li>MapReduce algorithm design</li> |
| 118 | + |
| 119 | + <li>Inverted Indexing for Text Retrieval</li> |
| 120 | + |
| 121 | + <li>Graph Algorithms</li> |
| 122 | + |
| 123 | + <li>EM Algorithms for Text Processing</li> |
| 124 | + |
| 125 | + <li>Closing Remarks</li> |
| 126 | + |
| 127 | +</ol> |
| 128 | + |
| 129 | +<h2 style="padding-top:15px">Design Patterns & Algorithms</h2> |
| 130 | + |
| 131 | +<p><a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/index.html">Cloud<sup><small>9</small></sup></a> |
| 132 | +is a MapReduce library for Hadoop designed to serve as both a teaching |
| 133 | +tool and to support research in data-intensive text processing. It |
| 134 | +also serves as a repository of |
| 135 | +<a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/patterns.html">many |
| 136 | +examples</a> discussed in the book. Reference implementations of |
| 137 | +design patterns and other algorithms discussed in the book are being |
| 138 | +added gradually, so please come back periodically. Thus far, the |
| 139 | +repository contains:</p> |
| 140 | + |
| 141 | +<ul> |
| 142 | + |
| 143 | + <li><a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/order-inversion.html">Order inversion</a> from Chapter 3 for computing bigram |
| 144 | + relative frequencies.</li> |
| 145 | + |
| 146 | + <li><a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/pairs-stripes.html">"Pairs" |
| 147 | + and "stripes"</a> from Chapter 3 for computing the word |
| 148 | + co-occurrence matrix of a large text collection.</li> |
| 149 | + |
| 150 | + <li><a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/pagerank.html">PageRank</a> from Chapter 4 |
| 151 | + as well as some more design patterns for graph algorithms not discussed in the book.</li> |
| 152 | + |
| 153 | +</ul> |
| 154 | + |
| 155 | +<h2 style="padding-top:15px">What People Are Saying</h2> |
| 156 | + |
| 157 | +<ul> |
| 158 | + |
| 159 | + <li>Book cited in a special report on managing information in <a href="http://www.economist.com/specialreports/displaystory.cfm?story_id=15557413">The Economist</a> (February 25, 2010)</li> |
| 160 | + |
| 161 | + <li>Design patterns mentioned by <a href="http://mir-in-action.blogspot.com/2010/04/mapreduce-algorithm-design.html">Mark Levy</a> at Last.fm (April 6, 2010)</li> |
| 162 | + |
| 163 | + <li>Google Research <a href="http://googleresearch.blogspot.com/2010/05/recent-accomplishments-by-research.html">plugs the book</a> (May 19, 2010)</li> |
| 164 | + |
| 165 | + <li>Mentioned in a blog post by <a href="http://www.ctctlabs.com/index.php/blog/detail/applying_data_mining_techniques_to_mapreduce/">Constant Contact Labs</a> (May 27, 2010)</li> |
| 166 | + |
| 167 | + <li>Deepak Singh from Amazon <a href="http://mndoci.com/2010/07/02/recommendation-data-intensive-text-processing-with-mapreduce/">recommends the book</a> (July 2, 2010)</li> |
| 168 | + |
| 169 | + <li>Used in <a href="http://www.csee.ogi.edu/~zak/cs506-pslc/">CS 506/606: Special Topics: Problem Solving with Large Clusters</a> by Izhak Shafran and Richard Sproat at Oregon Health & Science University (Spring 2010)</li> |
| 170 | + |
| 171 | + <li>Used in a Google-supported <a href="http://net.pku.edu.cn/~course/cs402/2010/index.html">Peking University course</a> on cloud computing by Hongfei Yan and Bo Peng (Summer, 2010)</li> |
| 172 | + |
| 173 | + <li>Used in <a href="http://www.andyli.ece.ufl.edu/teaching/eel6935/">EEL 6935: Special Topics in Cloud Computing and Storage</a> by Andy Li at the University of Florida (Fall, 2010 and Fall, 2011)</li> |
| 174 | + |
| 175 | + <li>Used in <a href="http://courses.cs.tamu.edu/caverlee/csce689/">CSCE 689: Internet-Scale Data Management</a> by James Caverlee at Texas A&M (Fall, 2010)</li> |
| 176 | + |
| 177 | + <li>Used in <a href="http://courses.cse.tamu.edu/caverlee/csce670/">CSCE 670: Information Storage and Retrieval</a> by James Caverlee at Texas A&M (Spring, 2011 and Spring, 2012)</li> |
| 178 | + |
| 179 | + <li>Used in <a href="http://cs.ua.edu/691Vrbsky/">CS 691-001: Cloud Computing</a> by Susan Vrbsky at University of Alabama (Spring, 2011)</li> |
| 180 | + |
| 181 | + <li>Used in <a href="http://courses.washington.edu/css534/syllabi/s11.html">CSS 534: Parallel Programming in Grid and Cloud</a> by Munehiro Fukuda at University of Washington (Spring, 2011)</li> |
| 182 | + |
| 183 | + <li>Used in <a href="http://snap.stanford.edu/class/cs341-2011/">CS341: Advanced Topics in Data Mining</a> by Jure Leskovec, Anand Rajaraman, and Jeff Ullman at Stanford (Spring, 2011)</li> |
| 184 | + |
| 185 | + <li>Used in <a href="http://www.eurecom.fr/~michiard/CCSS.html">Summer School on Cloud Computing: Challenges and opportunities</a> by Pietro Michiard (Summer, 2011)</li> |
| 186 | + |
| 187 | + <li>Used in a <a href="http://net.pku.edu.cn/~course/cs402/2011/index.html">Peking University course</a> on mass data processing/cloud computing by Hongfei Yan and Bo Peng (Summer, 2011)</li> |
| 188 | + |
| 189 | + <li>Used in <a href="http://dicta-f11.utcompling.com/">CS395T / INF385T / LIN386M: Data-Intensive Computing for Text Analysis</a> by Jason Baldridge and Matt Lease at the University of Texas, Austin (Fall, 2011)</li> |
| 190 | + |
| 191 | + <li>Used in <a href="http://www.ccs.neu.edu/home/mirek/classes/2011-F-CS6240/index.htm">CS 6240: Parallel Data Processing in MapReduce</a> by Mirek Riedewald at Northeastern University (Fall 2011)</li> |
| 192 | + |
| 193 | + <li>Used in <a href="http://www.cs.gmu.edu/syllabus/syllabi-fall11/CS795BarbaraD.html">CS 795 Mining Massive Datasets</a> by Daniel Barbara at George Mason University (Fall, 2011)</li> |
| 194 | + |
| 195 | + <li>Used in <a href="http://www.cs.kent.edu/~jin/Cloud12Spring/Cloud.html">CS 4/5/6/79995: Advanced Computing Platforms for Data Processing</a> by Ruoming Jin at Kent State University (Spring 2012)</li> |
| 196 | + |
| 197 | + <li>Used in <a href="http://beowulf.lcs.mit.edu/18.337/">18.337/6.338: Parallel Computing</a> by Alan Edelman at MIT (Fall, 2011)</li> |
| 198 | + |
| 199 | + <li>Used in <a href="http://www.cse.buffalo.edu/~bina/cse487/fall2011/">CSE487/587 Data-Intensive Computing</a> by Bina Ramamurthy at SUNY Buffalo (Fall, 2011)</li> |
| 200 | + |
| 201 | + <li>Used in <a href="http://www.csc.lsu.edu/~wuyj/Teaching/7481/fa11/">CSC7481/LIS 7610 - Information Retrieval Systems</a> by Yejun Wu at LSU (Fall, 2011)</li> |
| 202 | + |
| 203 | + <li>Used in <a href="http://www.cs.brown.edu/courses/csci2950-u/f11/index.html">CSCI-2950u: Data-Intensive Scalable Computing</a> by Rodrigo Fonseca at Brown (Fall, 2011)</li> |
| 204 | + |
| 205 | + <li>Used in <a href="http://www.cs.sunysb.edu/~rezaul/CSE590-S12.html">CSE 590 (#50569): Topics in Computer Science (Supercomputing)</a> by Rezaul A. Chowdhury at Stony Brook University (Spring 2012)</li> |
| 206 | + |
| 207 | + <li>Used in <a href="http://www-scf.usc.edu/~csci572/">Course 572: Information Retrieval and Web Search Engines</a> by Ellis Horowitz at USC (Spring 2012)</li> |
| 208 | + |
| 209 | +</ul> |
| 210 | + |
| 211 | + </div> <!-- /container --> |
| 212 | + |
| 213 | + <script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script> |
| 214 | + <script src="http://twitter.github.com/bootstrap/assets/js/jquery.js"></script> |
| 215 | + <script src="http://twitter.github.com/bootstrap/assets/js/google-code-prettify/prettify.js"></script> |
| 216 | + <script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-transition.js"></script> |
| 217 | + <script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-alert.js"></script> |
| 218 | + <script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-modal.js"></script> |
| 219 | + <script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-dropdown.js"></script> |
| 220 | + <script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-scrollspy.js"></script> |
| 221 | + <script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-tab.js"></script> |
| 222 | + <script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-tooltip.js"></script> |
| 223 | + <script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-popover.js"></script> |
| 224 | + <script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-button.js"></script> |
| 225 | + <script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-collapse.js"></script> |
| 226 | + <script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-carousel.js"></script> |
| 227 | + <script src="http://twitter.github.com/bootstrap/assets/js/bootstrap-typeahead.js"></script> |
| 228 | + <script src="http://twitter.github.com/bootstrap/assets/js/application.js"></script> |
| 229 | + |
| 230 | + </body> |
| 231 | +</html> |
0 commit comments