Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamped EMR scripts using pre-built binaries #4

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 18 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,27 +23,28 @@
* Run the command from the command line (or DOS Prompt) on your local machine where you installed elastic-mapreduce as outlined in the install guide above
* Linux/Mac
````
export S3BUCKET=sguhamozillaemr
./elastic-mapreduce --create --alive --name "RhipeCluster" --enable-debugging \
--num-instances 2 --slave-instance-type m1.large --master-instance-type m3.xlarge --ami-version "2.4.2" \
--num-instances 3 --slave-instance-type m3.xlarge --master-instance-type m3.xlarge --ami-version "2.4.2" \
--with-termination-protection \
--key-pair <Your Key Pair> \
--log-uri s3://<bucket>/logs \
--key-pair KEYPAIRAME \
--log-uri s3://$S3BUCKET/logs/ \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
--args "-m,mapred.reduce.tasks.speculative.execution=false" \
--args "-m,mapred.map.tasks.speculative.execution=false" \
--args "-m,mapred.map.child.java.opts=-Xmx1024m" \
--args "-m,mapred.reduce.child.java.opts=-Xmx1024m" \
--args "-m,mapred.job.reuse.jvm.num.tasks=1" \
--bootstrap-action "s3://<bucket>/install-preconfigure" \
--bootstrap-action "s3://<bucket>/install-r" \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,s3://<bucket>/install-rstudio" \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,s3://<bucket>/install-shiny-server" \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,s3://<bucket>/install-post-hadoop" \
--bootstrap-action "s3://<bucket>/install-protobuf" \
--bootstrap-action "s3://<bucket>/install-rhipe" \
--bootstrap-action "s3://<bucket>/install-additional-pkgs" \
--bootstrap-action "s3://<bucket>/install-post-configure"
--bootstrap-action "s3://$S3BUCKET/install-preconfigure" \
--bootstrap-action "s3://$S3BUCKET/install-r" \
--bootstrap-action "s3://$S3BUCKET/install-all-software" \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,s3://$S3BUCKET/install-master-r" \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,s3://$S3BUCKET/install-post-hadoop" \
--bootstrap-action "s3://$S3BUCKET/install-additional-pkgs" \
--bootstrap-action "s3://$S3BUCKET/install-rhipe" \
--bootstrap-action "s3://$S3BUCKET/install-post-configure"
````


* Windows Users:
* Run the following command from the DOS Prompt
Expand Down Expand Up @@ -73,7 +74,11 @@ From the AWS EC2 web site, find the master node in the EC2 instance list and sel
* "port range" = 8787
* "source" = your IP address OR Anywhere

Repeat for ports (check that the port are not already available first): 22, 9100, 9103
Repeat for ports (check that the port are not already available first): 22, 9100, 9103,3838 (for shiny server)
This need only be done once for the security group of the master. All subsequent
clusters will have their master inside the same security group and the security
permissions will be applicable to them too.


## Accessing RStudio ##
*****
Expand Down
27 changes: 27 additions & 0 deletions emr-2.4.2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
## Why prefix-emr.R ?

Though i made every effort to keep useful R packages in the binary tar archives,
I still forgot some. Once the cluster is up and running, how then to install an
R package on all the nodes? I do not know of a way on Elastic MapReduce.

One approach is to install R packages on the master node (the node you ssh into)
and then create an R bundle. That is the entire R distribution, the packages,
the binaries and all shared libraries that the connected graph of libraries
consist of. This is what the function `buildingR`in prefix-emr.R does.

When you source this file, it checks for the presence of Remr.tar.gz on the
HDFS. If it exists, it passes some options to RHIPE to use this archive to
execute R on the nodes. If not, it creates the bundle and saves it to the HDFS.

With this approach, if you need to install and R package which will be required
on the nodes, then, install the package on the master, rerun the `buildingR`
function (in a fresh R session)

```
library(Rhipe)
rhinit()
buildingR(nameof="Remr", dest="/",verbose=100)
```

And then source prefix-emr.R and carry on as usual.

14 changes: 6 additions & 8 deletions emr-2.4.2/install-additional-pkgs
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
#!/bin/bash
sudo apt-get --yes --force-yes install libcurl4-openssl-dev screen

## devtools
sudo su - -c "R -e \"install.packages('devtools', repos='http://cran.rstudio.com/')\""
## datadr
sudo su - -c "R -e \"options(repos = 'http://cran.rstudio.com/'); library(devtools); install_github('datadr', 'hafen')\""
## trelliscope
sudo su - -c "R -e \"options(repos = 'http://cran.rstudio.com/'); library(devtools); install_github('trelliscope', 'hafen')\""


## Add your packages here (examples of how to do that are in recreate-binaries/install-additional-packages)




10 changes: 10 additions & 0 deletions emr-2.4.2/install-all-software
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/bin/bash
## Runs on All nodes
## install-all-software
wget http://ml.stat.purdue.edu/rpackages/emr-binaries/forAll-binaries.tar.gz ~/
tar zxvf forAll-binaries.tar.gz
mv forAll emr-binaries
cd emr-binaries
sh complete.protobuf
mv site-library/* /usr/local/lib/R/site-library/
ln -s /usr/local/lib/R/site-library/rterra/include /usr/local/lib/R/site-library/rterra/clangheaders
11 changes: 11 additions & 0 deletions emr-2.4.2/install-master-r
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
## Runs on master
## install-server-r-components
wget http://ml.stat.purdue.edu/rpackages/emr-binaries/forMaster-binaries.tar.gz
tar zxvf forMaster-binaries.tar.gz
rm -rf emr-binaries
mv forMaster /home/hadoop/emr-binaries
cd /home/hadoop/emr-binaries
sh complete.rstudio.shiny


20 changes: 16 additions & 4 deletions emr-2.4.2/install-preconfigure
Original file line number Diff line number Diff line change
@@ -1,13 +1,20 @@
#!/bin/bash

# apt
sudo apt-get -y update
sudo apt-get -y install pkg-config
sudo apt-get -y install libcurl4-openssl-dev libtbb-dev



# java
echo '/usr/lib/jvm/java-7-oracle/jre/lib/amd64/server/' | sudo tee -a /etc/ld.so.conf.d/jre.conf
echo '/usr/lib/jvm/java-7-oracle/jre/lib/amd64/' | sudo tee -a /etc/ld.so.conf.d/jre.conf
echo '/home/hadoop/lib64' | sudo tee -a /etc/ld.so.conf.d/hadoop.conf
echo '/usr/lib/jvm/java-7-oracle/jre/lib/amd64/server/' | sudo tee -a /etc/ld.so.conf.d/jre.conf
echo '/usr/lib/jvm/java-7-oracle/jre/lib/amd64/' | sudo tee -a /etc/ld.so.conf.d/jre.conf
echo '/home/hadoop/lib64' | sudo tee -a /etc/ld.so.conf.d/hadoop.conf

## Link JavaH and JAR to /usr/bin/
## compiling rJava seems to fail without this
sudo ln -s /usr/lib/jvm/java-7-oracle/bin/javah /usr/bin/javah
sudo ln -s /usr/lib/jvm/java-7-oracle/bin/jar /usr/bin/jar
sudo ldconfig

# hadoop config
Expand All @@ -20,3 +27,8 @@ echo 'export HADOOP=/home/hadoop'| sudo tee -a /home/hadoop/.bash_profile
echo 'export HADOOP_HOME=/home/hadoop/' | sudo tee -a /home/hadoop/.bash_profile
echo 'export HADOOP_CONF_DIR=/home/hadoop/conf' | sudo tee -a /home/hadoop/.bash_profile
echo 'export HADOOP_LIBS=/home/hadoop:/home/hadoop/lib'| sudo tee -a /home/hadoop/.bash_profile

## Remove this troublesome file
rm -rf /home/hadoop/.versions/hive-0.11.0/hcatalog/share/webhcat/svr/lib/xercesImpl-2.6.1.jar


32 changes: 23 additions & 9 deletions emr-2.4.2/install-r
Original file line number Diff line number Diff line change
Expand Up @@ -6,24 +6,38 @@ dpkg --get-selections | grep r-base | awk '{print $1}' | xargs sudo apt-get --ye

# setup repo
echo "deb http://cran.rstudio.com/bin/linux/debian squeeze-cran3/" | sudo tee -a /etc/apt/sources.list
gpg --keyserver pgpkeys.mit.edu --recv-key 06F90DE5381BA480
gpg --keyserver pgpkeys.mit.edu --recv-key 06F90DE5381BA480
gpg -a --export 06F90DE5381BA480 | sudo apt-key add -

# install 3.1
sudo apt-get --yes update
sudo -E apt-get -t squeezecran3.0 --yes --force-yes install r-base-core=3.1.0-1~squeezecran3.0
sudo -E apt-get -t squeezecran3.0 --yes --force-yes install r-base-dev=3.1.0-1~squeezecran3.0
sudo chmod -R aou=rwx /usr/local/lib/R/site-library
sudo chmod -R aou=rwx /usr/local/lib/R/site-library

######################################################################
## You only need to run this code to update emr-binaries
## That is if you wish to recreate the binaries that are downloaded
## in install-all- & install-master, then run the following code
#######################################################################


# packages need updating
sudo su - -c "R -e \"install.packages('codetools', repos='http://cran.rstudio.com/')\""
sudo su - -c "R -e \"install.packages('lattice', repos='http://cran.rstudio.com/')\""
sudo su - -c "R -e \"install.packages('MASS', repos='http://cran.rstudio.com/')\""
sudo su - -c "R -e \"install.packages('boot', repos='http://cran.rstudio.com/')\""
# sudo su - -c "R -e \"install.packages('codetools', repos='http://cran.rstudio.com/')\""
# sudo su - -c "R -e \"install.packages('lattice', repos='http://cran.rstudio.com/')\""
# sudo su - -c "R -e \"install.packages('MASS', repos='http://cran.rstudio.com/')\""
# sudo su - -c "R -e \"install.packages('boot', repos='http://cran.rstudio.com/')\""
# sudo su - -c "R -e \"install.packages('nnet', repos='http://cran.rstudio.com/')\""
# sudo su - -c "R -e \"install.packages('cluster', repos='http://cran.rstudio.com/')\""

# some other required packages

## rjava ##
wget http://cran.r-project.org/src/contrib/rJava_0.9-6.tar.gz
sudo R CMD INSTALL rJava_0.9-6.tar.gz
# sudo R CMD javareconf
# wget http://cran.r-project.org/src/contrib/rJava_0.9-6.tar.gz
# sudo R CMD INSTALL rJava_0.9-6.tar.gz

## shiny package ##
sudo su - -c "R -e \"install.packages('shiny', repos='http://cran.rstudio.com/')\""
# sudo su - -c "R -e \"install.packages('shiny', repos='http://cran.rstudio.com/')\""

## took 2 minutes
11 changes: 8 additions & 3 deletions emr-2.4.2/install-rhipe
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
#!/bin/bash

export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
wget http://people.mozilla.com/~sguha/Rhipe_0.74.0_hadoop-1.tar.gz
R CMD INSTALL Rhipe_0.74.0_hadoop-1.tar.gz
###################################################################################
## The binaries (which are downloaded in install-all- and install-master
## already have this version of RHIPE. However invoke this if you want a newer
## version of RHIPE. This adds ~ 35 seconds to the build time
##################################################################################
# export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
# wget http://people.mozilla.com/~sguha/Rhipe_0.74.0_hadoop-1.tar.gz
# R CMD INSTALL Rhipe_0.74.0_hadoop-1.tar.gz
107 changes: 107 additions & 0 deletions emr-2.4.2/prefix-emr.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@

buildingR <- function(excludeLibs=c(),exclude=NULL,iterate=TRUE,verbose=1,nameof="Rfolder-test",destpath=sprintf("/user/%s/",USER)){
library(Rhipe)
rhinit()
local({
tfolder <- sprintf("%s/Rdist",tempdir())
## delete folder if it exists!
dir.create(tfolder)
execu <- if ("package:Rhipe" %in% search()) rhoptions()$RhipeMapReduce else sprintf("/home/%s/software/R_LIBS/Rhipe/bin/RhipeMapReduce",USER)
## execu <- if ("package:Rhipe" %in% search()) rhoptions()$Rhipe else sprintf("/home/%s/software/R_LIBS/Rhipe/libs/Rhipe.so",USER)
getLB <- function(n){
a <- system(sprintf("ldd %s",n),intern=TRUE)
b <- lapply(strsplit(a,"=>"), function(r) if (length(r)>=2) r[2] else NULL)
b <- b[unlist(lapply(b, function(bs){!grepl("(not found)",bs)}))]
if(length(b)==0) return()
b <- strsplit(unlist(b)," ")
b <- unlist(lapply(b,"[[",2))
b <- unique(unlist( sapply(b, function(r) if(nchar(r)>1) r else NULL)))
names(b) <- NULL
if(verbose>=1){
cat(sprintf("\n%s depends on:\n",n))
cat(paste(b,sep=":",collapse=" "))
cat(sprintf("\n---------------\n"))
if(verbose>10) print(a)
}
b
}
b <- getLB(execu)
## b <- unique(b[!grepl("(libc.so)",b)])
for(x in b) {
cat(sprintf("Copying %s to %s\n",x,tfolder))
file.copy(x,tfolder) ##copies the linked .so files
}
file.copy(execu,tfolder,overwrite=TRUE) ## copies the RHIPE C engine
file.copy(R.home(),tfolder,recursive=TRUE)
## R_LIBS
x <- .libPaths() ##Sys.getenv("R_LIBS")
if(TRUE){
for(y in list.files(x,full.names=TRUE)){
if(all( sapply(excludeLibs,function(h) !grepl(h,y))))
file.copy(y,sprintf("%s/R/library/",tfolder), recursive=TRUE)
}
allfiles <- list.files(x,full.names=TRUE,rec=TRUE)
allsofiles <- allfiles[grepl(".so$",allfiles)]
alldeps <- sort(unique(unlist(sapply(allsofiles, getLB))))
id <- 1
if(iterate){
while(TRUE){
message(sprintf("iteration %s", id))
alldeps2 <- sort(unique(unlist(sapply(alldeps, getLB))))
newones <- sum(!(alldeps2 %in% alldeps))
if(newones>0){
message(sprintf("There were %s additions(total=%s), iterating till this becomes zero", length(newones), length(alldeps2)))
id=id+1
alldeps=alldeps2
}else break
}
}
if(!is.null(exclude)) alldeps <- alldeps[!grepl(exclude,alldeps)]
for(x in alldeps) {
cat(sprintf("Copying %s to %s\n",x,tfolder))
file.copy(x,tfolder) ##copies the linked .so files
}
}
})
cat(sprintf("Building a gzipped tar archive at %s/%s.tar.gz\n",tempdir(),nameof))
system (sprintf("tar z --create --file=%s/%s.tar.gz -C %s/Rdist .",tempdir(),nameof, tempdir()))
cat(sprintf("Copying gzipped tar archive to HDFS (see %s) in user folder\n",sprintf("%s.tar.gz",nameof)))
if ("package:Rhipe" %in% search()) rhput(sprintf("%s/%s.tar.gz",tempdir(),nameof),destpath)
}



Sys.setenv("RHIPE_DEBUG_LEVEL"=2L)
library(Rhipe)
rhinit()
options(width=200)
if(!any(grepl("Remr", rhls("/")$file)))
buildingR(nameof="Remr", dest="/",verbose=100)
RDIST <- "Remr"
m <- rhoptions()$mropts
m$R_ENABLE_JIT = 2
m$R_HOME = sprintf("%s/R",RDIST)
m$R_HOME_DIR = sprintf("./%s/R",RDIST)
m$R_SHARE_DIR = sprintf("./%s/R/share",RDIST)
m$R_INCLUDE_DIR = sprintf("./%s/R/include",RDIST)
m$R_DOC_DIR = sprintf("./%s/R/doc",RDIST)
m$PATH = sprintf("./%s/R/bin:./%s/:$PATH",RDIST,RDIST)
m$LD_LIBRARY_PATH = sprintf("./%s/:./%s/R/lib:/usr/lib64",RDIST,RDIST)

rhoptions(runner = sprintf("./%s/RhipeMapReduce --silent --vanilla",RDIST),
zips = sprintf("/%s.tar.gz",RDIST),
HADOOP.TMP.FOLDER = sprintf("/tmp/"),
mropts = m,
job.status.overprint =TRUE,
write.job.info =TRUE)
rm(m);
summer <- Rhipe::rhoptions()$templates$scalarsummer

## I forgot to install R packages in the bootup script and so either i
## shut it all down or start R like this ... i.e. install packages on
## main node and create an R bundle as shown above
x <- rhwatch(map=function(a,b){
library(rjson)
suppressPackageStartupMessages(library(data.table))
rhcollect(1, data.table(x=runif(10)))
}, reduce=0, input=c(10,10))
Loading