Hello World Hadoop project... Eclipse + Maven + unit tests
- /conf: config to be used for running MR on the cluster
- /conf-local: config to be used for running MR locally
- /hadoop-1.2.1: Hadoop binaries (recomended WITHOUT conf/ dir, to avoid mistakes)
To setup files on /conf directory, you can use core-site.xml.sample
and mapred-site.xml.sample
as reference.
Before running the MR jobs you need to build the JAR and generate some data.
$ mvn clean package
$ find /usr/share/doc > /tmp/MYDATA.txt
From Eclipse, run the launch-local.xml
as an Ant script. By default the task launch-local
is used, which does not clean the output directory. You can use the clean-and-launch
task to clean the output directory before the launch of the MR job.
$ ./hadoop-1.2.1/bin/hadoop --config conf-local jar hello-hadoop-0.0.1-SNAPSHOT.jar \
ar.com.datatsunami.hellohadoop.Launcher file:///tmp/MYDATA.txt file:///tmp/OUTPUT
$ mvn clean package
$ ./hadoop-1.2.1/bin/hadoop --config conf fs -copyFromLocal /tmp/MYDATA.txt /
$ ./hadoop-1.2.1/bin/hadoop --config conf jar hello-hadoop-0.0.1-SNAPSHOT.jar \
ar.com.datatsunami.hellohadoop.Launcher /MYDATA.txt /OUTPUT
$ ./hadoop-1.2.1/bin/hadoop --config conf fs -ls /OUTPUT
$ mvn eclipse:eclipse
To download the Hadoop sources or javadocs DOESN'T WORKS with Maven, so you'll have to setup in Eclipse by yourself.
$ mvn dependency:sources -DincludeGroupIds=org.apache.hadoop
(...)
[INFO] The following files were skipped:
[INFO] org.apache.hadoop:hadoop-core:java-source:sources:1.2.0
[INFO] org.apache.hadoop:hadoop-test:java-source:sources:1.2.0
(...)
$ mvn dependency:resolve -DincludeGroupIds=org.apache.hadoop -Dclassifier=javadoc
(...)
[INFO] The following files have NOT been resolved:
[INFO] org.apache.hadoop:hadoop-core:java-source:javadoc:1.2.0
[INFO] org.apache.hadoop:hadoop-test:java-source:javadoc:1.2.0
(...)
First, you'll need to generate the jar with the sources, and then:
$ mvn org.apache.maven.plugins:maven-install-plugin:2.5:install-file \
-Dfile=hadoop-1.2.1-custom-sources.jar \
-DgroupId=org.apache.hadoop \
-DartifactId=hadoop-core \
-Dversion=1.2.1 \
-Dpackaging=jar \
-Dclassifier=sources
I needed to remove a file from my local repository:
$ rm ~/.m2/repository/org/apache/hadoop/hadoop-core/1.2.1/hadoop-core-1.2.1-sources.jar-not-available
The sources will be available to Eclipse after running:
$ mvn eclipse:eclipse
-
Developing, testing and debugging Hadoop map/reduce jobs with Eclipse
-
Open Source Big Data for the Impatient, Part 1: Hadoop tutorial: Hello World with Java, Pig, Hive, Flume, Fuse, Oozie, and Sqoop with Informix, DB2, and MySQL
-
http://comments.gmane.org/gmane.comp.java.hadoop.mapreduce.user/10598
-
Get source jar files attached to Eclipse for Maven-managed dependencies
-
Eclipse + MRUnit + Maven
- Use Maven task for Ant: https://maven.apache.org/ant-tasks/installation.html
- Try other versions of Hadoop
- Try debug of MR job from Eclipse
- Try to find a way to attach Hadoop sources to Eclipse (workaround found using
mvn install:install-file
)- https://issues.apache.org/jira/browse/HADOOP-8363
- https://issues.apache.org/jira/browse/HADOOP-8498
- https://issues.apache.org/jira/browse/MAPREDUCE-4035
- http://stackoverflow.com/questions/12551977/repository-for-hadoop-stable-release-with-sources
- apache/hadoop-common#8
- Possible workaround: https://maven.apache.org/plugins/maven-install-plugin/examples/installing-secondary-artifacts.html
Copyright 2013 (C) Horacio G. de Oro - [email protected]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.