A collection of resources for building low-latency, large scale web crawlers on Storm available under Apache License.
Available from Maven Central with :
<dependency>
<groupId>com.digitalpebble</groupId>
<artifactId>storm-crawler-core</artifactId>
<version>0.5</version>
</dependency>
To get started with storm-crawler, it's recommended that you run the CrawlTopology in local mode.
NOTE: These instructions assume that you have Maven installed.
First, clone the project from github:
git clone https://github.com/DigitalPebble/storm-crawler
Then :
cd core
mvn clean compile exec:java -Dstorm.topology=com.digitalpebble.storm.crawler.CrawlTopology -Dexec.args="-conf crawler-conf.yaml -local"
to run the demo CrawlTopology in local mode.
Alternatively, generate an uberjar:
mvn clean package
and then submit the topology with storm jar
:
storm jar target/storm-crawler-core-0.6-SNAPSHOT-jar-with-dependencies.jar com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml
to run it in distributed mode.
Mailing list : http://groups.google.com/group/digitalpebble
Or use the tag storm-crawler on stackoverflow.