-
Notifications
You must be signed in to change notification settings - Fork 0
2 Using EmrEtlRunner
HOME > SNOWPLOW SETUP GUIDE > Step 3: setting up EmrEtlRunner > 2: Using EmrEtlRunner
## 1. OverviewThere are two usage modes for EmrEtlRunner:
- Rolling mode where EmrEtlRunner processes whatever raw Snowplow event logs it finds in the In Bucket
- Timespan mode where EmrEtlRunner only processes those raw Snowplow event logs whose timestamp is within a timespan specified on the command-line
Timespan mode can be useful if you have a large backlog of raw Snowplow event logs and you want to start by processing just a small subset of those logs.
## 2. Command-line optionsInvoke EmrEtlRunner using Bundler's bundle exec
syntax:
$ bundle exec bin/snowplow-emr-etl-runner
Note the bin/
sub-folder, and that the bundle exec
command will
only work when you are inside the emr-etl-runner
folder.
The command-line options for EmrEtlRunner look like this:
Usage: snowplow-emr-etl-runner [options]
Specific options:
-c, --config CONFIG configuration file
-s, --start YYYY-MM-DD optional start date *
-e, --end YYYY-MM-DD optional end date *
-s, --skip staging,emr,archive skip work step(s)
-b, --process-bucket BUCKET run emr only on specified bucket. Implies --skip staging,archive
* filters the raw event logs processed by EmrEtlRunner by their timestamp
Common options:
-h, --help Show this message
-v, --version Show version
A note on the --skip
option: this takes a list of individual steps to skip.
So for example you could run only the Hive job with the command-line option:
$ bundle exec bin/snowplow-emr-etl-runner --skip staging,archive --config config/config.yml
Invoking EmrEtlRunner with just the --config
option puts it into rolling
mode, processing all the raw Snowplow event logs it can find in your In
Bucket:
$ bundle exec bin/snowplow-emr-etl-runner --config config/config.yml
To run EmrEtlRunner in timespan mode, you need to specify the --start
and/or --end
dates as well as the --config
option, like so:
$ bundle exec bin/snowplow-emr-etl-runner \
--config config.yml \
--start 2012-06-20 \
--end 2012-06-24
This will run EmrEtlRunner on log files which have timestamps in the period 20 June 2012 to 24 June 2012 inclusive.
Note that you do not have to specify both the start and end dates:
- Specify
--start
only and the timespan will run from your start date up to today, inclusive - Specify
--end
only and the timespan will run from the beginning of time up to your end date, inclusive
If your raw Snowplow logs are generated by the Amazon CloudFront collector, please note that CloudFront timestamps in UTC.
## 4. Checking the resultsOnce you have run the EmrEtlRunner you should be able to manually inspect in S3 the folder specified in the :out:
parameter in your config.yml
file and see new files generated, which will contain the cleaned data either for uploading into a storage target (e.g. Redshift or Infobright) or for analysing directly using Hive (or Pig or Mahout or some other Hadoop querying tool) on EMR.
Note: nost Snowplow users run the 'hadoop' version of the ETL process, in which case the data generated is saved into subfolders with names of the form part-000...
. If, however, you are running the legacy 'hive' ETL (because e.g. you want to use Hive or Infobright as your storage target, rather than Redshift, which is the only storage target the 'hadoop' etl currently supports), the subfolders names will be of the format dt=...
.
Comfortable using EmrEtlRunner? Then [schedule it] schedule so that it regularly takes new data generated by the collector, processes it, cleans it, enriches it, and writes it back to S3.
Home | About | Project | Setup Guide | Technical Docs | Copyright © 2012-2013 Snowplow Analytics Ltd
HOME > SNOWPLOW SETUP GUIDE > Step 3: Setup EmrEtlRunner
- [Step 1: Setup a Collector] (setting-up-a-collector)
- [Step 2: Setup a Tracker] (setting-up-a-tracker)
- [Step 3: Setup EmrEtlRunner] (setting-up-EmrEtlRunner)
- [3.1: install EmrEtlRunner] (1-Installing-EmrEtlRunner)
- [3.2: using EmrEtlRunner] (2-Using-EmrEtlRunner)
- [3.3: scheduling EmrEtlRunner] (3-scheduling-EmrEtlRunner)
- [Step 4: Setup alternative data stores] (setting-up-alternative-data-stores)
- [Step 5: Analyze your data!] (Getting started analyzing Snowplow data)
Useful resources