Skip to content

2 Using EmrEtlRunner

Alexander Dean edited this page Aug 23, 2013 · 10 revisions

HOME > SNOWPLOW SETUP GUIDE > Step 3: setting up EmrEtlRunner > 2: Using EmrEtlRunner

  1. Overview
  2. Command-line options
  3. Running in each mode
  4. Checking the results
  5. Next-steps
## 1. Overview

There are two usage modes for EmrEtlRunner:

  1. Rolling mode where EmrEtlRunner processes whatever raw Snowplow event logs it finds in the In Bucket
  2. Timespan mode where EmrEtlRunner only processes those raw Snowplow event logs whose timestamp is within a timespan specified on the command-line

Timespan mode can be useful if you have a large backlog of raw Snowplow event logs and you want to start by processing just a small subset of those logs.

## 2. Command-line options

Invoke EmrEtlRunner using Bundler's bundle exec syntax:

$ bundle exec bin/snowplow-emr-etl-runner

Note the bin/ sub-folder, and that the bundle exec command will only work when you are inside the emr-etl-runner folder.

The command-line options for EmrEtlRunner look like this:

Usage: snowplow-emr-etl-runner [options]

Specific options:
    -c, --config CONFIG              configuration file
    -s, --start YYYY-MM-DD           optional start date *
    -e, --end YYYY-MM-DD             optional end date *
    -s, --skip staging,emr,archive   skip work step(s)
    -b, --process-bucket BUCKET      run emr only on specified bucket. Implies --skip staging,archive

* filters the raw event logs processed by EmrEtlRunner by their timestamp

Common options:
    -h, --help                       Show this message
    -v, --version                    Show version

A note on the --skip option: this takes a list of individual steps to skip. So for example you could run only the Hive job with the command-line option:

$ bundle exec bin/snowplow-emr-etl-runner --skip staging,archive --config config/config.yml
## 3. Running in each mode

3.1 Rolling mode

Invoking EmrEtlRunner with just the --config option puts it into rolling mode, processing all the raw Snowplow event logs it can find in your In Bucket:

$ bundle exec bin/snowplow-emr-etl-runner --config config/config.yml

3.2 Timespan mode

To run EmrEtlRunner in timespan mode, you need to specify the --start and/or --end dates as well as the --config option, like so:

$ bundle exec bin/snowplow-emr-etl-runner \
  --config config.yml \
  --start 2012-06-20 \
  --end 2012-06-24 

This will run EmrEtlRunner on log files which have timestamps in the period 20 June 2012 to 24 June 2012 inclusive.

Note that you do not have to specify both the start and end dates:

  1. Specify --start only and the timespan will run from your start date up to today, inclusive
  2. Specify --end only and the timespan will run from the beginning of time up to your end date, inclusive

If your raw Snowplow logs are generated by the Amazon CloudFront collector, please note that CloudFront timestamps in UTC.

## 4. Checking the results

Once you have run the EmrEtlRunner you should be able to manually inspect in S3 the folder specified in the :out: parameter in your config.yml file and see new files generated, which will contain the cleaned data either for uploading into a storage target (e.g. Redshift or Infobright) or for analysing directly using Hive (or Pig or Mahout or some other Hadoop querying tool) on EMR.

Note: nost Snowplow users run the 'hadoop' version of the ETL process, in which case the data generated is saved into subfolders with names of the form part-000.... If, however, you are running the legacy 'hive' ETL (because e.g. you want to use Hive or Infobright as your storage target, rather than Redshift, which is the only storage target the 'hadoop' etl currently supports), the subfolders names will be of the format dt=....

5. Next steps

Comfortable using EmrEtlRunner? Then [schedule it] schedule so that it regularly takes new data generated by the collector, processes it, cleans it, enriches it, and writes it back to S3.

HOME > SNOWPLOW SETUP GUIDE > Step 3: Setup EmrEtlRunner

Setup Snowplow

  • [Step 1: Setup a Collector] (setting-up-a-collector)
  • [Step 2: Setup a Tracker] (setting-up-a-tracker)
  • [Step 3: Setup EmrEtlRunner] (setting-up-EmrEtlRunner)
    • [3.1: install EmrEtlRunner] (1-Installing-EmrEtlRunner)
    • [3.2: using EmrEtlRunner] (2-Using-EmrEtlRunner)
    • [3.3: scheduling EmrEtlRunner] (3-scheduling-EmrEtlRunner)
  • [Step 4: Setup alternative data stores] (setting-up-alternative-data-stores)
  • [Step 5: Analyze your data!] (Getting started analyzing Snowplow data)

Useful resources

Clone this wiki locally