Skip to content
Yali Sassoon edited this page Jul 9, 2013 · 2 revisions

HOME > [SNOWPLOW TECHNICAL DOCUMENTATION](Snowplow technical documentation) > Enrichment > Enrichment

The Snowplow Enrichment step takes the raw log files generated by the Snowplow collectors, tidies the data up and enriches it so that it is:

  1. Ready in S3 to be analysed using EMR
  2. Ready to be uploaded into Amazon Redshift, PostgreSQL or some other alternative storage mechanism for analysis

The Enrichment process is written using Scalding, a Scala implementation of Cascading, an ETL library that's written on top of Hadoop. Historically, we have referred to the Enrichment Process as the Hadoop-ETL, to distinguish it from the Hive-based ETL that preceded it. The Hive ETL has since been deprecated.

Snowplow uses Amazon's EMR to run the Enrichment process. The regular running of the process (which is necessary to ensure that up-to-date Snowplow data is available for analysis) is managed by EmrEtlRunner, a Ruby application.

In this guide, we cover:

  1. How the EmrEtlRunner instruments the regular running of the Enrichment Process
  2. [The Enrichment Process itself][The-enrichment-process]
Clone this wiki locally