Enrichment

HOME > [SNOWPLOW TECHNICAL DOCUMENTATION](Snowplow technical documentation) > Enrichment > Enrichment

The Snowplow Enrichment step takes the raw log files generated by the Snowplow collectors, tidies the data up and enriches it so that it is:

Ready in S3 to be analysed using EMR
Ready to be uploaded into Amazon Redshift, PostgreSQL or some other alternative storage mechanism for analysis

The Enrichment process is written using Scalding, a Scala implementation of Cascading, an ETL library that's written on top of Hadoop. Historically, we have referred to the Enrichment Process as the Hadoop-ETL, to distinguish it from the Hive-based ETL that preceded it. The Hive ETL has since been deprecated.

Snowplow uses Amazon's EMR to run the Enrichment process. The regular running of the process (which is necessary to ensure that up-to-date Snowplow data is available for analysis) is managed by EmrEtlRunner, a Ruby application.

In this guide, we cover:

How the EmrEtlRunner instruments the regular running of the Enrichment Process
[The Enrichment Process itself][The-enrichment-process]

HOME > [TECHNICAL DOCUMENTATION](Snowplow technical documentation)

1. Trackers
Overview
Javascript Tracker
No-JS Tracker
Lua Tracker
Arduino Tracker

A. Snowplow Tracker Protocol

2. Collectors
Overview
Cloudfront collector
Clojure collector (Elastic Beanstalk)
SnowCannon (node.js)

B. Collector logging formats

3. Enrich
Overview
EmrEtlRunner
Scalding-based Enrichment Process

C. Canonical Snowplow event model

4. Storage
Overview
[Storage in S3](S3 storage)
Storage in Redshift
Storage in PostgreSQL
Storage in Infobright (deprecated)
The StorageLoader

D. Snowplow storage formats (to write)

5. Analytics
Analytics documentation

Common
Artifact repositories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enrichment

Clone this wiki locally