Skip to content

4 Data pipelines (scheduled runs)

Tom Vogels edited this page Aug 14, 2017 · 1 revision

Overview

An AWS Data Pipeline allows to orchestrate the dump and load steps of a full ETL; or the dump and update for more selective ETL; or the validate and load --skip-copy steps of a validation ETL. The goal here is to merge information specific to the ETL environment (like the VPC, its subnets, security groups), timing of the ETL (like the start time) and steps in creating an ETL cluster or in running arthur.py steps.

Let's say we want to coordinate an ETL like this:

ETL Pipeline diagram

Then we need a description for the AWS Data Pipeline that describes

  • the schedule,
  • the EMR cluster (including security groups, instance types and counts, and applications),
  • the EC2 instance (including security groups and instance type),
  • the bootstrap action(s) for the EC2 instance,
  • the dump command for Arthur (configured for environment),
  • the load command for Arthur (again, configured for environment)
  • and we should probably add some alarms (via SNS) to let us know of success or failure of the ETL.

That's a lot of stuff and so we have a command for that.

arthur.py install_pipeline <options>

Production pipeline

There are four possible "production" pipelines

Schedule? Secondary? Notes
Nightly Yes Repeated full ETL to load into production and also into a development environment
Nightly No Repeated full ETL to load into production but no other environment (here)
One-off Yes Run ETL just once to load production and a development environment
One-off No Full ETL to load into production that runs just once.

Each of these runs will

  • Dump all data
  • Load all data into primary environment
  • Load all data into secondary environment (if chosen)
  • Contact a dead-man switch upon bootstrap and completed ETL (based on configuration)
  • Send message to SNS on success or failure (based on configuration)
  • Repeat the above every 24 hours if it's not a one-off ETL

Validation

The purpose of a validation run is to

  • detect changes in upstream schemas that might make a dump fail (since the download query will error out)
  • detect changes in table designs that will not succeed to load (e.g. bad dependencies, bad SQL)