4 Data pipelines (scheduled runs)

Overview

An AWS Data Pipeline allows to orchestrate the dump and load steps of a full ETL; or the dump and update for more selective ETL; or the validate and load --skip-copy steps of a validation ETL. The goal here is to merge information specific to the ETL environment (like the VPC, its subnets, security groups), timing of the ETL (like the start time) and steps in creating an ETL cluster or in running arthur.py steps.

Let's say we want to coordinate an ETL like this:

ETL Pipeline diagram

Then we need a description for the AWS Data Pipeline that describes

the schedule,
the EMR cluster (including security groups, instance types and counts, and applications),
the EC2 instance (including security groups and instance type),
the bootstrap action(s) for the EC2 instance,
the dump command for Arthur (configured for environment),
the load command for Arthur (again, configured for environment)
and we should probably add some alarms (via SNS) to let us know of success or failure of the ETL.

That's a lot of stuff and so we have a command for that.

arthur.py install_pipeline <options>

Production pipeline

There are four possible "production" pipelines

Schedule?	Secondary?	Notes
Nightly	Yes	Repeated full ETL to load into production and also into a development environment
Nightly	No	Repeated full ETL to load into production but no other environment (here)
One-off	Yes	Run ETL just once to load production and a development environment
One-off	No	Full ETL to load into production that runs just once.

Each of these runs will

Dump all data
Load all data into primary environment
Load all data into secondary environment (if chosen)
Contact a dead-man switch upon bootstrap and completed ETL (based on configuration)
Send message to SNS on success or failure (based on configuration)
Repeat the above every 24 hours if it's not a one-off ETL

Validation

The purpose of a validation run is to

detect changes in upstream schemas that might make a dump fail (since the download query will error out)
detect changes in table designs that will not succeed to load (e.g. bad dependencies, bad SQL)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4 Data pipelines (scheduled runs)

Overview

Production pipeline

Validation

Clone this wiki locally