-
Notifications
You must be signed in to change notification settings - Fork 11
4 Data pipelines (scheduled runs)
Tom Vogels edited this page Aug 14, 2017
·
1 revision
An AWS Data Pipeline allows to orchestrate the dump
and load
steps of a full ETL; or the dump
and update
for more selective ETL; or the validate
and load --skip-copy
steps of a validation ETL. The goal here is to merge information specific to the ETL environment (like the VPC, its subnets, security groups), timing of the ETL (like the start time) and steps in creating an ETL cluster or in running arthur.py
steps.
Let's say we want to coordinate an ETL like this:
Then we need a description for the AWS Data Pipeline that describes
- the schedule,
- the EMR cluster (including security groups, instance types and counts, and applications),
- the EC2 instance (including security groups and instance type),
- the bootstrap action(s) for the EC2 instance,
- the
dump
command for Arthur (configured for environment), - the
load
command for Arthur (again, configured for environment) - and we should probably add some alarms (via SNS) to let us know of success or failure of the ETL.
That's a lot of stuff and so we have a command for that.
arthur.py install_pipeline <options>
There are four possible "production" pipelines
Schedule? | Secondary? | Notes |
---|---|---|
Nightly | Yes | Repeated full ETL to load into production and also into a development environment |
Nightly | No | Repeated full ETL to load into production but no other environment (here) |
One-off | Yes | Run ETL just once to load production and a development environment |
One-off | No | Full ETL to load into production that runs just once. |
Each of these runs will
- Dump all data
- Load all data into primary environment
- Load all data into secondary environment (if chosen)
- Contact a dead-man switch upon bootstrap and completed ETL (based on configuration)
- Send message to SNS on success or failure (based on configuration)
- Repeat the above every 24 hours if it's not a one-off ETL
The purpose of a validation run is to
- detect changes in upstream schemas that might make a
dump
fail (since the download query will error out) - detect changes in table designs that will not succeed to
load
(e.g. bad dependencies, bad SQL)