-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike: Determine path forward for ETL Data Checks #3779
Comments
Testing against sample inputsWhat it isChecking committed snapshots stored in the repo against the output of the ETL step
Pros:
Cons:
Other notes
Runtime checks at contract between extract and load steps (Graph QL EQL return and db Input)AssumptionsThis would run as a check in the aws container Pros
Cons
Post-Load data quality dashboards in MetabasePros
Cons
Snapshot with PlPgSQL to validate end to endWhat is it?We could leverage the first approach of using snapshots to validate the GraphQL ETL pipeline output and then build a postgres procedural language query to loop through the tables we are concerned about and compare/validate the data from the database to what is in the file snapshot. Pros
Cons
Tagging @DavidDudas-Intuitial and @widal001 to get their thoughts. |
Adding an additional update here that I started experimenting with the first option and writing to the file system should not be an issue. So that I think is the way we should proceed unless others have insights into something I have missed. |
Now: We'll start by using sample data to implement snapshot testing for the ETL pipeline. The goal here is to catch bugs before being entered into main. Next: The next ideal step would be to set up alerts in metabase to catch downstream data quality issues that aren't caught by snapshot testing. For example changes to underlying data coming out of Github. The goal is to help us catch issues that appear in dev or staging. This work will also require establishing alerting/ or a monitoring process for this. Later Explore options for other runtime checks that halt ETL execution focusing on the contract between the extract step and the transform and load step. |
Nice work @jcrichlake! Next steps captured here: |
Summary
We currently don't have data checks in the ETL pipeline to test our changes against real data. We need to explore the following options and determine which of them should be implemented
Acceptance criteria
The text was updated successfully, but these errors were encountered: