Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Determine path forward for ETL Data Checks #3779

Closed
2 tasks done
jcrichlake opened this issue Feb 5, 2025 · 4 comments
Closed
2 tasks done

Spike: Determine path forward for ETL Data Checks #3779

jcrichlake opened this issue Feb 5, 2025 · 4 comments
Assignees

Comments

@jcrichlake
Copy link
Collaborator

jcrichlake commented Feb 5, 2025

Summary

We currently don't have data checks in the ETL pipeline to test our changes against real data. We need to explore the following options and determine which of them should be implemented

  • Testing against sample inputs
  • Run-time checks at contract between extract and load steps
  • Post-load data quality dashboards in Metabase
  • Snapshot and pgPLSql comparison?

Acceptance criteria

  • The options are explored and a path forward has been reviewed and approved by the team.
  • The card for Implement First Data Quality Check #3780 is updated with the specific implementation details that were decided on
@jcrichlake
Copy link
Collaborator Author

Testing against sample inputs

What it is

Checking committed snapshots stored in the repo against the output of the ETL step

  • Test should use a specific date to prevent snapshots from constantly changing

Pros:

  • Allows for fine grained testing
  • Can be done as part of committing the code so errors shouldn't make it past local development. This minimizes the blast radius of bugs
  • Snapshots seem fairly simple to implement
  • There are libraries that facilitate this

Cons:

  • Wouldn't validate that the database code is working leaving a gap in testing the pipeline
  • Relies upon writes to the local file system which may be impacted in future work if we update how data is being passed between different steps in the pipeline process

Other notes

  • I would propose we put this check in one place in the code for the initial story. Most likely directly after we've called the ETL GraphQL function and the files have been output to the local file system.

Runtime checks at contract between extract and load steps (Graph QL EQL return and db Input)

Assumptions

This would run as a check in the aws container

Pros

  • The check itself shouldn't be terribly difficult to implement, but not as easy as the snapshot

Cons

  • Wouldn't catch issues with changes to the SQL code
  • Would have to be in non local environments
  • Would need alerts to be set up
  • Only catching bugs once they are committed

Post-Load data quality dashboards in Metabase

Pros

  • Visualization may make errors easier to spot
  • By having bugs get committed it could 'bubble up' errors to have more eyes on them

Cons

  • Errors have to be committed to be discovered using this method
    • Could cause delays for other devs if they are working in adjacent areas
    • Makes bugs more visible
    • Riskier since we don't deploy on an as needed basis
  • Would need a process for checking the metabase dashboard at an established frequency for any errors
    • Would also need to establish a rotation of people to check this so that errors aren't sitting on the dashboard undiscovered
    • Multiple changes could get merged into the repo before an error is discovered, making it more difficult to know which change caused the error and for it to be harder to roll back

Snapshot with PlPgSQL to validate end to end

What is it?

We could leverage the first approach of using snapshots to validate the GraphQL ETL pipeline output and then build a postgres procedural language query to loop through the tables we are concerned about and compare/validate the data from the database to what is in the file snapshot.

Pros

  • The only approach that would validate the database code is working properly
  • Would ensure our pipeline is working end to end

Cons

  • For sure the most effort
  • May not be worth it depending on future work

Tagging @DavidDudas-Intuitial and @widal001 to get their thoughts.

@jcrichlake
Copy link
Collaborator Author

Adding an additional update here that I started experimenting with the first option and writing to the file system should not be an issue. So that I think is the way we should proceed unless others have insights into something I have missed.

@jcrichlake
Copy link
Collaborator Author

Now: We'll start by using sample data to implement snapshot testing for the ETL pipeline. The goal here is to catch bugs before being entered into main.

Next: The next ideal step would be to set up alerts in metabase to catch downstream data quality issues that aren't caught by snapshot testing. For example changes to underlying data coming out of Github. The goal is to help us catch issues that appear in dev or staging. This work will also require establishing alerting/ or a monitoring process for this.

Later Explore options for other runtime checks that halt ETL execution focusing on the contract between the extract step and the transform and load step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants