Spike: Determine path forward for ETL Data Checks #3779

jcrichlake · 2025-02-05T20:54:45Z

Summary

We currently don't have data checks in the ETL pipeline to test our changes against real data. We need to explore the following options and determine which of them should be implemented

Testing against sample inputs
Run-time checks at contract between extract and load steps
Post-load data quality dashboards in Metabase
Snapshot and pgPLSql comparison?

Acceptance criteria

The options are explored and a path forward has been reviewed and approved by the team.
The card for Implement First Data Quality Check #3780 is updated with the specific implementation details that were decided on

jcrichlake · 2025-02-06T21:50:54Z

Testing against sample inputs

What it is

Checking committed snapshots stored in the repo against the output of the ETL step

Test should use a specific date to prevent snapshots from constantly changing

Pros:

Allows for fine grained testing
Can be done as part of committing the code so errors shouldn't make it past local development. This minimizes the blast radius of bugs
Snapshots seem fairly simple to implement
There are libraries that facilitate this

Cons:

Wouldn't validate that the database code is working leaving a gap in testing the pipeline
Relies upon writes to the local file system which may be impacted in future work if we update how data is being passed between different steps in the pipeline process

Other notes

I would propose we put this check in one place in the code for the initial story. Most likely directly after we've called the ETL GraphQL function and the files have been output to the local file system.

Runtime checks at contract between extract and load steps (Graph QL EQL return and db Input)

Assumptions

This would run as a check in the aws container

Pros

The check itself shouldn't be terribly difficult to implement, but not as easy as the snapshot

Cons

Wouldn't catch issues with changes to the SQL code
Would have to be in non local environments
Would need alerts to be set up
Only catching bugs once they are committed

Post-Load data quality dashboards in Metabase

Pros

Visualization may make errors easier to spot
By having bugs get committed it could 'bubble up' errors to have more eyes on them

Cons

Errors have to be committed to be discovered using this method
- Could cause delays for other devs if they are working in adjacent areas
- Makes bugs more visible
- Riskier since we don't deploy on an as needed basis
Would need a process for checking the metabase dashboard at an established frequency for any errors
- Would also need to establish a rotation of people to check this so that errors aren't sitting on the dashboard undiscovered
- Multiple changes could get merged into the repo before an error is discovered, making it more difficult to know which change caused the error and for it to be harder to roll back

Snapshot with PlPgSQL to validate end to end

What is it?

We could leverage the first approach of using snapshots to validate the GraphQL ETL pipeline output and then build a postgres procedural language query to loop through the tables we are concerned about and compare/validate the data from the database to what is in the file snapshot.

Pros

The only approach that would validate the database code is working properly
Would ensure our pipeline is working end to end

Cons

For sure the most effort
May not be worth it depending on future work

Tagging @DavidDudas-Intuitial and @widal001 to get their thoughts.

jcrichlake · 2025-02-07T21:28:08Z

Adding an additional update here that I started experimenting with the first option and writing to the file system should not be an issue. So that I think is the way we should proceed unless others have insights into something I have missed.

jcrichlake · 2025-02-12T21:05:47Z

Now: We'll start by using sample data to implement snapshot testing for the ETL pipeline. The goal here is to catch bugs before being entered into main.

Next: The next ideal step would be to set up alerts in metabase to catch downstream data quality issues that aren't caught by snapshot testing. For example changes to underlying data coming out of Github. The goal is to help us catch issues that appear in dev or staging. This work will also require establishing alerting/ or a monitoring process for this.

Later Explore options for other runtime checks that halt ETL execution focusing on the contract between the extract step and the transform and load step.

DavidDudas-Intuitial · 2025-02-19T21:36:43Z

Nice work @jcrichlake!

Next steps captured here:
Data Quality - Create dashboard to measure data quality indicators #3945
Data Quality - Setup Metabase monitoring for data quality metrics #3947

jcrichlake self-assigned this Feb 5, 2025

jcrichlake added this to Simpler Grants Product & Delivery Feb 5, 2025

jcrichlake moved this to Todo in Simpler Grants Product & Delivery Feb 5, 2025

jcrichlake assigned DavidDudas-Intuitial Feb 5, 2025

github-project-automation bot added this to Simpler.Grants.gov Product Backlog Feb 5, 2025

github-project-automation bot moved this to Icebox in Simpler.Grants.gov Product Backlog Feb 5, 2025

jcrichlake mentioned this issue Feb 5, 2025

Implement First Data Quality Check #3780

Closed

jcrichlake moved this from Todo to In Progress in Simpler Grants Product & Delivery Feb 5, 2025

jcrichlake moved this from In Progress to Done in Simpler Grants Product & Delivery Feb 12, 2025

jcrichlake closed this as completed by moving to Done in Simpler Grants Product & Delivery Feb 12, 2025

github-project-automation bot moved this from Icebox to Done in Simpler.Grants.gov Product Backlog Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: Determine path forward for ETL Data Checks #3779

Spike: Determine path forward for ETL Data Checks #3779

jcrichlake commented Feb 5, 2025 •

edited

Loading

jcrichlake commented Feb 6, 2025

jcrichlake commented Feb 7, 2025

jcrichlake commented Feb 12, 2025

DavidDudas-Intuitial commented Feb 19, 2025

Spike: Determine path forward for ETL Data Checks #3779

Spike: Determine path forward for ETL Data Checks #3779

Comments

jcrichlake commented Feb 5, 2025 • edited Loading

Summary

Acceptance criteria

jcrichlake commented Feb 6, 2025

Testing against sample inputs

What it is

Pros:

Cons:

Other notes

Runtime checks at contract between extract and load steps (Graph QL EQL return and db Input)

Assumptions

Pros

Cons

Post-Load data quality dashboards in Metabase

Pros

Cons

Snapshot with PlPgSQL to validate end to end

What is it?

Pros

Cons

jcrichlake commented Feb 7, 2025

jcrichlake commented Feb 12, 2025

DavidDudas-Intuitial commented Feb 19, 2025

jcrichlake commented Feb 5, 2025 •

edited

Loading