Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Quality - Setup Metabase monitoring for data quality metrics #3947

Closed
3 tasks done
DavidDudas-Intuitial opened this issue Feb 19, 2025 · 10 comments
Closed
3 tasks done
Assignees

Comments

@DavidDudas-Intuitial
Copy link
Collaborator

DavidDudas-Intuitial commented Feb 19, 2025

Summary

Following work in #3945 , in which a baseline value will be established for a specific data quality metric (e.g. % of values in db with null etc), set up and activate a metabase monitor that will trigger when the data quality metric value exceeds threshold beyond tolerance.

Reference: https://www.metabase.com/docs/latest/questions/alerts#types-of-alerts

Acceptance criteria

  • At least one data quality metric is identified
  • A baseline value is established for the data quality metric(s)
  • A metabase monitoring rule is created to measure metric current value vs baseline value
@DavidDudas-Intuitial DavidDudas-Intuitial self-assigned this Feb 19, 2025
@DavidDudas-Intuitial DavidDudas-Intuitial changed the title Data Quality - Setup Metabase monitoringfor data quality metrics Data Quality - Setup Metabase monitoring for data quality metrics Feb 19, 2025
@DavidDudas-Intuitial
Copy link
Collaborator Author

DavidDudas-Intuitial commented Feb 21, 2025

I created two new questions in Metabase:

From these, we can monitor and alert.

Image Image

@DavidDudas-Intuitial
Copy link
Collaborator Author

Added alert to #analytics-alerts channel in slack

Image

@DavidDudas-Intuitial
Copy link
Collaborator Author

I added both new questions as panels to the ETL Metrics dashboard.

When there is a validation failure, the end user can click thru from the dashboard or the slack alert to open the ETL Data Quality Anomaly Detector which contains details on possible causes of failure (spoofed in screenshot)

Image

@DavidDudas-Intuitial DavidDudas-Intuitial moved this from In Progress to In Review in Simpler Grants Product & Delivery Feb 22, 2025
@DavidDudas-Intuitial DavidDudas-Intuitial moved this from In Review to In Progress in Simpler Grants Product & Delivery Feb 25, 2025
@DavidDudas-Intuitial
Copy link
Collaborator Author

DavidDudas-Intuitial commented Feb 25, 2025

UPDATE

Established the following key metrics (and baseline values) to serve as health indicators for the automated ETL workflow:

  • Count of deliverable history rows each day
  • Count of issue history rows each day
  • Sum of all issue points each day
  • Count pointed vs unpointed issues each day
  • Count issues by status each day
  • Count deliverables by status each day

Implemented the following six (6) monitors/alerts to detect anomalies in key metrics. Alerts will be triggered when any of the following conditions becomes true (threshold values in parens):

  • Age of current snapshot > max age (24hr)
  • Count of deliverables < previous day value
  • Count of deliverables < baseline value (20)
  • Count of issues < previous day value
  • Count of issues < baseline value (2000)
  • Sum of issue points < baseline value (3500)

The monitors are all encapsulated within a wrapper monitor: ETL Data Quality - Anomaly Detector

@DavidDudas-Intuitial
Copy link
Collaborator Author

I considered adding:

  • Sum of issue points < previous day value

But I'm not sure that would necessarily be a regression. For example, a swath of issues gets "re-pointed" by scrum team during planning.

@DavidDudas-Intuitial
Copy link
Collaborator Author

I added the following metrics to the dashboard today:

  • Count of distinct sprints each day
  • % pointed vs unpointed issues current day
  • % of each issue type current day

@DavidDudas-Intuitial
Copy link
Collaborator Author

I added two new monitors to the existing wrapper monitor:

  • % of pointed issues < baseline value (60%)
  • % of issues with type = ‘Task’ < baseline value (50%)

@DavidDudas-Intuitial
Copy link
Collaborator Author

Also: made copies of all ETL queries on dev instance, as backup for prod

http://metabase-dev-710651776.us-east-1.elb.amazonaws.com/collection/100-etl

@DavidDudas-Intuitial
Copy link
Collaborator Author

DavidDudas-Intuitial commented Feb 26, 2025

UPDATED SUMMARY

Established the following key metrics (and baseline values) to serve as health indicators for the automated ETL workflow:

  • Count of deliverable history rows each day
  • Count of issue history rows each day
  • Sum of all issue points each day
  • Count pointed vs unpointed issues each day
  • Count issues by status each day
  • Count deliverables by status each day
  • Count of distinct sprints each day
  • % pointed vs unpointed issues current day
  • % of each issue type current day

Implemented the following eight (8) monitors/alerts to detect anomalies in key metrics. Alerts will be triggered when any of the following conditions becomes true (threshold values in parens):

  • Age of current snapshot > max age (24hr)
  • Count of deliverables < previous day value
  • Count of deliverables < baseline value (20)
  • Count of issues < previous day value
  • Count of issues < baseline value (2000)
  • Sum of issue points < baseline value (3500)
  • % of pointed issues < baseline value (60%)
  • % of issues with type = ‘Task’ < baseline value (50%)

The monitors are all encapsulated within one wrapper monitor.

@DavidDudas-Intuitial
Copy link
Collaborator Author

ETL Metrics dashboard looks like this now

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

1 participant