AI supported analytics of recurring alerts #3864

Rotfuks · 2025-02-04T09:39:45Z

Motivation

In our alert review session we realised that we have a lot of recurring alerts that we don't put as much focus on as we potentially should. Those alerts individually are not as important but the amount of toil they produce is adding up over time. AI can help us recognize some patterns like "Top 5 most recurring alerts"

Todo

Create an AI analytics report that can help us in the alert review with recurring alerts
- this can be fed by multiple sources like slack for alerts and incidents
- a top X list of the most often triggered alerts might be interesting there
- also make it analyse any other patterns within the context of triggered alerts and incidents.
We talked about creating and getting a weekly summary of alerts on test clusters or just in notify priority

Outcome

We have a better understanding of our recurring alerts for our alert review because of more helpful data we can look at.

QuentinBisson · 2025-02-17T10:56:16Z

This is not taking incidents into account but it this not solving this https://giantswarm.grafana.net/d/TBwN5ZzMk/alerts-analysis?orgId=1&from=now-30d&to=now&timezone=utc&var-team=atlas&var-alertname=$__all&var-customer=$__all&var-provider=$__all&var-severity=page&var-severity=notify&var-pipeline=stable&var-installation=$__all&var-Filters=alertname%7C%21%3D%7CHeartbeat?

TheoBrigitte · 2025-02-28T16:29:01Z

My main idea here is to be able to produce some kind of digest with meaningful information that would help engineers focus on solving alerts and problem which matter most.

Questions to answer

Here are some questions example to be answered:

What are alerts with most occurences ?
Which customers, apps, cluster generate the most alerts ?
Do alerts cluster around certain times (e.g., business hours)?
Are certain alert types related to specific customers, clusters ?
Does an increase in one alert predict another ?
Find relevant patterns (e.g. Alert A fire every second day since a week, Alert B fire every Tuesday at noon)

Data sources

We have multiple data sources at hand:

#opsgenie slack channel: all alerts events are sent there
aggregation:prometheus:alerts metric in Grafana Cloud: timeseries with usual labels (alertname, cluster_id, installation, etc ...)
Alertmanager can push alerts via webhook https://prometheus.io/docs/alerting/latest/configuration/#webhook_config

Ideas

Scrape whatever is in #opsgenie slack and produce reports by crafting a detailed prompt
Use clustering algorithms (e.g. K-mean, DBScan) to group similar alerts based on occurrence patterns
Use embedding and vector database to find similarities in alerts

Experiments

Grafana Cloud

I had a look at tooling provided in Grafana Cloud: Metrics forecast, Outlier detection, Sift investigations, but haven't found a way to make any of those produce meaningful results and the tweaking options are very limited there.

n8n

I played a bit with n8n on our operations cluster, I've built a workflow which takes Alertmanager webhook data as input, stores the result in a vector database and then query it then using a webhook. I haven't managed to produce meaningful result, some additional data transformation might help there.

details

Here is the workflow details

workflow: Alert_analytics (2).json

Here is where the "answer" fails to be generated because the system input data does not make sense

teemow · 2025-03-01T11:09:55Z

@TheoBrigitte I'm happy to go through the workflow with you and see how we can structure the data and make it accessible. We should probably start with one question to answer and tweak it until it comes consistently good enough. Another thing would be to connect it to swarmgeist so it becomes accessible for everyone (it's easy). I also thought about adding swarmgeist to incident channels to let it store alerts and their resolutions, so it can give hints to people in new incidents.

TheoBrigitte · 2025-03-11T12:45:31Z

I managed to get some answer using n8n, but this solution to not seems to be a good fit for this use case. Using a vector database and llm is not very well suited to analyze large chunk of data in a "scientific" way, and there seems to be a strong limitation due to the k factor which only allow to fetch a certain number of entries from the vector database.

Even though it is nice to integrate with other systems, I did manage to get alerts from Alertmanager and/or OpsGenie.

n8n workflows

Alert_analytics_opsgenie.json
Alert_analytics_alertmanager.json

Workflow details

Input sent to OpenAI which contains 4 (due to k=4) alerts entries

and produces the following output

But it's also not able to answer all my questions, the prompt could use some more improvement I believe

Here is the information regarding the alerts:

1. **Summary of the last alerts received**: 
   - The most recent alert indicates a failure of the "Internal Sanity Checks" for the service "KONG" in the "peu02-private" environment. This alert was recorded on March 7, 2025, at 09:16:37 UTC. Additional details include a trigger URL for further investigation, but many fields concerning incident homepage, cluster status, severity, and others are either empty or unspecified.
   
2. **Summary of most significant alerts counts by details**: 
- Unfortunately, I do not have information available for this request.

3. **Total count of the last alerts received**: 
- Similarly, I do not have information available for this request.

If you have any more specific questions or need further assistance, feel free to ask!

my workbench: alert-analytics.tar.gz

teemow · 2025-03-11T14:54:09Z

@TheoBrigitte did you check vector stores and assistants in the openai API?

https://platform.openai.com/storage/vector_stores
https://platform.openai.com/assistants/

It is different than using the embeddings api. And should lead to better results. The model is still meh (only GPT 4o works with the vector stores). But this is going to improve over time (GPT 5 should make this better afaik).

Here is an example workflow where I store the incident channel information in a vector store.

Incident_Bot_Example.json

You should also see examples in the openai platform how I did this for the meetings. You can query the assistants in the playground. The incidents stuff is not yet ready to use.

github-project-automation bot added this to Roadmap Feb 4, 2025

github-project-automation bot moved this to Inbox 📥 in Roadmap Feb 4, 2025

Rotfuks added the postmortem label Feb 4, 2025

Rotfuks moved this from Inbox 📥 to Up Next ➡️ in Roadmap Feb 6, 2025

TheoBrigitte self-assigned this Feb 25, 2025

architectbot added the team/atlas Team Atlas label Feb 25, 2025

QuentinBisson moved this from Up Next ➡️ to In Progress ⛏️ in Roadmap Feb 26, 2025

TheoBrigitte moved this from In Progress ⛏️ to Up Next ➡️ in Roadmap Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI supported analytics of recurring alerts #3864

AI supported analytics of recurring alerts #3864

Rotfuks commented Feb 4, 2025 •

edited

Loading

QuentinBisson commented Feb 17, 2025

TheoBrigitte commented Feb 28, 2025

teemow commented Mar 1, 2025

TheoBrigitte commented Mar 11, 2025 •

edited

Loading

n8n workflows

Workflow details

teemow commented Mar 11, 2025

AI supported analytics of recurring alerts #3864

AI supported analytics of recurring alerts #3864

Comments

Rotfuks commented Feb 4, 2025 • edited Loading

Motivation

Todo

Outcome

QuentinBisson commented Feb 17, 2025

TheoBrigitte commented Feb 28, 2025

Questions to answer

Data sources

Ideas

Experiments

Grafana Cloud

n8n

teemow commented Mar 1, 2025

TheoBrigitte commented Mar 11, 2025 • edited Loading

n8n workflows

Workflow details

teemow commented Mar 11, 2025

Rotfuks commented Feb 4, 2025 •

edited

Loading

TheoBrigitte commented Mar 11, 2025 •

edited

Loading