Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AI supported analytics of recurring alerts #3864

Open
2 tasks
Rotfuks opened this issue Feb 4, 2025 · 5 comments
Open
2 tasks

AI supported analytics of recurring alerts #3864

Rotfuks opened this issue Feb 4, 2025 · 5 comments
Assignees
Labels

Comments

@Rotfuks
Copy link
Contributor

Rotfuks commented Feb 4, 2025

Motivation

In our alert review session we realised that we have a lot of recurring alerts that we don't put as much focus on as we potentially should. Those alerts individually are not as important but the amount of toil they produce is adding up over time. AI can help us recognize some patterns like "Top 5 most recurring alerts"

Todo

  • Create an AI analytics report that can help us in the alert review with recurring alerts
    • this can be fed by multiple sources like slack for alerts and incidents
    • a top X list of the most often triggered alerts might be interesting there
    • also make it analyse any other patterns within the context of triggered alerts and incidents.
  • We talked about creating and getting a weekly summary of alerts on test clusters or just in notify priority

Outcome

  • We have a better understanding of our recurring alerts for our alert review because of more helpful data we can look at.
@github-project-automation github-project-automation bot moved this to Inbox 📥 in Roadmap Feb 4, 2025
@Rotfuks Rotfuks moved this from Inbox 📥 to Up Next ➡️ in Roadmap Feb 6, 2025
@TheoBrigitte TheoBrigitte self-assigned this Feb 25, 2025
@architectbot architectbot added the team/atlas Team Atlas label Feb 25, 2025
@QuentinBisson QuentinBisson moved this from Up Next ➡️ to In Progress ⛏️ in Roadmap Feb 26, 2025
@TheoBrigitte
Copy link
Member

My main idea here is to be able to produce some kind of digest with meaningful information that would help engineers focus on solving alerts and problem which matter most.

Questions to answer

Here are some questions example to be answered:

  • What are alerts with most occurences ?
  • Which customers, apps, cluster generate the most alerts ?
  • Do alerts cluster around certain times (e.g., business hours)?
  • Are certain alert types related to specific customers, clusters ?
  • Does an increase in one alert predict another ?
  • Find relevant patterns (e.g. Alert A fire every second day since a week, Alert B fire every Tuesday at noon)

Data sources

We have multiple data sources at hand:

Ideas

  • Scrape whatever is in #opsgenie slack and produce reports by crafting a detailed prompt
  • Use clustering algorithms (e.g. K-mean, DBScan) to group similar alerts based on occurrence patterns
  • Use embedding and vector database to find similarities in alerts

Experiments

Grafana Cloud

I had a look at tooling provided in Grafana Cloud: Metrics forecast, Outlier detection, Sift investigations, but haven't found a way to make any of those produce meaningful results and the tweaking options are very limited there.

n8n

I played a bit with n8n on our operations cluster, I've built a workflow which takes Alertmanager webhook data as input, stores the result in a vector database and then query it then using a webhook. I haven't managed to produce meaningful result, some additional data transformation might help there.

details

Here is the workflow details

Image
workflow: Alert_analytics (2).json

Here is where the "answer" fails to be generated because the system input data does not make sense

Image

@teemow
Copy link
Member

teemow commented Mar 1, 2025

@TheoBrigitte I'm happy to go through the workflow with you and see how we can structure the data and make it accessible. We should probably start with one question to answer and tweak it until it comes consistently good enough. Another thing would be to connect it to swarmgeist so it becomes accessible for everyone (it's easy). I also thought about adding swarmgeist to incident channels to let it store alerts and their resolutions, so it can give hints to people in new incidents.

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Mar 11, 2025

I managed to get some answer using n8n, but this solution to not seems to be a good fit for this use case. Using a vector database and llm is not very well suited to analyze large chunk of data in a "scientific" way, and there seems to be a strong limitation due to the k factor which only allow to fetch a certain number of entries from the vector database.

Even though it is nice to integrate with other systems, I did manage to get alerts from Alertmanager and/or OpsGenie.

n8n workflows

Alert_analytics_opsgenie.json
Alert_analytics_alertmanager.json

Workflow details

Input sent to OpenAI which contains 4 (due to k=4) alerts entries
Image
and produces the following output
Image

But it's also not able to answer all my questions, the prompt could use some more improvement I believe

Here is the information regarding the alerts:

1. **Summary of the last alerts received**: 
   - The most recent alert indicates a failure of the "Internal Sanity Checks" for the service "KONG" in the "peu02-private" environment. This alert was recorded on March 7, 2025, at 09:16:37 UTC. Additional details include a trigger URL for further investigation, but many fields concerning incident homepage, cluster status, severity, and others are either empty or unspecified.
   
2. **Summary of most significant alerts counts by details**: 
- Unfortunately, I do not have information available for this request.

3. **Total count of the last alerts received**: 
- Similarly, I do not have information available for this request.

If you have any more specific questions or need further assistance, feel free to ask!	

my workbench: alert-analytics.tar.gz

@teemow
Copy link
Member

teemow commented Mar 11, 2025

@TheoBrigitte did you check vector stores and assistants in the openai API?

https://platform.openai.com/storage/vector_stores
https://platform.openai.com/assistants/

It is different than using the embeddings api. And should lead to better results. The model is still meh (only GPT 4o works with the vector stores). But this is going to improve over time (GPT 5 should make this better afaik).

Here is an example workflow where I store the incident channel information in a vector store.

Incident_Bot_Example.json

You should also see examples in the openai platform how I did this for the meetings. You can query the assistants in the playground. The incidents stuff is not yet ready to use.

@TheoBrigitte TheoBrigitte moved this from In Progress ⛏️ to Up Next ➡️ in Roadmap Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Up Next ➡️
Development

No branches or pull requests

5 participants