-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AI supported analytics of recurring alerts #3864
Comments
My main idea here is to be able to produce some kind of digest with meaningful information that would help engineers focus on solving alerts and problem which matter most. Questions to answerHere are some questions example to be answered:
Data sourcesWe have multiple data sources at hand:
Ideas
ExperimentsGrafana CloudI had a look at tooling provided in Grafana Cloud: Metrics forecast, Outlier detection, Sift investigations, but haven't found a way to make any of those produce meaningful results and the tweaking options are very limited there. n8nI played a bit with n8n on our operations cluster, I've built a workflow which takes Alertmanager webhook data as input, stores the result in a vector database and then query it then using a webhook. I haven't managed to produce meaningful result, some additional data transformation might help there. detailsHere is the workflow details
Here is where the "answer" fails to be generated because the system input data does not make sense |
@TheoBrigitte I'm happy to go through the workflow with you and see how we can structure the data and make it accessible. We should probably start with one question to answer and tweak it until it comes consistently good enough. Another thing would be to connect it to swarmgeist so it becomes accessible for everyone (it's easy). I also thought about adding swarmgeist to incident channels to let it store alerts and their resolutions, so it can give hints to people in new incidents. |
I managed to get some answer using n8n, but this solution to not seems to be a good fit for this use case. Using a vector database and llm is not very well suited to analyze large chunk of data in a "scientific" way, and there seems to be a strong limitation due to the Even though it is nice to integrate with other systems, I did manage to get alerts from Alertmanager and/or OpsGenie. n8n workflowsAlert_analytics_opsgenie.json Workflow detailsInput sent to OpenAI which contains 4 (due to k=4) alerts entries But it's also not able to answer all my questions, the prompt could use some more improvement I believe
my workbench: alert-analytics.tar.gz |
@TheoBrigitte did you check vector stores and assistants in the openai API? https://platform.openai.com/storage/vector_stores It is different than using the embeddings api. And should lead to better results. The model is still meh (only GPT 4o works with the vector stores). But this is going to improve over time (GPT 5 should make this better afaik). Here is an example workflow where I store the incident channel information in a vector store. You should also see examples in the openai platform how I did this for the meetings. You can query the assistants in the playground. The incidents stuff is not yet ready to use. |
Motivation
In our alert review session we realised that we have a lot of recurring alerts that we don't put as much focus on as we potentially should. Those alerts individually are not as important but the amount of toil they produce is adding up over time. AI can help us recognize some patterns like "Top 5 most recurring alerts"
Todo
Outcome
The text was updated successfully, but these errors were encountered: