08/29/2025 | News release | Archived content
Site reliability engineering (SRE)and DevOps teams are exhausted. Sprawling IT estates, tool overload and the job's on-call nature, all play a role in an overarching issue-alert fatigue.
Alert fatigue(sometimes called alarm fatigue) refers to "a state of mental and operational exhaustion caused by an overwhelming number of alerts." It erodes the responsiveness and efficacy of DevOps, security operations center (SOC), site reliability engineering (SRE)and other teams responsible for IT performance and security, and is a widespread, consequential problem.
While the Vectra report takes a security-specific focus, teams charged with monitoring application and infrastructure performance face a similar overload. For example, a single misconfiguration can cause hundreds or thousands of performance alerts, an "alert storm" that can distract or desensitize IT teams and cause delayed responses to critical alerts and real issues. Those real issues can be costly.
What's driving this burnout, and can agentic AIbe part of a scalable solution?
Think Newsletter
Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statementfor more information.
Simply feeding these massive telemetry streams into a large language model (LLM)isn't a viable solution, either. For one, it's a waste of compute. It's also a great way to produce hallucinations.
A practical solution starts with developing a workflow that synthesizes raw data, and aggregates this higher-quality, context-rich data within a centralized platform. There it can be used for enterprise-wide observabilityand the training of local AI models.
Each one of these tools might be responsible for monitoring dozens or hundreds of applications, application programming interfaces (APIs)or servers, each feeding their own data pipeline. With such silos, separate tools can generate multiple alerts stemming from the same underlying issue. This lack of integration limits visibility, which hampers correlation and root cause analysis. SREs waste time chasing up each one of these alerts before identifying the redundancies.
What's worse, this lack of integration hinders the efficacy of the automation tools for alert management, such as alert prioritization and correlation workflows, set up to assist in detection and resolution and reduce the volume of alerts. Teams are left to manually connect the dots, an arduous and time-consuming (if not impossible) task.
A surveycited in Deloitte's "Adaptive Defense: Custom Alerts for Modern Threats" reportfound that a "lack of visibility or context from security tools resulted in 47% of attacks being missed in a 12-month period."
While individual agents don't necessarily require centralization, a centralized platform where data from agents is aggregated facilitates system-wide analysis, storage and visualization.
A recent MIT reportignited a firestorm with the claim that "95% of organizations are getting zero return" on their generative AIinvestments.
Setting aside the inflammatory stat, and the cascade of opinions the report solicited, the report highlights a valuable theme: many AIprojects fail because of "brittle workflows, lack of contextual learning, and misalignment with day-to-day operations." As Marina Danilevsky, Senior Research Scientist at IBM notes on a recent Mixture of Experts podcast,the most successful deployments are "focused, scoped and address a proper pain point."
The MIT report reinforces the fact that companies that view AI as a sort of panacea or something that can be haphazardly shoehorned into a process, aren't likely to see a return on their investment. Organizations that can strategically implement AI tools into their workflows to solve a specific problem, and reinforce these tools over time, are better suited for success.
Mixture of Experts | 28 August, episode 70
Watch the latest podcast episodes
AI agents can improve traditional systems that rely on static rules and preset thresholds by bringing factors like asset importance, performance guarantees, risk profiles and historical trends to bear.
For example, consider a post-incident detection and remediationworkflow, and how an AI agent might assist an SRE team.
A notification hits the alert system flagging high CPU usage for a node in a Kubernetescluster. In a traditional system, SREs might need to comb through MELT data (metrics, events, logs, traces) and dependencies to identify the root cause.
In this hypothetical agentic workflow, the agent uses the observability tool's knowledge graph, and topology aware correlation, to pull only the telemetry related to the alert (such as logs for the services running on that node, recent deployments, telemetry from the Kubernetes API server or load balancers that route traffic to the node or cluster). With this additional information, the agent can enrich raw alerts and provide context-rich telemetry to a local AI model trained on the enterprise's performance data and benchmarks.
The agent excludes irrelevant information, such as logs for unrelated services that happen to run on the same cluster. During this context gathering, the agent can also identify related signals and correlate alerts that likely stem from the same root cause and group these alerts together to be investigated as one incident.
With this information, the model can propose a hypothesis. The agent can also request more information (perhaps checking container configurations or time series data around the usage spike) to check and refine the model hypothesis, adding additional context before proposing a probable root cause.