IBM - International Business Machines Corporation

08/29/2025 | News release | Archived content

Reducing Alert Fatigue With AI Agents

Site reliability engineering (SRE)and DevOps teams are exhausted. Sprawling IT estates, tool overload and the job's on-call nature, all play a role in an overarching issue-alert fatigue.

Alert fatigue(sometimes called alarm fatigue) refers to "a state of mental and operational exhaustion caused by an overwhelming number of alerts." It erodes the responsiveness and efficacy of DevOps, security operations center (SOC), site reliability engineering (SRE)and other teams responsible for IT performance and security, and is a widespread, consequential problem.

Vectra's "2023 State of Threat Detection" report(based on a survey of 2,000 IT security analysts at firms with 1,000 or more employees) found that SOC teams field an average of 4,484 alerts per day. Of these, 67% are ignored due to a high volume of false positives and alert fatigue. The report also found that 71% of analysts believed that their organization might already have been "compromised without their knowledge, due to lack of visibility and confidence in threat detection capabilities."

While the Vectra report takes a security-specific focus, teams charged with monitoring application and infrastructure performance face a similar overload. For example, a single misconfiguration can cause hundreds or thousands of performance alerts, an "alert storm" that can distract or desensitize IT teams and cause delayed responses to critical alerts and real issues. Those real issues can be costly.

What's driving this burnout, and can
agentic AIbe part of a scalable solution?

Think Newsletter

The latest tech news, backed by expert insights

Join over 100,000 subscriberswho receive access to learning hubs, expert insights and industry news on AI, security, automation, data and infrastructure-all curated in the Think Newsletter. See the IBM Privacy Statement.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statementfor more information.

Primary causes of alert fatigue

There are several culprits, and an overwhelming volume of telemetry is often cited as one of them, but a focus on data volume specifically obscures a core issue-data quality and context.

Lack of context and alert noise

When teams are dealing with loads of low-quality, context-poor data, feeding dozens of different threat intelligence or performance feeds, they are bound to encounter trouble. This is the sort of environment in which false positives and redundant alerts proliferate, and low-priority noise distracts from real threats and performance issues. These "false alarms" can grind the life out of IT, DevOps and security teams.

Simply feeding these massive telemetry streams into a large language model (LLM)isn't a viable solution, either. For one, it's a waste of compute. It's also a great way to produce hallucinations.

A practical solution starts with developing a workflow that synthesizes raw data, and aggregates this higher-quality, context-rich data within a centralized platform. There it can be used for enterprise-wide observabilityand the training of local AI models.

Fragmented tools

Enterprises often use many performance and security monitoring solutions-large enterprises have an average of 76 security tools. These tools can be team- or product-specific, or specific to a certain IT environment (on-premises solutions vs. cloudsolutions, for example).

Each one of these tools might be responsible for monitoring dozens or hundreds of applications, application programming interfaces (APIs)or servers, each feeding their own data pipeline. With such silos, separate tools can generate multiple alerts stemming from the same underlying issue. This lack of integration limits visibility, which hampers correlation and root cause analysis. SREs waste time chasing up each one of these alerts before identifying the redundancies.

Poor data integration and visibility

When data streams are not integrated into a comprehensive monitoring system, IT teams don't have the system-wide observability needed for efficient alert correlation, root cause analysis and remediation.

What's worse, this lack of integration hinders the efficacy of the automation tools for alert management, such as alert prioritization and correlation workflows, set up to assist in detection and resolution and reduce the volume of alerts. Teams are left to manually connect the dots, an arduous and time-consuming (if not impossible) task.

A surveycited in Deloitte's "Adaptive Defense: Custom Alerts for Modern Threats" reportfound that a "lack of visibility or context from security tools resulted in 47% of attacks being missed in a 12-month period."

While individual agents don't necessarily require centralization, a centralized platform where data from agents is aggregated facilitates system-wide analysis, storage and visualization.

Can AI and agentic solutions deliver some relief?

Yes… with a focused strategy.

A recent MIT reportignited a firestorm with the claim that "95% of organizations are getting zero return" on their generative AIinvestments.

Setting aside the inflammatory stat, and the cascade of opinions the report solicited, the report highlights a valuable theme: many AIprojects fail because of "brittle workflows, lack of contextual learning, and misalignment with day-to-day operations." As Marina Danilevsky, Senior Research Scientist at IBM notes on a recent Mixture of Experts podcast,the most successful deployments are "focused, scoped and address a proper pain point."

The MIT report reinforces the fact that companies that view AI as a sort of panacea or something that can be haphazardly shoehorned into a process, aren't likely to see a return on their investment. Organizations that can strategically implement AI tools into their workflows to solve a specific problem, and reinforce these tools over time, are better suited for success.

Mixture of Experts | 28 August, episode 70

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch the latest podcast episodes

What might this specific implementation look like?

An observability or security solution that can incorporate adaptive machine learning, contextual prioritization, explainable AI, AI-powered automation and real-time intelligence into an integrated strategy can enable teams to create stronger workflows that help correlate, prioritize and remediate performance or security alerts.

AI agents can improve traditional systems that rely on static rules and preset thresholds by bringing factors like asset importance, performance guarantees, risk profiles and historical trends to bear.

For example, consider a post-incident detection and remediationworkflow, and how an AI agent might assist an SRE team.

A notification hits the alert system flagging high CPU usage for a node in a Kubernetescluster. In a traditional system, SREs might need to comb through MELT data (metrics, events, logs, traces) and dependencies to identify the root cause.

In this hypothetical agentic workflow, the agent uses the observability tool's knowledge graph, and topology aware correlation, to pull only the telemetry related to the alert (such as logs for the services running on that node, recent deployments, telemetry from the Kubernetes API server or load balancers that route traffic to the node or cluster). With this additional information, the agent can enrich raw alerts and provide context-rich telemetry to a local AI model trained on the enterprise's performance data and benchmarks.

The agent excludes irrelevant information, such as logs for unrelated services that happen to run on the same cluster. During this context gathering, the agent can also identify related signals and correlate alerts that likely stem from the same root cause and group these alerts together to be investigated as one incident.

With this information, the model can propose a hypothesis. The agent can also request more information (perhaps checking container configurations or time series data around the usage spike) to check and refine the model hypothesis, adding additional context before proposing a probable root cause.

IBM - International Business Machines Corporation published this content on August 29, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on September 03, 2025 at 16:14 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]