Databricks Inc.

09/12/2025 | News release | Distributed by Public on 09/12/2025 17:52

The key to production AI agents: Evaluations

Organizations are eager to deploy GenAI agents to do things like automate workflows, answer customer inquiries and improve productivity. But in practice, most agents hit a wall before they reach production.

According to a recent survey by The Economist Impact and Databricks, 85 percent of organizations actively use GenAI in at least one business function, and 73 percent of companies say GenAI is critical to their long-term strategic goals. Innovations in agentic AI have added even more excitement and strategic importance to enterprise AI initiatives. Yet despite its widespread adoption, many find that their GenAI projects stall out after the pilot.

Today's LLMs demonstrate remarkable capabilities for broader tasks and strategies. But it is not practical to rely on off-the-shelf models, no matter how sophisticated, for business-specific, accurate and well-governed outputs. This gap between general AI capabilities and specific business needs often prevents agents from moving beyond experimental deployments in an enterprise setting.

To trust and scale AI agents in production, organizations need an agent platform that connects to their enterprise data and continuously measures and improves their agents' accuracy. Success requires domain-specific agents that understand your business context, paired with thorough AI evaluations that ensure outputs remain accurate, relevant and compliant.

This blog will discuss why generic metrics often fail in enterprise environments, what effective evaluation systems require and how to create continuous optimization that builds user trust.

Move beyond one-size-fits-all evaluations

You cannot responsibly deploy an AI agent if you can't measure whether it produces high-quality, enterprise-specific responses at scale. Historically, most organizations do not have a way to measure evaluation and rely on informal "vibe checks"-quick, impression-based assessments of whether the output feels right or aligns with brand tone-rather than systematic accuracy evaluations. Relying solely on those gut-checks is comparable to only walking through the obvious, success-scenario of a substantial software rollout before it goes live; no one would consider that sufficient validation for a mission-critical system. Other approaches include relying on general evaluation frameworks that were never designed for an enterprise's specific business, tasks, and with data. These off-the-shelf evaluations break down when AI agents tackle domain-specific problems. For example, these benchmarks can't assess whether an agent correctly interprets internal documentation, provides accurate customer support based on proprietary policies or delivers sound financial analysis based on company-specific data and industry regulations.

Trust in AI agents erodes through these critical failure points:

  • Organizations lack mechanisms to measure correctness within their unique knowledge base.
  • Business owners cannot trace how agents arrived at specific decisions or outputs.
  • Teams cannot quantify improvements across iterations, making it difficult to demonstrate progress or justify continued investment.

Ultimately, evaluation without context equals expensive guesswork and makes improving AI agents exceedingly difficult. Quality challenges can emerge from any component in the AI chain, from query parsing to information retrieval to response generation, creating a debugging nightmare where teams struggle to identify root causes and implement fixes quickly.

Build evaluation systems that actually work

Effective agent evaluation requires a systems-thinking approach built around three critical concepts:

  • Task-level benchmarking: Assess whether agents can complete specific workflows, not just answer random questions. For example, can it process a customer refund from start to finish?
  • Grounded evaluation: Ensure responses draw from internal knowledge and enterprise context, not generic public information. Does your legal AI agent reference actual company contracts or generic legal principles?
  • Change tracking: Monitor how performance changes across model updates and system modifications. This prevents scenarios where minor system updates unexpectedly degrade agent performance in production.

Enterprise agents are deeply tied to enterprise context and must navigate private data sources, proprietary business logic and task-specific workflows that define how real organizations operate. AI evaluations must be custom-built around each agent's specific purpose, which varies across use cases and organizations.

But building effective evaluation is only the first step. The real value comes from turning that evaluation data into continuous improvement. The most sophisticated organizations are moving toward platforms that enable auto-optimized agents: systems where high-quality, domain-specific agents can be built by simply describing the task and desired outcomes. These platforms handle evaluation, optimization and continuous improvement automatically, allowing teams to focus on business outcomes rather than technical details.

Databricks Inc. published this content on September 12, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on September 12, 2025 at 23:52 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]