Your Agent Works.
You Just Can’t Prove It.
57% of teams have agents in production.
Only 37% evaluate if their outputs are correct.
By Saheb Singh · Enterprise AI, American Express. Ex-Google. CMU CS.
The Failure Spectrum
Four ways agents fail — and why you won’t see it coming.
LangChain 2025: Quality is the #1 barrier. 32% of teams cite it. Not cost. Not latency. Quality.
The Hallucination Cascade
Chatbots hallucinate and a human catches it. Agents hallucinate and then act on it. This is the fundamental difference: when an autonomous system generates false information at step 3 of a 12-step chain, every subsequent step builds on fiction.
Gemini's code agent generated 693 lines of fabricated code that looked syntactically perfect, passed its own internal checks, and would have shipped to production. The agent was confident. The code was fiction. No alarm fired.
This isn't a model problem — it's an architecture problem. Single-pass chatbot hallucinations are annoying. Multi-step agent hallucinations are compounding errors with real-world consequences.
The Tool Misfire
MCP (Model Context Protocol) hit 97 million monthly SDK downloads. It's the USB-C of AI — a universal standard for connecting agents to tools. But 43% of MCP implementations have command injection vulnerabilities.
Supabase's MCP server had a flaw that let agents read data they shouldn't have access to. Not a theoretical risk — a shipping product with a data leak built into the tool interface. The agent didn't hack anything. It used the tool exactly as designed. The design was broken.
Tool misuse isn't always a security flaw. Sometimes the agent calls the right tool with wrong parameters. DELETE instead of GET. Production database instead of staging. The tool works perfectly — on the wrong target.
Goal Drift
Academic research identified 14 distinct failure modes in multi-agent systems. The subtlest: goal drift. Over 7-12 steps, the agent's working objective subtly mutates from what you asked for. Not wrong — adjacent. Close enough to look right, far enough to be useless.
In a real-world test, multi-agent systems failed 41-86.7% of the time depending on task complexity. The failures weren't dramatic crashes. They were quiet deviations — the agent solving a slightly different problem than the one it was given.
This is why Gartner predicts 40%+ of agentic AI projects will be canceled by 2027. Not because agents can't work — because the gap between 'works in demos' and 'works reliably at scale' is larger than most organizations expect.
The Silent Failure
No monitoring · No eval · No human check
Only 37% run online evaluations — LangChain 2025
89% of teams with agents in production have observability. Only 37% run online evaluations. Only 52% run offline evals. Translation: most teams can see their agent running but can't tell if it's running correctly.
The LangChain 2025 survey of 1,300 AI professionals found quality is the #1 barrier to production — cited by 32% of respondents. Not cost. Not latency. Not security. Quality. The agent works, ships output, and nobody knows if the output is right until a customer complains.
Air Canada learned this when their chatbot promised a bereavement discount that didn't exist. The system ran for months. No monitoring caught the fabricated policy. A customer sued. The airline lost.
When aggregate metrics hide the truth
War Story
Klarna’s 700-Agent Bet — and What It Actually Proved
The most-cited AI deployment success story of 2024 became the most instructive failure of 2025. Not because the AI was bad — because the autonomy boundary was wrong.
Klarna deploys an AI customer service agent powered by OpenAI. The ambition is massive: fully autonomous customer support at scale. The agent handles returns, refunds, FAQ, and complaint routing. No human-in-the-loop for routine queries. Leadership frames it as the future of customer service.
Next issue
Coming Soon
The next deep dive is in the works. Subscribe to get it the moment it drops.
Next issue: March 3 · Free · Unsubscribe anytime
Free · Every other Tuesday · 5-min read
Before you ship
Readiness Assessment
Is your agent actually ready for production?
57% of teams have agents in production. Most shipped without answering these questions. Walk through honestly — the right answer might be “not yet.”
Can you trace every agent decision back to its reasoning chain?
89% of teams have observability. Only 37% can actually evaluate if outputs are correct. Monitoring uptime is not the same as monitoring quality.
Failure Taxonomy
6 ways your agent will break. And what to build before it does.
Microsoft’s research team cataloged these. Academic papers validated them. Your production system will encounter them. The question is whether you’ve built the controls before or after the incident.
The cross-silo insight: Aviation solved this 30 years ago. Crew Resource Management (CRM) reduced fatal accidents by 50% — not with better planes, but with better human-machine interaction protocols. Checklists. Structured handoffs. Graduated autonomy. Mandatory callouts before irreversible actions. The agentic AI industry is re-learning what aviation already knew: autonomy without structured oversight isn’t innovation. It’s negligence.