Your Agent Works.
You Just Can’t Prove It.

57% of teams have agents in production.
Only 37% evaluate if their outputs are correct.

By Saheb Singh · Enterprise AI, American Express. Ex-Google. CMU CS.

The Failure Spectrum

Four ways agents fail — and why you won’t see it coming.

LangChain 2025: Quality is the #1 barrier. 32% of teams cite it. Not cost. Not latency. Quality.

Goal

well-defined

ReAct Loop

Reason

think

Act

tool call

Observe

hallucinated

Agent acts on its own hallucination · Confidence: 0.94

Failure Mode 1

The Hallucination Cascade

Goal

well-defined

ReAct Loop

Reason

think

Act

tool call

Observe

hallucinated

Agent acts on its own hallucination · Confidence: 0.94

Chatbots hallucinate and a human catches it. Agents hallucinate and then act on it. This is the fundamental difference: when an autonomous system generates false information at step 3 of a 12-step chain, every subsequent step builds on fiction.

Gemini's code agent generated 693 lines of fabricated code that looked syntactically perfect, passed its own internal checks, and would have shipped to production. The agent was confident. The code was fiction. No alarm fired.

This isn't a model problem — it's an architecture problem. Single-pass chatbot hallucinations are annoying. Multi-step agent hallucinations are compounding errors with real-world consequences.

Failure Mode 2

The Tool Misfire

Goal

well-defined

ReAct Loop

Reason

correct

Act

wrong params

Observe

bad data

DELETE instead of GET

43% of MCP have injection flaws

MCP (Model Context Protocol) hit 97 million monthly SDK downloads. It's the USB-C of AI — a universal standard for connecting agents to tools. But 43% of MCP implementations have command injection vulnerabilities.

Supabase's MCP server had a flaw that let agents read data they shouldn't have access to. Not a theoretical risk — a shipping product with a data leak built into the tool interface. The agent didn't hack anything. It used the tool exactly as designed. The design was broken.

Tool misuse isn't always a security flaw. Sometimes the agent calls the right tool with wrong parameters. DELETE instead of GET. Production database instead of staging. The tool works perfectly — on the wrong target.

Failure Mode 3

Goal Drift

Original Goal

intended

...drifts...

Mutated Goal

step 7 of 12

Corrupted Loop

Reason

off-track

Act

wrong tool

Observe

irrelevant

Multi-agent systems: 14 failure modes identified

Academic research identified 14 distinct failure modes in multi-agent systems. The subtlest: goal drift. Over 7-12 steps, the agent's working objective subtly mutates from what you asked for. Not wrong — adjacent. Close enough to look right, far enough to be useless.

In a real-world test, multi-agent systems failed 41-86.7% of the time depending on task complexity. The failures weren't dramatic crashes. They were quiet deviations — the agent solving a slightly different problem than the one it was given.

This is why Gartner predicts 40%+ of agentic AI projects will be canceled by 2027. Not because agents can't work — because the gap between 'works in demos' and 'works reliably at scale' is larger than most organizations expect.

Failure Mode 4

The Silent Failure

Goal

defined

Looks healthy

Reason

✓

Act

✓

Observe

✓

Output

looks correct

No monitoring · No eval · No human check

Only 37% run online evaluations — LangChain 2025

89% of teams with agents in production have observability. Only 37% run online evaluations. Only 52% run offline evals. Translation: most teams can see their agent running but can't tell if it's running correctly.

The LangChain 2025 survey of 1,300 AI professionals found quality is the #1 barrier to production — cited by 32% of respondents. Not cost. Not latency. Not security. Quality. The agent works, ships output, and nobody knows if the output is right until a customer complains.

Air Canada learned this when their chatbot promised a bereavement discount that didn't exist. The system ran for months. No monitoring caught the fabricated policy. A customer sued. The airline lost.

When aggregate metrics hide the truth

War Story

Klarna’s 700-Agent Bet — and What It Actually Proved

The most-cited AI deployment success story of 2024 became the most instructive failure of 2025. Not because the AI was bad — because the autonomy boundary was wrong.

Dec 2023The BetLaunch

Klarna deploys an AI customer service agent powered by OpenAI. The ambition is massive: fully autonomous customer support at scale. The agent handles returns, refunds, FAQ, and complaint routing. No human-in-the-loop for routine queries. Leadership frames it as the future of customer service.

1 / 5

Next issue

Coming Soon

The next deep dive is in the works. Subscribe to get it the moment it drops.

Next issue: March 3 · Free · Unsubscribe anytime

Free · Every other Tuesday · 5-min read

Before you ship

Readiness Assessment

Is your agent actually ready for production?

57% of teams have agents in production. Most shipped without answering these questions. Walk through honestly — the right answer might be “not yet.”

Can you trace every agent decision back to its reasoning chain?

89% of teams have observability. Only 37% can actually evaluate if outputs are correct. Monitoring uptime is not the same as monitoring quality.

Failure Taxonomy

6 ways your agent will break. And what to build before it does.

Microsoft’s research team cataloged these. Academic papers validated them. Your production system will encounter them. The question is whether you’ve built the controls before or after the incident.

The cross-silo insight: Aviation solved this 30 years ago. Crew Resource Management (CRM) reduced fatal accidents by 50% — not with better planes, but with better human-machine interaction protocols. Checklists. Structured handoffs. Graduated autonomy. Mandatory callouts before irreversible actions. The agentic AI industry is re-learning what aviation already knew: autonomy without structured oversight isn’t innovation. It’s negligence.

Your Agent Works.You Just Can’t Prove It.

Four ways agents fail — and why you won’t see it coming.

The Hallucination Cascade

The Tool Misfire

Goal Drift

The Silent Failure

Klarna’s 700-Agent Bet — and What It Actually Proved

Coming Soon

Is your agent actually ready for production?

6 ways your agent will break. And what to build before it does.

Your Agent Works.
You Just Can’t Prove It.