The compound failure problem: why your agent demo won't scale
Your model gets the right answer 95% of the time. You ran the evals; you saw the numbers. The demo worked three times in a row. You shipped to staging and it looked fine. So why is your agent wrong on four out of ten full runs?
Because 95% per-step accuracy in a ten-step pipeline multiplies out to 0.95^10, which is 0.5987. Roughly 60% end-to-end success. Your model is not a 95%-reliable agent. It's a coin flip that almost always lands heads.
That's the bait-and-switch. Teams measure the model. The business experiences the pipeline. Those are different things, and the gap between them is where most agentic AI projects quietly fall apart. The fix isn't a smarter model. It's the boring infrastructure most teams skip because it doesn't show up in the benchmark numbers.
The math, fully
The formula is simple. If each step in a pipeline succeeds independently with probability A, and there are N steps, then end-to-end success probability is A^N.
Run that across four realistic accuracy levels:
At 99% per-step accuracy (best-in-class for a well-scoped task with solid tooling): ten steps gets you 90%. Fifteen steps gets you 86%. Twenty steps drops to 82%.
At 95% per-step (a number many teams would happily claim): ten steps gets you 60%. Fifteen steps: 46%. Twenty steps: 36%.
At 90% per-step (honest for a moderately complex task): ten steps gets you 35%. Fifteen steps: 21%. Twenty steps: 12%.
At 85% per-step (not unusual for a task with messy real-world data): ten steps lands at 20%. Twenty steps: under 4%.
These numbers should stop you cold. A twenty-step agent at 90% per-step accuracy fails more than 87% of the time. That is not a system you can run in production. It's a system that will generate a support ticket on nearly every user action.
The math is also, in practice, worse than the independent-failure model suggests. Step failures don't stay contained. A misparse in step three propagates assumptions into step four, five, and six. The error isn't just "step three failed"; it's "everything downstream of step three operated on bad state." Path divergence amplifies the failure. Datadog's State of AI Engineering report finds ~63% variation in execution paths for identical inputs, which means two identical user requests may fail for completely different reasons, making the failures hard to reproduce and harder to bucket.
And capacity limits compound everything further. The same Datadog report counts ~8.4 million rate-limit errors from LLM spans in March 2026 alone. Rate limits are not a model-quality problem. They're a throughput-ceiling problem that looks, from the outside, identical to a model failure. If your telemetry collapses those two into a single "agent error" log line, you will never know which one you're debugging.
Why this lands hardest in production
The staging environment lies to you. In staging, the happy path works. The inputs are clean. The context windows are empty at the start of every test. Nobody's running fifty concurrent requests. Nobody's context window has been running for six hours and accumulated enough prior tool output to push the original system prompt toward the edge of attention.
The production environment is none of those things.
In July 2025, Replit's AI assistant deleted a production database. The user had explicitly told it not to. The agent proceeded anyway. This wasn't a model hallucination in the abstract sense, it was a failure of authorization rails: an agent that could take a destructive action, did. The instruction was in the context. The context was not the constraint.
Around the same period, Google's Antigravity assistant wiped a developer's D: drive when asked to clear a project cache. The agent reasoned from the instruction to an action that was technically in scope of "clearing" but completely outside the scope of what any human would have meant. No authorization boundary stopped it. No dry-run confirmation was required. It just ran.
Both incidents got press coverage because they were dramatic. The quieter class of failures is the one that doesn't trigger any alert at all. Liu et al.'s "Lost in the Middle" work quantifies something practitioners have noticed for years: models exhibit primacy/recency bias. The beginning and end of a context window get weighted more heavily than the middle. A long-running agent conversation, filling with tool outputs and intermediate results, will gradually deprioritize instructions that were given in the middle. No error is thrown. No warning is logged. The agent simply drifts, and you find out when a user reports that the system stopped following the rules you set.
That silent class of failure is the one that's hardest to catch in a demo. It only shows up after the context window fills. Which means it only shows up in production.
The boring infrastructure that closes the gap
None of what follows is new. That's the point.
Idempotency keys on every side effect. An agent that sends an email has a side effect. An agent that retries a failed step and sends the same email twice has a spam outbox and a very unhappy recipient. Every action that touches the external world needs an idempotency key that is stable across retries and generated before the attempt, not after. This is table stakes in any well-run payment system. It is nowhere near universal in agent systems, because agent systems are often built by people who haven't had the pleasure of debugging a double-charge incident at 2 a.m.
Structured tool I/O with type contracts. Tool inputs and outputs should be validated at the boundary. Fail fast. A tool call that receives a malformed response from the model should not attempt to parse it heroically; it should throw a typed error, log the schema mismatch as its own event category, and return cleanly. Schema mismatches are not exceptions to handle gracefully, they're signals to track separately. If your logs show a 2% schema mismatch rate on a specific tool, that's a prompt problem. You can fix it. If all your errors collapse into a single log line, you'll never see it.
Telemetry that distinguishes failure modes. There are at least three categories of failure in an agent pipeline that should never share a log line: model errors (the model returned something wrong or refused), tool errors (the tool itself failed, threw, or timed out), and network and rate-limit errors (the infrastructure between your agent and the model or tool was the problem). Each has a different remediation. Model errors need prompt changes or eval-driven model swap decisions. Tool errors need retries with backoff or fallback routing. Rate-limit errors need capacity planning. If you can't distinguish them in your dashboards, you will mis-spend every engineering hour you put toward reliability.
Eval harness before the prompt. The evals are the spec. Build them first. If you can't write a fifty-row eval set that captures the success criteria for your agent task, you don't yet have a clear enough definition of "working" to write the prompt. Reversing the usual order (prompt first, eval when something breaks) is the single most reliable way to compress the time between "works in demo" and "works in production." The eval harness is a deliverable, not a checkbox at the end.
Authorization rails. Deny by default. Agent identity is not user identity. An agent acting on behalf of a user does not inherit that user's full permissions; it gets the minimum access required to complete the task it was given. Column-level and row-level access scoping matters here, particularly when the agent has access to anything that contains personal data, financial data, or any operation that's destructive or hard to reverse. (This is the work I do day-to-day at a large fintech, which is why I'm specific about it: the gap between "the agent can technically access this" and "the agent should be allowed to access this" is where the most expensive failures live.)
The contrarian beat
Gartner forecasts that more than 40% of agentic AI projects will be cancelled by 2027. That number gets cited a lot, mostly by people who think it means model quality will improve and the cancellations will slow down.
I think the causation runs the other way. Most of those cancellations won't happen because the model wasn't good enough. They'll happen because the team shipped a pipeline with no idempotency keys, no type contracts on tool calls, no telemetry that could distinguish model failures from rate limits, no eval harness, and authorization rails that were either absent or bolted on after the first incident. The model will keep getting better. The infrastructure that makes a model safe to run in production is a different problem, and it doesn't solve itself when the next model version ships.
The Replit and Antigravity incidents weren't model-quality failures. A better model might have been more likely to follow the instruction not to delete. But the correct fix is an authorization boundary that makes deletion impossible without a hard confirmation step, regardless of what the model decides. Better models and better infrastructure are not substitutes for each other. Teams that treat them as substitutes are the ones that end up in the Gartner cancellation statistic.
What to do Monday morning
Three things. Do them in this order.
Compute your real success rate. Take the agent you're running or planning to run. Count the number of distinct steps: tool calls, LLM calls, state mutations, output writes. Estimate or measure your per-step success rate honestly, not best-case. Raise it to the power of your step count. Look at the number. If it's below 80%, you have an infrastructure problem that a better model will not solve. Write that number down and put it in the planning doc.
Pick one side effect that lacks an idempotency key and add one this week. One. Not all of them. Just the one that would hurt the most if it fired twice. Email send, payment initiation, database write, external API call. Find it, add the key, write a test that confirms the key prevents duplicate execution. Then do the next one next week. This is a campaign, not a sprint.
Build a fifty-row eval set before you change another prompt. Pick the task your agent is supposed to do. Write fifty representative inputs, maybe thirty normal cases, ten edge cases, ten adversarial inputs, and write down what a correct output looks like for each. This takes a day. Now you have a spec. When you change the prompt to fix one thing, you'll know immediately whether you broke something else. You cannot do responsible prompt engineering without this. The eval harness is not optional for production; it's just sometimes delayed until after the first big failure.
Closing
The model is the new part. Almost everything else is the same job we've been doing for twenty years. Idempotency, typed interfaces, structured error budgets, telemetry that distinguishes failure modes, authorization that scopes access to what's actually needed. These aren't novel AI problems. They're distributed systems problems with a new caller.
The teams that ship reliable agent systems in 2026 and 2027 will not be the ones with the best models. They'll be the ones who treated the model as one component in a system that needed all the same discipline any other component needs. Eval before code. Production-bias at every design decision. Scope locked before build begins. The discovery sprint I run always ships the eval harness as deliverable one, because without it, everything else is a demo.
You already know how to build reliable systems. The new part is smaller than the marketing suggests.