// TECHNICAL

Long-context vs RAG: a 2026 decision tree

technical·2026-05-09·11 MIN READ

Gael Hatchue

"Should we still bother with RAG?" is a question I get every week now, and it's the wrong question. Frontier models with million-token context windows are not a roadmap item anymore. Claude reaches 1M tokens in production. Gemini reaches 2M. GPT-class models are at 1M+. As of mid-2026, these are the defaults, not the experiments. So teams see the marketing, conclude that retrieval-augmented generation is obsolete, and either go all-in on stuffing entire codebases or document corpora into every prompt, or they stick with RAG pipelines they built in 2023 and wonder whether they're leaving something on the table.

Both reactions are wrong. The real question is not "long-context or RAG?" It is: which of cost, latency, and accuracy are you willing to spend, and by how much? Those three levers don't all move in the same direction. A decision you optimize on one of them can wreck the other two. This post is a framework for making that tradeoff explicitly, with numbers, instead of by instinct or vendor recommendation.

The cost lens

Start with cost because it's the most legible variable and the one where the intuitions are most off.

A 1M-token input request runs roughly $3-5 in input cost per query depending on the model. RAG over the same corpus, where you retrieve a focused slice and feed 5-50K tokens to the model, averages around $0.02 per query. The ratio is approximately 1,250x. That number is cited by multiple 2026 analyses of this decision (TianPan's long-context vs. RAG decision framework, April 2026; open-techstack's RAG vs Long Context 2026 comparison). One thousand two hundred fifty times more expensive, per query, for the same corpus.

That's the naive comparison, and it's the number that drives people back to RAG without thinking harder. The modifier that changes the picture is prefix caching.

Prefix caching is now a standard feature across major inference providers. When a substantial portion of your input is stable across queries, the provider caches the processed key-value state for that prefix, and subsequent calls that hit the cache are billed at 10-25% of the normal input rate. If your system prompt plus corpus is 900K tokens and only the last 100K varies per query, you're paying full price on 100K tokens and reduced price on 900K. That collapses the 1,250x gap to somewhere in the 5-50x range depending on cache hit rate, corpus stability, and provider pricing.

Stacking optimizations narrows it further. Redis's analyses of LLM token optimization in production describe 60-80% cost reductions through stacked approaches: prompt caching at the provider level, semantic caching at the application level (returning cached results for queries that are similar enough to a prior query), and tighter retrieval to reduce the tokens fed per round. These are not theoretical numbers. They come from production deployments. The ceiling on this stacking is that it only works when your query distribution has high semantic similarity and your corpus is stable. For highly varied, exploratory queries against a rapidly-updating corpus, the cache hit rates collapse and the cost reverts toward the naive number.

Practical heuristics by per-query budget, after caching but before optimistic cache hit assumptions: if your budget is under $0.10 per query, RAG is the right default and long-context is probably not feasible at any meaningful query volume. Under $1.00 per query, long-context with aggressive caching is feasible if the corpus is stable. Above $1.00 per query, either architecture can work, but at that point cache strategy matters more than retrieval strategy.

The latency lens

Cost is a number you can optimize around. Latency is often a hard constraint.

A 1M-token query, end to end on a real reasoning task, runs 30-60 seconds on current inference infrastructure. RAG over the same corpus, retrieving a focused slice and running the actual model call, runs around 1 second. That's a 30-60x difference. TianPan's 2026 decision framework documents similar numbers for production deployments.

30-60 seconds is not a user-facing latency. It's a batch processing latency. Any product that requires a person to wait for a response, any chat interface, any document QA tool, any code assistant, cannot absorb 30-60 seconds per turn without destroying the experience. This isn't a matter of tuning; it's physics and billing. The compute that processes 1M tokens takes time.

The caveat worth adding: streaming helps the perceived latency picture significantly for chat-style interfaces. Time-to-first-token (TTFT) on a long-context call can be 2-5 seconds even when total latency is 45 seconds. If your UI streams tokens as they arrive, the user sees a response begin quickly and experiences something closer to an interactive product even if they're waiting a full minute for the complete output. This is real but bounded. You can hide latency with streaming; you can't hide it with a table or a structured JSON response that the user needs in full before they can do anything.

Latency heuristics: if your latency budget is under 2 seconds, use RAG. If you have up to 10 seconds, either architecture is viable with tuning. If you can accept more than 10 seconds (batch jobs, overnight reports, async analysis pipelines), long-context with a large window is worth considering because it removes the retrieval infrastructure entirely and simplifies the system.

The accuracy lens, and lost-in-the-middle

Cost and latency are tractable. Accuracy is where the 2026 narrative about long-context has the most blind spots.

The assumption in most "long-context replaces RAG" arguments is that giving the model everything is strictly better than giving it a retrieved subset. More context, fewer missed facts. This is wrong in practice, and the reason has been known since Liu et al.'s "Lost in the Middle" paper: large language models exhibit a primacy/recency bias. Information at the very beginning and very end of a context window is retrieved with higher accuracy than information in the middle. The effect Liu et al. quantified was 10-20 percentage points lower retrieval accuracy for middle-of-window facts compared to the same facts at the start or end.

The magnitude of this effect has decreased with 2026 frontier models. Better training and attention mechanisms have reduced (not eliminated) the bias. But reduced is not gone, and the reduction matters less than the baseline. A 2025 empirical study published on arXiv (arXiv 2501.01880, "Long Context vs. RAG for LLMs: An Evaluation and Revisits") measured multi-fact recall in pure long-context configurations and found it hovering around 60%. That means a 40% silent miss rate. Not 40% errors that throw an exception or return a visible failure. Silent misses: facts your system doesn't surface, and doesn't know it didn't surface.

That number should land as hard as the compound failure math in agent pipelines. You're building a system that will silently fail to retrieve a relevant fact four out of ten times. If the answer to a user question depends on three facts that each have a 60% retrieval chance, your probability of getting all three is 0.6^3, roughly 22%. Most users will never know what the correct answer was; they'll just get a plausible-sounding incomplete answer.

RAG with a focused retrieval window often outperforms naive 1M-token long-context on retrieval-heavy tasks precisely because it reduces the window to the span where attention is most reliable. A well-tuned retrieval step that feeds 50K tokens of highly relevant content will beat a 1M-token dump of the full corpus on factual retrieval benchmarks, even when the full corpus contains the answer.

The hybrid pattern that most production systems converge on in 2026 follows from this: RAG retrieves top-K chunks (typically 50K-300K tokens of relevant material), and that slice is fed into a 1M-class context window for reasoning. The retrieval step ensures the relevant facts are near the attention focal point; the large window allows the model to reason across the retrieved context without artificial truncation. You get the recall benefits of retrieval and the reasoning capacity of a large window.

The 2026 decision tree

Given cost, latency, and accuracy constraints, here is the architecture decision as an explicit if-then tree by corpus size.

Under 200K tokens. Just paste it. Modern frontier models handle a 200K-token context window without significant degradation. Retrieval infrastructure adds operational overhead, latency, and potential retrieval errors. If your corpus fits comfortably under 200K tokens, the simplest correct answer is to include the full corpus in your system prompt with prefix caching and call it done.

200K to 1M tokens. This is the first real decision branch. If your queries hit similar context across most requests, and you can achieve meaningful cache hit rates, long-context with prefix caching is viable. If queries are highly varied or the corpus is frequently updated, cache efficiency collapses and RAG over the corpus is the better default. The heuristic: measure your cache hit rate for a week of real queries. Above 60% cache hits, long-context is cost-competitive. Below that, RAG.

1M to 10M tokens. Hybrid. This is the modal production architecture for mid-large knowledge bases in 2026 according to both the TianPan and open-techstack analyses. RAG retrieves top-K relevant chunks, typically resulting in 50K-300K tokens, and feeds that slice into a 1M-class context window for reasoning. You need the retrieval step because you can't fit the full corpus in a single call. You want the large window because it allows reasoning across multi-document context that narrow retrieval would fragment.

Over 10M tokens. RAG-only at the storage and retrieval layer. The corpus simply doesn't fit in any current context window, so retrieval is mandatory. Long-context can still be applied at the reasoning step over the retrieved top-K, but the primary architecture is retrieval-first.

Three orthogonal overrides apply regardless of corpus size.

First, latency ceiling. If you need sub-2-second responses, the latency constraints in the previous section override the corpus size heuristic. Even a 150K-token corpus should use RAG if the latency budget is tight.

Second, cost ceiling. If your per-query budget is under $0.10, RAG regardless of corpus size. The 1M-token window with prefix caching can still be expensive at moderate query volumes if cache hit rates aren't high.

Third, data locality and regulatory constraints. GDPR, HIPAA, and sector-specific data residency requirements can force retrieval to happen on-prem or within a specific region. Long-context via a cloud inference API may not be available for that data at all. This isn't a cost or accuracy question; it's a constraint that removes options from the tree before you get to evaluate them.

What to evaluate, not assume

The decision tree above is a reasonable starting point. It is not a substitute for measurement on your actual data with your actual queries.

Vendor benchmarks are optimized for vendor interests. They use curated datasets, favorable hardware configurations, and task distributions that may have no overlap with your workload. A 70% recall number from a 2025 benchmark on a model that's been trained since is not the same as your recall number on your internal documents with your user query distribution.

The approach I use before committing to an architecture: build a 20-row golden set. Twenty representative queries against your corpus. Write down the correct answer for each. Three metrics: factual accuracy (did the answer include the correct facts?), latency (wall clock, not marketing), and cost per query (real API billing, not theoretical). Run both architectures against this set. The evaluation takes a day to build and an afternoon to run. That one day will save you two months of rebuilding an architecture you committed to based on a benchmark that didn't apply to you.

This is the same principle behind building the eval harness before writing the prompt in any other AI system: the evaluation is the spec. Don't pick your retrieval architecture from a blog post comparison. Pick it from your data.

Closing

The boundary between "long-context handles this" and "you need retrieval" moved in 2025 and is still moving. It didn't disappear. RAG didn't become obsolete when context windows hit 1M tokens. It became less necessary at the small end of the corpus size distribution and increasingly a component (the retrieval step in a hybrid) rather than the full architecture at the large end.

The expensive mistake in 2026 is picking a side. Teams that went all-in on long-context are paying 1,250x more per query than they need to for large stable corpora. Teams that reflexively kept their 2023 RAG pipelines intact are doing unnecessary engineering for corpora that would fit cleanly in a single 200K-token call.

The right question is always: what are your cost, latency, and accuracy constraints, and what does your corpus size say about the architecture? The answer is almost always "measure first, then decide." Tool-agnostic. Eval before architecture commitment.