RAG in 2026: why bother if long context works
Second-generation Claude 4 and GPT-5 handle a million-token context. A full book fits into a single request. Which raises an obvious question: why do we still need RAG, embeddings, vector databases, chunking strategies?
Answering as someone who ships both.
Where long context decisively wins
Single-document tasks — dump the whole thing into context, ask your question, get your answer. You used to slice into chunks, compute embeddings, run hybrid search, rerank, assemble context. Now: just send the PDF.
This works for contracts, articles, books, codebases up to ~500K tokens. Quality beats RAG because the model sees everything at once — it doesn't lose connections between parts.
Where RAG is still irreplaceable
Three cases.
First: the corpus is bigger than the context window. A million tokens is about 700 pages. A company knowledge base, a support archive, the full docs of a big API — easily larger. You can't cram it, so you have to pick what's relevant.
Second: cost. A request with a million input tokens to Claude costs roughly $3. With thousands of users per day, each needing fresh context — the economics break. RAG cuts input by 100x.
Third: freshness. With RAG, you only reindex changed documents. Rebuilding a million-token context on every request is slow and expensive.
The hybrid I actually ship
In production, it's almost always a pipeline.
- Embedding search pulls top-20 candidates.
- A reranker (usually bge-reranker or Cohere Rerank) narrows to top-5.
- Those fragments + system prompt + question go to Claude.
Result: high-precision retrieval, reasonable cost, the model sees enough context to answer well.
What changed in chunking itself
The old way: 512-token slices with 50-token overlap. The new way: semantic splitters. langchain.SemanticChunker or markdown-structure slicing work well.
Another trend — parent-child retrieval. Search tiny chunks (precision), but feed parent blocks into context (completeness). Often +15% quality on the same data.
When you need neither RAG nor long context
Pure reasoning, code, generation — the model trained on the whole internet, it already knows. Adding more context is noise. The beginner's mistake is stuffing documentation into the prompt when a precise task formulation is enough.
My rule: reach for RAG when the answer depends on specific information that isn't in the model's weights.
What got worse
Embedding infrastructure has aged. Pinecone, Weaviate, Qdrant are all fine, but deployment and ops complexity grew while the space of relevant use cases shrank. For small projects, SQLite with the pgvector extension — or even a JSON file with numpy — is enough. No need to drag in a whole platform.
Bottom line
RAG isn't dead, but it's no longer the default. The default is "throw it in the context." RAG kicks in when the corpus is too big, cost matters, or freshness matters. And even then — not pure vector search, but a hybrid with a reranker.
If you have a document-based task and don't know where to start, start with long context. If you hit a limit or cost wall, add retrieval. Jumping straight to RAG is almost always overengineering.
Need help architecting document workflows? Drop a line.