Engineering · RAG
RAG in Production: 7 Failure Modes We Fixed for Enterprise Clients (and How)
A two-week Retrieval-Augmented Generation (RAG) prototype is genuinely easy: chunk some documents, embed them, drop them in a vector store, bolt on an LLM, ship a demo. We've seen dozens of these prototypes. We've also been called in to fix many of them after they stalled out on the path to production. The seven failure modes below are the ones that show up the most — and the architectural patterns that make them go away.
1. Stale embeddings (no re-indexing pipeline)
The most common production failure is also the most boring: nobody built the pipeline that keeps embeddings in sync with the source data. The demo was built off a one-time export of last quarter's documents. Three months later, sales reps are asking the assistant about a pricing tier that no longer exists, and engineering is surprised when it answers confidently with last year's API spec.
The fix
Treat the embedding index as a derived data store, just like a search index or a cache, and build the same kind of refresh discipline around it:
- Change-data-capture from the source of truth. Hook into the document store's webhooks (Confluence, SharePoint, Notion, Postgres LISTEN/NOTIFY, S3 event notifications). Don't poll, and don't re-embed the whole corpus every night unless it's tiny.
- Stable document IDs and content hashes. Hash the document body. If the hash didn't change, don't re-embed. This single change typically cuts re-indexing cost by 70–90%.
- Soft-delete instead of hard-delete. When a document is removed upstream, mark the vector as inactive and filter it out at retrieval. Hard deletes during traffic surges cause inconsistent results that are hellish to debug.
2. Naive chunking that destroys the meaning of the text
A fixed 1000-token sliding window is the simplest chunking strategy, which is why most prototypes use it. It's also the strategy that most aggressively splits semantic units in half. A regulatory paragraph gets its conclusion in one chunk and its conditional clauses in the next; an answer-key Q&A pair gets split between the question and the answer. Retrieval then returns one half of the meaning, and the LLM cheerfully completes the rest from its own training data — which is exactly the hallucination pattern you're trying to avoid.
The fix
- Structure-aware chunking first. Use the document's own boundaries — headings, list items, table rows, slide breaks — before falling back to token-based splitting. For HTML, use the DOM tree. For PDFs, run a layout-aware extractor like Unstructured, Azure Document Intelligence, or pdfplumber+heuristics, not a raw `pdftotext` dump.
-
Recursive token splitting with semantic anchors.
When you must split a long paragraph, prefer sentence boundaries
over hard token cuts. LangChain's
RecursiveCharacterTextSplitteris fine for this; so is LlamaIndex'sSentenceSplitter. - Small chunk with neighbour context at query time. Embed at 200–400 tokens for retrieval precision, but on a hit, include the previous and next chunk in the LLM context. This "small-to-big" pattern preserves precise retrieval without starving the LLM of surrounding meaning.
3. Vector-only search (which silently fails on names, IDs, and codes)
Pure semantic search is great at "find me something about Q3
inventory shortfalls" and terrible at "find me ticket
INC-48213." Embeddings smear named entities,
product SKUs, error codes, and version numbers into a fuzzy
representational soup. The user thinks the system is broken;
really, the retrieval is just doing exactly what dense vectors
were never designed to do.
The fix: hybrid retrieval
Run BM25 (or any sparse lexical search) and dense vector search in parallel, then fuse the results. Reciprocal Rank Fusion (RRF) is the simplest fusion algorithm and works embarrassingly well:
def rrf(rank_lists, k=60):
scores = {}
for ranked in rank_lists:
for rank, doc_id in enumerate(ranked):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores, key=scores.get, reverse=True)
Most modern vector stores (Weaviate, Qdrant, Azure AI Search, Elastic, pgvector + a `tsvector` column) ship hybrid retrieval out of the box. Turn it on by default for any enterprise corpus where identifiers, names, or jargon matter — which is almost all of them.
4. No reranker (so the top-k is mostly noise)
Embedding similarity gives you a coarse ordering. The top result is usually relevant. The 2nd through 10th are a mix of relevant and "vaguely on-topic but unhelpful." Feed all ten into the LLM and you've now diluted the prompt with noise — the model either averages everything into a vague answer or, worse, latches onto a tangentially-related chunk and confidently expands on it.
The fix: cross-encoder rerank between retrieval and generation
A small cross-encoder reranker (Cohere Rerank, BGE Reranker, Jina Reranker, or a fine-tuned MiniLM) takes your top-50 candidates and reorders them by true query-document relevance. You then feed the top-5 to the LLM. The latency cost is modest (50–150ms on most rerankers) and the answer-quality lift is consistently the single largest you'll get from a retrieval tweak — typically 10–25 points of nDCG@5 in our evals.
Rule of thumb: retrieve 50, rerank to 5, generate from 5. Anything less than rerank, and you're paying LLM tokens for noise.
5. Access control bolted on at the API layer (which leaks PII across users)
The most dangerous bug we've fixed in enterprise RAG isn't a hallucination — it's data leakage. The classic shape: documents from all of Finance, HR, and Legal end up in the same vector index, with access enforced only at the chat API layer ("if the user is in Sales, hide the response"). One race condition, one misrouted request, one prompt-injection trick, and a Sales rep sees a salary spreadsheet.
The fix: filter at the retrieval layer, not the response layer
- Index ACLs as first-class metadata. Every vector record carries an array of allowed roles, allowed user IDs, or an OPA-style policy reference. The vector store applies these as a pre-filter on the search, so unauthorized chunks are never retrieved in the first place.
- Pull from the source of truth at query time. For high-sensitivity sources, store only the embedding + the pointer in the vector store. Fetch the actual chunk text at query time from the source system, which re-enforces its own ACLs. Slower, but PII never leaks even if the index is compromised.
-
Audit log every retrieval. Log
(user, query, doc_ids_returned, timestamp)for every request. You'll need this for incident response, SOC2, and to debug "why did the bot say X to user Y."
6. Latency that's fine for a demo and unusable at scale
A 2-second response in a demo feels snappy. The same 2-second response on every message in a busy internal chat tool feels interminable, and at 5 concurrent users on a single-replica inference endpoint, the tail latency starts hitting 8–10 seconds. Users churn off the assistant within a week.
The fix
- Stream tokens. Time-to-first-token matters far more than total completion time. Streaming a 4-second response that starts arriving in 600ms feels twice as fast as a 2-second response delivered all at once.
- Cache aggressively. Cache the embedding for the query (not just the answer). Cache the retrieved doc IDs for identical queries. Cache the final answer for FAQ-shaped traffic. Three layers of cache typically absorb 30–50% of production load.
- Smaller model for routing, bigger for synthesis. Use a 7B or smaller model for query rewriting, intent classification, and retrieval filtering. Reserve the 70B (or GPT-4-class) model for the final synthesis step. This single change cut p95 latency by 40% on one of our recent engagements.
- Right-size your inference infrastructure. Load test before launch, not after. Most teams discover their batching configuration is wrong only after a Tuesday morning load spike melts the endpoint.
7. No evaluation harness (so you can't tell if changes help or hurt)
Every RAG team we've worked with that doesn't have an evaluation harness ends up in the same place: a graveyard of well-intentioned "improvements" that nobody is sure actually improved anything. Someone swaps the embedding model. Someone tweaks the chunking. Someone tries a different reranker. Each change is shipped on vibes, and the system slowly drifts in a direction nobody can quantify.
The fix: build the eval harness on day one
- A golden set of 50–200 query/answer pairs drawn from real (or realistic) user questions, labeled by subject matter experts. This is the single most valuable artifact in a RAG program.
- Retrieval metrics: recall@k and nDCG@k against the golden set. Cheap to compute, runs in CI.
- Generation metrics: use an LLM-as-judge with a strict rubric (faithfulness, completeness, refusal-quality). Frameworks like RAGAS, Ragas, TruLens, and Promptfoo are all fine starting points. Don't trust raw BLEU or ROUGE for generative answers; they correlate poorly with actual quality.
- Production telemetry beats offline eval. Log thumbs-up/thumbs-down, edit-on-copy events, and no-answer rates. Pipe them into a dashboard that someone actually looks at every week.
The harsh truth: a RAG system without an eval harness is a system that nobody knows the quality of, including its own authors. Build the harness before you optimize anything.
Putting it all together
The seven failures aren't independent — they interact. Stale embeddings make eval scores drift downward; naive chunking amplifies the noise that reranking has to clean up; missing ACLs become catastrophic precisely when latency optimizations push more queries through more cache layers.
Production-grade RAG is mostly about turning the loose, demo-shaped pipeline into a disciplined data system: known sources, known chunks, known retrieval contract, known access policy, known metrics. The architecture isn't exotic. The discipline is.
Stuck on a RAG system that won't make it to production?
We've taken enterprise RAG pipelines from "great demo" to "trusted internal tool" across healthcare, finance, and logistics. If any of the seven failure modes above sound familiar, let's talk.
Get in touch with wehamd →