wehamd

Engineering · RAG

RAG in Production: 7 Failure Modes We Fixed for Enterprise Clients (and How)

Published May 26, 2026 · 12 min read · By the wehamd engineering team

A two-week Retrieval-Augmented Generation (RAG) prototype is genuinely easy: chunk some documents, embed them, drop them in a vector store, bolt on an LLM, ship a demo. We've seen dozens of these prototypes. We've also been called in to fix many of them after they stalled out on the path to production. The seven failure modes below are the ones that show up the most — and the architectural patterns that make them go away.

1. Stale embeddings (no re-indexing pipeline)

The most common production failure is also the most boring: nobody built the pipeline that keeps embeddings in sync with the source data. The demo was built off a one-time export of last quarter's documents. Three months later, sales reps are asking the assistant about a pricing tier that no longer exists, and engineering is surprised when it answers confidently with last year's API spec.

The fix

Treat the embedding index as a derived data store, just like a search index or a cache, and build the same kind of refresh discipline around it:

Symptom you'll see in production logs: rising "I don't know" or hallucinated responses for queries that should be well-supported, concentrated in the most-edited document sets. Almost always traces back to stale embeddings.

2. Naive chunking that destroys the meaning of the text

A fixed 1000-token sliding window is the simplest chunking strategy, which is why most prototypes use it. It's also the strategy that most aggressively splits semantic units in half. A regulatory paragraph gets its conclusion in one chunk and its conditional clauses in the next; an answer-key Q&A pair gets split between the question and the answer. Retrieval then returns one half of the meaning, and the LLM cheerfully completes the rest from its own training data — which is exactly the hallucination pattern you're trying to avoid.

The fix

3. Vector-only search (which silently fails on names, IDs, and codes)

Pure semantic search is great at "find me something about Q3 inventory shortfalls" and terrible at "find me ticket INC-48213." Embeddings smear named entities, product SKUs, error codes, and version numbers into a fuzzy representational soup. The user thinks the system is broken; really, the retrieval is just doing exactly what dense vectors were never designed to do.

The fix: hybrid retrieval

Run BM25 (or any sparse lexical search) and dense vector search in parallel, then fuse the results. Reciprocal Rank Fusion (RRF) is the simplest fusion algorithm and works embarrassingly well:

def rrf(rank_lists, k=60):
    scores = {}
    for ranked in rank_lists:
        for rank, doc_id in enumerate(ranked):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

Most modern vector stores (Weaviate, Qdrant, Azure AI Search, Elastic, pgvector + a `tsvector` column) ship hybrid retrieval out of the box. Turn it on by default for any enterprise corpus where identifiers, names, or jargon matter — which is almost all of them.

4. No reranker (so the top-k is mostly noise)

Embedding similarity gives you a coarse ordering. The top result is usually relevant. The 2nd through 10th are a mix of relevant and "vaguely on-topic but unhelpful." Feed all ten into the LLM and you've now diluted the prompt with noise — the model either averages everything into a vague answer or, worse, latches onto a tangentially-related chunk and confidently expands on it.

The fix: cross-encoder rerank between retrieval and generation

A small cross-encoder reranker (Cohere Rerank, BGE Reranker, Jina Reranker, or a fine-tuned MiniLM) takes your top-50 candidates and reorders them by true query-document relevance. You then feed the top-5 to the LLM. The latency cost is modest (50–150ms on most rerankers) and the answer-quality lift is consistently the single largest you'll get from a retrieval tweak — typically 10–25 points of nDCG@5 in our evals.

Rule of thumb: retrieve 50, rerank to 5, generate from 5. Anything less than rerank, and you're paying LLM tokens for noise.

5. Access control bolted on at the API layer (which leaks PII across users)

The most dangerous bug we've fixed in enterprise RAG isn't a hallucination — it's data leakage. The classic shape: documents from all of Finance, HR, and Legal end up in the same vector index, with access enforced only at the chat API layer ("if the user is in Sales, hide the response"). One race condition, one misrouted request, one prompt-injection trick, and a Sales rep sees a salary spreadsheet.

The fix: filter at the retrieval layer, not the response layer

If you do nothing else from this post, audit your access-control story before you onboard the next department's documents. This is the failure mode that ends RAG programs.

6. Latency that's fine for a demo and unusable at scale

A 2-second response in a demo feels snappy. The same 2-second response on every message in a busy internal chat tool feels interminable, and at 5 concurrent users on a single-replica inference endpoint, the tail latency starts hitting 8–10 seconds. Users churn off the assistant within a week.

The fix

7. No evaluation harness (so you can't tell if changes help or hurt)

Every RAG team we've worked with that doesn't have an evaluation harness ends up in the same place: a graveyard of well-intentioned "improvements" that nobody is sure actually improved anything. Someone swaps the embedding model. Someone tweaks the chunking. Someone tries a different reranker. Each change is shipped on vibes, and the system slowly drifts in a direction nobody can quantify.

The fix: build the eval harness on day one

The harsh truth: a RAG system without an eval harness is a system that nobody knows the quality of, including its own authors. Build the harness before you optimize anything.

Putting it all together

The seven failures aren't independent — they interact. Stale embeddings make eval scores drift downward; naive chunking amplifies the noise that reranking has to clean up; missing ACLs become catastrophic precisely when latency optimizations push more queries through more cache layers.

Production-grade RAG is mostly about turning the loose, demo-shaped pipeline into a disciplined data system: known sources, known chunks, known retrieval contract, known access policy, known metrics. The architecture isn't exotic. The discipline is.

Stuck on a RAG system that won't make it to production?

We've taken enterprise RAG pipelines from "great demo" to "trusted internal tool" across healthcare, finance, and logistics. If any of the seven failure modes above sound familiar, let's talk.

Get in touch with wehamd →