If you have spent any time in the AI space lately, you have probably seen a dozen tutorials that show you how to build a "RAG chatbot" in under 10 minutes. You copy-paste some Python, connect it to a PDF, ask it a question, and it answers correctly. Magic, right?
Then you try to do the same thing with real data, in a real app, with real users, and everything falls apart. The chatbot hallucinates. It retrieves the wrong chunks. It forgets context. Your users are unhappy. You are confused. The tutorial author is nowhere to be found.
This guide is for that second scenario. We are going to cover what a production ready RAG pipeline actually looks like, why the naive version fails, and what decisions you need to make to build something that lasts longer than your demo.
Fair warning: this is a long one. Get a coffee.
What is RAG (and Why Should You Care)?
RAG stands for Retrieval Augmented Generation. The core idea is simple: instead of relying solely on what an LLM "knows" from training, you retrieve relevant information from a custom knowledge base and inject it into the prompt before asking the model to answer.
The result? The model answers based on your data, not just its training corpus. This is why RAG is everywhere right now. It solves a real problem without requiring you to fine-tune or host your own model. You get the reasoning capabilities of a frontier LLM applied to your specific documents, databases, or knowledge base.
flowchart LR
User["User Query"] --> Embed["Embed Query"]
Embed --> VDB["Vector Database\n(ANN Search)"]
VDB --> Chunks["Retrieved Chunks"]
Chunks --> Prompt["Augmented Prompt"]
Prompt --> LLM["LLM"]
LLM --> Response["Response"]
Simple concept. Deceptively difficult to execute well at scale. Let's get into it.
The Real Problem with Naive RAG
Most tutorials build what I call "demo RAG." It works on clean PDFs, in a local Jupyter notebook, with a small dataset, and a very patient user who asks questions that happen to match perfectly with the embedded content.
Here are the three places it falls apart in production.
Chunking
What most tutorials say: Split your documents into 500-token chunks with 50-token overlap, embed them all, and you are done.
What actually happens in production: Your documents are a mess. PDFs have headers, footers, page numbers, and tables. HTML has navigation menus, cookie banners, and sidebar content. Word documents have inline metadata and tracked changes. Your "clean" input data is not clean at all.
Fixed-size chunking blindly splits sentences mid-thought, separates questions from their answers, and turns coherent paragraphs into noise. The chunks you embed are incoherent fragments, and you wonder why the model returns garbage.
Retrieval Quality
What most tutorials say: Embed the query, do a cosine similarity search, take the top 5 chunks.
What actually happens in production: The top 5 chunks are often irrelevant, redundant, or both. Pure vector search struggles with exact-match queries (like product codes or invoice numbers), misses context-dependent relevance, and returns stale results if your index has not been refreshed lately. Users ask questions your embedding model was never optimised to handle well.
LLM Response Quality
What most tutorials say: Feed the chunks to the LLM and it will figure out what is relevant.
What actually happens in production: The LLM confidently stitches together an answer from chunks that are technically present in the context window but semantically disconnected from the actual question. It hallucinates citations. It contradicts itself across paragraphs. If your chunks contain any noise (stray HTML, table headers, leftover metadata), the model gets confused and outputs polished nonsense with full confidence.
The root cause in all three cases is the same: RAG quality is determined upstream. You cannot fix bad retrieval with a clever prompt.
Data Ingestion: The Messiest Part Nobody Talks About
Real-world data sources are a disaster zone. Before you even think about embeddings or vector databases, you need to invest seriously in your ingestion pipeline. This is where most teams underinvest and pay for it later.
Document Parsing
HTML: Strip it properly. Simply calling something like strip_tags() and moving on will leave navigation menus, cookie banners, ad scripts, and footer links in your chunks. Use a parser that understands content structure and throws away everything that is not the main body.
PDFs: These are the worst offenders. Scanned PDFs need OCR. Multi-column PDFs confuse most text extractors. Tables become garbled nonsense when linearised. Images are invisible to the text pipeline entirely. Use a dedicated PDF extraction library and be prepared to write document-specific handling for anything that matters.
Office documents: Word files have tracked changes, comments, and embedded objects. Excel files have merged cells and multiple sheets. Each format has its own failure mode. Know the formats in your data source and handle them explicitly.
A rough ingestion pipeline looks like this:
flowchart TD
Source["Raw Source\n(PDF, HTML, DOCX, DB)"] --> Parse["Document Parser"]
Parse --> Clean["Content Cleaner\n(strip boilerplate, fix encoding)"]
Clean --> Chunk["Chunker"]
Chunk --> Meta["Metadata Enrichment"]
Meta --> Embed["Embedding Service"]
Embed --> Index["Vector Index"]
Meta --> Store["Object Storage\n(raw originals)"]
Cleaning and Preprocessing
Before chunking, clean the text:
- Remove boilerplate: headers, footers, repeated navigation text
- Fix encoding issues (UTF-8 problems are surprisingly common with legacy content)
- Normalise whitespace and line breaks
- Handle special characters and ligatures
This is the most boring part of building a RAG system. It is also the difference between a system that works and one that produces mysterious failures at 2 AM. Do not skip it.
Chunking Strategies
This is where a surprising number of systems go wrong. Here are the main approaches:
| Strategy | Complexity | Quality | Best For |
|---|---|---|---|
| Fixed-size | Low | Low-Medium | Quick prototypes, dense prose |
| Sentence-based | Low | Medium | General narrative content |
| Semantic | Medium | High | Mixed or varied content |
| Document-aware | High | Very High | Structured docs (FAQs, product pages) |
Fixed-size chunking: Split every N tokens with an M-token overlap. Simple to implement, terrible in practice for anything except uniform prose. Good enough to get started, not good enough to stay with.
Sentence-based chunking: Split by sentence boundaries. Preserves semantic units better than fixed-size. The downside is that sentences vary wildly in information density, and short sentences lose context without their neighbours.
Semantic chunking: Use embedding similarity to detect topic shifts and split at natural boundaries. Produces more coherent chunks. Slower and more expensive at ingestion time, but the quality improvement is real.
Document-aware chunking: Respect the structure of your content. A FAQ document should be chunked per question-answer pair. A product page should be chunked per section. An API reference should be chunked per endpoint. This is the highest quality approach, but requires custom logic per document type. Worth it for high-value content.
The Importance of Metadata
Every chunk must carry metadata: source URL, document title, section heading, creation date, document type, and any domain-specific fields. This metadata enables filtering, attribution, and freshness checks at query time.
A chunk without metadata is a decontextualised blob. Do not produce decontextualised blobs.
interface DocumentChunk {
id: string;
content: string;
embedding?: number[];
metadata: {
sourceId: string;
sourceUrl: string;
title: string;
section?: string;
createdAt: string;
updatedAt: string;
docType: "faq" | "article" | "product" | "policy";
tags?: string[];
tenantId?: string;
};
}
Embeddings: Where People Make Expensive Mistakes
Embeddings convert your text into high-dimensional vectors that capture semantic meaning. The embedding model you choose matters more than most teams realise, and the mistakes here are expensive because they affect every query your system ever processes.
Model Selection Tradeoffs
| Model Type | Cost | Latency | Quality | Notes |
|---|---|---|---|---|
| API-based (e.g. text-embedding-3) | Per token | Low-Med | High | Easy to start, vendor dependency |
| Self-hosted small (e.g. all-MiniLM-L6-v2) | Infra cost | Very Low | Medium | Great baseline, cheap to run |
| Self-hosted large (e.g. E5-large-v2) | Infra cost | Medium | High | Needs GPU, excellent accuracy |
| Domain-fine-tuned | High upfront | Varies | Very High | Overkill unless you have a very specific domain |
What most tutorials say: Just use the OpenAI embeddings API. It is the best and the easiest.
What actually happens in production:
For many use cases, a self-hosted model like all-MiniLM-L6-v2 gives you 80-90% of the quality at a fraction of the cost, with zero vendor dependency and sub-5ms latency. The sweet spot for most production systems is a mid-sized, open-source embedding model you control. If you are in a regulated industry (finance, healthcare, legal), sending all your document content to a third-party API may not even be an option.
There is also the migration trap: the embedding model you use during ingestion must be the same model you use at query time. This sounds obvious, but every few months a team quietly migrates embedding providers, forgets to re-index their documents, and spends three days debugging mysterious relevance degradation.
When NOT to Embed Everything
Not every piece of text needs a vector embedding. Frequently-changing content, highly structured data (JSON records with well-defined fields), and very short strings under 20 tokens are often better served by traditional database queries or keyword search. Embedding has a cost and a latency budget. Be selective about what actually benefits from semantic search.
Batch vs Real-Time Embedding
For ingestion, always batch your embedding calls. Sending 10,000 individual API requests is painfully slow and expensive. For query-time embedding (the user's question), you need low latency, so treat them separately.
A good ingestion pipeline batches in groups of 100-500 documents, handles retries with exponential backoff, and tracks which documents have already been embedded so you are not re-processing unchanged content.
async function batchEmbed(
chunks: DocumentChunk[],
batchSize = 200,
): Promise<DocumentChunk[]> {
const results: DocumentChunk[] = [];
for (let i = 0; i < chunks.length; i += batchSize) {
const batch = chunks.slice(i, i + batchSize);
const texts = batch.map((c) => c.content);
// Replace with your embedding provider
const embeddings = await embeddingService.embed(texts);
results.push(
...batch.map((chunk, idx) => ({
...chunk,
embedding: embeddings[idx],
})),
);
// Brief pause to respect rate limits
await sleep(50);
}
return results;
}
Retrieval: The Layer Where Systems Live or Die
This is the most important section. Get retrieval wrong, and no amount of prompt engineering will save you. Seriously.
Why Pure Vector Search is Not Enough
Most RAG tutorials stop at cosine similarity search: embed the query, find the nearest neighbours in vector space, return the top K results. This works reasonably well for semantic similarity but breaks down in the cases your users actually care about:
- Exact keyword matches: A user searches for "Invoice #INV-2024-0091." Vector search has no idea what to do with that. It will return semantically similar content about invoices in general, which is completely useless.
- Rare terms and jargon: Technical product names, internal terminology, and proper nouns are poorly represented in general-purpose embedding space.
- Recency: Vector search has no concept of "newest first." A document from three years ago ranks identically to one from last week if the embeddings are similar.
Hybrid Search: The Real Answer
The solution is hybrid search: combining BM25 keyword search with vector search. Keyword search handles exact matches and rare terms. Vector search handles semantic similarity. Together, they cover each other's blind spots.
flowchart LR
Query["User Query"] --> KW["BM25\nKeyword Search"]
Query --> VS["Vector\nSemantic Search"]
KW --> Merge["Score Fusion\n(RRF)"]
VS --> Merge
Merge --> Reranker["Cross-Encoder\nRe-ranker"]
Reranker --> TopK["Top K Results\n(context)"]
Most modern vector databases (Qdrant, Weaviate, OpenSearch, Elasticsearch with pgvector) support hybrid search natively. Use it. The complexity overhead is small and the quality improvement is significant. There is genuinely no reason to ship pure vector search in 2026.
Score fusion: When merging results from keyword and vector search, use RRF. It is simple, parameter-free, and works well in practice:
RRF(doc) = sum of [ 1 / (k + rank_in_list) ]
where k is typically 60. No tuning required.
Re-Ranking: The Single Best Quality Improvement
Even with hybrid search, your top K candidates are still ranked by a fast approximate relevance score. Re-ranking is a second pass that uses a more powerful cross-encoder model to precisely score query-document pairs from the shortlist.
The pattern is:
- Retrieve top 20-50 candidates cheaply using hybrid search
- Re-rank with a cross-encoder that sees the full query and document together
- Return the top 5-10
The re-ranker sees the query and document simultaneously, giving it far better context than the bi-encoder used for initial retrieval. This two-stage approach is used by every serious production RAG system I have encountered. It is not optional.
async function retrieve(
query: string,
filter: Record<string, unknown>,
): Promise<DocumentChunk[]> {
// Stage 1: Fast hybrid retrieval (30 candidates)
const candidates = await vectorDb.hybridSearch(query, {
filter,
topK: 30,
});
// Stage 2: Precise re-ranking
const scores = await reranker.score(
query,
candidates.map((c) => c.content),
);
// Return top 5 after re-ranking
return scores
.map((score, index) => ({ score, chunk: candidates[index] }))
.sort((a, b) => b.score - a.score)
.slice(0, 5)
.map((r) => r.chunk);
}
Metadata Filtering
Before retrieval, metadata filtering dramatically improves both precision and security. If a user is asking about a specific product, filter to only chunks from that product's documentation. In a multi-tenant system, filter strictly to that tenant's data.
Most vector databases support pre-filtering (applied before the ANN search) or post-filtering. Pre-filtering is faster but can hurt recall if the filtered subset is small. Post-filtering is safer for correctness. Know the difference and choose deliberately.
What most tutorials say: Retrieve from the full index and let the LLM decide what is relevant.
What actually happens in production: An unfiltered index in a multi-tenant system is simultaneously a quality problem and a security problem. Tenant A can, in theory, retrieve information that belongs to Tenant B. Metadata-based tenant isolation is not a nice-to-have. It is a hard requirement.
Prompt Engineering for RAG
Good retrieval gets you the right information. Good prompting gets the LLM to use it correctly. Both are necessary.
Context Injection
The simplest approach is concatenating retrieved chunks into the prompt:
You are a helpful assistant. Answer the question based only on the context below.
Context:
[chunk 1]
[chunk 2]
[chunk 3]
Question: {user_question}
This works, but it has known failure modes.
The lost-in-the-middle problem: LLMs are better at using context that appears at the beginning and end of the prompt. Important information buried in the middle of a long context gets underweighted or ignored. Put your most relevant chunk first or last, not in the middle of a pile.
Context window overflow: If your chunks are large and you are injecting many of them, you will hit context limits. Define a hard cap on injected context (for example, 2000 tokens maximum) and prioritise based on re-ranker scores. Padding the prompt with marginally relevant text actively hurts answer quality.
Source attribution: Add source labels to each chunk. This makes hallucinations visible and lets users verify answers. It also makes debugging much easier when the system returns something wrong.
function buildPrompt(query: string, chunks: DocumentChunk[]): string {
const contextBlocks = chunks
.map((c, i) => `[Source ${i + 1}: ${c.metadata.title}]\n${c.content}`)
.join("\n\n---\n\n");
return `You are a helpful assistant. Answer the user's question using only the information provided in the context below. If the context does not contain enough information to answer clearly, say so.
Context:
${contextBlocks}
Question: ${query}
Answer:`;
}
Guardrails and Structured Output
For production systems, predictable LLM behaviour matters more than maximum creativity. A few patterns that help:
- Constrain the format: Ask for JSON output when you need structured responses. Most modern LLMs handle this reliably.
- Source attribution in the response: Instruct the model to reference which source it used. Hallucinations are much easier to catch when the model has to claim a source.
- Explicit fallback instruction: "If you cannot find the answer in the context, respond with: I do not have information about that." Without this, models make things up rather than admitting ignorance.
- Confidence signalling: Asking the model to indicate its confidence is imperfect, but useful as a weak signal for when to trigger a human review workflow.
Avoiding Prompt Bloat
Every token costs money and adds latency. Keep your system prompt concise. Avoid repeating instructions that are already implicit. Test the minimum prompt that produces acceptable behaviour and stick to it. The instinct to add more instructions when quality is bad is understandable, but usually the real fix is in retrieval, not in the prompt.
Evaluation: The Most Underrated Part of the Entire Stack
Most teams ship a RAG system, test it on five representative questions, get acceptable answers, and ship it. Three months later, users are reporting quality degradation, but nobody has the data to understand what changed or why.
Evaluation is how you know if your system is actually working. It is also how you justify engineering investment in hybrid search, re-ranking, and better chunking to stakeholders who want to know why you are spending time on retrieval infrastructure instead of shipping features.
What to Measure
Retrieval quality:
- Recall@K: Of the relevant documents in your knowledge base, how many did retrieval surface in the top K? If this is low, your retrieval pipeline is your problem.
- Precision@K: Of the K documents retrieved, how many were actually relevant? If this is low, your ranking is the problem.
- MRR: Measures how highly the first relevant result is ranked. A useful signal for single-answer lookup scenarios.
Generation quality:
- Answer relevance: Does the response actually address the question?
- Faithfulness: Is the response supported by the retrieved context, or is the model generating information beyond what was provided?
- Latency: End-to-end response time from query to first token and to full completion.
A Practical Evaluation Approach
You do not need a heavy framework to start. A minimal evaluation setup:
- Collect 50-100 representative questions from real user queries or synthetic data
- Annotate expected answers and the chunks that should be retrieved for each
- Run your pipeline and compare outputs
- Track metrics over time as you make changes to the pipeline
For automated evaluation, the LLM-as-judge pattern works well for generation quality. Send the question, the retrieved context, and the generated answer to a separate LLM call and ask it to score relevance and faithfulness.
async function evaluateResponse(
question: string,
context: string,
answer: string,
): Promise<{ relevance: number; faithfulness: number }> {
const prompt = `You are an evaluator for a question-answering system.
Question: ${question}
Context: ${context}
Answer: ${answer}
Rate the answer on two dimensions from 1 to 5:
- Relevance: Does the answer address the question asked?
- Faithfulness: Is the answer supported by the provided context (not invented)?
Respond with JSON only: { "relevance": <number>, "faithfulness": <number> }`;
const response = await llm.complete(prompt, { format: "json" });
return JSON.parse(response);
}
This approach is imperfect (LLM judges have their own biases), but it is far better than no evaluation at all. Treat it as a relative signal, not an absolute score.
Production Challenges: The Part Tutorials Skip
Now for the parts tutorials never cover because they require infrastructure to care about.
Latency Optimisation
A naive RAG pipeline has multiple sequential steps: embed the query, search the index, re-rank, build the prompt, call the LLM. Each step adds latency. In practice:
| Step | Typical Latency |
|---|---|
| Query embedding (API) | 10-50ms |
| Query embedding (self-hosted) | 2-10ms |
| Vector search | 5-20ms |
| Re-ranking (20 candidates) | 50-200ms |
| LLM generation | 500ms-5s |
The LLM call dominates, and you should stream the response so users see output immediately rather than waiting for the full completion. Beyond that:
- Cache embedding results for common or repeated queries
- Run embedding and search steps in parallel where possible
- Set aggressive timeouts on each step and have fallback behaviour when they are exceeded
- Pre-warm your embedding service to avoid cold-start latency spikes
Cost Control
RAG systems get expensive faster than you expect. The main cost drivers and how to manage them:
Embedding API costs: Batch aggressively at ingestion time. Cache embeddings and only re-embed when content actually changes. Track your embedding token usage the same way you track LLM token usage.
LLM costs: Minimise context window usage by being strict about how many chunks you inject. Cache responses for repeated identical queries. Consider using smaller, cheaper models for lower-stakes queries and reserving frontier models for complex ones.
Vector database hosting: Managed services are easy to start but expensive at scale. Self-hosted (Qdrant, Weaviate, Milvus) is cheaper at volume but requires operational overhead. Plan your scale before committing to a tier.
Index Updates
Your knowledge base is not static. Documents get added, updated, and deleted. Your vector index needs to stay in sync. The decisions you need to make:
Full re-index vs incremental updates: For large corpora, always prefer incremental. Full re-indexing is a heavyweight operation that blocks fresh content from appearing in results.
Deletions: Hard deletes from vector databases are often slow or expensive depending on the implementation. Use a soft-delete pattern: mark deleted documents with a metadata flag, filter them out at query time, and run periodic bulk cleanups during low-traffic windows.
Index versioning: When you change your chunking strategy or switch embedding models, every document needs to be re-processed. Track your index version as a metadata field so you know which documents need refreshing after a schema change.
Multi-Tenant Architecture
If you are building a product where each customer has their own knowledge base:
- Use metadata filtering for tenant isolation (fast, requires careful implementation and thorough testing)
- Alternatively, use separate namespaces or collections per tenant (cleaner isolation, more management overhead)
Critically: never mix tenant data into the same chunks. The metadata boundary is your security boundary. Test tenant isolation the same way you test any other security control.
Scaling
Vector databases have their own scaling characteristics. Know your numbers before you hit them:
- How many documents will you index at peak?
- What is your expected query-per-second throughput?
- What is acceptable latency at p99?
Most managed vector databases scale horizontally, but there is a cost involved. Size for the load you have now with a known upgrade path, not for hypothetical future scale.
Reference Architecture
Here is what a real-world production RAG system looks like, with the components and their responsibilities:
flowchart TD
subgraph Ingestion["Ingestion Pipeline"]
Src["Data Sources\n(S3, DB, CMS, API)"] --> Parser["Document Parser"]
Parser --> Chunker["Chunker + Metadata Tagger"]
Chunker --> EmbSvc["Embedding Service"]
EmbSvc --> VDB["Vector Database\n(hybrid index)"]
Chunker --> ObjStore["Object Storage\n(raw originals)"]
end
subgraph Query["Query Path"]
API["REST API"] --> AuthZ["Auth + Tenant Filter"]
AuthZ --> Cache["Response Cache\n(Redis)"]
AuthZ --> QEmb["Query Embedder"]
QEmb --> Hybrid["Hybrid Search\n(BM25 + Vector)"]
Hybrid --> Rerank["Re-ranker"]
Rerank --> CtxBuild["Context Builder\n(prompt assembly)"]
CtxBuild --> LLMLayer["LLM Abstraction Layer"]
LLMLayer --> LLMProvider["LLM Provider\n(OpenAI / Anthropic / Local)"]
LLMProvider --> Stream["Streaming Response"]
end
VDB --> Hybrid
Component Responsibilities
REST API: Entry point for all user queries. Handles authentication, rate limiting, input validation, and response streaming. This is your standard backend service and should be treated like one.
Embedding Service: A thin wrapper around your embedding model. Can be a self-hosted open-source model (on CPU or GPU) or a call to a managed API. The key requirement is that it is fast and deterministic: same input always produces the same output.
Vector Database: Stores embeddings and metadata. Handles ANN search, metadata filtering, and document CRUD operations. Popular choices include Qdrant, Weaviate, Milvus, Pinecone, and pgvector for PostgreSQL. Each has different tradeoffs around hosting, performance, and feature set.
LLM Abstraction Layer: A thin wrapper that lets you swap LLM providers without changing any other code. This is worth building from day one. You will change providers, whether because of pricing, a new model release, or a latency requirement. Keep the swap cost at zero.
Response Cache: Cache responses for identical or near-identical queries. A semantic cache can match similar questions to existing cached answers and skip the LLM call entirely. Redis is the standard choice here.
Object Storage: Stores the raw source documents. The vector database stores embeddings; object storage stores the originals. You will need originals for re-indexing after a schema change and for showing source previews to users.
A Sample Query Flow
// Simplified RAG query handler
app.post("/api/chat", async (req, res) => {
const { query, tenantId } = req.body;
// Validate inputs
if (!query || typeof query !== "string" || query.length > 2000) {
return res.status(400).json({ error: "Invalid query" });
}
// Check response cache
const cached = await cache.get(tenantId, query);
if (cached) {
return res.json({ answer: cached, fromCache: true });
}
// Embed the query
const queryEmbedding = await embeddingService.embed(query);
// Hybrid retrieval with tenant isolation
const candidates = await vectorDb.hybridSearch({
query,
embedding: queryEmbedding,
filter: { tenantId },
topK: 30,
});
// Re-rank candidates
const reranked = await reranker.rerank(query, candidates);
const topChunks = reranked.slice(0, 5);
// Build the prompt
const prompt = buildPrompt(query, topChunks);
// Stream the LLM response
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache");
const stream = await llm.stream(prompt);
let fullResponse = "";
for await (const token of stream) {
res.write(`data: ${JSON.stringify({ token })}\n\n`);
fullResponse += token;
}
res.end();
// Cache asynchronously, do not block the response
cache.set(tenantId, query, fullResponse).catch(console.error);
});
Key Takeaways
After covering all of this, here is what actually moves the needle, in rough order of impact:
-
Invest in ingestion first. Garbage in, garbage out. Your chunking and cleaning pipeline is the foundation of everything else. No amount of prompt engineering compensates for bad chunks.
-
Use hybrid search. Pure vector search is not enough for production use cases. BM25 combined with vector search and re-ranking is the industry standard for a reason.
-
Re-rank your results. If you are not running a cross-encoder re-ranker over your retrieval candidates, you are leaving significant quality on the table for a relatively small latency and infrastructure cost.
-
Metadata is not optional. Tag everything at ingestion time. Use filters aggressively at query time. This improves both relevance and security.
-
Abstract your dependencies. Embedding models, LLM providers, and vector databases will all change over the lifetime of your system. Keep them behind interfaces you own. The swap cost should always be close to zero.
-
Set up evaluation before you optimise. Without a baseline, every change is guesswork. Even a simple 50-question test set with expected answers tells you more than instinct.
-
Plan for freshness from day one. Documents change. Implement incremental indexing, soft-delete patterns, and index versioning before you have thousands of documents to deal with. Retrofitting these later is genuinely painful.
-
Multi-tenant isolation is a security control. Treat it with the same rigour you would give to any other access control. Test it explicitly.
The gap between a toy RAG system and a production RAG system is not a single clever technique. It is ten boring engineering decisions made carefully, one after another. Most teams get three or four right on the first attempt and spend the following year fixing the rest.
Now you know which ones to get right upfront.
Acronym Legend
| Acronym | Full Form | Notes |
|---|---|---|
| RAG | Retrieval-Augmented Generation | A technique that retrieves relevant documents before asking an LLM to generate an answer |
| LLM | Large Language Model | AI models trained on large text corpora, e.g. GPT-4, Claude, Llama |
| ANN | Approximate Nearest Neighbour | A fast vector search algorithm used in vector databases |
| BM25 | Best Match 25 | A probabilistic keyword ranking algorithm, the backbone of traditional full-text search |
| RRF | Reciprocal Rank Fusion | A score fusion algorithm for merging ranked lists from different retrieval systems |
| MRR | Mean Reciprocal Rank | A metric for evaluating retrieval quality based on the rank of the first relevant result |
| OCR | Optical Character Recognition | Converting scanned images of text into machine-readable characters |
| SaaS | Software as a Service | Cloud-hosted software delivered over the internet |
| GPU | Graphics Processing Unit | Hardware used to accelerate machine learning model inference |
| API | Application Programming Interface | A contract for communication between software components |
| p99 | 99th Percentile Latency | The response time that 99% of requests complete within; a standard worst-case latency metric |
| FAQ | Frequently Asked Questions | A document format of common questions paired with their answers |
| CMS | Content Management System | A platform for managing digital content, e.g. WordPress, Contentful |