RAG Architecture Explained: Complete Guide (2026)

Retrieval-Augmented Generation (RAG) has become the default approach for building AI systems that need to answer questions using private or current data. Instead of fine-tuning a model on your data (expensive, slow, hard to update), RAG retrieves relevant information at query time and feeds it to the LLM as context.

The concept is simple. The execution, in production, is anything but. This guide covers RAG architecture from first principles through advanced production patterns, so you can build systems that actually deliver accurate answers.

What Is RAG and Why Does It Matter?

RAG solves the fundamental limitation of large language models: they only know what they were trained on. An LLM trained in early 2025 knows nothing about your company's internal docs, your product catalog updates from last week, or the policy changes you made yesterday.

RAG bridges this gap by combining two capabilities:

Retrieval — find the most relevant pieces of information from your data
Generation — use an LLM to synthesize a natural language answer from that information

The result: an AI system that gives accurate, grounded, up-to-date answers about your specific data, without the cost and delay of model training.

RAG vs Fine-Tuning vs Prompt Engineering

Approach	Best For	Data Freshness	Cost	Accuracy on Private Data
Prompt engineering	Small, static context	Manual updates	Low	Limited by context window
RAG	Large, dynamic knowledge bases	Real-time	Moderate	High (with good retrieval)
Fine-tuning	Changing model behavior/style	Requires retraining	High	Moderate (can hallucinate)
RAG + Fine-tuning	Maximum accuracy on domain data	Real-time + trained behavior	Highest	Highest

For most use cases in 2026, RAG is the right starting point. Fine-tuning complements RAG when you also need to change how the model writes, reasons, or follows domain-specific patterns.

The RAG Pipeline: End to End

Every RAG system has two main pipelines: indexing (offline, preparing your data) and querying (online, answering questions).

Indexing Pipeline

The indexing pipeline runs offline (or on a schedule) to prepare your data for retrieval.

Documents → Load → Chunk → Embed → Store in Vector DB

Query Pipeline

The query pipeline runs in real-time when a user asks a question.

User Query → Embed → Retrieve → Re-rank → Augment Prompt → LLM → Answer

Let's go through each stage in detail.

Stage 1: Document Loading and Preprocessing

Before anything else, you need to get your data into a format the pipeline can process.

Common Data Sources

Source Type	Tools	Challenges
PDFs	PyPDF2, Unstructured, LlamaParse	Tables, images, multi-column layouts
Web pages	BeautifulSoup, Firecrawl	Dynamic content, navigation cruft
Databases	SQL connectors, ORM queries	Schema mapping, joining related data
APIs	REST/GraphQL clients	Rate limits, pagination
Confluence/Notion	Official APIs, community loaders	Permission handling, nested pages
Slack/Teams	Bot APIs	Thread context, noise filtering

Preprocessing Best Practices

Clean your data before chunking. Remove boilerplate (headers, footers, navigation text), normalize formatting, resolve abbreviations in technical docs, and extract metadata (title, author, date, source URL) that you'll use later for filtering.

from unstructured.partition.auto import partition

elements = partition(filename="annual_report.pdf")

cleaned_elements = []
for element in elements:
    if element.category == "Header" or element.category == "Footer":
        continue
    cleaned_elements.append(element)

Stage 2: Chunking Strategies

Chunking is where most RAG systems succeed or fail. The goal is to split documents into pieces that are semantically coherent and the right size for retrieval.

Chunking Methods Compared

Method	How It Works	Best For	Chunk Quality
Fixed-size	Split every N characters/tokens	Quick prototype	Low
Recursive character	Split by separators (paragraphs, sentences)	General purpose	Medium
Semantic	Split when embedding similarity drops	Documents with varied topics	High
Document-structure	Split by headings, sections	Well-structured docs (Markdown, HTML)	High
Agentic chunking	LLM decides chunk boundaries	Complex, mixed-format docs	Highest

Recommended Approach: Recursive with Overlap

For most use cases, recursive character splitting with overlap gives the best balance of quality and speed. The example below uses LangChain's RecursiveCharacterTextSplitter^[1], which is the most widely deployed chunker in production RAG.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)

chunks = splitter.split_documents(documents)

Chunk size selection guidelines:

500–800 tokens — best for precise factual Q&A (support docs, FAQs)
800–1200 tokens — good general-purpose size for most RAG applications
1200–2000 tokens — better for complex topics that need more context (legal, medical)

Always use overlap. Without overlap, information that spans chunk boundaries gets lost. 10–20% overlap is standard.

Metadata Enrichment

Attach metadata to every chunk. This enables filtered retrieval and better source attribution.

for chunk in chunks:
    chunk.metadata.update({
        "source": document.metadata["source"],
        "title": document.metadata["title"],
        "section": extract_section_heading(chunk),
        "date": document.metadata.get("date"),
        "doc_type": document.metadata.get("type", "general"),
    })

Stage 3: Embedding Models

Embeddings convert text chunks into numerical vectors that capture semantic meaning. Similar concepts end up close together in vector space, enabling semantic search.

Embedding Model Comparison (2026)

The "Performance (MTEB)" column refers to the Massive Text Embedding Benchmark^[2] — the closest thing the field has to a vendor-neutral embedding ranking. Relative ordering matters more than the absolute score.

Model	Dimensions	Max Tokens	Performance (MTEB)	Cost	Speed
OpenAI text-embedding-3-large	3072	8191	Excellent	$0.13/1M tokens	Fast
OpenAI text-embedding-3-small	1536	8191	Good	$0.02/1M tokens	Very fast
Cohere embed-v3	1024	512	Excellent	$0.10/1M tokens	Fast
Voyage AI voyage-3	1024	32000	Excellent	$0.06/1M tokens	Fast
BGE-M3 (open-source)	1024	8192	Good	Free (self-hosted)	Variable
Nomic embed-text-v1.5	768	8192	Good	Free (self-hosted)	Variable

Choosing an Embedding Model

For most production RAG systems, text-embedding-3-small from OpenAI is the best starting point: good quality, low cost, and fast. Upgrade to text-embedding-3-large or Voyage AI if retrieval quality is critical and you have the budget.

If data privacy requires self-hosting, BGE-M3 and Nomic are strong open-source options.

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectors = embeddings.embed_documents([chunk.page_content for chunk in chunks])

Stage 4: Vector Storage

Vector databases store your embeddings and enable fast similarity search across millions of vectors.

Vector Database Comparison

Database	Type	Managed	Filtering	Max Vectors	Best For
Pinecone	Cloud-native	Yes	Advanced	Billions	Production SaaS, enterprise
Weaviate	Open-source + cloud	Both	Advanced	Billions	Hybrid search, multi-modal
Qdrant	Open-source + cloud	Both	Advanced	Billions	Performance-critical apps
ChromaDB	Open-source	Self-hosted	Basic	Millions	Prototyping, small datasets
pgvector	Postgres extension	Any Postgres	SQL-based	Millions	Teams already using Postgres

Indexing Your Data

from langchain_pinecone import PineconeVectorStore

vector_store = PineconeVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name="knowledge-base",
    namespace="support-docs"
)

Stage 5: Retrieval Strategies

Retrieval is the most impactful stage for answer quality. If you retrieve the wrong chunks, the LLM will generate a wrong (but confident-sounding) answer.

Retrieval Methods

Semantic Search (Dense Retrieval)

The standard approach. Convert the query to a vector and find the nearest vectors in your database.

results = vector_store.similarity_search(query, k=5)

Strengths: Understands meaning, handles paraphrasing. Weaknesses: Misses exact keyword matches, can retrieve semantically similar but factually irrelevant chunks.

Keyword Search (Sparse Retrieval)

Traditional BM25-style search. Matches exact terms and their frequencies.

Strengths: Great for exact terms, product names, codes, acronyms. Weaknesses: No semantic understanding, misses synonyms.

Hybrid Search

Combines semantic and keyword search. This is the recommended approach for production RAG.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(chunks, k=5)
semantic_retriever = vector_store.as_retriever(search_kwargs={"k": 5})

hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, semantic_retriever],
    weights=[0.3, 0.7]
)

Re-Ranking

Initial retrieval casts a wide net. Re-ranking uses a more powerful (but slower) model to score and reorder the results by actual relevance to the query.

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

reranker = CohereRerank(model="rerank-v3.5", top_n=3)

retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid_retriever
)

final_docs = retriever.invoke("How do I reset my password?")

Re-ranking typically improves answer accuracy by 10–25% at the cost of 100–300ms additional latency. For production systems, this trade-off is almost always worth it.

Query Transformation

Sometimes the user's query isn't optimal for retrieval. Query transformation techniques rewrite the query before searching.

Technique	How It Works	When to Use
HyDE	Generate a hypothetical answer, use it as the search query	Vague or short queries
Multi-query	Generate 3–5 query variations, retrieve for each, combine results	Complex questions
Step-back	Abstract the query to a broader topic first	Specific questions needing general context
Decomposition	Break complex question into sub-questions	Multi-part questions

from langchain.retrievers.multi_query import MultiQueryRetriever

multi_retriever = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(),
    llm=ChatOpenAI(model="gpt-4o-mini")
)

Stage 6: Generation with Context

Once you have the relevant chunks, construct a prompt that gives the LLM the context it needs to answer accurately.

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", """Answer the user's question based ONLY on the provided context.
If the context doesn't contain enough information to answer, say so.
Always cite the source document for your claims.

Context:
{context}"""),
    ("human", "{question}")
])

def format_docs(docs):
    return "\n\n---\n\n".join([
        f"Source: {doc.metadata.get('title', 'Unknown')}\n{doc.page_content}"
        for doc in docs
    ])

chain = (
    {"context": retriever | format_docs, "question": lambda x: x}
    | prompt
    | ChatOpenAI(model="gpt-4o", temperature=0)
)

answer = chain.invoke("How do I reset my password?")

Reducing Hallucination

The biggest risk in RAG is the LLM generating information that isn't in the retrieved context. Mitigation strategies:

Explicit grounding instructions — tell the model to only use provided context
Citation requirements — force the model to reference specific sources
Lower temperature — use temperature 0 or 0.1 for factual tasks
Smaller context windows — less noise means fewer opportunities to hallucinate
Faithful generation checks — post-process to verify claims against sources

Evaluating Your RAG System

Without evaluation, you're guessing about quality. RAG evaluation measures both retrieval quality and generation quality.

Key Metrics

Metric	What It Measures	How to Compute	Target
Context Precision	Are retrieved chunks actually relevant?	Relevant chunks / Total chunks retrieved	> 0.8
Context Recall	Did we retrieve all relevant information?	Relevant info found / Total relevant info	> 0.7
Faithfulness	Is the answer grounded in context?	Claims supported by context / Total claims	> 0.9
Answer Relevance	Does the answer address the question?	LLM-as-judge scoring	> 0.8
Answer Correctness	Is the answer factually correct?	Comparison against ground truth	> 0.85

Evaluation Frameworks

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=ChatOpenAI(model="gpt-4o"),
    embeddings=OpenAIEmbeddings()
)

print(results)

Build your evaluation dataset incrementally. Start with 50–100 question-answer pairs that cover your main use cases. Add edge cases as you discover them in production.

Naive RAG vs Advanced RAG

Most tutorials teach naive RAG. Production systems need advanced RAG. Here's what changes.

Aspect	Naive RAG	Advanced RAG
Chunking	Fixed-size, no overlap	Semantic or structure-aware, with overlap
Retrieval	Single semantic search	Hybrid search + re-ranking
Query handling	Pass query directly	Query transformation (multi-query, HyDE)
Context	Dump all chunks into prompt	Filtered, compressed, deduplicated context
Evaluation	None or manual	Automated evaluation pipeline
Indexing	One-time batch	Incremental updates, stale data removal
Metadata	None	Rich metadata for filtering and attribution

The jump from naive to advanced RAG typically improves answer accuracy from 60–70% to 85–95%, depending on the domain and data quality.

Production Considerations

Scaling the Indexing Pipeline

For large document collections (100K+ documents), batch your indexing operations and run them asynchronously.

import asyncio
from langchain_core.documents import Document

async def index_batch(documents: list[Document], batch_size: int = 100):
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i + batch_size]
        await vector_store.aadd_documents(batch)

Keeping Data Fresh

Stale data is a silent killer in RAG systems. Implement an update strategy:

Webhooks — trigger re-indexing when source documents change
Scheduled crawls — re-index on a schedule (hourly, daily) for web sources
Version tracking — store document hashes and only re-index changed content
TTL (Time-to-Live) — expire old chunks automatically

Cost Optimization

RAG costs come from three places: embedding generation, vector storage, and LLM inference.

Cost Component	Optimization Strategy
Embedding API calls	Batch embeddings, cache common queries, use cheaper models for non-critical data
Vector DB storage	Reduce dimensions (MRL), archive unused namespaces, compress vectors
LLM inference	Cache common answers, use cheaper models for simple questions, reduce context size
Re-ranking	Only re-rank when initial retrieval confidence is low

Use our RAG Cost Estimator to model costs for your specific document volume and query patterns before committing to an architecture.

Real-World RAG Use Cases

Customer Support

Ingest help center articles, product docs, and past ticket resolutions. The RAG system answers customer questions instantly, reducing ticket volume by 40–60%.

Legal Document Analysis

Index contracts, case law, and regulatory documents. Lawyers search across thousands of documents in seconds instead of hours.

Internal Knowledge Management

Connect Confluence, Slack, and Google Drive. Employees ask questions in natural language and get answers sourced from across the organization.

E-commerce Product Search

RAG-powered product search understands natural language queries ("waterproof jacket for hiking in cold weather") and retrieves relevant products with explanations of why they match.

Getting Started

Building a production RAG system requires expertise in data engineering, LLM orchestration, and infrastructure. The architecture decisions you make early—chunking strategy, embedding model, retrieval approach—determine the quality ceiling of your entire system.

If you need a RAG system that works reliably at scale, ZTABS specializes in RAG development. We build production pipelines using Pinecone, Weaviate, and other leading vector databases, tailored to your specific data and accuracy requirements.

Start with a focused use case, measure relentlessly, and iterate. That's how you build RAG systems that deliver real value.

Frequently Asked Questions

What's a realistic production RAG cost at 1M queries per month?

At 1M queries/month with GPT-4o or Claude Sonnet and average 3K-token context windows, expect $3-8K/month in LLM API costs plus $200-600 in vector database costs (Pinecone serverless, Weaviate Cloud). Embedding refreshes add $100-400 depending on document volume. Total typically lands at $4-10K/month.

Which vector database should I pick for my first RAG app?

For under 1M vectors, pgvector on Postgres is usually the right call — no extra infrastructure and solid performance. Past 5-10M vectors or when you need sub-50ms latency at scale, move to Pinecone, Weaviate, or Qdrant. Don't over-engineer on day one; pgvector handles most MVPs comfortably.

What chunk size works best for document retrieval?

500-800 tokens per chunk with 10-15% overlap hits the sweet spot for most business documents. Smaller chunks (200-300 tokens) improve precision but fragment context. Larger chunks (1,500+ tokens) reduce recall precision and blow up LLM input costs. Test on your own corpus with a 50-query eval set before committing.

What's the top failure mode of production RAG systems?

Retrieval quality, not LLM quality. Teams obsess over prompt engineering while their retriever returns irrelevant chunks 30-40% of the time. Measure retrieval precision@k separately from end-to-end answer quality, and invest in reranking (Cohere Rerank, cross-encoders) before upgrading your generation model.

What Is RAG and Why Does It Matter?

RAG bridges this gap by combining two capabilities:

Retrieval — find the most relevant pieces of information from your data
Generation — use an LLM to synthesize a natural language answer from that information

The result: an AI system that gives accurate, grounded, up-to-date answers about your specific data, without the cost and delay of model training.

RAG vs Fine-Tuning vs Prompt Engineering

Approach	Best For	Data Freshness	Cost	Accuracy on Private Data
Prompt engineering	Small, static context	Manual updates	Low	Limited by context window
RAG	Large, dynamic knowledge bases	Real-time	Moderate	High (with good retrieval)
Fine-tuning	Changing model behavior/style	Requires retraining	High	Moderate (can hallucinate)
RAG + Fine-tuning	Maximum accuracy on domain data	Real-time + trained behavior	Highest	Highest

For most use cases in 2026, RAG is the right starting point. Fine-tuning complements RAG when you also need to change how the model writes, reasons, or follows domain-specific patterns.

The RAG Pipeline: End to End

Every RAG system has two main pipelines: indexing (offline, preparing your data) and querying (online, answering questions).

Indexing Pipeline

The indexing pipeline runs offline (or on a schedule) to prepare your data for retrieval.

Documents → Load → Chunk → Embed → Store in Vector DB

Query Pipeline

The query pipeline runs in real-time when a user asks a question.

User Query → Embed → Retrieve → Re-rank → Augment Prompt → LLM → Answer

Let's go through each stage in detail.

Stage 1: Document Loading and Preprocessing

Before anything else, you need to get your data into a format the pipeline can process.

Common Data Sources

Source Type	Tools	Challenges
PDFs	PyPDF2, Unstructured, LlamaParse	Tables, images, multi-column layouts
Web pages	BeautifulSoup, Firecrawl	Dynamic content, navigation cruft
Databases	SQL connectors, ORM queries	Schema mapping, joining related data
APIs	REST/GraphQL clients	Rate limits, pagination
Confluence/Notion	Official APIs, community loaders	Permission handling, nested pages
Slack/Teams	Bot APIs	Thread context, noise filtering

Preprocessing Best Practices

from unstructured.partition.auto import partition

elements = partition(filename="annual_report.pdf")

cleaned_elements = []
for element in elements:
    if element.category == "Header" or element.category == "Footer":
        continue
    cleaned_elements.append(element)

Stage 2: Chunking Strategies

Chunking is where most RAG systems succeed or fail. The goal is to split documents into pieces that are semantically coherent and the right size for retrieval.

Chunking Methods Compared

Method	How It Works	Best For	Chunk Quality
Fixed-size	Split every N characters/tokens	Quick prototype	Low
Recursive character	Split by separators (paragraphs, sentences)	General purpose	Medium
Semantic	Split when embedding similarity drops	Documents with varied topics	High
Document-structure	Split by headings, sections	Well-structured docs (Markdown, HTML)	High
Agentic chunking	LLM decides chunk boundaries	Complex, mixed-format docs	Highest

Recommended Approach: Recursive with Overlap

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)

chunks = splitter.split_documents(documents)

Chunk size selection guidelines:

500–800 tokens — best for precise factual Q&A (support docs, FAQs)
800–1200 tokens — good general-purpose size for most RAG applications
1200–2000 tokens — better for complex topics that need more context (legal, medical)

Always use overlap. Without overlap, information that spans chunk boundaries gets lost. 10–20% overlap is standard.

Metadata Enrichment

Attach metadata to every chunk. This enables filtered retrieval and better source attribution.

for chunk in chunks:
    chunk.metadata.update({
        "source": document.metadata["source"],
        "title": document.metadata["title"],
        "section": extract_section_heading(chunk),
        "date": document.metadata.get("date"),
        "doc_type": document.metadata.get("type", "general"),
    })

Stage 3: Embedding Models

Embeddings convert text chunks into numerical vectors that capture semantic meaning. Similar concepts end up close together in vector space, enabling semantic search.

Embedding Model Comparison (2026)

Model	Dimensions	Max Tokens	Performance (MTEB)	Cost	Speed
OpenAI text-embedding-3-large	3072	8191	Excellent	$0.13/1M tokens	Fast
OpenAI text-embedding-3-small	1536	8191	Good	$0.02/1M tokens	Very fast
Cohere embed-v3	1024	512	Excellent	$0.10/1M tokens	Fast
Voyage AI voyage-3	1024	32000	Excellent	$0.06/1M tokens	Fast
BGE-M3 (open-source)	1024	8192	Good	Free (self-hosted)	Variable
Nomic embed-text-v1.5	768	8192	Good	Free (self-hosted)	Variable

Choosing an Embedding Model

If data privacy requires self-hosting, BGE-M3 and Nomic are strong open-source options.

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectors = embeddings.embed_documents([chunk.page_content for chunk in chunks])

Stage 4: Vector Storage

Vector databases store your embeddings and enable fast similarity search across millions of vectors.

Vector Database Comparison

Database	Type	Managed	Filtering	Max Vectors	Best For
Pinecone	Cloud-native	Yes	Advanced	Billions	Production SaaS, enterprise
Weaviate	Open-source + cloud	Both	Advanced	Billions	Hybrid search, multi-modal
Qdrant	Open-source + cloud	Both	Advanced	Billions	Performance-critical apps
ChromaDB	Open-source	Self-hosted	Basic	Millions	Prototyping, small datasets
pgvector	Postgres extension	Any Postgres	SQL-based	Millions	Teams already using Postgres

Indexing Your Data

from langchain_pinecone import PineconeVectorStore

vector_store = PineconeVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name="knowledge-base",
    namespace="support-docs"
)

Stage 5: Retrieval Strategies

Retrieval is the most impactful stage for answer quality. If you retrieve the wrong chunks, the LLM will generate a wrong (but confident-sounding) answer.

Retrieval Methods

Semantic Search (Dense Retrieval)

The standard approach. Convert the query to a vector and find the nearest vectors in your database.

results = vector_store.similarity_search(query, k=5)

Strengths: Understands meaning, handles paraphrasing. Weaknesses: Misses exact keyword matches, can retrieve semantically similar but factually irrelevant chunks.

Keyword Search (Sparse Retrieval)

Traditional BM25-style search. Matches exact terms and their frequencies.

Strengths: Great for exact terms, product names, codes, acronyms. Weaknesses: No semantic understanding, misses synonyms.

Hybrid Search

Combines semantic and keyword search. This is the recommended approach for production RAG.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(chunks, k=5)
semantic_retriever = vector_store.as_retriever(search_kwargs={"k": 5})

hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, semantic_retriever],
    weights=[0.3, 0.7]
)

Re-Ranking

Initial retrieval casts a wide net. Re-ranking uses a more powerful (but slower) model to score and reorder the results by actual relevance to the query.

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

reranker = CohereRerank(model="rerank-v3.5", top_n=3)

retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid_retriever
)

final_docs = retriever.invoke("How do I reset my password?")

Re-ranking typically improves answer accuracy by 10–25% at the cost of 100–300ms additional latency. For production systems, this trade-off is almost always worth it.

Query Transformation

Sometimes the user's query isn't optimal for retrieval. Query transformation techniques rewrite the query before searching.

Technique	How It Works	When to Use
HyDE	Generate a hypothetical answer, use it as the search query	Vague or short queries
Multi-query	Generate 3–5 query variations, retrieve for each, combine results	Complex questions
Step-back	Abstract the query to a broader topic first	Specific questions needing general context
Decomposition	Break complex question into sub-questions	Multi-part questions

from langchain.retrievers.multi_query import MultiQueryRetriever

multi_retriever = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(),
    llm=ChatOpenAI(model="gpt-4o-mini")
)

Stage 6: Generation with Context

Once you have the relevant chunks, construct a prompt that gives the LLM the context it needs to answer accurately.

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", """Answer the user's question based ONLY on the provided context.
If the context doesn't contain enough information to answer, say so.
Always cite the source document for your claims.

Context:
{context}"""),
    ("human", "{question}")
])

def format_docs(docs):
    return "\n\n---\n\n".join([
        f"Source: {doc.metadata.get('title', 'Unknown')}\n{doc.page_content}"
        for doc in docs
    ])

chain = (
    {"context": retriever | format_docs, "question": lambda x: x}
    | prompt
    | ChatOpenAI(model="gpt-4o", temperature=0)
)

answer = chain.invoke("How do I reset my password?")

Reducing Hallucination

The biggest risk in RAG is the LLM generating information that isn't in the retrieved context. Mitigation strategies:

Explicit grounding instructions — tell the model to only use provided context
Citation requirements — force the model to reference specific sources
Lower temperature — use temperature 0 or 0.1 for factual tasks
Smaller context windows — less noise means fewer opportunities to hallucinate
Faithful generation checks — post-process to verify claims against sources

Evaluating Your RAG System

Without evaluation, you're guessing about quality. RAG evaluation measures both retrieval quality and generation quality.

Key Metrics

Metric	What It Measures	How to Compute	Target
Context Precision	Are retrieved chunks actually relevant?	Relevant chunks / Total chunks retrieved	> 0.8
Context Recall	Did we retrieve all relevant information?	Relevant info found / Total relevant info	> 0.7
Faithfulness	Is the answer grounded in context?	Claims supported by context / Total claims	> 0.9
Answer Relevance	Does the answer address the question?	LLM-as-judge scoring	> 0.8
Answer Correctness	Is the answer factually correct?	Comparison against ground truth	> 0.85

Evaluation Frameworks

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=ChatOpenAI(model="gpt-4o"),
    embeddings=OpenAIEmbeddings()
)

print(results)

Build your evaluation dataset incrementally. Start with 50–100 question-answer pairs that cover your main use cases. Add edge cases as you discover them in production.

Naive RAG vs Advanced RAG

Most tutorials teach naive RAG. Production systems need advanced RAG. Here's what changes.

Aspect	Naive RAG	Advanced RAG
Chunking	Fixed-size, no overlap	Semantic or structure-aware, with overlap
Retrieval	Single semantic search	Hybrid search + re-ranking
Query handling	Pass query directly	Query transformation (multi-query, HyDE)
Context	Dump all chunks into prompt	Filtered, compressed, deduplicated context
Evaluation	None or manual	Automated evaluation pipeline
Indexing	One-time batch	Incremental updates, stale data removal
Metadata	None	Rich metadata for filtering and attribution

The jump from naive to advanced RAG typically improves answer accuracy from 60–70% to 85–95%, depending on the domain and data quality.

Production Considerations

Scaling the Indexing Pipeline

For large document collections (100K+ documents), batch your indexing operations and run them asynchronously.

import asyncio
from langchain_core.documents import Document

async def index_batch(documents: list[Document], batch_size: int = 100):
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i + batch_size]
        await vector_store.aadd_documents(batch)

Keeping Data Fresh

Stale data is a silent killer in RAG systems. Implement an update strategy:

Webhooks — trigger re-indexing when source documents change
Scheduled crawls — re-index on a schedule (hourly, daily) for web sources
Version tracking — store document hashes and only re-index changed content
TTL (Time-to-Live) — expire old chunks automatically

Cost Optimization

RAG costs come from three places: embedding generation, vector storage, and LLM inference.

Cost Component	Optimization Strategy
Embedding API calls	Batch embeddings, cache common queries, use cheaper models for non-critical data
Vector DB storage	Reduce dimensions (MRL), archive unused namespaces, compress vectors
LLM inference	Cache common answers, use cheaper models for simple questions, reduce context size
Re-ranking	Only re-rank when initial retrieval confidence is low

Use our RAG Cost Estimator to model costs for your specific document volume and query patterns before committing to an architecture.

Real-World RAG Use Cases

Customer Support

Ingest help center articles, product docs, and past ticket resolutions. The RAG system answers customer questions instantly, reducing ticket volume by 40–60%.

Legal Document Analysis

Index contracts, case law, and regulatory documents. Lawyers search across thousands of documents in seconds instead of hours.

Internal Knowledge Management

Connect Confluence, Slack, and Google Drive. Employees ask questions in natural language and get answers sourced from across the organization.

E-commerce Product Search

RAG-powered product search understands natural language queries ("waterproof jacket for hiking in cold weather") and retrieves relevant products with explanations of why they match.

Getting Started

Start with a focused use case, measure relentlessly, and iterate. That's how you build RAG systems that deliver real value.

What Is RAG and Why Does It Matter?

RAG vs Fine-Tuning vs Prompt Engineering

The RAG Pipeline: End to End

Indexing Pipeline

Query Pipeline

Stage 1: Document Loading and Preprocessing

Common Data Sources

Preprocessing Best Practices

Stage 2: Chunking Strategies

Chunking Methods Compared

Recommended Approach: Recursive with Overlap

Metadata Enrichment

Stage 3: Embedding Models

Embedding Model Comparison (2026)

Choosing an Embedding Model

Stage 4: Vector Storage

Vector Database Comparison

Indexing Your Data

Stage 5: Retrieval Strategies

Retrieval Methods

Semantic Search (Dense Retrieval)

Keyword Search (Sparse Retrieval)

Hybrid Search

Re-Ranking

Query Transformation

Stage 6: Generation with Context

Reducing Hallucination

Evaluating Your RAG System

Key Metrics

Evaluation Frameworks

Naive RAG vs Advanced RAG

Production Considerations

Scaling the Indexing Pipeline

Keeping Data Fresh

Cost Optimization

Real-World RAG Use Cases

Customer Support

Legal Document Analysis

Internal Knowledge Management

E-commerce Product Search

Getting Started

Frequently Asked Questions

What's a realistic production RAG cost at 1M queries per month?

Which vector database should I pick for my first RAG app?

What chunk size works best for document retrieval?

What's the top failure mode of production RAG systems?

Explore Related Solutions

Need Help Building Your Project?

Related Articles

AI Browser Automation in 2026: ChatGPT Agent, Computer Use, and What Actually Ships

AI Cost Optimization at Scale: How We Cut LLM Bills 60% Without Quality Loss

Blockchain Development in 2026: What's Actually Worth Building

What Is RAG and Why Does It Matter?

RAG vs Fine-Tuning vs Prompt Engineering

The RAG Pipeline: End to End

Indexing Pipeline

Query Pipeline

Stage 1: Document Loading and Preprocessing

Common Data Sources

Preprocessing Best Practices

Stage 2: Chunking Strategies

Chunking Methods Compared

Recommended Approach: Recursive with Overlap

Metadata Enrichment

Stage 3: Embedding Models

Embedding Model Comparison (2026)

Choosing an Embedding Model

Stage 4: Vector Storage

Vector Database Comparison

Indexing Your Data

Stage 5: Retrieval Strategies

Retrieval Methods

Semantic Search (Dense Retrieval)

Keyword Search (Sparse Retrieval)

Hybrid Search

Re-Ranking

Query Transformation

Stage 6: Generation with Context

Reducing Hallucination

Evaluating Your RAG System