RAG Advanced Techniques
Naive RAG is simple:
- Chunk documents
- Embed everything
- Retrieve top‑k
- Paste into LLM
- Hope it works
That version is mostly dead in production systems.
Modern RAG is retrieval engineering.
Below is a structured reference of advanced RAG techniques, grouped by layer, with concrete examples and tradeoffs.
Retrieval-Level Techniques
These techniques improve recall and precision before the LLM generates anything.
Chunking R&D
What it is
Designing how documents are split before embedding.
Example
Instead of:
- Fixed 1000-token chunks
Try:
- 400-token chunks with 100-token overlap
- Split by headings (
##,###) - Semantic chunking using sentence boundaries
Why it matters
Too small → context fragmentation
Too large → noisy embeddings
No overlap → boundary information loss
When to use
Always. Chunking is often the highest ROI optimization in RAG.
Tradeoff
More chunks → higher storage + indexing cost.
Encoder (Embedding Model) R&D
What it is
Evaluating and selecting the best embedding model.
Example
Compare models using Recall@K on a test set:
text-embedding-3-largebge-largee5-large
Measure:
- Recall@5
- MRR
- nDCG
Why it matters
Embedding quality directly determines retrieval quality.
When to use
When retrieval feels “almost correct” but not reliable.
Tradeoff
Better models may be slower or more expensive.
Document Pre-processing
What it is
Cleaning or transforming documents before embedding.
Example
- Remove navigation bars from scraped HTML
- Convert tables into structured text
- Rewrite messy PDF content into normalized paragraphs
Example transformation:
Raw:
Header | Footer | Legal disclaimer
Processed:
Product specification: ...
Why it matters
Embeddings capture signal. Garbage input reduces signal density.
When to use
Enterprise documents, PDFs, scraped content.
Tradeoff
Pre-processing pipelines increase system complexity.
Query Rewriting
What it is
Transforming user questions into retrieval-friendly queries.
Example
User:
Why is my service slow?
Rewrite:
Common causes of high latency in distributed microservices
Why it matters
Embedding search works better with explicit semantic signals.
When to use
User questions are vague or conversational.
Tradeoff
Adds an extra LLM step.
Query Expansion
What it is
Generating multiple retrieval queries.
Example
User:
How do I scale my backend?
Expand into:
- Horizontal scaling strategies
- Vertical scaling tradeoffs
- Load balancing approaches
- Caching techniques
Retrieve for each and merge results.
Why it matters
Improves recall.
When to use
Complex or multi-faceted questions.
Tradeoff
Higher compute and retrieval cost.
Re-ranking
What it is
Improving precision after vector retrieval.
Pipeline
- Retrieve top 20 via vector similarity
- Use cross-encoder or LLM to score each
- Keep best 5
Why it matters
Vector similarity ≠ semantic relevance.
Re-ranking often yields major quality improvements.
When to use
When retrieval returns partially relevant chunks.
Tradeoff
Slower due to cross-encoder inference.
Architecture-Level Techniques
These techniques change how retrieval is structured.
Hierarchical RAG
What it is
Multi-level retrieval or summarization.
Example
Step 1: Retrieve relevant documents
Step 2: Retrieve relevant sections within them
Step 3: Summarize sections
Alternative:
- Pre-summarize large documents
- Embed summaries
- Retrieve summaries first
Why it matters
Scales better for very large corpora.
When to use
Large knowledge bases or long documents.
Tradeoff
More orchestration logic.
Graph RAG
What it is
Combining vector retrieval with structured relationships.
Example
Instead of only similarity search:
- Retrieve documents linked to the same entity
- Traverse knowledge graph edges
- Expand based on shared metadata
Why it matters
Captures relational structure beyond semantic similarity.
When to use
Enterprise knowledge graphs, legal corpora, technical documentation.
Tradeoff
Requires maintaining structured metadata or graph database.
Agentic RAG
What it is
Letting an agent decide how and where to retrieve.
Example
Agent workflow:
- Decide whether to query vector DB
- Query SQL database
- Call API
- Chain multiple retrieval steps
Why it matters
Supports multi-source reasoning.
When to use
Complex enterprise workflows.
Tradeoff
Higher latency, higher complexity, harder to debug.
Generation-Level Improvements
These techniques improve answer quality after retrieval.
Prompt Engineering
What it is
Structuring context and instructions before generation.
Example
Instead of:
Here are some documents: {docs}
Use:
Today is {date}. Use only the verified documents below. If unsure, say you don’t know.
Include:
- Clear system instructions
- Context formatting
- Conversation history
- Source citation format
Why it matters
Reduces hallucination and improves answer reliability.
When to use
Always, but only after retrieval quality is stable.
Tradeoff
Prompt tweaks cannot fix broken retrieval.
Summary
Modern RAG is not about “adding GPT to a vector database.”
It is about:
- Retrieval evaluation
- Chunk design
- Embedding selection
- Query control
- Ranking precision
- Architectural scaling
If retrieval is weak, generation cannot save it.
RAG engineering is retrieval engineering.