RAG
Created: 2026-02-20 10:00
#note
Retrieval-Augmented Generation (RAG) is a pattern that augments LLM generation with retrieved context from an external knowledge base. By retrieving relevant documents before generation, RAG significantly reduces hallucinations, enables knowledge beyond the model's training cutoff, and provides grounded, verifiable answers. This makes it essential for applications requiring up-to-date or domain-specific information.
Core Flow
graph LR
A["User Query"] --> B["Retriever"]
B --> C["Retrieved Context"]
C --> D["Generator/LLM"]
D --> E["Grounded Answer"]
Indexing Pipeline
graph LR
A["Raw Documents"] --> B["Chunking"]
B --> C["Embeddings"]
C --> D["Vector Store"]
D --> E["Index Ready"]
Query Pipeline
graph LR
A["User Query"] --> B["Embed Query"]
B --> C["Vector Search"]
C --> D["Rerank Candidates"]
D --> E["Context + Query to LLM"]
E --> F["Generate Response"]
Key Components
| Component | Purpose | Example |
|---|---|---|
| Embeddings | Convert text to vector representations | OpenAI, Cohere, all-MiniLM |
| Vector DB | Store and retrieve embeddings at scale | Pinecone, Weaviate, Chroma |
| Chunker | Split documents into retrievable units | Token-based, semantic boundaries |
| Reranker | Rank retrieved docs by relevance | Cross-encoder models |
| LLM | Generate final answer from context | GPT-4, Claude, Llama |
Chunking Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed-size | Split at token/word count (e.g., 512 tokens) | Simple, uniform documents |
| Recursive | Split by delimiters (headers, paragraphs) | Markdown, structured text |
| Semantic | Split at semantic boundaries | Long-form, coherent passages |
| Document-aware | Preserve document structure (tables, lists) | Complex layouts, mixed content |
Vector Databases
| Database | Deployment | Hosting | Scalability | Key Strength |
|---|---|---|---|---|
| Pinecone | Managed | Cloud-only | Multi-region | Simple, production-ready |
| Weaviate | Self-hosted or cloud | Both | Excellent | GraphQL API, flexible |
| Chroma | Lightweight | Local/cloud | Small-medium | Embedded, developer-friendly |
| pgvector | Self-hosted | PostgreSQL | Good | Integration with SQL |
| FAISS | Library | Local | CPU-optimized | High-speed similarity search |
| Qdrant | Self-hosted or cloud | Both | Excellent | Fast, filtering-rich |
Retrieval Strategies
- Dense retrieval: Use embeddings for semantic similarity
- Sparse retrieval: BM25, keyword-based matching
- Hybrid: Combine dense and sparse (better coverage)
- Multi-query: Generate multiple query reformulations, retrieve for each
- HyDE: Hypothetical Document Embeddings (generate likely answers, embed, retrieve similar)
- Parent-child: Retrieve child chunks, return parent context for larger context window
Evaluation
The RAG Triad measures RAG system quality:
- Context Relevance: Is retrieved context relevant to the query?
- Groundedness: Is the answer supported by the context (not hallucinated)?
- Answer Relevance: Does the answer address the user's query?
See LLM Evaluation for detailed evaluation frameworks and metrics.
Common Failure Modes
| Issue | Cause | Fix |
|---|---|---|
| Missing relevant documents | Poor chunking, query-document mismatch | Refine chunking, use multi-query retrieval |
| Irrelevant context retrieved | Weak embeddings, semantic gap | Use domain-tuned embeddings, reranking |
| Hallucinations despite context | Model ignores context | Few-shot prompts, stricter system instructions |
| Low latency | Large retrieval set, slow reranking | Limit retrieved docs, optimize vector DB |
| Outdated knowledge | Stale index | Implement incremental indexing, scheduled refresh |
Advanced Patterns
Multi-hop RAG: Chain multiple retrieval steps to answer complex questions requiring reasoning across documents.
Agentic RAG: Combine RAG with agents that decide when to retrieve, what queries to run, and how to synthesize results. See AI Agents, Agentic AI Frameworks.
Graph RAG: Represent documents as knowledge graphs; retrieve via graph traversal for structured reasoning.
Corrective RAG: Evaluate retrieved context quality and dynamically adjust retrieval strategy if confidence is low.
References
Tags
#llm #rag #retrieval #embeddings #genai