RAG

Created: 2026-02-20 10:00
#note

Retrieval-Augmented Generation (RAG) is a pattern that augments LLM generation with retrieved context from an external knowledge base. By retrieving relevant documents before generation, RAG significantly reduces hallucinations, enables knowledge beyond the model's training cutoff, and provides grounded, verifiable answers. This makes it essential for applications requiring up-to-date or domain-specific information.

Core Flow

graph LR
    A["User Query"] --> B["Retriever"]
    B --> C["Retrieved Context"]
    C --> D["Generator/LLM"]
    D --> E["Grounded Answer"]

Indexing Pipeline

graph LR
    A["Raw Documents"] --> B["Chunking"]
    B --> C["Embeddings"]
    C --> D["Vector Store"]
    D --> E["Index Ready"]

Query Pipeline

graph LR
    A["User Query"] --> B["Embed Query"]
    B --> C["Vector Search"]
    C --> D["Rerank Candidates"]
    D --> E["Context + Query to LLM"]
    E --> F["Generate Response"]

Key Components

ComponentPurposeExample
EmbeddingsConvert text to vector representationsOpenAI, Cohere, all-MiniLM
Vector DBStore and retrieve embeddings at scalePinecone, Weaviate, Chroma
ChunkerSplit documents into retrievable unitsToken-based, semantic boundaries
RerankerRank retrieved docs by relevanceCross-encoder models
LLMGenerate final answer from contextGPT-4, Claude, Llama

Chunking Strategies

StrategyHow It WorksBest For
Fixed-sizeSplit at token/word count (e.g., 512 tokens)Simple, uniform documents
RecursiveSplit by delimiters (headers, paragraphs)Markdown, structured text
SemanticSplit at semantic boundariesLong-form, coherent passages
Document-awarePreserve document structure (tables, lists)Complex layouts, mixed content

Vector Databases

DatabaseDeploymentHostingScalabilityKey Strength
PineconeManagedCloud-onlyMulti-regionSimple, production-ready
WeaviateSelf-hosted or cloudBothExcellentGraphQL API, flexible
ChromaLightweightLocal/cloudSmall-mediumEmbedded, developer-friendly
pgvectorSelf-hostedPostgreSQLGoodIntegration with SQL
FAISSLibraryLocalCPU-optimizedHigh-speed similarity search
QdrantSelf-hosted or cloudBothExcellentFast, filtering-rich

Retrieval Strategies

  • Dense retrieval: Use embeddings for semantic similarity
  • Sparse retrieval: BM25, keyword-based matching
  • Hybrid: Combine dense and sparse (better coverage)
  • Multi-query: Generate multiple query reformulations, retrieve for each
  • HyDE: Hypothetical Document Embeddings (generate likely answers, embed, retrieve similar)
  • Parent-child: Retrieve child chunks, return parent context for larger context window

Evaluation

The RAG Triad measures RAG system quality:

  • Context Relevance: Is retrieved context relevant to the query?
  • Groundedness: Is the answer supported by the context (not hallucinated)?
  • Answer Relevance: Does the answer address the user's query?

See LLM Evaluation for detailed evaluation frameworks and metrics.

Common Failure Modes

IssueCauseFix
Missing relevant documentsPoor chunking, query-document mismatchRefine chunking, use multi-query retrieval
Irrelevant context retrievedWeak embeddings, semantic gapUse domain-tuned embeddings, reranking
Hallucinations despite contextModel ignores contextFew-shot prompts, stricter system instructions
Low latencyLarge retrieval set, slow rerankingLimit retrieved docs, optimize vector DB
Outdated knowledgeStale indexImplement incremental indexing, scheduled refresh

Advanced Patterns

Multi-hop RAG: Chain multiple retrieval steps to answer complex questions requiring reasoning across documents.

Agentic RAG: Combine RAG with agents that decide when to retrieve, what queries to run, and how to synthesize results. See AI Agents, Agentic AI Frameworks.

Graph RAG: Represent documents as knowledge graphs; retrieve via graph traversal for structured reasoning.

Corrective RAG: Evaluate retrieved context quality and dynamically adjust retrieval strategy if confidence is low.

References

  1. LangChain RAG docs
  2. LlamaIndex RAG guide
  3. Pinecone - What is RAG?

Tags

#llm #rag #retrieval #embeddings #genai