LLM Evaluation

Created: 2026-02-20 10:00
#note

Large Language Models and AI agents produce probabilistic outputs that demand systematic evaluation beyond traditional unit testing paradigms. Unlike deterministic software, LLM outputs vary based on temperature settings, model versions, and input variations, necessitating robust measurement frameworks that capture quality across multiple dimensions. Effective evaluation enables teams to detect regressions, compare model versions, and ensure production systems maintain acceptable performance standards before deployment.

Evaluation Types

TypeDescriptionBest For
Assertion-basedHard checks on outputs (length, format, presence of keywords)Deterministic requirements
LLM-as-judgeAnother LLM grades agent outputs against rubricsFast iteration, nuanced quality
Human annotationDomain experts label outputs on quality scalesGround truth, critical domains
Reference-basedCompare against gold-standard outputs (BLEU, ROUGE)Text generation, summarization
RAG TriadContext relevance, groundedness, answer relevanceRetrieval-augmented systems

LLM-as-Judge Pattern

The LLM-as-judge approach leverages a capable model to evaluate another model's outputs against explicit rubrics. The judge model receives the input prompt, generated output, and evaluation criteria, returning structured scores and reasoning. This pattern enables rapid iteration cycles without expensive human review; evaluation happens instantly at scale during testing phases.

However, LLM-as-judge carries systematic biases worth acknowledging. Position bias causes judges to favor outputs presented first in comparative evaluations. Verbosity bias leads judges to prefer longer, more detailed responses over concise correct answers. Self-alignment bias causes judges to favor outputs matching their own writing style. Mitigation strategies include employing multiple independent judges (typically 3-5), randomizing output presentation order, and using explicit rubric anchors that define what excellent, good, acceptable, and poor performance actually entails.

Metrics for Agent Evaluation

Functional Metrics

  • task_complete: Boolean indicating whether the agent successfully completed the requested task
  • tool_call_success_rate: Percentage of tool invocations that executed without errors
  • steps_to_completion: Number of agent steps required to solve the task (fewer is better)
  • error_rate: Percentage of runs encountering failures or exceptions

Quality Metrics

  • output_quality: LLM-judge score on overall response quality (typically 1-10 scale)
  • instruction_following: Degree to which outputs adhere to specified constraints and requirements
  • hallucination_rate: Percentage of factual claims that cannot be verified against source materials
  • relevance: Proportion of output content directly addressing the user's query

Efficiency Metrics

  • latency_p50, p95, p99: 50th, 95th, and 99th percentile response times in milliseconds
  • token_usage: Total input and output tokens consumed per task
  • cost_per_task: Actual monetary cost including API calls and infrastructure

Evaluation Dataset

A robust evaluation dataset forms the foundation for reliable quality measurement. Strong datasets exhibit diverse inputs spanning different difficulty levels, edge cases, and user personas. Each example requires expected outputs defined by domain experts or derived from human demonstrations. Metadata annotations capture task category, expected difficulty, relevant knowledge domains, and evaluation strategy.

Dataset bootstrapping follows a staged approach: Begin with seed examples (10-50 high-quality hand-written cases), apply systematic augmentation through paraphrasing and parameter variation, generate synthetic examples using templates or models, and continuously capture production failures to expand coverage. This progression balances human effort against dataset scale.

CI/CD Integration

Evaluation frameworks integrate into continuous integration pipelines to prevent regressions. Each pull request triggers evaluation runs against the standard dataset, generating quality reports that block merges if metrics fall below established quality gates. Common gates specify minimum thresholds for task completion rate (>95%), hallucination rate (<5%), and p95 latency (<2000ms). This automated feedback loop accelerates iteration while maintaining production readiness.

RAG Evaluation (RAG Triad)

Retrieval-augmented generation systems require specialized evaluation beyond standard LLM metrics. The RAG Triad framework measures three orthogonal dimensions:

  • Context relevance: Assesses whether retrieved documents contain information needed to answer the query
  • Groundedness: Measures whether generated responses derive from retrieved context rather than model hallucinations
  • Answer relevance: Evaluates whether the final response actually addresses the user's question

See RAG for architectural details on retrieval pipelines.

Long-Context Evaluation

A growing area of concern is evaluating whether models can effectively use their full context window. Traditional retrieval-based benchmarks (e.g., needle-in-a-haystack) test only a narrow slice of long-context ability. More recent benchmarks like Oolong (Bertsch et al., 2025) test aggregation — requiring models to reason over distributed information across the entire context and combine atomic analyses into global answers. Even frontier models fail to exceed 50% accuracy at 128K tokens on these aggregation tasks, revealing that context window size alone does not indicate reasoning capability.

This has practical implications for how we provide context to agents -> more context is not always better. See Context Constraints for AI Agents for the broader analysis on how context files, reasoning length, and task instructions interact.

Evaluation Frameworks

FrameworkStrengthsUse Case
RAGASPurpose-built for RAG systems, comprehensive metrics, open-sourceRAG-specific evaluation
DeepEvalSimple assertion syntax, rich built-in metrics, LLM-as-judge supportGeneral LLM evaluation
LangSmith EvalsIntegrated with LangChain ecosystem, dataset management, tracingAgent/chain evaluation
BraintrustReal-time collaboration, detailed diffs, regression detectionTeam-based evaluation
MLflow EvaluateModel comparison, built-in metrics, experiment trackingEnd-to-end MLOps

References

Tags: #mlops #evaluation #testing #llm #genai

Links to: LLM Observability, Langfuse, RAG, Prompts as Infrastructure, AI Agents, AGENTS.md Evaluation