LLM Observability
Created: 2026-02-20 10:00
#note
Observability in Large Language Model powered applications is critical for understanding system behavior at production scale. Unlike traditional software systems, LLM applications introduce variability in outputs, non-deterministic latencies, and complex chains of reasoning that span multiple models and external tools. Observability enables teams to diagnose failures, optimize costs, and measure quality in real-time across the entire application lifecycle. Without comprehensive observability, organizations remain blind to performance degradation, unexpected cost spikes, and silent quality regressions that users experience.
Three-Layer Observability Stack
LLM observability comprises three complementary layers that together provide end-to-end visibility into application behavior and economics.
graph TD
A["Agent Traces"] --> B["Evaluation Layer"]
A --> C["Cost Tracking"]
A --> D["Application Behavior"]
B --> E["Quality Scores"]
C --> F["Token Accounting"]
D --> G["Span Metrics"]
Agent Traces capture structured events representing spans and metrics across the entire request lifecycle. Evaluation attaches quality judgments—scores from language model judges or heuristic scorers—to execution traces. Cost Tracking aggregates token consumption and compute expenditure for budgeting and resource optimization.
Key Concepts
| Term | Definition |
|---|---|
| Trace | Complete record of a single user request from start to finish, containing all nested operations |
| Span | Individual operation within a trace, such as an LLM inference, database query, or tool invocation |
| Generation | Output produced by an LLM within a span, including tokens, logprobs, and reasoning |
| Score | Quantitative evaluation of generation quality, assigned by judges or heuristic evaluators |
| Dataset | Curated collection of examples used for reproducible evaluation and regression testing |
Tool Landscape
| Platform | Strength | Best For |
|---|---|---|
| Langfuse | Developer-friendly, open-source, strong trace UI | Early-stage teams, cost-conscious organizations |
| Braintrust | Evals-first design, model-based judges | Continuous evaluation, quality assurance |
| MLflow 3.0+ | Ecosystem integration, experiment tracking | ML teams already using MLflow infrastructure |
| Arize Phoenix | Large-scale analytics, drift detection | Production monitoring at enterprise scale |
| LangSmith | LangChain integration, prompt management | Teams heavily invested in LangChain ecosystem |
| W&B Weave | Experiment tracking, team collaboration | Organizations using Weights & Biases platform |
| OpenLLMetry | Open-source, minimal dependencies | Privacy-conscious teams, on-prem deployment |
OpenTelemetry for AI
OpenTelemetry emerges as the industry standard for distributed tracing across LLM applications. The OpenTelemetry semantic conventions for generative AI define standardized attribute names and structures that enable interoperability between observability tools. Key conventions include gen_ai.system (identifying the AI platform such as OpenAI or Anthropic), gen_ai.request.model (the specific model identifier), gen_ai.usage.input_tokens (input token count), and gen_ai.usage.output_tokens (completion tokens). These conventions allow organizations to migrate between observability platforms without rewriting instrumentation, reducing vendor lock-in and enabling cost-aware tool selection.
Instrumentation Patterns
Auto-instrumentation uses libraries and framework patches to automatically emit traces without modifying application code. Frameworks like LangChain and LlamaIndex provide built-in exporters; patching libraries intercept library calls at runtime to capture spans. This approach minimizes code changes but may miss custom logic outside instrumented libraries.
Manual instrumentation requires developers to explicitly create spans and record attributes within application code. While more verbose, manual instrumentation provides precise control over what is captured and allows tracking of business logic, custom tool calls, and application-specific branching. Most production systems employ a hybrid approach: auto-instrumentation for framework operations combined with manual spans for unique application behavior.
What to Track
| Metric | Purpose | Signal |
|---|---|---|
| Latency | Time from request to response | Model delays, infrastructure issues |
| Token Usage | Input and output token counts | Cost prediction, quota management |
| Error Rate | Proportion of failed requests | Reliability, service health |
| Quality Scores | Judge or heuristic assessments | Output correctness, user satisfaction |
| Tool Call Success | Agent tool invocation outcomes | Reasoning quality, planning failures |
| Retry Rate | Proportion of requests requiring retry | Transient failure patterns, system instability |
Evaluation Integration
Attaching evaluation scores to traces enables continuous quality assurance without separate evaluation pipelines. After a trace completes, an offline evaluation framework can asynchronously compute scores using language model judges, heuristics, or reference-based metrics. Linking scores back to the original trace creates a queryable dataset for debugging regressions, analyzing failure patterns, and comparing model versions. This pattern transforms evaluation from a batch process into a real-time feedback mechanism that surfaces quality issues immediately, enabling rapid iteration on prompts, retrieval strategies, and tool compositions. See LLM Evaluation for detailed scoring approaches.