Langfuse
Created: 2026-02-20 10:00
#note
Langfuse is an open-source observability platform purpose-built for large language model applications. As LLM systems grow in complexity, the ability to trace execution paths, evaluate model outputs, manage prompts as versioned artifacts, and organize evaluation datasets becomes critical for production reliability. Langfuse addresses these needs through integrated tracing, evaluation, prompt management, and dataset tools that enable teams to monitor, debug, and optimize LLM applications at scale.
Key Features
| Feature | Description |
|---|---|
| Tracing | Capture complete execution traces with nested spans, latency, costs, and error tracking |
| Evaluation | LLM-as-judge scoring, human annotation workflows, and custom assertion validation |
| Prompt Management | Versioned prompt templates with A/B testing and runtime fetching |
| Datasets | Organized collections of test cases and golden examples for evaluation |
| Cost Tracking | Per-model and per-request cost attribution across traces |
| Sessions | Grouped traces for conversation-like interactions and multi-turn workflows |
Tracing and Instrumentation
Langfuse captures detailed execution traces by instrumenting API calls and custom code. When an LLM request is made, Langfuse records a trace containing multiple spans—discrete units representing function calls, API requests, or agent decisions. The platform employs a decorator pattern that allows developers to automatically create spans around functions without boilerplate instrumentation code. Developers can wrap functions with the @langfuse_context decorator, which extracts timing, error information, input/output data, and model metadata. Langfuse also provides SDKs for popular frameworks that auto-instrument calls, capturing spans for LLM completions, embedding operations, and vector database queries. Each span includes parent-child relationships, forming a tree structure that represents the logical flow of execution.
Evaluation and Scoring
Evaluation in Langfuse combines automated and human-driven assessment. The LLM-as-judge pattern uses another LLM to score outputs against defined rubrics, enabling rapid evaluation of large datasets. Human annotators can review traces through a collaborative interface, assigning scores or labels that feed into datasets for future fine-tuning. Custom assertions allow developers to attach rule-based checks—such as regex validation or output length constraints—directly to traces. Scores are attached to individual spans or entire traces, creating a queryable history of model performance over time. This multi-modal evaluation approach supports both rapid iteration during development and rigorous quality gates in production.
Prompt Management
Prompts in Langfuse are versioned, tagged, and fetched at runtime from a centralized registry. Developers define prompt templates with variables, commit them to the registry, and retrieve them by name and optional version tag at application startup or request time. A/B testing is native—multiple prompt versions can be deployed simultaneously, with trace tags indicating which version was used. This decouples prompt changes from code deployment, allowing product teams to experiment with wording, instruction clarity, and few-shot examples without triggering new releases. Version history and performance metrics are linked, making it straightforward to identify high-performing prompt variants.
Framework Integrations
Langfuse provides native integrations with LangChain, LlamaIndex, LangGraph, and other agentic frameworks, enabling automatic trace capture without instrumentation. Direct API integrations support OpenAI, Anthropic, Cohere, and other model providers. SDKs in Python, JavaScript, and Go offer low-level control for custom implementations. These integrations reduce friction in adopting observability, making it practical to add Langfuse to existing applications with minimal code changes.
Deployment Options
| Deployment | Recommended For | Trade-offs |
|---|---|---|
| Cloud (Langfuse) | Teams prioritizing ease of setup and zero ops overhead | Vendor lock-in; data residency may not suit regulated workloads |
| Docker (Self-hosted) | Small teams wanting data privacy; development and staging environments | Modest infrastructure overhead; single-node limitations |
| Kubernetes (Self-hosted Helm) | Enterprises with existing K8s infrastructure; large-scale deployments | Operational complexity; requires cluster management expertise |
Comparison with Alternatives
| Dimension | Langfuse | Braintrust | MLflow |
|---|---|---|---|
| Self-hosted | Yes (Docker, K8s) | Limited | Yes |
| Tracing | First-class; detailed spans | Yes; UI-focused | Added in MLflow 3.0 |
| Prompt Management | Core feature; versioning & A/B | Minimal | Prompt Registry (MLflow 3.0) |
| Datasets | Built-in; linked to evals | Yes | Secondary to experiments |
| ML Experiments | Not focused | Not focused | Primary focus |
| Ease of Setup | Cloud-first; SDKs available | Web UI emphasis | Requires MLflow server |
Langfuse excels at LLM-specific observability and prompt iteration, while MLflow caters to traditional ML experiments with growing LLM support. Braintrust emphasizes collaborative evaluation workflows.