Langfuse

Created: 2026-02-20 10:00
#note

Langfuse is an open-source observability platform purpose-built for large language model applications. As LLM systems grow in complexity, the ability to trace execution paths, evaluate model outputs, manage prompts as versioned artifacts, and organize evaluation datasets becomes critical for production reliability. Langfuse addresses these needs through integrated tracing, evaluation, prompt management, and dataset tools that enable teams to monitor, debug, and optimize LLM applications at scale.

Key Features

Feature	Description
Tracing	Capture complete execution traces with nested spans, latency, costs, and error tracking
Evaluation	LLM-as-judge scoring, human annotation workflows, and custom assertion validation
Prompt Management	Versioned prompt templates with A/B testing and runtime fetching
Datasets	Organized collections of test cases and golden examples for evaluation
Cost Tracking	Per-model and per-request cost attribution across traces
Sessions	Grouped traces for conversation-like interactions and multi-turn workflows

Tracing and Instrumentation

Langfuse captures detailed execution traces by instrumenting API calls and custom code. When an LLM request is made, Langfuse records a trace containing multiple spans—discrete units representing function calls, API requests, or agent decisions. The platform employs a decorator pattern that allows developers to automatically create spans around functions without boilerplate instrumentation code. Developers can wrap functions with the @langfuse_context decorator, which extracts timing, error information, input/output data, and model metadata. Langfuse also provides SDKs for popular frameworks that auto-instrument calls, capturing spans for LLM completions, embedding operations, and vector database queries. Each span includes parent-child relationships, forming a tree structure that represents the logical flow of execution.

Evaluation and Scoring

Evaluation in Langfuse combines automated and human-driven assessment. The LLM-as-judge pattern uses another LLM to score outputs against defined rubrics, enabling rapid evaluation of large datasets. Human annotators can review traces through a collaborative interface, assigning scores or labels that feed into datasets for future fine-tuning. Custom assertions allow developers to attach rule-based checks—such as regex validation or output length constraints—directly to traces. Scores are attached to individual spans or entire traces, creating a queryable history of model performance over time. This multi-modal evaluation approach supports both rapid iteration during development and rigorous quality gates in production.

Prompt Management

Prompts in Langfuse are versioned, tagged, and fetched at runtime from a centralized registry. Developers define prompt templates with variables, commit them to the registry, and retrieve them by name and optional version tag at application startup or request time. A/B testing is native—multiple prompt versions can be deployed simultaneously, with trace tags indicating which version was used. This decouples prompt changes from code deployment, allowing product teams to experiment with wording, instruction clarity, and few-shot examples without triggering new releases. Version history and performance metrics are linked, making it straightforward to identify high-performing prompt variants.

Framework Integrations

Langfuse provides native integrations with LangChain, LlamaIndex, LangGraph, and other agentic frameworks, enabling automatic trace capture without instrumentation. Direct API integrations support OpenAI, Anthropic, Cohere, and other model providers. SDKs in Python, JavaScript, and Go offer low-level control for custom implementations. These integrations reduce friction in adopting observability, making it practical to add Langfuse to existing applications with minimal code changes.

Deployment Options

Deployment	Recommended For	Trade-offs
Cloud (Langfuse)	Teams prioritizing ease of setup and zero ops overhead	Vendor lock-in; data residency may not suit regulated workloads
Docker (Self-hosted)	Small teams wanting data privacy; development and staging environments	Modest infrastructure overhead; single-node limitations
Kubernetes (Self-hosted Helm)	Enterprises with existing K8s infrastructure; large-scale deployments	Operational complexity; requires cluster management expertise

Comparison with Alternatives

Dimension	Langfuse	Braintrust	MLflow
Self-hosted	Yes (Docker, K8s)	Limited	Yes
Tracing	First-class; detailed spans	Yes; UI-focused	Added in MLflow 3.0
Prompt Management	Core feature; versioning & A/B	Minimal	Prompt Registry (MLflow 3.0)
Datasets	Built-in; linked to evals	Yes	Secondary to experiments
ML Experiments	Not focused	Not focused	Primary focus
Ease of Setup	Cloud-first; SDKs available	Web UI emphasis	Requires MLflow server

Langfuse excels at LLM-specific observability and prompt iteration, while MLflow caters to traditional ML experiments with growing LLM support. Braintrust emphasizes collaborative evaluation workflows.