Langfuse

Created: 2026-02-20 10:00
#note

Langfuse is an open-source observability platform purpose-built for large language model applications. As LLM systems grow in complexity, the ability to trace execution paths, evaluate model outputs, manage prompts as versioned artifacts, and organize evaluation datasets becomes critical for production reliability. Langfuse addresses these needs through integrated tracing, evaluation, prompt management, and dataset tools that enable teams to monitor, debug, and optimize LLM applications at scale.

Key Features

FeatureDescription
TracingCapture complete execution traces with nested spans, latency, costs, and error tracking
EvaluationLLM-as-judge scoring, human annotation workflows, and custom assertion validation
Prompt ManagementVersioned prompt templates with A/B testing and runtime fetching
DatasetsOrganized collections of test cases and golden examples for evaluation
Cost TrackingPer-model and per-request cost attribution across traces
SessionsGrouped traces for conversation-like interactions and multi-turn workflows

Tracing and Instrumentation

Langfuse captures detailed execution traces by instrumenting API calls and custom code. When an LLM request is made, Langfuse records a trace containing multiple spans—discrete units representing function calls, API requests, or agent decisions. The platform employs a decorator pattern that allows developers to automatically create spans around functions without boilerplate instrumentation code. Developers can wrap functions with the @langfuse_context decorator, which extracts timing, error information, input/output data, and model metadata. Langfuse also provides SDKs for popular frameworks that auto-instrument calls, capturing spans for LLM completions, embedding operations, and vector database queries. Each span includes parent-child relationships, forming a tree structure that represents the logical flow of execution.

Evaluation and Scoring

Evaluation in Langfuse combines automated and human-driven assessment. The LLM-as-judge pattern uses another LLM to score outputs against defined rubrics, enabling rapid evaluation of large datasets. Human annotators can review traces through a collaborative interface, assigning scores or labels that feed into datasets for future fine-tuning. Custom assertions allow developers to attach rule-based checks—such as regex validation or output length constraints—directly to traces. Scores are attached to individual spans or entire traces, creating a queryable history of model performance over time. This multi-modal evaluation approach supports both rapid iteration during development and rigorous quality gates in production.

Prompt Management

Prompts in Langfuse are versioned, tagged, and fetched at runtime from a centralized registry. Developers define prompt templates with variables, commit them to the registry, and retrieve them by name and optional version tag at application startup or request time. A/B testing is native—multiple prompt versions can be deployed simultaneously, with trace tags indicating which version was used. This decouples prompt changes from code deployment, allowing product teams to experiment with wording, instruction clarity, and few-shot examples without triggering new releases. Version history and performance metrics are linked, making it straightforward to identify high-performing prompt variants.

Framework Integrations

Langfuse provides native integrations with LangChain, LlamaIndex, LangGraph, and other agentic frameworks, enabling automatic trace capture without instrumentation. Direct API integrations support OpenAI, Anthropic, Cohere, and other model providers. SDKs in Python, JavaScript, and Go offer low-level control for custom implementations. These integrations reduce friction in adopting observability, making it practical to add Langfuse to existing applications with minimal code changes.

Deployment Options

DeploymentRecommended ForTrade-offs
Cloud (Langfuse)Teams prioritizing ease of setup and zero ops overheadVendor lock-in; data residency may not suit regulated workloads
Docker (Self-hosted)Small teams wanting data privacy; development and staging environmentsModest infrastructure overhead; single-node limitations
Kubernetes (Self-hosted Helm)Enterprises with existing K8s infrastructure; large-scale deploymentsOperational complexity; requires cluster management expertise

Comparison with Alternatives

DimensionLangfuseBraintrustMLflow
Self-hostedYes (Docker, K8s)LimitedYes
TracingFirst-class; detailed spansYes; UI-focusedAdded in MLflow 3.0
Prompt ManagementCore feature; versioning & A/BMinimalPrompt Registry (MLflow 3.0)
DatasetsBuilt-in; linked to evalsYesSecondary to experiments
ML ExperimentsNot focusedNot focusedPrimary focus
Ease of SetupCloud-first; SDKs availableWeb UI emphasisRequires MLflow server

Langfuse excels at LLM-specific observability and prompt iteration, while MLflow caters to traditional ML experiments with growing LLM support. Braintrust emphasizes collaborative evaluation workflows.

References

Tags: #mlops #observability #langfuse #tracing #llm