LangSmith vs Langfuse vs Braintrust: LLM Tracing 2026
TL;DR
Langfuse for most teams — it's open-source, has the most generous free tier (1M trace spans/month), self-hosts cleanly, and doesn't lock you into LangChain. LangSmith if your entire stack is LangChain/LangGraph — the zero-config tracing integration is genuinely useful but you pay in per-seat pricing and vendor lock-in. Braintrust for teams that need CI/CD evaluation pipelines — the deployment-blocking eval workflow is the most complete in the category, and the managed platform removes infrastructure toil.
Key Takeaways
- Langfuse: 1M trace spans/month free, fully open-source (MIT), self-hostable on Docker Compose, OpenTelemetry native, no LangChain dependency
- LangSmith: 5K traces/month free, $39/seat/month paid, best-in-class for LangChain/LangGraph tracing, tight LangChain integration is also its ceiling
- Braintrust: 1M trace spans/month free, managed platform, CI/CD deployment blocking with automated evals — the most production-oriented evaluation story
- Self-hosting: Langfuse is the only true open-source option; LangSmith has limited self-hosting; Braintrust is managed-only
- Pricing model: Langfuse bills by usage (spans), LangSmith by seat, Braintrust by usage — Langfuse wins for high-trace-volume, low-team-size workloads
- OpenTelemetry: Langfuse has full OTel support; LangSmith has partial; Braintrust has partial
Why LLM Tracing Matters in 2026
Debugging a hallucinating RAG pipeline or a broken agent chain without tracing is archaeology — you're examining artifacts after the fact with no execution context. LLM tracing captures the full call graph: which prompt was sent, which model responded, what tool calls fired, how long each step took, and what the token costs were.
In 2026, the category has matured from "log your prompts" to full-stack LLM observability: structured traces, evaluators that score outputs automatically, prompt versioning with A/B experiments, and CI/CD hooks that block deployments when eval scores regress.
The three platforms that dominate this space each represent a different bet:
- LangSmith bets you're building with LangChain
- Langfuse bets you want open-source infrastructure you can own
- Braintrust bets you want the most complete managed evaluation platform
Pricing Breakdown
| Tier | LangSmith | Langfuse | Braintrust |
|---|---|---|---|
| Free | 5K traces/mo, 1 seat | 1M spans/mo, unlimited users | 1M spans/mo, unlimited users |
| Paid base | $39/seat/month | $249/month | $249/month |
| Trace overage | $0.50 / 1K traces | Usage-based | Usage-based |
| Extended retention | $5.00 / 1K traces | Custom | Custom |
| Self-host | Limited (Plus+) | ✅ Free, OSS | ❌ |
| Enterprise | Custom | Custom | Custom |
The pricing model difference matters as much as the price:
LangSmith charges per seat — $39/user/month. A 5-person AI team costs $195/month before you count trace overages. A 20-person org hits $780/month just for seats. This model is expensive for growing teams.
Langfuse charges per usage — based on trace spans ingested on cloud, or free if you self-host. For teams with 10+ engineers all viewing traces, Langfuse Cloud's flat-rate Pro tier ($249/month, unlimited users) frequently beats LangSmith's per-seat math.
Braintrust charges per usage similarly, with the same $249/month Pro entry point and 1M spans on the free tier.
LangSmith: The LangChain-Native Option
LangSmith's killer feature is zero-config tracing for LangChain applications. Two environment variables and every chain, agent, and LLM call automatically appears in the trace UI:
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"
# That's it. All LangChain calls are now traced automatically.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_template("Answer this question: {question}")
chain = prompt | llm
# This call appears in LangSmith with full trace including:
# - Input/output at each step
# - Token counts and costs
# - Latency per step
result = chain.invoke({"question": "What is RAG?"})
For teams running LangGraph agents, LangSmith traces the full graph execution — node by node, edge by edge. No instrumentation code required beyond the environment variables.
LangSmith for Non-LangChain Code
Outside the LangChain ecosystem, LangSmith becomes more work. You instrument manually using the @traceable decorator:
from langsmith import traceable, Client
client = Client()
@traceable(name="generate-answer")
def generate_answer(question: str, context: str) -> str:
# Your OpenAI / Anthropic call here
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Context: {context}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
@traceable(name="rag-pipeline")
def rag_pipeline(query: str) -> str:
# This creates a parent trace with child spans
docs = retrieve_docs(query)
context = "\n".join([d.page_content for d in docs])
return generate_answer(query, context)
Manual instrumentation works, but it's boilerplate you maintain. Langfuse and Braintrust have similar patterns with slightly cleaner APIs.
Langfuse: The Open-Source Choice
Langfuse is the only genuinely open-source option with a permissive license (MIT for all core features). The self-hosted version has no usage limits and no feature gates — the same codebase runs Langfuse Cloud and self-hosted deployments.
# docker-compose.yml — minimal Langfuse self-hosted stack
version: "3"
services:
langfuse-server:
image: langfuse/langfuse:latest
depends_on:
- db
- clickhouse
ports:
- "3000:3000"
environment:
DATABASE_URL: postgresql://langfuse:password@db:5432/langfuse
CLICKHOUSE_URL: http://clickhouse:8123
REDIS_URL: redis://redis:6379
NEXTAUTH_SECRET: your-secret-here
NEXTAUTH_URL: http://localhost:3000
SALT: your-salt-here
db:
image: postgres:15
environment:
POSTGRES_USER: langfuse
POSTGRES_PASSWORD: password
POSTGRES_DB: langfuse
volumes:
- postgres_data:/var/lib/postgresql/data
clickhouse:
image: clickhouse/clickhouse-server:24
volumes:
- clickhouse_data:/var/lib/clickhouse
redis:
image: redis:7
volumes:
postgres_data:
clickhouse_data:
Production self-hosting needs Kubernetes for ClickHouse and Postgres HA, but for development or small teams, Docker Compose is the full stack.
Langfuse SDK Integration
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
langfuse = Langfuse(
public_key="pk-...",
secret_key="sk-...",
host="https://cloud.langfuse.com" # or self-hosted URL
)
@observe()
def retrieve_documents(query: str) -> list[str]:
# Auto-traced as a span
return vector_store.similarity_search(query, k=3)
@observe(name="rag-pipeline")
def answer_question(question: str) -> str:
docs = retrieve_documents(question)
context = "\n".join(docs)
# Add custom metadata to the trace
langfuse_context.update_current_observation(
metadata={"num_docs": len(docs), "retrieval_method": "semantic"},
input={"question": question},
)
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Use this context: {context}"},
{"role": "user", "content": question}
]
)
answer = response.choices[0].message.content
langfuse_context.update_current_observation(output=answer)
return answer
Langfuse supports OpenTelemetry natively — if your codebase uses OTel for general observability, you can route LLM traces to Langfuse alongside your existing infrastructure.
Langfuse Evaluations
# Run LLM-as-a-Judge evaluations on traces
from langfuse import Langfuse
langfuse = Langfuse()
# Fetch recent traces
traces = langfuse.fetch_traces(limit=100).data
for trace in traces:
# Score a trace (e.g., from your custom eval)
langfuse.score(
trace_id=trace.id,
name="relevance",
value=0.85, # 0-1 scale
comment="Response addresses the question directly"
)
Braintrust: The Eval-First Platform
Braintrust's differentiation is building evaluation into the deployment pipeline — not as a reporting tool you check after the fact, but as a blocking gate in CI/CD:
import braintrust
from braintrust import Eval
# Define your eval dataset
examples = [
{"input": "What is RAG?", "expected": "Retrieval-Augmented Generation..."},
{"input": "Explain embeddings", "expected": "Embeddings are numerical..."},
]
# Define your task (the thing you're evaluating)
def my_rag_pipeline(input_data):
return rag_pipeline(input_data["input"])
# Define your scorer
def semantic_similarity(input, output, expected):
# Returns a 0-1 score
return cosine_similarity(embed(output), embed(expected))
# Run the eval — blocks CI if score regresses
result = Eval(
"my-rag-experiment",
data=examples,
task=my_rag_pipeline,
scores=[semantic_similarity],
)
print(result.summary)
# EvalResultWithSummary:
# score: 0.89 (vs baseline: 0.87, +0.02 ✅)
# 95% CI: [0.84, 0.93]
The experiment system tracks score regressions across deployments. When you run evals in CI and a new version scores worse than the baseline, the deployment blocks automatically.
Braintrust Tracing
import braintrust
# Wrap any function for auto-tracing
@braintrust.traced
def generate_response(prompt: str, model: str = "gpt-4o") -> str:
response = openai_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Or trace manually with spans
with braintrust.start_span(name="rag-pipeline") as span:
docs = retrieve_docs(query)
span.log(metadata={"num_docs": len(docs)})
with braintrust.start_span(name="llm-call") as llm_span:
response = generate_response(build_prompt(docs, query))
llm_span.log(output=response)
Feature Comparison
| Feature | LangSmith | Langfuse | Braintrust |
|---|---|---|---|
| Tracing | ✅ | ✅ | ✅ |
| Prompt management | ✅ | ✅ | ✅ |
| Evaluations | ✅ | ✅ | ✅ Best-in-class |
| CI/CD eval blocking | ⚠️ Manual | ⚠️ Manual | ✅ Native |
| LangChain auto-trace | ✅ Native | ✅ Plugin | ✅ Plugin |
| OpenTelemetry | ⚠️ Partial | ✅ Full | ⚠️ Partial |
| Open source | ❌ | ✅ MIT | ❌ |
| Self-hosting | ⚠️ Limited | ✅ Full | ❌ |
| Dataset management | ✅ | ✅ | ✅ |
| Human annotation | ✅ | ✅ | ✅ |
| A/B prompt experiments | ✅ | ✅ | ✅ |
| Cost tracking | ✅ | ✅ | ✅ |
| HIPAA/SOC2 | Enterprise | Enterprise | Enterprise |
How to Choose
Choose LangSmith if:
- Your application uses LangChain or LangGraph as the primary framework
- You want zero-config tracing without any instrumentation code
- Per-seat pricing fits your team size (< 10 engineers)
Choose Langfuse if:
- You want open-source infrastructure you can fully own and self-host
- Your team is larger (unlimited user seats on cloud plan)
- You use multiple LLM frameworks or raw API calls (OpenTelemetry integration is cleaner)
- Vendor lock-in is a concern — Langfuse can be swapped with a URL change
Choose Braintrust if:
- You need CI/CD deployment gates that block on eval regressions
- Your team runs systematic A/B experiments on prompt changes
- You prefer a fully managed platform with no infrastructure overhead
- Evaluation automation is the primary use case, not just tracing
Find and compare LLM observability APIs at APIScout.
Related: OpenAI API vs Anthropic Claude API 2026 · LangChain vs Vercel AI SDK 2026