Choose LangSmith if: Your application uses LangChain or LangGraph as the primary framework You want zero-config tracing without any instrumentation code Per-seat pricing fits your team size (< 10 engineers) Choose Langfuse if: You want open-source infrastructure you can fully own and self-host Your team is larger (unlimited user seats on cloud plan) You use multiple LLM frameworks or raw API calls (OpenTelemetry integration is cleaner) Vendor lock-in is a concern — Langfuse can be swapped with a

LangSmith vs Langfuse vs Braintrust: LLM Tracing 2026

Q: Why LLM Tracing Matters in 2026?

Debugging a hallucinating RAG pipeline or a broken agent chain without tracing is archaeology — you're examining artifacts after the fact with no execution context. LLM tracing captures the full call graph: which prompt was sent, which model responded, what tool calls fired, how long each step took, and what the token costs were. In 2026, the category has matured from "log your prompts" to full-stack LLM observability: structured traces, evaluators that score outputs automatically, prompt versio

TL;DR

Langfuse for most teams — it's open-source, has the most generous free tier (1M trace spans/month), self-hosts cleanly, and doesn't lock you into LangChain. LangSmith if your entire stack is LangChain/LangGraph — the zero-config tracing integration is genuinely useful but you pay in per-seat pricing and vendor lock-in. Braintrust for teams that need CI/CD evaluation pipelines — the deployment-blocking eval workflow is the most complete in the category, and the managed platform removes infrastructure toil.

Key Takeaways

Langfuse: 1M trace spans/month free, fully open-source (MIT), self-hostable on Docker Compose, OpenTelemetry native, no LangChain dependency
LangSmith: 5K traces/month free, $39/seat/month paid, best-in-class for LangChain/LangGraph tracing, tight LangChain integration is also its ceiling
Braintrust: 1M trace spans/month free, managed platform, CI/CD deployment blocking with automated evals — the most production-oriented evaluation story
Self-hosting: Langfuse is the only true open-source option; LangSmith has limited self-hosting; Braintrust is managed-only
Pricing model: Langfuse bills by usage (spans), LangSmith by seat, Braintrust by usage — Langfuse wins for high-trace-volume, low-team-size workloads
OpenTelemetry: Langfuse has full OTel support; LangSmith has partial; Braintrust has partial

Why LLM Tracing Matters in 2026

Debugging a hallucinating RAG pipeline or a broken agent chain without tracing is archaeology — you're examining artifacts after the fact with no execution context. LLM tracing captures the full call graph: which prompt was sent, which model responded, what tool calls fired, how long each step took, and what the token costs were.

In 2026, the category has matured from "log your prompts" to full-stack LLM observability: structured traces, evaluators that score outputs automatically, prompt versioning with A/B experiments, and CI/CD hooks that block deployments when eval scores regress.

The three platforms that dominate this space each represent a different bet:

LangSmith bets you're building with LangChain
Langfuse bets you want open-source infrastructure you can own
Braintrust bets you want the most complete managed evaluation platform

Pricing Breakdown

Tier	LangSmith	Langfuse	Braintrust
Free	5K traces/mo, 1 seat	1M spans/mo, unlimited users	1M spans/mo, unlimited users
Paid base	$39/seat/month	$249/month	$249/month
Trace overage	$0.50 / 1K traces	Usage-based	Usage-based
Extended retention	$5.00 / 1K traces	Custom	Custom
Self-host	Limited (Plus+)	✅ Free, OSS	❌
Enterprise	Custom	Custom	Custom

The pricing model difference matters as much as the price:

LangSmith charges per seat — $39/user/month. A 5-person AI team costs $195/month before you count trace overages. A 20-person org hits $780/month just for seats. This model is expensive for growing teams.

Langfuse charges per usage — based on trace spans ingested on cloud, or free if you self-host. For teams with 10+ engineers all viewing traces, Langfuse Cloud's flat-rate Pro tier ($249/month, unlimited users) frequently beats LangSmith's per-seat math.

Braintrust charges per usage similarly, with the same $249/month Pro entry point and 1M spans on the free tier.

LangSmith: The LangChain-Native Option

LangSmith's killer feature is zero-config tracing for LangChain applications. Two environment variables and every chain, agent, and LLM call automatically appears in the trace UI:

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"

# That's it. All LangChain calls are now traced automatically.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_template("Answer this question: {question}")
chain = prompt | llm

# This call appears in LangSmith with full trace including:
# - Input/output at each step
# - Token counts and costs
# - Latency per step
result = chain.invoke({"question": "What is RAG?"})

For teams running LangGraph agents, LangSmith traces the full graph execution — node by node, edge by edge. No instrumentation code required beyond the environment variables.

LangSmith for Non-LangChain Code

Outside the LangChain ecosystem, LangSmith becomes more work. You instrument manually using the @traceable decorator:

from langsmith import traceable, Client

client = Client()

@traceable(name="generate-answer")
def generate_answer(question: str, context: str) -> str:
    # Your OpenAI / Anthropic call here
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Context: {context}"},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content

@traceable(name="rag-pipeline")
def rag_pipeline(query: str) -> str:
    # This creates a parent trace with child spans
    docs = retrieve_docs(query)
    context = "\n".join([d.page_content for d in docs])
    return generate_answer(query, context)

Manual instrumentation works, but it's boilerplate you maintain. Langfuse and Braintrust have similar patterns with slightly cleaner APIs.

Langfuse: The Open-Source Choice

Langfuse is the only genuinely open-source option with a permissive license (MIT for all core features). The self-hosted version has no usage limits and no feature gates — the same codebase runs Langfuse Cloud and self-hosted deployments.

# docker-compose.yml — minimal Langfuse self-hosted stack
version: "3"
services:
  langfuse-server:
    image: langfuse/langfuse:latest
    depends_on:
      - db
      - clickhouse
    ports:
      - "3000:3000"
    environment:
      DATABASE_URL: postgresql://langfuse:password@db:5432/langfuse
      CLICKHOUSE_URL: http://clickhouse:8123
      REDIS_URL: redis://redis:6379
      NEXTAUTH_SECRET: your-secret-here
      NEXTAUTH_URL: http://localhost:3000
      SALT: your-salt-here

  db:
    image: postgres:15
    environment:
      POSTGRES_USER: langfuse
      POSTGRES_PASSWORD: password
      POSTGRES_DB: langfuse
    volumes:
      - postgres_data:/var/lib/postgresql/data

  clickhouse:
    image: clickhouse/clickhouse-server:24
    volumes:
      - clickhouse_data:/var/lib/clickhouse

  redis:
    image: redis:7

volumes:
  postgres_data:
  clickhouse_data:

Production self-hosting needs Kubernetes for ClickHouse and Postgres HA, but for development or small teams, Docker Compose is the full stack.

Langfuse SDK Integration

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

langfuse = Langfuse(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com"  # or self-hosted URL
)

@observe()
def retrieve_documents(query: str) -> list[str]:
    # Auto-traced as a span
    return vector_store.similarity_search(query, k=3)

@observe(name="rag-pipeline")
def answer_question(question: str) -> str:
    docs = retrieve_documents(question)
    context = "\n".join(docs)

    # Add custom metadata to the trace
    langfuse_context.update_current_observation(
        metadata={"num_docs": len(docs), "retrieval_method": "semantic"},
        input={"question": question},
    )

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Use this context: {context}"},
            {"role": "user", "content": question}
        ]
    )

    answer = response.choices[0].message.content
    langfuse_context.update_current_observation(output=answer)
    return answer

Langfuse supports OpenTelemetry natively — if your codebase uses OTel for general observability, you can route LLM traces to Langfuse alongside your existing infrastructure.

Langfuse Evaluations

# Run LLM-as-a-Judge evaluations on traces
from langfuse import Langfuse

langfuse = Langfuse()

# Fetch recent traces
traces = langfuse.fetch_traces(limit=100).data

for trace in traces:
    # Score a trace (e.g., from your custom eval)
    langfuse.score(
        trace_id=trace.id,
        name="relevance",
        value=0.85,  # 0-1 scale
        comment="Response addresses the question directly"
    )

Braintrust: The Eval-First Platform

Braintrust's differentiation is building evaluation into the deployment pipeline — not as a reporting tool you check after the fact, but as a blocking gate in CI/CD:

import braintrust
from braintrust import Eval

# Define your eval dataset
examples = [
    {"input": "What is RAG?", "expected": "Retrieval-Augmented Generation..."},
    {"input": "Explain embeddings", "expected": "Embeddings are numerical..."},
]

# Define your task (the thing you're evaluating)
def my_rag_pipeline(input_data):
    return rag_pipeline(input_data["input"])

# Define your scorer
def semantic_similarity(input, output, expected):
    # Returns a 0-1 score
    return cosine_similarity(embed(output), embed(expected))

# Run the eval — blocks CI if score regresses
result = Eval(
    "my-rag-experiment",
    data=examples,
    task=my_rag_pipeline,
    scores=[semantic_similarity],
)

print(result.summary)
# EvalResultWithSummary:
#   score: 0.89 (vs baseline: 0.87, +0.02 ✅)
#   95% CI: [0.84, 0.93]

The experiment system tracks score regressions across deployments. When you run evals in CI and a new version scores worse than the baseline, the deployment blocks automatically.

Braintrust Tracing

import braintrust

# Wrap any function for auto-tracing
@braintrust.traced
def generate_response(prompt: str, model: str = "gpt-4o") -> str:
    response = openai_client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Or trace manually with spans
with braintrust.start_span(name="rag-pipeline") as span:
    docs = retrieve_docs(query)
    span.log(metadata={"num_docs": len(docs)})

    with braintrust.start_span(name="llm-call") as llm_span:
        response = generate_response(build_prompt(docs, query))
        llm_span.log(output=response)

Feature Comparison

Feature	LangSmith	Langfuse	Braintrust
Tracing	✅	✅	✅
Prompt management	✅	✅	✅
Evaluations	✅	✅	✅ Best-in-class
CI/CD eval blocking	⚠️ Manual	⚠️ Manual	✅ Native
LangChain auto-trace	✅ Native	✅ Plugin	✅ Plugin
OpenTelemetry	⚠️ Partial	✅ Full	⚠️ Partial
Open source	❌	✅ MIT	❌
Self-hosting	⚠️ Limited	✅ Full	❌
Dataset management	✅	✅	✅
Human annotation	✅	✅	✅
A/B prompt experiments	✅	✅	✅
Cost tracking	✅	✅	✅
HIPAA/SOC2	Enterprise	Enterprise	Enterprise

How to Choose

Choose LangSmith if:

Your application uses LangChain or LangGraph as the primary framework
You want zero-config tracing without any instrumentation code
Per-seat pricing fits your team size (< 10 engineers)

Choose Langfuse if:

You want open-source infrastructure you can fully own and self-host
Your team is larger (unlimited user seats on cloud plan)
You use multiple LLM frameworks or raw API calls (OpenTelemetry integration is cleaner)
Vendor lock-in is a concern — Langfuse can be swapped with a URL change

Choose Braintrust if:

You need CI/CD deployment gates that block on eval regressions
Your team runs systematic A/B experiments on prompt changes
You prefer a fully managed platform with no infrastructure overhead
Evaluation automation is the primary use case, not just tracing

Find and compare LLM observability APIs at APIScout.

The API Integration Checklist (Free PDF)