Skip to main content

LangSmith vs Langfuse vs Braintrust: LLM Tracing 2026

·APIScout Team
langsmithlangfusebraintrustllm-observabilitytracingllmops2026

TL;DR

Langfuse for most teams — it's open-source, has the most generous free tier (1M trace spans/month), self-hosts cleanly, and doesn't lock you into LangChain. LangSmith if your entire stack is LangChain/LangGraph — the zero-config tracing integration is genuinely useful but you pay in per-seat pricing and vendor lock-in. Braintrust for teams that need CI/CD evaluation pipelines — the deployment-blocking eval workflow is the most complete in the category, and the managed platform removes infrastructure toil.

Key Takeaways

  • Langfuse: 1M trace spans/month free, fully open-source (MIT), self-hostable on Docker Compose, OpenTelemetry native, no LangChain dependency
  • LangSmith: 5K traces/month free, $39/seat/month paid, best-in-class for LangChain/LangGraph tracing, tight LangChain integration is also its ceiling
  • Braintrust: 1M trace spans/month free, managed platform, CI/CD deployment blocking with automated evals — the most production-oriented evaluation story
  • Self-hosting: Langfuse is the only true open-source option; LangSmith has limited self-hosting; Braintrust is managed-only
  • Pricing model: Langfuse bills by usage (spans), LangSmith by seat, Braintrust by usage — Langfuse wins for high-trace-volume, low-team-size workloads
  • OpenTelemetry: Langfuse has full OTel support; LangSmith has partial; Braintrust has partial

Why LLM Tracing Matters in 2026

Debugging a hallucinating RAG pipeline or a broken agent chain without tracing is archaeology — you're examining artifacts after the fact with no execution context. LLM tracing captures the full call graph: which prompt was sent, which model responded, what tool calls fired, how long each step took, and what the token costs were.

In 2026, the category has matured from "log your prompts" to full-stack LLM observability: structured traces, evaluators that score outputs automatically, prompt versioning with A/B experiments, and CI/CD hooks that block deployments when eval scores regress.

The three platforms that dominate this space each represent a different bet:

  • LangSmith bets you're building with LangChain
  • Langfuse bets you want open-source infrastructure you can own
  • Braintrust bets you want the most complete managed evaluation platform

Pricing Breakdown

TierLangSmithLangfuseBraintrust
Free5K traces/mo, 1 seat1M spans/mo, unlimited users1M spans/mo, unlimited users
Paid base$39/seat/month$249/month$249/month
Trace overage$0.50 / 1K tracesUsage-basedUsage-based
Extended retention$5.00 / 1K tracesCustomCustom
Self-hostLimited (Plus+)✅ Free, OSS
EnterpriseCustomCustomCustom

The pricing model difference matters as much as the price:

LangSmith charges per seat — $39/user/month. A 5-person AI team costs $195/month before you count trace overages. A 20-person org hits $780/month just for seats. This model is expensive for growing teams.

Langfuse charges per usage — based on trace spans ingested on cloud, or free if you self-host. For teams with 10+ engineers all viewing traces, Langfuse Cloud's flat-rate Pro tier ($249/month, unlimited users) frequently beats LangSmith's per-seat math.

Braintrust charges per usage similarly, with the same $249/month Pro entry point and 1M spans on the free tier.


LangSmith: The LangChain-Native Option

LangSmith's killer feature is zero-config tracing for LangChain applications. Two environment variables and every chain, agent, and LLM call automatically appears in the trace UI:

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"

# That's it. All LangChain calls are now traced automatically.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_template("Answer this question: {question}")
chain = prompt | llm

# This call appears in LangSmith with full trace including:
# - Input/output at each step
# - Token counts and costs
# - Latency per step
result = chain.invoke({"question": "What is RAG?"})

For teams running LangGraph agents, LangSmith traces the full graph execution — node by node, edge by edge. No instrumentation code required beyond the environment variables.

LangSmith for Non-LangChain Code

Outside the LangChain ecosystem, LangSmith becomes more work. You instrument manually using the @traceable decorator:

from langsmith import traceable, Client

client = Client()

@traceable(name="generate-answer")
def generate_answer(question: str, context: str) -> str:
    # Your OpenAI / Anthropic call here
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Context: {context}"},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content

@traceable(name="rag-pipeline")
def rag_pipeline(query: str) -> str:
    # This creates a parent trace with child spans
    docs = retrieve_docs(query)
    context = "\n".join([d.page_content for d in docs])
    return generate_answer(query, context)

Manual instrumentation works, but it's boilerplate you maintain. Langfuse and Braintrust have similar patterns with slightly cleaner APIs.


Langfuse: The Open-Source Choice

Langfuse is the only genuinely open-source option with a permissive license (MIT for all core features). The self-hosted version has no usage limits and no feature gates — the same codebase runs Langfuse Cloud and self-hosted deployments.

# docker-compose.yml — minimal Langfuse self-hosted stack
version: "3"
services:
  langfuse-server:
    image: langfuse/langfuse:latest
    depends_on:
      - db
      - clickhouse
    ports:
      - "3000:3000"
    environment:
      DATABASE_URL: postgresql://langfuse:password@db:5432/langfuse
      CLICKHOUSE_URL: http://clickhouse:8123
      REDIS_URL: redis://redis:6379
      NEXTAUTH_SECRET: your-secret-here
      NEXTAUTH_URL: http://localhost:3000
      SALT: your-salt-here

  db:
    image: postgres:15
    environment:
      POSTGRES_USER: langfuse
      POSTGRES_PASSWORD: password
      POSTGRES_DB: langfuse
    volumes:
      - postgres_data:/var/lib/postgresql/data

  clickhouse:
    image: clickhouse/clickhouse-server:24
    volumes:
      - clickhouse_data:/var/lib/clickhouse

  redis:
    image: redis:7

volumes:
  postgres_data:
  clickhouse_data:

Production self-hosting needs Kubernetes for ClickHouse and Postgres HA, but for development or small teams, Docker Compose is the full stack.

Langfuse SDK Integration

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

langfuse = Langfuse(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com"  # or self-hosted URL
)

@observe()
def retrieve_documents(query: str) -> list[str]:
    # Auto-traced as a span
    return vector_store.similarity_search(query, k=3)

@observe(name="rag-pipeline")
def answer_question(question: str) -> str:
    docs = retrieve_documents(question)
    context = "\n".join(docs)

    # Add custom metadata to the trace
    langfuse_context.update_current_observation(
        metadata={"num_docs": len(docs), "retrieval_method": "semantic"},
        input={"question": question},
    )

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Use this context: {context}"},
            {"role": "user", "content": question}
        ]
    )

    answer = response.choices[0].message.content
    langfuse_context.update_current_observation(output=answer)
    return answer

Langfuse supports OpenTelemetry natively — if your codebase uses OTel for general observability, you can route LLM traces to Langfuse alongside your existing infrastructure.

Langfuse Evaluations

# Run LLM-as-a-Judge evaluations on traces
from langfuse import Langfuse

langfuse = Langfuse()

# Fetch recent traces
traces = langfuse.fetch_traces(limit=100).data

for trace in traces:
    # Score a trace (e.g., from your custom eval)
    langfuse.score(
        trace_id=trace.id,
        name="relevance",
        value=0.85,  # 0-1 scale
        comment="Response addresses the question directly"
    )

Braintrust: The Eval-First Platform

Braintrust's differentiation is building evaluation into the deployment pipeline — not as a reporting tool you check after the fact, but as a blocking gate in CI/CD:

import braintrust
from braintrust import Eval

# Define your eval dataset
examples = [
    {"input": "What is RAG?", "expected": "Retrieval-Augmented Generation..."},
    {"input": "Explain embeddings", "expected": "Embeddings are numerical..."},
]

# Define your task (the thing you're evaluating)
def my_rag_pipeline(input_data):
    return rag_pipeline(input_data["input"])

# Define your scorer
def semantic_similarity(input, output, expected):
    # Returns a 0-1 score
    return cosine_similarity(embed(output), embed(expected))

# Run the eval — blocks CI if score regresses
result = Eval(
    "my-rag-experiment",
    data=examples,
    task=my_rag_pipeline,
    scores=[semantic_similarity],
)

print(result.summary)
# EvalResultWithSummary:
#   score: 0.89 (vs baseline: 0.87, +0.02 ✅)
#   95% CI: [0.84, 0.93]

The experiment system tracks score regressions across deployments. When you run evals in CI and a new version scores worse than the baseline, the deployment blocks automatically.

Braintrust Tracing

import braintrust

# Wrap any function for auto-tracing
@braintrust.traced
def generate_response(prompt: str, model: str = "gpt-4o") -> str:
    response = openai_client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Or trace manually with spans
with braintrust.start_span(name="rag-pipeline") as span:
    docs = retrieve_docs(query)
    span.log(metadata={"num_docs": len(docs)})

    with braintrust.start_span(name="llm-call") as llm_span:
        response = generate_response(build_prompt(docs, query))
        llm_span.log(output=response)

Feature Comparison

FeatureLangSmithLangfuseBraintrust
Tracing
Prompt management
Evaluations✅ Best-in-class
CI/CD eval blocking⚠️ Manual⚠️ Manual✅ Native
LangChain auto-trace✅ Native✅ Plugin✅ Plugin
OpenTelemetry⚠️ Partial✅ Full⚠️ Partial
Open source✅ MIT
Self-hosting⚠️ Limited✅ Full
Dataset management
Human annotation
A/B prompt experiments
Cost tracking
HIPAA/SOC2EnterpriseEnterpriseEnterprise

How to Choose

Choose LangSmith if:

  • Your application uses LangChain or LangGraph as the primary framework
  • You want zero-config tracing without any instrumentation code
  • Per-seat pricing fits your team size (< 10 engineers)

Choose Langfuse if:

  • You want open-source infrastructure you can fully own and self-host
  • Your team is larger (unlimited user seats on cloud plan)
  • You use multiple LLM frameworks or raw API calls (OpenTelemetry integration is cleaner)
  • Vendor lock-in is a concern — Langfuse can be swapped with a URL change

Choose Braintrust if:

  • You need CI/CD deployment gates that block on eval regressions
  • Your team runs systematic A/B experiments on prompt changes
  • You prefer a fully managed platform with no infrastructure overhead
  • Evaluation automation is the primary use case, not just tracing

Find and compare LLM observability APIs at APIScout.

Related: OpenAI API vs Anthropic Claude API 2026 · LangChain vs Vercel AI SDK 2026

Comments