Skip to main content

Claude Extended Thinking API: Cost & When to Use

·APIScout Team
claude apianthropicextended thinkingllm apiai reasoning2026

What Extended Thinking Actually Is

Extended thinking is Anthropic's implementation of chain-of-thought reasoning at the API level. When enabled, Claude spends tokens on a hidden reasoning scratchpad before generating its final response. The thinking block is visible in the API response but not shown to end users unless you explicitly display it.

The key claim: extended thinking makes Claude meaningfully better on tasks that require multi-step reasoning — math, complex coding, nuanced analysis — at the cost of higher token usage, higher latency, and higher per-call cost.

Important 2026 update: On the latest models (claude-opus-4.6, claude-sonnet-4.6), the budget_tokens parameter is deprecated in favor of adaptive thinking (thinking: { type: "adaptive" }), which lets Claude dynamically decide when and how much to think. The manual budget_tokens approach documented here still works and is the right API for claude-3-7-sonnet and claude-opus-4, but plan to migrate to adaptive thinking on newer models.

This guide is for developers evaluating whether to turn it on, and for what.

TL;DR

Enable extended thinking for complex reasoning tasks — AIME-level math, multi-constraint programming problems, long-form analysis with conflicting evidence. Skip it for straightforward generation tasks (summarization, classification, simple Q&A) where the overhead adds cost and latency without improving output quality. Budget 5,000–10,000 thinking tokens for most use cases; 20,000+ for genuinely hard problems.

Key Takeaways

  • Supported models: claude-3-7-sonnet, claude-opus-4 (and newer models that ship with thinking capability)
  • Pricing: Thinking tokens are billed as output tokens at the model's standard output rate — they're not free
  • Minimum budget: 1,024 thinking tokens (hard minimum); maximum varies by model (up to 128K on Opus 4.6)
  • Latency impact: Significant — adding 5K thinking tokens adds ~5–15 seconds to response time depending on model
  • Claude 4+ billing gotcha: Billed for full thinking tokens but API returns only a summary — visible thinking ≠ billed thinking
  • Tool choice restriction: With thinking enabled, only tool_choice: "auto" or "none" are valid — forcing specific tools returns an API error
  • Best use cases: Multi-step math, competitive programming, complex multi-constraint decisions, long-document analysis
  • Skip it for: Summarization, classification, simple generation, conversational responses

How to Enable Extended Thinking

Extended thinking is activated via the thinking parameter in the API request:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # How many tokens Claude can use for thinking
    },
    messages=[{
        "role": "user",
        "content": "Solve this step by step: A train leaves Chicago at 60 mph. Another leaves NYC at 80 mph toward Chicago. The cities are 790 miles apart. When and where do they meet?"
    }]
)

# Response contains both thinking blocks and text blocks
for block in response.content:
    if block.type == "thinking":
        print("Thinking:", block.thinking[:500])  # First 500 chars
    elif block.type == "text":
        print("Answer:", block.text)

The response structure returns a list of content blocks. Thinking blocks have type: "thinking" and contain Claude's raw reasoning. Text blocks have type: "text" and contain the final response.

# TypeScript equivalent
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-opus-4-5",
  max_tokens: 16000,
  thinking: {
    type: "enabled",
    budget_tokens: 10000,
  },
  messages: [
    {
      role: "user",
      content: "Your complex problem here",
    },
  ],
});

for (const block of response.content) {
  if (block.type === "thinking") {
    console.log("Thinking:", block.thinking);
  } else if (block.type === "text") {
    console.log("Response:", block.text);
  }
}

Streaming Extended Thinking

Extended thinking supports streaming. Thinking blocks stream first, then the text response:

with client.messages.stream(
    model="claude-opus-4-5",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[{"role": "user", "content": "Your question here"}]
) as stream:
    for event in stream:
        if hasattr(event, 'type'):
            if event.type == 'content_block_start':
                if event.content_block.type == 'thinking':
                    print("Thinking started...")
            elif event.type == 'content_block_delta':
                if event.delta.type == 'thinking_delta':
                    print(event.delta.thinking, end='', flush=True)
                elif event.delta.type == 'text_delta':
                    print(event.delta.text, end='', flush=True)

For most applications, you'll want to stream the text block to the user and either discard the thinking block or log it for debugging.

Faster streaming with display: "omitted": If you don't need the thinking content in the response, set display: "omitted" to skip streaming the thinking block entirely and get faster time-to-first-text-token. You still pay for full thinking tokens — this only affects what's returned, not what's computed:

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 8000,
        "display": "omitted"  # Don't return thinking in response; faster TTFT
    },
    messages=[{"role": "user", "content": "Your question here"}]
)
# response.content will only have text blocks; thinking is computed but not returned

Pricing: What Thinking Tokens Cost

Thinking tokens are billed as output tokens at the model's output token rate. They are not free — thinking is compute-intensive.

ModelInput (per 1M)Output (per 1M)Adaptive thinking
claude-haiku-4-5$1.00$5.00
claude-sonnet-4-5 / claude-3-7-sonnet$3.00$15.00
claude-sonnet-4-6$3.00$15.00
claude-opus-4-5$5.00$25.00
claude-opus-4-6$5.00$25.00✅ (recommended)
claude-opus-4 / claude-opus-4-1$15.00$75.00

All thinking tokens billed as output tokens. Claude 4+ models return a summarized thinking block but bill for the full internal reasoning — budget accordingly.

A request with 10,000 thinking tokens on claude-3-7-sonnet costs an additional $0.15 in thinking overhead. On claude-opus-4-5, the same budget costs $0.25 extra.

Cost Scenario: Complex Reasoning at Scale

Suppose you're building a code review tool that uses 8,000 thinking tokens + 1,000 input tokens + 2,000 output tokens per review:

ModelPer Review1,000 Reviews/Month
claude-3-7-sonnet$0.147$147
claude-opus-4-5$0.255$255

Compare to the same request without thinking (just 1,000 input + 2,000 output tokens):

ModelPer Review1,000 Reviews/Month
claude-3-7-sonnet$0.033$33
claude-opus-4-5$0.055$55

Extended thinking adds ~4-5x cost overhead. This is only justified when the quality improvement is significant and measurable.

When Extended Thinking Helps (and Doesn't)

✅ High-benefit use cases

Multi-step math and symbolic reasoning Extended thinking most clearly shines on problems where humans write long scratch paper derivations. AIME math competition problems, physics word problems with multiple unknowns, optimization problems. Claude's performance on AIME 2024 problems improves substantially with extended thinking enabled.

Competitive programming Problems requiring algorithm design, edge case analysis, and correctness proofs. Extended thinking helps Claude reason through time complexity, identify edge cases, and verify its own solution before outputting code.

Complex code review and debugging Multi-file debugging where understanding the call chain, state mutations, and async behavior requires holding a lot of context simultaneously. Extended thinking gives Claude space to trace execution paths before drawing conclusions.

Multi-constraint decision analysis "Given 15 competing requirements, conflicting stakeholder preferences, and three viable architectures, which should we choose?" — tasks where the answer requires genuinely weighing competing considerations rather than pattern-matching to a common answer.

Long document analysis When asked to synthesize findings from a 50-page document or identify contradictions across a legal contract, extended thinking gives Claude space to work through sections systematically.

❌ Low-benefit or net-negative use cases

Summarization and extraction Summarizing a document or extracting structured data doesn't benefit from extended thinking — the task is mostly reading comprehension and formatting. The thinking tokens add cost with no quality gain.

Simple generation Writing a product description, generating email copy, or producing boilerplate code — these are pattern-completion tasks. Extended thinking won't make them better and will add latency.

Classification and routing Determining the sentiment of a review or routing a customer ticket to the right department — binary or multi-class classification doesn't need reasoning space.

Conversational responses Chatbot responses to simple questions. Extended thinking feels unnatural in real-time chat (5-15 second delay before any response) and the quality improvement is marginal.

Time-sensitive generation Any use case where P99 latency matters — autocomplete, real-time assistance, interactive applications — should avoid extended thinking.

Budget Token Guidance

The budget_tokens parameter controls how much thinking Claude can do. More isn't always better:

BudgetUse CaseApprox Latency Add
1,024–2,000Simple multi-step problems+2–4s
4,000–8,000Medium complexity reasoning+5–10s
10,000–20,000Hard reasoning tasks+10–25s
30,000–100,000Very hard problems (AIME, complex research)+30–120s

Claude won't always use the full budget — it stops thinking when it has a confident answer. Setting a high budget doesn't guarantee high thinking token usage, but it does set an upper cost ceiling.

Practical guidance: Start with 8,000 for most reasoning tasks. Bump to 16,000 if results are unsatisfactory. Only go beyond 20,000 for genuinely hard single-answer problems where accuracy is paramount.

Interleaved Thinking (Advanced)

Some models support interleaved thinking — where thinking blocks and text blocks alternate in a multi-turn conversation. This allows Claude to reason about each turn in a dialogue rather than only at the start.

With interleaved thinking enabled:

  • Claude can reconsider its reasoning mid-conversation
  • Multi-turn agentic tasks benefit most (tool calls, iterative refinement)
  • The thinking blocks from prior turns are included in subsequent turns' context

This is most useful for agentic applications where Claude is executing multi-step plans with external tool calls between turns.

Debugging Extended Thinking

The thinking block is a goldmine for debugging why Claude produced a wrong answer:

for block in response.content:
    if block.type == "thinking":
        thinking_text = block.thinking
        # Log to your observability platform
        logger.debug(f"Claude thinking: {thinking_text}")

        # Check if Claude expressed doubt
        if "I'm not sure" in thinking_text or "actually" in thinking_text:
            logger.warning("Claude expressed uncertainty — consider higher budget")

When Claude's final answer is wrong, reading its thinking block usually reveals exactly where the reasoning broke down — a wrong assumption, a skipped step, or a miscalculation. This feedback loop is useful when tuning your prompts.

Comparison: Extended Thinking vs Few-Shot Prompting

A common question: is extended thinking better than chain-of-thought few-shot prompting?

ApproachProsCons
Extended thinkingNo prompt engineering; works on novel problems; thinking is separableMore expensive; higher latency; budget management
Few-shot CoTCheaper (no thinking tokens); faster; controllable formatRequires examples; brittle on distribution shift; examples consume input context

For structured problem types where you have good examples (accounting reconciliation, code review rubrics), few-shot CoT is often more cost-effective. For open-ended reasoning where the problem type varies, extended thinking is more reliable.

Adaptive Thinking: The New Default (Opus 4.6+)

For the latest Anthropic models, the preferred approach is adaptive thinking rather than manual budget allocation. The key addition is the effort level in output_config:

# Adaptive thinking — recommended for claude-opus-4-6 and claude-sonnet-4-6
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},  # "low" | "medium" | "high" (default) | "max" (Opus 4.6 only)
    messages=[{"role": "user", "content": "Your complex problem here"}]
)

effort controls how aggressively Claude applies reasoning:

  • "low" — Claude often skips thinking for simple queries; lowest cost and latency
  • "medium" — Balanced; Claude thinks when it judges the task warrants it
  • "high" (default) — Claude almost always reasons; best for production reasoning tasks
  • "max" — Opus 4.6 only; maximum reasoning effort for the hardest problems

With adaptive thinking, interleaved thinking (reasoning between tool calls) is automatic — no beta header required on Opus 4.6/Sonnet 4.6.

⚠️ Claude 4+ billing gotcha: On Claude 4 models, you are billed for the full thinking tokens generated internally — but the API only returns a summary of the thinking. The visible thinking field is shorter than what was actually computed. Budget with this in mind: if you set budget_tokens: 10000, expect to be billed for close to 10K output tokens even though the returned thinking summary may be 500 tokens.

(Claude 3.7 Sonnet does not have this behavior — it returns full thinking text, so billed tokens match what you see.)

When to stick with budget_tokens:

  • You're using claude-3-7-sonnet, claude-opus-4, or claude-opus-4-1 (not the .6 variants)
  • You need a hard cost ceiling per request
  • You want predictable latency guarantees

When to use adaptive thinking:

  • You're on Opus 4.6 or Sonnet 4.6 (where budget_tokens is deprecated)
  • Task complexity varies widely across requests
  • You want automatic interleaved thinking in agentic workflows

When to Use Which Model

claude-haiku-4-5 + extended thinking ($1/$5 per MTok): Cheapest entry point for reasoning tasks. Extended thinking only (no adaptive). Right for high-volume workloads where cost is critical and tasks benefit from structured reasoning.

claude-sonnet-4-5 / claude-3-7-sonnet + extended thinking ($3/$15 per MTok): The production sweet spot. Capable reasoning at an affordable output rate. Good for code review pipelines, analysis tasks, and mixed-complexity workloads.

claude-opus-4-5 + extended thinking ($5/$25 per MTok): Strong capability for hard problems — noticeably cheaper than the original Opus 4/4.1 ($15/$75) while delivering comparable reasoning quality. Use for architecture decisions, complex debugging, research synthesis.

claude-opus-4-6 + adaptive thinking ($5/$25 per MTok): The latest model with adaptive thinking + effort levels. Automatic interleaved thinking in agentic workflows. Use effort: "max" for the hardest problems. Best for new projects where you want the most capable, modern API.

Production Integration Patterns

Pattern 1: Selective thinking activation

Don't use extended thinking for every request. Classify the query first:

def should_use_thinking(query: str) -> bool:
    """Use simple heuristics to decide if a query benefits from thinking."""
    complexity_signals = [
        "step by step", "optimize", "algorithm", "prove", "explain why",
        "compare", "trade-off", "architecture", "debug", "analyze"
    ]
    return any(signal in query.lower() for signal in complexity_signals)

# In your API handler:
thinking_config = {"type": "enabled", "budget_tokens": 8000} if should_use_thinking(user_query) else {"type": "disabled"}

Pattern 2: Thinking token caching

For repeated similar problems (e.g., a code review pipeline), consider whether you can structure prompts to take advantage of prompt caching — though thinking blocks themselves are not cacheable.

Pattern 3: Expose thinking as a "show your work" feature

Some applications benefit from surfacing thinking to users. A math tutoring app, a legal research tool, or a code review system might show Claude's reasoning as part of the value proposition:

for block in response.content:
    if block.type == "thinking":
        st.expander("💭 Claude's reasoning process").write(block.thinking)
    elif block.type == "text":
        st.write(block.text)

Track Claude API pricing, uptime, and updates at APIScout.

Related: LangChain vs CrewAI vs OpenAI Agents SDK 2026 · LLM API Pricing Comparison 2026

Comments