Claude Extended Thinking API: Cost & When to Use
What Extended Thinking Actually Is
Extended thinking is Anthropic's implementation of chain-of-thought reasoning at the API level. When enabled, Claude spends tokens on a hidden reasoning scratchpad before generating its final response. The thinking block is visible in the API response but not shown to end users unless you explicitly display it.
The key claim: extended thinking makes Claude meaningfully better on tasks that require multi-step reasoning — math, complex coding, nuanced analysis — at the cost of higher token usage, higher latency, and higher per-call cost.
Important 2026 update: On the latest models (claude-opus-4.6, claude-sonnet-4.6), the budget_tokens parameter is deprecated in favor of adaptive thinking (thinking: { type: "adaptive" }), which lets Claude dynamically decide when and how much to think. The manual budget_tokens approach documented here still works and is the right API for claude-3-7-sonnet and claude-opus-4, but plan to migrate to adaptive thinking on newer models.
This guide is for developers evaluating whether to turn it on, and for what.
TL;DR
Enable extended thinking for complex reasoning tasks — AIME-level math, multi-constraint programming problems, long-form analysis with conflicting evidence. Skip it for straightforward generation tasks (summarization, classification, simple Q&A) where the overhead adds cost and latency without improving output quality. Budget 5,000–10,000 thinking tokens for most use cases; 20,000+ for genuinely hard problems.
Key Takeaways
- Supported models: claude-3-7-sonnet, claude-opus-4 (and newer models that ship with thinking capability)
- Pricing: Thinking tokens are billed as output tokens at the model's standard output rate — they're not free
- Minimum budget: 1,024 thinking tokens (hard minimum); maximum varies by model (up to 128K on Opus 4.6)
- Latency impact: Significant — adding 5K thinking tokens adds ~5–15 seconds to response time depending on model
- Claude 4+ billing gotcha: Billed for full thinking tokens but API returns only a summary — visible thinking ≠ billed thinking
- Tool choice restriction: With thinking enabled, only
tool_choice: "auto"or"none"are valid — forcing specific tools returns an API error - Best use cases: Multi-step math, competitive programming, complex multi-constraint decisions, long-document analysis
- Skip it for: Summarization, classification, simple generation, conversational responses
How to Enable Extended Thinking
Extended thinking is activated via the thinking parameter in the API request:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000 # How many tokens Claude can use for thinking
},
messages=[{
"role": "user",
"content": "Solve this step by step: A train leaves Chicago at 60 mph. Another leaves NYC at 80 mph toward Chicago. The cities are 790 miles apart. When and where do they meet?"
}]
)
# Response contains both thinking blocks and text blocks
for block in response.content:
if block.type == "thinking":
print("Thinking:", block.thinking[:500]) # First 500 chars
elif block.type == "text":
print("Answer:", block.text)
The response structure returns a list of content blocks. Thinking blocks have type: "thinking" and contain Claude's raw reasoning. Text blocks have type: "text" and contain the final response.
# TypeScript equivalent
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-opus-4-5",
max_tokens: 16000,
thinking: {
type: "enabled",
budget_tokens: 10000,
},
messages: [
{
role: "user",
content: "Your complex problem here",
},
],
});
for (const block of response.content) {
if (block.type === "thinking") {
console.log("Thinking:", block.thinking);
} else if (block.type === "text") {
console.log("Response:", block.text);
}
}
Streaming Extended Thinking
Extended thinking supports streaming. Thinking blocks stream first, then the text response:
with client.messages.stream(
model="claude-opus-4-5",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 8000},
messages=[{"role": "user", "content": "Your question here"}]
) as stream:
for event in stream:
if hasattr(event, 'type'):
if event.type == 'content_block_start':
if event.content_block.type == 'thinking':
print("Thinking started...")
elif event.type == 'content_block_delta':
if event.delta.type == 'thinking_delta':
print(event.delta.thinking, end='', flush=True)
elif event.delta.type == 'text_delta':
print(event.delta.text, end='', flush=True)
For most applications, you'll want to stream the text block to the user and either discard the thinking block or log it for debugging.
Faster streaming with display: "omitted": If you don't need the thinking content in the response, set display: "omitted" to skip streaming the thinking block entirely and get faster time-to-first-text-token. You still pay for full thinking tokens — this only affects what's returned, not what's computed:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 8000,
"display": "omitted" # Don't return thinking in response; faster TTFT
},
messages=[{"role": "user", "content": "Your question here"}]
)
# response.content will only have text blocks; thinking is computed but not returned
Pricing: What Thinking Tokens Cost
Thinking tokens are billed as output tokens at the model's output token rate. They are not free — thinking is compute-intensive.
| Model | Input (per 1M) | Output (per 1M) | Adaptive thinking |
|---|---|---|---|
| claude-haiku-4-5 | $1.00 | $5.00 | ❌ |
| claude-sonnet-4-5 / claude-3-7-sonnet | $3.00 | $15.00 | ❌ |
| claude-sonnet-4-6 | $3.00 | $15.00 | ✅ |
| claude-opus-4-5 | $5.00 | $25.00 | ❌ |
| claude-opus-4-6 | $5.00 | $25.00 | ✅ (recommended) |
| claude-opus-4 / claude-opus-4-1 | $15.00 | $75.00 | ❌ |
All thinking tokens billed as output tokens. Claude 4+ models return a summarized thinking block but bill for the full internal reasoning — budget accordingly.
A request with 10,000 thinking tokens on claude-3-7-sonnet costs an additional $0.15 in thinking overhead. On claude-opus-4-5, the same budget costs $0.25 extra.
Cost Scenario: Complex Reasoning at Scale
Suppose you're building a code review tool that uses 8,000 thinking tokens + 1,000 input tokens + 2,000 output tokens per review:
| Model | Per Review | 1,000 Reviews/Month |
|---|---|---|
| claude-3-7-sonnet | $0.147 | $147 |
| claude-opus-4-5 | $0.255 | $255 |
Compare to the same request without thinking (just 1,000 input + 2,000 output tokens):
| Model | Per Review | 1,000 Reviews/Month |
|---|---|---|
| claude-3-7-sonnet | $0.033 | $33 |
| claude-opus-4-5 | $0.055 | $55 |
Extended thinking adds ~4-5x cost overhead. This is only justified when the quality improvement is significant and measurable.
When Extended Thinking Helps (and Doesn't)
✅ High-benefit use cases
Multi-step math and symbolic reasoning Extended thinking most clearly shines on problems where humans write long scratch paper derivations. AIME math competition problems, physics word problems with multiple unknowns, optimization problems. Claude's performance on AIME 2024 problems improves substantially with extended thinking enabled.
Competitive programming Problems requiring algorithm design, edge case analysis, and correctness proofs. Extended thinking helps Claude reason through time complexity, identify edge cases, and verify its own solution before outputting code.
Complex code review and debugging Multi-file debugging where understanding the call chain, state mutations, and async behavior requires holding a lot of context simultaneously. Extended thinking gives Claude space to trace execution paths before drawing conclusions.
Multi-constraint decision analysis "Given 15 competing requirements, conflicting stakeholder preferences, and three viable architectures, which should we choose?" — tasks where the answer requires genuinely weighing competing considerations rather than pattern-matching to a common answer.
Long document analysis When asked to synthesize findings from a 50-page document or identify contradictions across a legal contract, extended thinking gives Claude space to work through sections systematically.
❌ Low-benefit or net-negative use cases
Summarization and extraction Summarizing a document or extracting structured data doesn't benefit from extended thinking — the task is mostly reading comprehension and formatting. The thinking tokens add cost with no quality gain.
Simple generation Writing a product description, generating email copy, or producing boilerplate code — these are pattern-completion tasks. Extended thinking won't make them better and will add latency.
Classification and routing Determining the sentiment of a review or routing a customer ticket to the right department — binary or multi-class classification doesn't need reasoning space.
Conversational responses Chatbot responses to simple questions. Extended thinking feels unnatural in real-time chat (5-15 second delay before any response) and the quality improvement is marginal.
Time-sensitive generation Any use case where P99 latency matters — autocomplete, real-time assistance, interactive applications — should avoid extended thinking.
Budget Token Guidance
The budget_tokens parameter controls how much thinking Claude can do. More isn't always better:
| Budget | Use Case | Approx Latency Add |
|---|---|---|
| 1,024–2,000 | Simple multi-step problems | +2–4s |
| 4,000–8,000 | Medium complexity reasoning | +5–10s |
| 10,000–20,000 | Hard reasoning tasks | +10–25s |
| 30,000–100,000 | Very hard problems (AIME, complex research) | +30–120s |
Claude won't always use the full budget — it stops thinking when it has a confident answer. Setting a high budget doesn't guarantee high thinking token usage, but it does set an upper cost ceiling.
Practical guidance: Start with 8,000 for most reasoning tasks. Bump to 16,000 if results are unsatisfactory. Only go beyond 20,000 for genuinely hard single-answer problems where accuracy is paramount.
Interleaved Thinking (Advanced)
Some models support interleaved thinking — where thinking blocks and text blocks alternate in a multi-turn conversation. This allows Claude to reason about each turn in a dialogue rather than only at the start.
With interleaved thinking enabled:
- Claude can reconsider its reasoning mid-conversation
- Multi-turn agentic tasks benefit most (tool calls, iterative refinement)
- The thinking blocks from prior turns are included in subsequent turns' context
This is most useful for agentic applications where Claude is executing multi-step plans with external tool calls between turns.
Debugging Extended Thinking
The thinking block is a goldmine for debugging why Claude produced a wrong answer:
for block in response.content:
if block.type == "thinking":
thinking_text = block.thinking
# Log to your observability platform
logger.debug(f"Claude thinking: {thinking_text}")
# Check if Claude expressed doubt
if "I'm not sure" in thinking_text or "actually" in thinking_text:
logger.warning("Claude expressed uncertainty — consider higher budget")
When Claude's final answer is wrong, reading its thinking block usually reveals exactly where the reasoning broke down — a wrong assumption, a skipped step, or a miscalculation. This feedback loop is useful when tuning your prompts.
Comparison: Extended Thinking vs Few-Shot Prompting
A common question: is extended thinking better than chain-of-thought few-shot prompting?
| Approach | Pros | Cons |
|---|---|---|
| Extended thinking | No prompt engineering; works on novel problems; thinking is separable | More expensive; higher latency; budget management |
| Few-shot CoT | Cheaper (no thinking tokens); faster; controllable format | Requires examples; brittle on distribution shift; examples consume input context |
For structured problem types where you have good examples (accounting reconciliation, code review rubrics), few-shot CoT is often more cost-effective. For open-ended reasoning where the problem type varies, extended thinking is more reliable.
Adaptive Thinking: The New Default (Opus 4.6+)
For the latest Anthropic models, the preferred approach is adaptive thinking rather than manual budget allocation. The key addition is the effort level in output_config:
# Adaptive thinking — recommended for claude-opus-4-6 and claude-sonnet-4-6
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=16000,
thinking={"type": "adaptive"},
output_config={"effort": "high"}, # "low" | "medium" | "high" (default) | "max" (Opus 4.6 only)
messages=[{"role": "user", "content": "Your complex problem here"}]
)
effort controls how aggressively Claude applies reasoning:
"low"— Claude often skips thinking for simple queries; lowest cost and latency"medium"— Balanced; Claude thinks when it judges the task warrants it"high"(default) — Claude almost always reasons; best for production reasoning tasks"max"— Opus 4.6 only; maximum reasoning effort for the hardest problems
With adaptive thinking, interleaved thinking (reasoning between tool calls) is automatic — no beta header required on Opus 4.6/Sonnet 4.6.
⚠️ Claude 4+ billing gotcha: On Claude 4 models, you are billed for the full thinking tokens generated internally — but the API only returns a summary of the thinking. The visible thinking field is shorter than what was actually computed. Budget with this in mind: if you set budget_tokens: 10000, expect to be billed for close to 10K output tokens even though the returned thinking summary may be 500 tokens.
(Claude 3.7 Sonnet does not have this behavior — it returns full thinking text, so billed tokens match what you see.)
When to stick with budget_tokens:
- You're using
claude-3-7-sonnet,claude-opus-4, orclaude-opus-4-1(not the .6 variants) - You need a hard cost ceiling per request
- You want predictable latency guarantees
When to use adaptive thinking:
- You're on Opus 4.6 or Sonnet 4.6 (where
budget_tokensis deprecated) - Task complexity varies widely across requests
- You want automatic interleaved thinking in agentic workflows
When to Use Which Model
claude-haiku-4-5 + extended thinking ($1/$5 per MTok): Cheapest entry point for reasoning tasks. Extended thinking only (no adaptive). Right for high-volume workloads where cost is critical and tasks benefit from structured reasoning.
claude-sonnet-4-5 / claude-3-7-sonnet + extended thinking ($3/$15 per MTok): The production sweet spot. Capable reasoning at an affordable output rate. Good for code review pipelines, analysis tasks, and mixed-complexity workloads.
claude-opus-4-5 + extended thinking ($5/$25 per MTok): Strong capability for hard problems — noticeably cheaper than the original Opus 4/4.1 ($15/$75) while delivering comparable reasoning quality. Use for architecture decisions, complex debugging, research synthesis.
claude-opus-4-6 + adaptive thinking ($5/$25 per MTok): The latest model with adaptive thinking + effort levels. Automatic interleaved thinking in agentic workflows. Use effort: "max" for the hardest problems. Best for new projects where you want the most capable, modern API.
Production Integration Patterns
Pattern 1: Selective thinking activation
Don't use extended thinking for every request. Classify the query first:
def should_use_thinking(query: str) -> bool:
"""Use simple heuristics to decide if a query benefits from thinking."""
complexity_signals = [
"step by step", "optimize", "algorithm", "prove", "explain why",
"compare", "trade-off", "architecture", "debug", "analyze"
]
return any(signal in query.lower() for signal in complexity_signals)
# In your API handler:
thinking_config = {"type": "enabled", "budget_tokens": 8000} if should_use_thinking(user_query) else {"type": "disabled"}
Pattern 2: Thinking token caching
For repeated similar problems (e.g., a code review pipeline), consider whether you can structure prompts to take advantage of prompt caching — though thinking blocks themselves are not cacheable.
Pattern 3: Expose thinking as a "show your work" feature
Some applications benefit from surfacing thinking to users. A math tutoring app, a legal research tool, or a code review system might show Claude's reasoning as part of the value proposition:
for block in response.content:
if block.type == "thinking":
st.expander("💭 Claude's reasoning process").write(block.thinking)
elif block.type == "text":
st.write(block.text)
Track Claude API pricing, uptime, and updates at APIScout.
Related: LangChain vs CrewAI vs OpenAI Agents SDK 2026 · LLM API Pricing Comparison 2026