Skip to main content

LLM API Pricing 2026: GPT-5 vs Claude vs Gemini

·APIScout Team
llm-apisopenaianthropicgoogle-geminipricingai2026

The LLM Pricing Landscape Has Shifted Dramatically

Flagship LLM API prices have fallen 40–60% per generation year-over-year. Models that cost $30/MTok in 2024 cost $5/MTok today. Ultra-budget entrants like Mistral Nemo ($0.02/$0.04 per MTok) and Gemini 2.0 Flash-Lite ($0.075/$0.30) have compressed the bottom of the market to near-zero. At the top, GPT-5.2 Pro charges $21/$168 per MTok — nearly 70x more than the cheapest commercial option.

Choosing the wrong model for your workload is now a significant cost risk. Build a high-volume production pipeline on GPT-5.2 Pro instead of DeepSeek V3.2 and you could overspend by an order of magnitude. Choose Gemini 2.0 Flash-Lite when your task requires frontier reasoning and your quality metrics will fall apart.

This guide gives you the complete pricing picture for March 2026: every major model, every discount mechanism, and practical decision frameworks for real production workloads.

TL;DR

  • Cheapest commercial option: Mistral Nemo at $0.02/$0.04 per MTok
  • Best budget frontier model: DeepSeek V3.2 at $0.28/$0.42 — 90% prompt cache discount drops effective input cost to $0.028/MTok
  • Best value flagship: Gemini 2.5 Pro at $1.25/$10 with 1M context
  • Best for long-context tasks (flat-rate 1M): Claude Opus 4.6 or Sonnet 4.6 — no surcharge above 200K tokens
  • Most expensive option: GPT-5.2 Pro at $21/$168 per MTok (specialized use cases only)
  • Batch + caching combined: up to 95% total savings on qualifying workloads

Key Takeaways

  • Prices have dropped 40–60% year-over-year. What you budgeted for in 2024 likely costs half as much today.
  • DeepSeek V3.2's prompt cache discount is exceptional. A 90% discount on cached input tokens means $0.28/MTok becomes $0.028/MTok — cheaper than any competitor including Mistral Nemo on cached workloads.
  • Claude Opus 4.6, GPT-5.4, and Gemini 2.5 Pro all ship 1M+ context windows in GA. Claude's 1M is flat-rate (no surcharge); Gemini 3 Pro charges a 2x premium above 200K tokens. Context window size is now table stakes at the flagship tier.
  • Batch API discounts (50% off) are available from OpenAI, Anthropic, and Google. If your workload tolerates 24-hour async processing, this is free money.
  • Combining batch and caching can reach 95% total savings. A $25/MTok output model becomes effectively $1.25/MTok.
  • Mistral Nemo at $0.02/$0.04 is the cheapest commercial API for classification, extraction, and structured output tasks that do not need frontier capability.
  • GPT-5.2 Pro and O3 Pro are in a category of their own. At $21/$168 and $150/MTok input respectively, they serve specialized enterprise and research workloads where cost is secondary to capability.

Full Pricing Comparison Table

Prices are per million tokens (MTok). All prices reflect standard list pricing as of March 2026.

ModelProviderInput ($/MTok)Output ($/MTok)Context WindowNotes
GPT-5.4OpenAI$2.50$15.001.05MGeneral flagship; cost↑ above 272K ctx
GPT-5.4 ProOpenAI$30.00$180.001.05MEnterprise deep reasoning
GPT-5.3 CodexOpenAI$3.00$15.00128KCode-optimized
GPT-5.2 ProOpenAI$21.00$168.00128KAdvanced reasoning
O3 ProOpenAI$150.00128KInput pricing only published
Claude Opus 4.6Anthropic$5.00$25.001MFlat rate; GA March 13, 2026
Claude Sonnet 4.6Anthropic$3.00$15.001MBest value Anthropic
Claude Haiku 3.5Anthropic$0.25$1.25200KHigh-volume, low-latency
Gemini 3 Pro PreviewGoogle$2.00/$4.00$12.00/$18.001M2x input premium >200K ctx
Gemini 2.5 ProGoogle$1.25$10.001MBest value with long context
Gemini 2.0 FlashGoogle$0.30$2.501MFast, cost-efficient
Gemini 2.0 Flash-LiteGoogle$0.075$0.301MUltra-budget
DeepSeek V3.2DeepSeek$0.28$0.42128K90% cache discount
DeepSeek R1DeepSeek$0.55$2.19128KReasoning model
Grok 3xAI$3.00$15.00131KReal-time web access
Mistral NemoMistral$0.02$0.04128KCheapest commercial API

Context Window Comparison

Context window size determines how much text, code, or data a model can process in a single request. This directly affects what use cases are feasible without chunking.

Context WindowModels
1M+ tokensGPT-5.4 (1.05M), Claude Opus 4.6 (1M), Claude Sonnet 4.6 (1M), Gemini 2.5 Pro (1M), Gemini 2.0 Flash (1M), Gemini 2.0 Flash-Lite (1M)
200K tokensClaude Haiku 3.5
128–131K tokensGPT-5.3 Codex, GPT-5.2 Pro, O3 Pro, DeepSeek V3.2, DeepSeek R1, Grok 3, Mistral Nemo

What a 1M context window enables in practice:

  • Entire codebases (medium-sized projects) in a single request
  • Full legal contracts, financial filings, or research papers without chunking
  • Long conversation history without summarization hacks
  • Multi-document analysis without retrieval pipelines

1M context is now standard among flagship models. The key differentiator is pricing above 200K tokens: Claude charges the same flat rate across the full 1M window. Gemini 3 Pro charges 2x input / 1.5x output above 200K tokens — meaningful for very long-context workloads. GPT-5.4 also notes that input costs double beyond 272K tokens per session.

The practical winner for sustained long-context work: Claude. Flat pricing across 1M tokens makes cost predictable at any context depth.

Cost Optimization: Batch API and Prompt Caching

The headline prices above are not what most production teams actually pay. Two discount mechanisms can dramatically reduce effective costs.

Batch API: 50% Discount for Async Workloads

OpenAI, Anthropic, and Google all offer batch processing APIs that accept jobs and return results within 24 hours. The tradeoff: no real-time responses. The reward: 50% off standard pricing.

Batch API math:

  • Claude Opus 4.6 output at list price: $25.00/MTok
  • Claude Opus 4.6 output with batch API: $12.50/MTok
  • Gemini 2.5 Pro input with batch API: $0.625/MTok
  • GPT-5.4 input with batch API: $1.25/MTok

Batch API is the right choice for: data labeling pipelines, document summarization at scale, overnight report generation, content moderation queues, and any workload that does not require sub-second latency.

Prompt Caching: Up to 90% Savings on Repeated Context

Prompt caching lets you reuse a cached version of a long system prompt, document, or tool definition across many requests. Only new tokens are billed at full price. Cached tokens are billed at a steep discount.

ProviderCache Read Discount
Anthropic90% off input price
OpenAI~50% off input price
Google~75% off input price
DeepSeek90% off input price

DeepSeek V3.2 caching math:

DeepSeek V3.2 input at list price: $0.28/MTok. With 90% cache discount: $0.028/MTok cached input — less than half the price of Mistral Nemo's already-minimal list rate. For workloads with heavy context reuse (RAG pipelines, agent loops with fixed tool definitions, document Q&A), DeepSeek V3.2 is the most cost-efficient frontier model available.

Anthropic caching math:

Claude Opus 4.6 at $5.00/MTok input. A RAG pipeline sending a 50K-token knowledge base with every request:

  • Without caching: 50K tokens × $5.00/MTok = $0.25 per request
  • With caching (90% discount): 50K tokens × $0.50/MTok = $0.025 per request
  • Savings: 90% per cached token

This can fully invert cost comparisons. Claude Opus 4.6 with aggressive caching can be cheaper than Gemini 2.5 Pro at list price for the right workload.

Combined Savings: Batch + Caching

When batch API and prompt caching stack, savings compound:

  • Claude Opus 4.6 list output: $25.00/MTok
  • After batch API (50% off): $12.50/MTok
  • After cache read on input (90% off): effective blended cost drops significantly
  • Real-world combined savings: up to 95% on qualifying workloads

A workload that looks like $25,000/month at list price could cost $1,250/month with both mechanisms applied. That changes business cases fundamentally.

Real Dollar Examples

Abstract per-million-token pricing can be hard to reason about. Here are concrete examples.

Generating 1M output tokens (roughly 750,000 words — a large novel):

ModelCost
Mistral Nemo$0.04
Gemini 2.0 Flash-Lite$0.30
DeepSeek V3.2$0.42
Gemini 2.0 Flash$2.50
Gemini 2.5 Pro$10.00
Claude Haiku 3.5$1.25
Claude Sonnet 4.6$15.00
GPT-5.4$15.00
Claude Opus 4.6$25.00
GPT-5.2 Pro$168.00

The spread from cheapest to most expensive: 4,200x. Mistral Nemo costs $0.04 to generate a million output tokens. GPT-5.2 Pro costs $168.00 for the same volume. Model selection is the single largest cost lever in any LLM-powered application.

Processing a 10,000-token document (roughly 30-40 pages):

ModelInput Cost
Gemini 2.0 Flash-Lite$0.00075
Mistral Nemo$0.0002
DeepSeek V3.2$0.0028
Claude Haiku 3.5$0.0025
Gemini 2.5 Pro$0.0125
GPT-5.4$0.025
Claude Opus 4.6$0.05
GPT-5.2 Pro$0.21

At this document scale, absolute costs are small for any model. The economics only become significant at volume — processing 100,000 such documents per month shifts the Claude Opus 4.6 cost to $5,000/month vs $28/month for Gemini 2.0 Flash-Lite.

Use Case Recommendations

Budget-Tight: Maximum Volume at Minimum Cost

Best options: Mistral Nemo, Gemini 2.0 Flash-Lite, DeepSeek V3.2

Use Mistral Nemo ($0.02/$0.04) for: text classification, entity extraction, structured output generation, content moderation, and any task where a capable-but-not-frontier model works.

Use Gemini 2.0 Flash-Lite ($0.075/$0.30) when you need a well-supported platform with Google's ecosystem and want sub-dollar costs at high volume.

Use DeepSeek V3.2 ($0.28/$0.42, $0.028 cached) when you need frontier-quality output at budget prices. R1 reasoning tasks at $0.55/$2.19 remain competitive with GPT-5.4 for complex reasoning while costing half as much.

Explore DeepSeek's API pricing and capabilities and see how it compares in our DeepSeek vs OpenAI API comparison.

Performance-First: Best Quality Regardless of Cost

Best options: Claude Opus 4.6, GPT-5.2 Pro, O3 Pro

Claude Opus 4.6 ($5/$25) is the correct default for performance-first teams. It leads SWE-bench Verified benchmarks, offers 1M context, and has a well-documented API with strong tooling. It is the model to beat on complex reasoning, agentic coding, and nuanced document analysis.

GPT-5.2 Pro ($21/$168) and O3 Pro ($150/MTok input) are justified for specialized enterprise scenarios: high-stakes financial modeling, medical reasoning, research synthesis, or cases where even marginal quality improvements translate to significant downstream value. Most teams will not need them.

Explore Anthropic's API and OpenAI's API for full capability documentation.

Long-Context Workloads

Best options: Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro

If your workload requires processing more than 100K tokens in a single request, you have three strong options. The key differentiator is not context window size (all flagships now hit 1M+), but how they price extended context:

  • Claude Opus 4.6 ($5/$25, 1M context, flat rate): Best for long-form reasoning, multi-document synthesis, codebase analysis. No surcharge at any context depth.
  • Claude Sonnet 4.6 ($3/$15, 1M context, flat rate): Best value for long-context tasks — same flat-rate 1M window at a lower price point.
  • Gemini 2.5 Pro ($1.25/$10, 1M context): Most cost-efficient for workloads under 200K tokens; charges a 2x input premium above that threshold.
  • GPT-5.4 ($2.50/$15, 1.05M context): Input costs double beyond 272K tokens per session — plan accordingly for sustained long-context workloads.

Explore Google Gemini's API for full context window and pricing details. See also our Anthropic vs Google Gemini comparison.

Balanced: Cost-Efficient Frontier Quality

Best options: Gemini 2.5 Pro, Claude Sonnet 4.6, Grok 3

For teams that want strong performance without paying flagship premiums:

  • Gemini 2.5 Pro at $1.25/$10 is the most underrated model in this comparison. Frontier-quality output, 1M context, and a price point well below GPT-5.4 or Claude Opus 4.6.
  • Claude Sonnet 4.6 at $3/$15 is Anthropic's most popular model for good reason: it sits at the sweet spot between Haiku's cost-efficiency and Opus's reasoning depth.
  • Grok 3 at $3/$15 adds real-time web access, useful for applications requiring current information without separate retrieval pipelines.

When NOT to Use Premium Models

Premium models are often the wrong choice. Here are the specific cases to avoid them.

Do not use GPT-5.2 Pro or O3 Pro for:

  • High-volume text classification or entity extraction
  • Standard chatbot conversations
  • Summarization of documents where quality is "good enough" at lower tiers
  • Any workload where GPT-5.4 or Claude Sonnet 4.6 produces acceptable output

The quality gap between GPT-5.4 ($2.50/$10) and GPT-5.2 Pro ($21/$168) does not justify a 7–17x price increase for most applications. Run quality evaluations on your specific task with a tier-down model before committing to premium pricing.

Do not use Claude Opus 4.6 for:

  • Tasks that Sonnet 4.6 handles equally well (check with evals)
  • High-throughput, latency-sensitive production endpoints where Haiku 3.5 suffices
  • Simple extraction or formatting tasks

Rule of thumb: Always eval down. Start with the cheapest model that might work. Move up only when quality metrics fail. Most teams skip this step and overspend significantly.

See our API cost optimization strategies guide for a full framework on tiered model selection and eval-driven cost reduction.

Methodology and Sources

Pricing data in this article reflects publicly published list prices from each provider as of March 2026. All prices are per million tokens (MTok) in USD. Input and output token prices are listed separately where providers distinguish between them.

Sources:

  • OpenAI pricing page (platform.openai.com/pricing): GPT-5.4, GPT-5.3 Codex, GPT-5.2 Pro, O3 Pro
  • Anthropic pricing page (anthropic.com/pricing): Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 3.5
  • Google AI pricing (ai.google.dev/pricing): Gemini 2.5 Pro, Gemini 2.0 Flash, Gemini 2.0 Flash-Lite
  • DeepSeek pricing (platform.deepseek.com/api-docs/pricing): DeepSeek V3.2, DeepSeek R1
  • xAI pricing (docs.x.ai/docs/models): Grok 3
  • Mistral pricing (mistral.ai/pricing): Mistral Nemo

Methodology notes:

  • Batch API discounts are stated as 50% off list price; actual discounts may vary by provider and model tier — always verify at time of purchase.
  • Prompt cache discount percentages are based on published cache read rates; write costs (to populate the cache) are typically billed at full input price.
  • "Combined savings" estimates assume ideal conditions: high cache hit rate and batch-eligible workload patterns. Real-world savings will vary.
  • Context window sizes reflect the maximum published context; performance on tasks filling the full window may degrade — test your specific use case.
  • Prices change frequently. Verify current rates via each provider's pricing page or the APIScout pricing tracker before making budget decisions.

We update this comparison quarterly. Last updated: March 16, 2026.

Comments