Speech-to-Text APIs 2026: Whisper vs Deepgram vs AssemblyAI
The Speech API Market Has Stratified
In 2023, OpenAI Whisper disrupted the speech recognition market by open-sourcing a model that matched commercial APIs at zero cost for self-hosting. In 2025–2026, the market has responded: Deepgram, AssemblyAI, and Google have released specialized models that beat Whisper on specific tasks — real-time streaming, speaker diarization, domain-specific accuracy — while remaining cost-competitive at scale.
The choice now depends heavily on your use case: batch transcription, real-time streaming, feature richness, or cost minimization.
TL;DR
Deepgram Nova-3 leads on batch accuracy (5.26% WER) and real-time streaming latency at $0.0077/min. AssemblyAI wins on features AND is surprisingly cheap for streaming at $0.15/hr (Universal-Streaming) — cheaper than Deepgram for real-time workloads. OpenAI's new gpt-4o-mini-transcribe ($0.003/min, launched March 2025) is the budget batch pick — half the price of whisper-1 with better accuracy and streaming support. Google Chirp 3 (also March 2025) leads on multilingual accuracy with a built-in denoiser.
Key Takeaways
- Deepgram Nova-3: $0.0077/min; 5.26% WER (batch); best real-time streaming latency; Nova-3 Medical at 3.44% WER; GA February 2025
- AssemblyAI: $0.37/hr pre-recorded; $0.15/hr streaming (cheaper than Deepgram for real-time); LeMUR for LLM analysis; Slam-1 launched October 2025
- OpenAI:
whisper-1$0.006/min (batch only);gpt-4o-mini-transcribe$0.003/min with streaming (launched March 2025, now recommended) - Google Chirp 3: $0.016/min standard; $0.004/min Dynamic Batch (75% off); built-in denoiser; native diarization; launched March 2025
- WER benchmarks: Deepgram Nova-3 at 5.26%, AssemblyAI Universal-2 at 5.9%, gpt-4o-transcribe ~5-7%; Google Chirp 3 multilingual is best
- Self-hosting: Whisper
large-v3-turbois the practical choice — 6x faster than large-v3 with <2% accuracy loss
Pricing Comparison
| Provider | Model | Price per Minute | Free Tier | Streaming |
|---|---|---|---|---|
| OpenAI | whisper-1 (legacy) | $0.006/min | None | ❌ |
| OpenAI | gpt-4o-mini-transcribe ⭐ | $0.003/min | None | ✅ |
| OpenAI | gpt-4o-transcribe | $0.006/min | None | ✅ |
| Deepgram | Nova-3 | $0.0077/min | $200 credit | ✅ |
| AssemblyAI | Universal (pre-recorded) | ~$0.0062/min ($0.37/hr) | $50 credit | ✅ |
| AssemblyAI | Universal-Streaming ⭐ | $0.0025/min ($0.15/hr) | $50 credit | ✅ |
| Chirp 3 | $0.016/min | 60 min/month | ✅ | |
| Chirp 3 Dynamic Batch | $0.004/min | 60 min/month | ❌ | |
| Whisper (self-hosted) | large-v3-turbo | Infrastructure only | — | ❌ |
Pricing Notes
OpenAI now has three transcription models (as of March 2025):
whisper-1: $0.006/min, batch only, still works but no longer recommendedgpt-4o-transcribe: $0.006/min, same price as whisper-1 but ~35% better accuracy + streaminggpt-4o-mini-transcribe: $0.003/min, streaming supported, now the recommended choice for most use cases
Deepgram pricing varies by plan:
- Nova-3 pay-as-you-go: $0.0077/min ($0.462/hr)
- Nova-3 Growth plan: $0.0065/min ($0.39/hr)
- Free tier: $200 in credits (~433 hours at pay-as-you-go rates)
AssemblyAI has two distinct pricing tiers worth knowing:
- Universal (pre-recorded/batch): $0.37/hr — best accuracy + features
- Universal-Streaming: $0.15/hr — cheaper than Deepgram for real-time; 6 languages (English, Spanish, French, German, Italian, Portuguese)
- Nano: $0.12/hr — lower accuracy, budget option
- LeMUR (LLM analysis): token-based on top of transcription cost
- Free tier: $50 in credits (~185 hours Universal or 333 hours streaming)
Google Speech-to-Text (Chirp 3 is now current, launched March 2025):
- Standard rate: $0.016/min (most expensive in this group)
- Dynamic Batch (async, 24-hour SLA): $0.004/min — 75% discount for non-real-time workloads
- Free tier: 60 minutes/month (permanent, not trial credit)
OpenAI: Whisper → GPT-4o Transcribe Models
OpenAI launched gpt-4o-transcribe and gpt-4o-mini-transcribe in March 2025. whisper-1 still works, but OpenAI now recommends the GPT-4o-based models. The key improvements: ~35% lower WER than whisper-1, and both support real-time streaming via WebSocket — something whisper-1 never had.
from openai import OpenAI
client = OpenAI()
# Recommended: gpt-4o-mini-transcribe (cheaper, streaming-capable)
with open("meeting-recording.mp3", "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="gpt-4o-mini-transcribe", # $0.003/min — half the price of whisper-1
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word", "segment"]
)
print(transcription.text)
for word in transcription.words:
print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")
# Real-time streaming with gpt-4o-transcribe (WebSocket)
import websocket
import json
import base64
ws = websocket.WebSocket()
ws.connect(
"wss://api.openai.com/v1/realtime?model=gpt-4o-transcribe",
header={"Authorization": f"Bearer {OPENAI_API_KEY}", "OpenAI-Beta": "realtime=v1"}
)
# Stream audio chunks and receive transcripts
# (WebSocket streaming pattern similar to other real-time STT APIs)
Strengths:
- Simplest API integration in the group
gpt-4o-mini-transcribeat $0.003/min is now the cheapest managed option- 99+ languages supported
- Word-level timestamps available
- Open-source Whisper model available for self-hosting
- GPT-4o models now support real-time streaming (March 2025)
Weaknesses:
- 25MB file size limit per request
- No speaker diarization built-in
- No LLM-powered post-processing features (unlike AssemblyAI LeMUR)
whisper-1is batch-only (use gpt-4o-mini-transcribe for streaming)
When to use OpenAI transcription: Budget batch transcription (gpt-4o-mini-transcribe at $0.003/min), or teams already integrated with OpenAI who want streaming without adopting another SDK. Not the best choice if you need diarization or post-processing.
Deepgram Nova-3
Deepgram built its own speech recognition architecture from scratch. Nova-3 reached General Availability in February 2025 with the strongest batch WER benchmark in this comparison (5.26%) and a standout feature: self-serve custom keyword injection — up to 100 domain-specific terms without model retraining. Nova-3 Medical launches with 3.44% WER for healthcare audio.
from deepgram import DeepgramClient, PrerecordedOptions, LiveOptions
import asyncio
import httpx
# Batch transcription
client = DeepgramClient(api_key="YOUR_DEEPGRAM_API_KEY")
with open("audio.mp3", "rb") as audio_file:
response = client.listen.rest.v("1").transcribe_file(
{"buffer": audio_file, "mimetype": "audio/mp3"},
PrerecordedOptions(
model="nova-3",
smart_format=True,
diarize=True, # Speaker detection
punctuate=True,
utterances=True,
)
)
# Access structured results
for utterance in response.results.utterances:
print(f"Speaker {utterance.speaker}: {utterance.transcript}")
# Real-time streaming with Deepgram
import asyncio
import websockets
import json
async def transcribe_stream():
uri = "wss://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&diarize=true"
headers = {"Authorization": f"Token {DEEPGRAM_API_KEY}"}
async with websockets.connect(uri, extra_headers=headers) as ws:
# Stream audio chunks
async for audio_chunk in audio_source():
await ws.send(audio_chunk)
# Receive transcripts in real-time
async for message in ws:
result = json.loads(message)
if result.get("is_final"):
transcript = result["channel"]["alternatives"][0]["transcript"]
if transcript:
print(f"Final: {transcript}")
Strengths:
- Best-in-class real-time streaming latency (sub-300ms to first word)
- Strong accuracy on conversational speech, call center audio, and domain vocabulary
- Speaker diarization included
- Keyword boosting for domain-specific terms
- Language detection (identify the language automatically)
- Good filler word detection (um, uh, like)
- Reliable WebSocket streaming API
Weaknesses:
- More expensive than Whisper for simple batch transcription
- Fewer LLM-powered post-processing features than AssemblyAI
- Less multilingual strength than Google Chirp 2
When to use Deepgram: Real-time transcription (live captioning, voice assistants, call center analytics) where latency is critical. Best batch WER in this group at 5.26%. Strong for call centers using domain vocabulary (custom keyword injection without retraining). Note: for pure cost on real-time streaming, AssemblyAI Universal-Streaming at $0.15/hr is cheaper than Deepgram Nova-3 at $0.462/hr — evaluate if the latency difference matters for your use case.
AssemblyAI
AssemblyAI's pricing structure has a hidden advantage: Universal-Streaming at $0.15/hr is 3x cheaper than Deepgram Nova-3 for real-time ($0.462/hr). For teams that need streaming + features, this changes the math significantly. They also launched Slam-1 in October 2025 — a new speech-language model that goes beyond transcription toward deeper audio understanding.
import assemblyai as aai
aai.settings.api_key = "YOUR_ASSEMBLYAI_API_KEY"
transcriber = aai.Transcriber()
# Transcription with all features enabled
transcript = transcriber.transcribe(
"https://example.com/podcast-episode.mp3",
config=aai.TranscriptionConfig(
speech_model=aai.SpeechModel.universal, # Universal-2
speaker_labels=True, # Diarization
auto_chapters=True, # Chapter detection with summaries
sentiment_analysis=True,
auto_highlights=True, # Key phrases
entity_detection=True, # Named entities
iab_categories=True, # Topic classification
)
)
# Access speakers
for utterance in transcript.utterances:
print(f"Speaker {utterance.speaker}: {utterance.text}")
# Auto-chapters
for chapter in transcript.chapters:
print(f"{chapter.start/1000:.0f}s - {chapter.end/1000:.0f}s: {chapter.headline}")
# Ask questions about the transcript via LeMUR
result = transcript.lemur.task(
"What were the three main topics discussed and what conclusions were reached on each?",
final_model=aai.LemurModel.claude3_5_sonnet
)
print(result.response)
Strengths:
- Best post-processing feature set: auto chapters, sentiment analysis, entity detection, topic classification, PII redaction
- LeMUR: ask free-form questions about transcribed content using Claude or other LLMs
- Strong real-time streaming (WebSocket)
- Universal-2 model is highly accurate on diverse audio types
- Best for podcast/interview/meeting analysis workflows
Weaknesses:
- Most expensive per-minute for raw transcription
- LeMUR adds additional token cost on top of transcription
- More complex setup than Whisper
- Fewer language options than Google
When to use AssemblyAI: Meeting intelligence, podcast workflows, qualitative research, customer call analysis needing structured insights. Also the best-value real-time streaming choice at $0.15/hr. LeMUR is genuinely unique — ask "what action items came up in this meeting?" and get structured answers. Note: Universal-Streaming currently supports 6 languages only; use Universal (batch) for broader language coverage.
Google Speech-to-Text (Chirp 3)
Google launched Chirp 3 in March 2025 — the same month OpenAI launched its GPT-4o transcription models. Chirp 3 adds a built-in audio denoiser, native diarization, and automatic language detection. For non-English audio, Google's multilingual training scale (28 billion text sentences across 100+ languages) gives it an edge no specialist provider can match. The Dynamic Batch option at $0.004/min (75% off standard) makes Google cost-competitive for non-real-time workloads.
from google.cloud import speech_v2
client = speech_v2.SpeechClient()
project_id = "your-project-id"
# Batch transcription
with open("audio.wav", "rb") as f:
audio_data = f.read()
config = speech_v2.RecognitionConfig(
auto_decoding_config=speech_v2.AutoDetectDecodingConfig(),
language_codes=["en-US"],
model="chirp_2",
features=speech_v2.RecognitionFeatures(
enable_automatic_punctuation=True,
enable_word_time_offsets=True,
enable_speaker_diarization=True,
diarization_speaker_count=2,
)
)
request = speech_v2.RecognizeRequest(
recognizer=f"projects/{project_id}/locations/global/recognizers/_",
config=config,
content=audio_data,
)
response = client.recognize(request=request)
for result in response.results:
print(result.alternatives[0].transcript)
Strengths:
- Best multilingual support (100+ languages with Chirp)
- Strong for code-switched audio (multiple languages in same recording)
- Streaming available via gRPC
- Deep GCP integration (Cloud Storage, BigQuery, Pub/Sub)
- Speaker diarization available
- Proven production reliability (powers Google's own products)
Weaknesses:
- Most expensive managed API in this comparison
- GCP setup required (service accounts, project configuration)
- More complex SDK/client library compared to Whisper or Deepgram
- Overkill for English-only use cases
When to use Google Chirp 3: Non-English or multilingual audio (85+ languages, best multilingual accuracy). Applications where audio quality varies and the built-in denoiser helps. GCP-native applications. Dynamic Batch at $0.004/min makes it competitive for non-real-time workloads where 24-hour turnaround is acceptable.
Accuracy Benchmarks: Word Error Rate (WER)
Lower WER = better accuracy. Approximate WER figures on common benchmarks:
| Provider | Model | Batch WER | Streaming WER | Multilingual |
|---|---|---|---|---|
| Deepgram | Nova-3 | 5.26% | 6.84% | Good (Nova-3 multilingual, 6+ languages) |
| AssemblyAI | Universal-2 | ~5.9% | ~7–9% | Good (99 languages batch; 6 languages streaming) |
| OpenAI | gpt-4o-transcribe | ~5–7% | ~6–8% | Strong (99+ languages) |
| OpenAI | gpt-4o-mini-transcribe | ~6–8% | ~7–9% | Strong (99+ languages) |
| Chirp 3 | ~8–11% | ~9–12% | Best (85+ languages with denoising) |
Deepgram WER figures from Deepgram's own benchmarks (2,703 files, 81.69 hours). Independent benchmarks show smaller gaps. Always test on your own audio.
Note: WER varies significantly by audio quality, accent, domain, and benchmark. These are approximate figures from published evaluations; your domain's accuracy may differ substantially.
No single model wins on accuracy across all conditions. For your production use case, benchmark with a sample of your own audio before committing.
Real-Time Streaming Comparison
| Provider | Model | Protocol | Time to First Word | Monthly Cost (1K hr) |
|---|---|---|---|---|
| Deepgram | Nova-3 | WebSocket | ~200–300ms | ~$462 |
| AssemblyAI | Universal-Streaming | WebSocket | ~300–500ms | ~$150 |
| Chirp 3 | gRPC | ~400–700ms | ~$960 | |
| OpenAI | gpt-4o-transcribe | WebSocket | ~300–500ms | ~$360 |
| OpenAI | whisper-1 | Batch only | N/A | ~$360 |
Deepgram has the lowest latency — meaningful for voice assistants and real-time captioning. But AssemblyAI Universal-Streaming at $0.15/hr undercuts Deepgram by 3x for teams where ~300ms additional latency is acceptable. For call center analytics where you don't need sub-300ms, AssemblyAI is the cost winner.
Self-Hosting with Whisper
For very high-volume batch transcription, self-hosting Whisper on your own GPU infrastructure can dramatically reduce costs:
from faster_whisper import WhisperModel
# large-v3-turbo: 6x faster than large-v3, <2% accuracy loss — the practical default
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")
# Transcribe
segments, info = model.transcribe(
"audio.mp3",
beam_size=5,
word_timestamps=True,
vad_filter=True, # Voice activity detection — skip silence
)
for segment in segments:
print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")
faster-whisper (CTranslate2-based) runs Whisper large-v3 at 4-8x faster than the original implementation and requires ~6GB VRAM. On a single A100 GPU at ~$2/hour, you can transcribe ~1,000 hours of audio per hour of compute — that's $0.002/minute vs Whisper API's $0.006/minute at the crossover point of ~500 hours/month.
Cost Scenarios
Scenario 1: Podcast transcription (batch) — 100 hours/month
| Provider | Monthly Cost | Notes |
|---|---|---|
| gpt-4o-mini-transcribe | 100h × 60 × $0.003 = $18 | Best value for batch |
| Deepgram Nova-3 | 100h × 60 × $0.0077 = $46.20 | Best accuracy |
| AssemblyAI Universal | 100h × $0.37 = $37 | Best features |
| Google Chirp 3 Dynamic Batch | 100h × 60 × $0.004 = $24 | Best for multilingual |
| Google Chirp 3 Standard | 100h × 60 × $0.016 = $96 | |
| Whisper self-hosted | ~$15–30 (GPU cost) |
Scenario 2: Call center — 1,000 hours/month real-time streaming
| Provider | Monthly Cost | Streaming Latency |
|---|---|---|
| AssemblyAI Universal-Streaming | 1,000 × $0.15 = $150 | ~300–500ms |
| Deepgram Nova-3 | 1,000 × $0.462 = $462 | ~200–300ms |
| OpenAI gpt-4o-transcribe | 1,000h × 60 × $0.006 = $360 | ~300–500ms |
| Google Chirp 3 | 1,000 × $0.96 = $960 | ~400–700ms |
AssemblyAI Universal-Streaming's pricing advantage is substantial at scale. For call centers where ~300ms latency is acceptable and features (diarization, summaries) matter, it's the clear winner. Deepgram is worth the premium only when sub-300ms latency is a hard requirement.
When to Choose Each
Choose OpenAI (gpt-4o-mini-transcribe) if:
- Best price for managed batch transcription ($0.003/min — cheapest managed option)
- Already integrated with OpenAI and want one SDK
- Need streaming but don't want Deepgram or AssemblyAI onboarding
- English audio, moderate volume, simplicity priority
- Note: use
gpt-4o-mini-transcribenotwhisper-1— better accuracy, same or lower cost, streaming support
Choose Deepgram if:
- Real-time streaming is required (live captions, voice assistants, call center)
- Lowest per-minute cost for managed streaming APIs
- Speaker diarization is needed at scale
- Call center, customer service, or live meeting transcription
Choose AssemblyAI if:
- You need post-processing features: summaries, chapters, sentiment, entities
- Building meeting intelligence or podcast analysis workflows
- LeMUR (ask questions about audio content) is valuable for your use case
- You want the richest feature set and can pay the premium
Choose Google Chirp 3 if:
- Non-English or multilingual audio is primary (85+ languages, best multilingual accuracy)
- Audio quality varies and the built-in denoiser helps
- You're already on GCP and want ecosystem integration
- Non-real-time workloads where Dynamic Batch pricing ($0.004/min) makes Google cost-competitive
Compare STT API pricing, accuracy, and uptime at APIScout.
Related: ElevenLabs vs Cartesia: Best Voice AI API 2026 · Best Free APIs for Developers 2026