API guide
Speech-to-Text APIs (2026)
Deepgram Nova-3: 5.26% WER, best real-time. AssemblyAI streaming cheapest at $0.15/hr. OpenAI's new gpt-4o-transcribe beats Whisper. 2026 STT API comparison.
The Speech API Market Has Stratified
In 2023, OpenAI Whisper disrupted the speech recognition market by open-sourcing a model that matched commercial APIs at zero cost for self-hosting. In 2025–2026, the market has responded: Deepgram, AssemblyAI, and Google have released specialized models that beat Whisper on specific tasks — real-time streaming, speaker diarization, domain-specific accuracy — while remaining cost-competitive at scale.
The choice now depends heavily on your use case: batch transcription, real-time streaming, feature richness, or cost minimization.
TL;DR
Deepgram Nova-3 leads on batch accuracy (5.26% WER) and real-time streaming latency at $0.0077/min. AssemblyAI wins on features AND is surprisingly cheap for streaming at $0.15/hr (Universal-Streaming) — cheaper than Deepgram for real-time workloads. OpenAI's new gpt-4o-mini-transcribe ($0.003/min, launched March 2025) is the budget batch pick — half the price of whisper-1 with better accuracy and streaming support. Google Chirp 3 (also March 2025) leads on multilingual accuracy with a built-in denoiser.
Key Takeaways
- Deepgram Nova-3: $0.0077/min; 5.26% WER (batch); best real-time streaming latency; Nova-3 Medical at 3.44% WER; GA February 2025
- AssemblyAI: $0.37/hr pre-recorded; $0.15/hr streaming (cheaper than Deepgram for real-time); LeMUR for LLM analysis; Slam-1 launched October 2025
- OpenAI:
whisper-1$0.006/min (batch only);gpt-4o-mini-transcribe$0.003/min with streaming (launched March 2025, now recommended) - Google Chirp 3: $0.016/min standard; $0.004/min Dynamic Batch (75% off); built-in denoiser; native diarization; launched March 2025
- WER benchmarks: Deepgram Nova-3 at 5.26%, AssemblyAI Universal-2 at 5.9%, gpt-4o-transcribe ~5-7%; Google Chirp 3 multilingual is best
- Self-hosting: Whisper
large-v3-turbois the practical choice — 6x faster than large-v3 with <2% accuracy loss
Pricing Comparison
| Provider | Model | Price per Minute | Free Tier | Streaming |
|---|---|---|---|---|
| OpenAI | whisper-1 (legacy) | $0.006/min | None | ❌ |
| OpenAI | gpt-4o-mini-transcribe ⭐ | $0.003/min | None | ✅ |
| OpenAI | gpt-4o-transcribe | $0.006/min | None | ✅ |
| Deepgram | Nova-3 | $0.0077/min | $200 credit | ✅ |
| AssemblyAI | Universal (pre-recorded) | ~$0.0062/min ($0.37/hr) | $50 credit | ✅ |
| AssemblyAI | Universal-Streaming ⭐ | $0.0025/min ($0.15/hr) | $50 credit | ✅ |
| Chirp 3 | $0.016/min | 60 min/month | ✅ | |
| Chirp 3 Dynamic Batch | $0.004/min | 60 min/month | ❌ | |
| Whisper (self-hosted) | large-v3-turbo | Infrastructure only | — | ❌ |
Pricing Notes
OpenAI now has three transcription models (as of March 2025):
whisper-1: $0.006/min, batch only, still works but no longer recommendedgpt-4o-transcribe: $0.006/min, same price as whisper-1 but ~35% better accuracy + streaminggpt-4o-mini-transcribe: $0.003/min, streaming supported, now the recommended choice for most use cases
Deepgram pricing varies by plan:
- Nova-3 pay-as-you-go: $0.0077/min ($0.462/hr)
- Nova-3 Growth plan: $0.0065/min ($0.39/hr)
- Free tier: $200 in credits (~433 hours at pay-as-you-go rates)
AssemblyAI has two distinct pricing tiers worth knowing:
- Universal (pre-recorded/batch): $0.37/hr — best accuracy + features
- Universal-Streaming: $0.15/hr — cheaper than Deepgram for real-time; 6 languages (English, Spanish, French, German, Italian, Portuguese)
- Nano: $0.12/hr — lower accuracy, budget option
- LeMUR (LLM analysis): token-based on top of transcription cost
- Free tier: $50 in credits (~185 hours Universal or 333 hours streaming)
Google Speech-to-Text (Chirp 3 is now current, launched March 2025):
- Standard rate: $0.016/min (most expensive in this group)
- Dynamic Batch (async, 24-hour SLA): $0.004/min — 75% discount for non-real-time workloads
- Free tier: 60 minutes/month (permanent, not trial credit)
OpenAI: Whisper → GPT-4o Transcribe Models
OpenAI launched gpt-4o-transcribe and gpt-4o-mini-transcribe in March 2025. whisper-1 still works, but OpenAI now recommends the GPT-4o-based models. The key improvements: ~35% lower WER than whisper-1, and both support real-time streaming via WebSocket — something whisper-1 never had.
from openai import OpenAI
client = OpenAI()
# Recommended: gpt-4o-mini-transcribe (cheaper, streaming-capable)
with open("meeting-recording.mp3", "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="gpt-4o-mini-transcribe", # $0.003/min — half the price of whisper-1
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word", "segment"]
)
print(transcription.text)
for word in transcription.words:
print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")
# Real-time streaming with gpt-4o-transcribe (WebSocket)
import websocket
import json
import base64
ws = websocket.WebSocket()
ws.connect(
"wss://api.openai.com/v1/realtime?model=gpt-4o-transcribe",
header={"Authorization": f"Bearer {OPENAI_API_KEY}", "OpenAI-Beta": "realtime=v1"}
)
# Stream audio chunks and receive transcripts
# (WebSocket streaming pattern similar to other real-time STT APIs)
Strengths:
- Simplest API integration in the group
gpt-4o-mini-transcribeat $0.003/min is now the cheapest managed option- 99+ languages supported
- Word-level timestamps available
- Open-source Whisper model available for self-hosting
- GPT-4o models now support real-time streaming (March 2025)
Weaknesses:
- 25MB file size limit per request
- No speaker diarization built-in
- No LLM-powered post-processing features (unlike AssemblyAI LeMUR)
whisper-1is batch-only (use gpt-4o-mini-transcribe for streaming)
When to use OpenAI transcription: Budget batch transcription (gpt-4o-mini-transcribe at $0.003/min), or teams already integrated with OpenAI who want streaming without adopting another SDK. Not the best choice if you need diarization or post-processing.
Deepgram Nova-3
Deepgram built its own speech recognition architecture from scratch. Nova-3 reached General Availability in February 2025 with the strongest batch WER benchmark in this comparison (5.26%) and a standout feature: self-serve custom keyword injection — up to 100 domain-specific terms without model retraining. Nova-3 Medical launches with 3.44% WER for healthcare audio.
from deepgram import DeepgramClient, PrerecordedOptions, LiveOptions
import asyncio
import httpx
# Batch transcription
client = DeepgramClient(api_key="YOUR_DEEPGRAM_API_KEY")
with open("audio.mp3", "rb") as audio_file:
response = client.listen.rest.v("1").transcribe_file(
{"buffer": audio_file, "mimetype": "audio/mp3"},
PrerecordedOptions(
model="nova-3",
smart_format=True,
diarize=True, # Speaker detection
punctuate=True,
utterances=True,
)
)
# Access structured results
for utterance in response.results.utterances:
print(f"Speaker {utterance.speaker}: {utterance.transcript}")
# Real-time streaming with Deepgram
import asyncio
import websockets
import json
async def transcribe_stream():
uri = "wss://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&diarize=true"
headers = {"Authorization": f"Token {DEEPGRAM_API_KEY}"}
async with websockets.connect(uri, extra_headers=headers) as ws:
# Stream audio chunks
async for audio_chunk in audio_source():
await ws.send(audio_chunk)
# Receive transcripts in real-time
async for message in ws:
result = json.loads(message)
if result.get("is_final"):
transcript = result["channel"]["alternatives"][0]["transcript"]
if transcript:
print(f"Final: {transcript}")
Strengths:
- Best-in-class real-time streaming latency (sub-300ms to first word)
- Strong accuracy on conversational speech, call center audio, and domain vocabulary
- Speaker diarization included
- Keyword boosting for domain-specific terms
- Language detection (identify the language automatically)
- Good filler word detection (um, uh, like)
- Reliable WebSocket streaming API
Weaknesses:
- More expensive than Whisper for simple batch transcription
- Fewer LLM-powered post-processing features than AssemblyAI
- Less multilingual strength than Google Chirp 2
When to use Deepgram: Real-time transcription (live captioning, voice assistants, call center analytics) where latency is critical. Best batch WER in this group at 5.26%. Strong for call centers using domain vocabulary (custom keyword injection without retraining). Note: for pure cost on real-time streaming, AssemblyAI Universal-Streaming at $0.15/hr is cheaper than Deepgram Nova-3 at $0.462/hr — evaluate if the latency difference matters for your use case.
AssemblyAI
AssemblyAI's pricing structure has a hidden advantage: Universal-Streaming at $0.15/hr is 3x cheaper than Deepgram Nova-3 for real-time ($0.462/hr). For teams that need streaming + features, this changes the math significantly. They also launched Slam-1 in October 2025 — a new speech-language model that goes beyond transcription toward deeper audio understanding.
import assemblyai as aai
aai.settings.api_key = "YOUR_ASSEMBLYAI_API_KEY"
transcriber = aai.Transcriber()
# Transcription with all features enabled
transcript = transcriber.transcribe(
"https://example.com/podcast-episode.mp3",
config=aai.TranscriptionConfig(
speech_model=aai.SpeechModel.universal, # Universal-2
speaker_labels=True, # Diarization
auto_chapters=True, # Chapter detection with summaries
sentiment_analysis=True,
auto_highlights=True, # Key phrases
entity_detection=True, # Named entities
iab_categories=True, # Topic classification
)
)
# Access speakers
for utterance in transcript.utterances:
print(f"Speaker {utterance.speaker}: {utterance.text}")
# Auto-chapters
for chapter in transcript.chapters:
print(f"{chapter.start/1000:.0f}s - {chapter.end/1000:.0f}s: {chapter.headline}")
# Ask questions about the transcript via LeMUR
result = transcript.lemur.task(
"What were the three main topics discussed and what conclusions were reached on each?",
final_model=aai.LemurModel.claude3_5_sonnet
)
print(result.response)
Strengths:
- Best post-processing feature set: auto chapters, sentiment analysis, entity detection, topic classification, PII redaction
- LeMUR: ask free-form questions about transcribed content using Claude or other LLMs
- Strong real-time streaming (WebSocket)
- Universal-2 model is highly accurate on diverse audio types
- Best for podcast/interview/meeting analysis workflows
Weaknesses:
- Most expensive per-minute for raw transcription
- LeMUR adds additional token cost on top of transcription
- More complex setup than Whisper
- Fewer language options than Google
When to use AssemblyAI: Meeting intelligence, podcast workflows, qualitative research, customer call analysis needing structured insights. Also the best-value real-time streaming choice at $0.15/hr. LeMUR is genuinely unique — ask "what action items came up in this meeting?" and get structured answers. Note: Universal-Streaming currently supports 6 languages only; use Universal (batch) for broader language coverage.
Google Speech-to-Text (Chirp 3)
Google launched Chirp 3 in March 2025 — the same month OpenAI launched its GPT-4o transcription models. Chirp 3 adds a built-in audio denoiser, native diarization, and automatic language detection. For non-English audio, Google's multilingual training scale (28 billion text sentences across 100+ languages) gives it an edge no specialist provider can match. The Dynamic Batch option at $0.004/min (75% off standard) makes Google cost-competitive for non-real-time workloads.
from google.cloud import speech_v2
client = speech_v2.SpeechClient()
project_id = "your-project-id"
# Batch transcription
with open("audio.wav", "rb") as f:
audio_data = f.read()
config = speech_v2.RecognitionConfig(
auto_decoding_config=speech_v2.AutoDetectDecodingConfig(),
language_codes=["en-US"],
model="chirp_2",
features=speech_v2.RecognitionFeatures(
enable_automatic_punctuation=True,
enable_word_time_offsets=True,
enable_speaker_diarization=True,
diarization_speaker_count=2,
)
)
request = speech_v2.RecognizeRequest(
recognizer=f"projects/{project_id}/locations/global/recognizers/_",
config=config,
content=audio_data,
)
response = client.recognize(request=request)
for result in response.results:
print(result.alternatives[0].transcript)
Strengths:
- Best multilingual support (100+ languages with Chirp)
- Strong for code-switched audio (multiple languages in same recording)
- Streaming available via gRPC
- Deep GCP integration (Cloud Storage, BigQuery, Pub/Sub)
- Speaker diarization available
- Proven production reliability (powers Google's own products)
Weaknesses:
- Most expensive managed API in this comparison
- GCP setup required (service accounts, project configuration)
- More complex SDK/client library compared to Whisper or Deepgram
- Overkill for English-only use cases
When to use Google Chirp 3: Non-English or multilingual audio (85+ languages, best multilingual accuracy). Applications where audio quality varies and the built-in denoiser helps. GCP-native applications. Dynamic Batch at $0.004/min makes it competitive for non-real-time workloads where 24-hour turnaround is acceptable.
Accuracy Benchmarks: Word Error Rate (WER)
Lower WER = better accuracy. Approximate WER figures on common benchmarks:
| Provider | Model | Batch WER | Streaming WER | Multilingual |
|---|---|---|---|---|
| Deepgram | Nova-3 | 5.26% | 6.84% | Good (Nova-3 multilingual, 6+ languages) |
| AssemblyAI | Universal-2 | ~5.9% | ~7–9% | Good (99 languages batch; 6 languages streaming) |
| OpenAI | gpt-4o-transcribe | ~5–7% | ~6–8% | Strong (99+ languages) |
| OpenAI | gpt-4o-mini-transcribe | ~6–8% | ~7–9% | Strong (99+ languages) |
| Chirp 3 | ~8–11% | ~9–12% | Best (85+ languages with denoising) |
Deepgram WER figures from Deepgram's own benchmarks (2,703 files, 81.69 hours). Independent benchmarks show smaller gaps. Always test on your own audio.
Note: WER varies significantly by audio quality, accent, domain, and benchmark. These are approximate figures from published evaluations; your domain's accuracy may differ substantially.
No single model wins on accuracy across all conditions. For your production use case, benchmark with a sample of your own audio before committing.
Real-Time Streaming Comparison
| Provider | Model | Protocol | Time to First Word | Monthly Cost (1K hr) |
|---|---|---|---|---|
| Deepgram | Nova-3 | WebSocket | ~200–300ms | ~$462 |
| AssemblyAI | Universal-Streaming | WebSocket | ~300–500ms | ~$150 |
| Chirp 3 | gRPC | ~400–700ms | ~$960 | |
| OpenAI | gpt-4o-transcribe | WebSocket | ~300–500ms | ~$360 |
| OpenAI | whisper-1 | Batch only | N/A | ~$360 |
Deepgram has the lowest latency — meaningful for voice assistants and real-time captioning. But AssemblyAI Universal-Streaming at $0.15/hr undercuts Deepgram by 3x for teams where ~300ms additional latency is acceptable. For call center analytics where you don't need sub-300ms, AssemblyAI is the cost winner.
Self-Hosting with Whisper
For very high-volume batch transcription, self-hosting Whisper on your own GPU infrastructure can dramatically reduce costs:
from faster_whisper import WhisperModel
# large-v3-turbo: 6x faster than large-v3, <2% accuracy loss — the practical default
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")
# Transcribe
segments, info = model.transcribe(
"audio.mp3",
beam_size=5,
word_timestamps=True,
vad_filter=True, # Voice activity detection — skip silence
)
for segment in segments:
print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")
faster-whisper (CTranslate2-based) runs Whisper large-v3 at 4-8x faster than the original implementation and requires ~6GB VRAM. On a single A100 GPU at ~$2/hour, you can transcribe ~1,000 hours of audio per hour of compute — that's $0.002/minute vs Whisper API's $0.006/minute at the crossover point of ~500 hours/month.
Cost Scenarios
Scenario 1: Podcast transcription (batch) — 100 hours/month
| Provider | Monthly Cost | Notes |
|---|---|---|
| gpt-4o-mini-transcribe | 100h × 60 × $0.003 = $18 | Best value for batch |
| Deepgram Nova-3 | 100h × 60 × $0.0077 = $46.20 | Best accuracy |
| AssemblyAI Universal | 100h × $0.37 = $37 | Best features |
| Google Chirp 3 Dynamic Batch | 100h × 60 × $0.004 = $24 | Best for multilingual |
| Google Chirp 3 Standard | 100h × 60 × $0.016 = $96 | |
| Whisper self-hosted | ~$15–30 (GPU cost) |
Scenario 2: Call center — 1,000 hours/month real-time streaming
| Provider | Monthly Cost | Streaming Latency |
|---|---|---|
| AssemblyAI Universal-Streaming | 1,000 × $0.15 = $150 | ~300–500ms |
| Deepgram Nova-3 | 1,000 × $0.462 = $462 | ~200–300ms |
| OpenAI gpt-4o-transcribe | 1,000h × 60 × $0.006 = $360 | ~300–500ms |
| Google Chirp 3 | 1,000 × $0.96 = $960 | ~400–700ms |
AssemblyAI Universal-Streaming's pricing advantage is substantial at scale. For call centers where ~300ms latency is acceptable and features (diarization, summaries) matter, it's the clear winner. Deepgram is worth the premium only when sub-300ms latency is a hard requirement.
When to Choose Each
Choose OpenAI (gpt-4o-mini-transcribe) if:
- Best price for managed batch transcription ($0.003/min — cheapest managed option)
- Already integrated with OpenAI and want one SDK
- Need streaming but don't want Deepgram or AssemblyAI onboarding
- English audio, moderate volume, simplicity priority
- Note: use
gpt-4o-mini-transcribenotwhisper-1— better accuracy, same or lower cost, streaming support
Choose Deepgram if:
- Real-time streaming is required (live captions, voice assistants, call center)
- Lowest per-minute cost for managed streaming APIs
- Speaker diarization is needed at scale
- Call center, customer service, or live meeting transcription
Choose AssemblyAI if:
- You need post-processing features: summaries, chapters, sentiment, entities
- Building meeting intelligence or podcast analysis workflows
- LeMUR (ask questions about audio content) is valuable for your use case
- You want the richest feature set and can pay the premium
Choose Google Chirp 3 if:
- Non-English or multilingual audio is primary (85+ languages, best multilingual accuracy)
- Audio quality varies and the built-in denoiser helps
- You're already on GCP and want ecosystem integration
- Non-real-time workloads where Dynamic Batch pricing ($0.004/min) makes Google cost-competitive
Compare STT API pricing, accuracy, and uptime at APIScout.
Related: ElevenLabs vs Cartesia: Best Voice AI API 2026 · Best Free APIs for Developers 2026, Best Speech-to-Text APIs (2026), How AI Is Transforming API Design and Documentation, API Breaking Changes Without Breaking Clients
Evaluate Deepgram and compare alternatives on APIScout.
The API Integration Checklist (Free PDF)
Step-by-step checklist: auth setup, rate limit handling, error codes, SDK evaluation, and pricing comparison for 50+ APIs. Used by 200+ developers.
Join 200+ developers. Unsubscribe in one click.