Best Speech-to-Text APIs 2026: Whisper vs Deepgram vs AssemblyAI
Voice Is Eating Software
Real-time transcription. Voice agents. Meeting intelligence. Podcast search. Call center analytics. Audio content accessibility. The list of applications requiring production-grade speech-to-text has expanded dramatically in 2025-2026, and the API market has responded with genuinely impressive advances in accuracy, latency, and specialized audio intelligence features.
Three platforms lead the market for developers: Deepgram (real-time speed leader), AssemblyAI (audio intelligence and LLM integration), and OpenAI Whisper (language breadth and accuracy at scale). Each has a distinct position — the right choice depends on your use case.
TL;DR
Deepgram Nova-3 at $0.0059/minute is the fastest and cheapest for real-time voice applications (200-400ms latency, 5.26% WER). AssemblyAI at $0.37/hour leads on audio intelligence — sentiment, topic detection, auto-highlights, and the LeMUR framework for LLM-over-audio. OpenAI's gpt-4o-transcribe handles the broadest language coverage (99 languages) with the best accuracy on multilingual content. For voice agents: Deepgram. For meeting intelligence: AssemblyAI. For multilingual applications: OpenAI/Whisper.
Key Takeaways
- Deepgram Nova-3 achieves 5.26% Word Error Rate on benchmarks with real-time streaming in 200-400ms — the fastest production STT API available.
- AssemblyAI reduced pricing 43% to $0.37/hour and released Slam-1 (October 2025) with multilingual streaming in six languages and LLM Gateway integration.
- OpenAI released gpt-4o-transcribe and gpt-4o-mini-transcribe in March 2025, outperforming Whisper Large-v2 on accuracy across most languages.
- AssemblyAI's LeMUR framework applies LLMs directly to transcribed audio — summarization, Q&A, and analysis of 10+ hours of audio in a single API call.
- Deepgram's Nova-3 Medical reaches 1-10% WER on healthcare vocabulary — the most specialized domain model in the market.
- Real-world WER is 3-4x higher than benchmarks on challenging audio (noise, accents, jargon) — test on your actual production audio, not published benchmarks.
- $200 free credit on Deepgram signup vs $0 free tier for OpenAI Whisper — Deepgram wins for experimentation budget.
Pricing Comparison
| Provider | Model | Price | Billing | Free Credit |
|---|---|---|---|---|
| Deepgram | Nova-3 | $0.0059/min ($4.30/1K min) | Per minute | $200 |
| Deepgram | Nova-3 Batch | $0.0043/min ($3.20/1K min) | Per minute | $200 |
| AssemblyAI | Universal-2 | $0.37/hour ($6.17/1K min) | Per hour | Free testing credits |
| OpenAI | gpt-4o-transcribe | $0.006/min ($6.00/1K min) | Per minute | None |
| OpenAI | Whisper-1 | $0.006/min ($6.00/1K min) | Per minute | None |
| Google Cloud | Standard | $0.004/min | Per 15 sec | $300 trial |
| Amazon Transcribe | Standard | $0.0004/sec ($0.024/min) | Per second | AWS Free Tier |
| Azure Cognitive | Standard | $1.00/hour | Per second | Azure credits |
Cost for 1,000 hours of audio:
- Deepgram Nova-3: ~$354
- Deepgram Batch: ~$258
- AssemblyAI: $370
- OpenAI Whisper: $360
- Amazon Transcribe: $1,440
Deepgram and AssemblyAI are nearly tied for cost on production volume. OpenAI matches. Amazon is 4x more expensive.
Deepgram
Best for: Real-time voice agents, low-latency transcription, high-volume batch processing
Deepgram is the speed and cost leader for production speech-to-text. Nova-3, their latest model, delivers 5.26% WER on benchmark audio with real-time streaming that produces words within 200-400ms of speech ending.
Models
| Model | WER | Use Case |
|---|---|---|
| Nova-3 | 5.26% | General purpose, best accuracy |
| Nova-3 Medical | 1-10% | Healthcare vocabulary |
| Nova-3 Finance | Low | Financial terminology |
| Whisper Cloud | Variable | Whisper compatibility layer |
Real-Time Streaming
import asyncio
import websockets
import json
async def transcribe_realtime():
url = "wss://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&diarize=true"
async with websockets.connect(
url,
extra_headers={"Authorization": f"Token {DEEPGRAM_API_KEY}"}
) as ws:
# Send audio chunks as they arrive
async def send_audio():
# Your microphone/audio source
async for chunk in audio_source:
await ws.send(chunk)
await ws.send(json.dumps({"type": "CloseStream"}))
async def receive_transcripts():
async for message in ws:
result = json.loads(message)
if result.get("is_final"):
transcript = result["channel"]["alternatives"][0]["transcript"]
print(f"Final: {transcript}")
else:
# Interim results for immediate display
interim = result["channel"]["alternatives"][0]["transcript"]
print(f"Interim: {interim}", end="\r")
await asyncio.gather(send_audio(), receive_transcripts())
Batch Transcription
import httpx
response = httpx.post(
"https://api.deepgram.com/v1/listen",
headers={"Authorization": f"Token {DEEPGRAM_API_KEY}"},
params={
"model": "nova-3",
"smart_format": "true",
"diarize": "true",
"punctuate": "true",
"paragraphs": "true",
},
content=audio_bytes,
)
result = response.json()
transcript = result["results"]["channels"][0]["alternatives"][0]["transcript"]
words = result["results"]["channels"][0]["alternatives"][0]["words"] # Word-level timestamps
Voice Agent Features
Deepgram's Aura TTS and Flux STT combination is specifically designed for voice agent pipelines:
- Model-integrated end-of-turn detection (knows when user stops speaking)
- Configurable turn-taking dynamics
- Ultra-low latency optimized for conversation
- Voice Activity Detection (VAD) built in
Strengths
- Fastest real-time transcription (200-400ms latency)
- Cheapest at scale ($0.0059/min vs $0.006 for Whisper)
- Domain-specific models (Medical, Finance)
- $200 free credit on signup
- Voice agent pipeline features (Flux, Aura)
- 36+ languages supported
- Self-serve model customization
When to choose Deepgram
Voice agents requiring real-time transcription, high-volume batch transcription at lowest cost, healthcare/finance applications with domain-specific vocabulary, any application where latency is the primary constraint.
AssemblyAI
Best for: Audio intelligence, meeting analytics, LLM-over-audio applications
AssemblyAI's differentiation in 2026 isn't transcription accuracy — it's what you can do with transcribed audio. The LeMUR framework and their suite of audio intelligence features (sentiment analysis, topic detection, content safety, PII redaction) make AssemblyAI the choice for applications that need to understand audio, not just transcribe it.
Models
| Model | WER (benchmark) | Notes |
|---|---|---|
| Universal-2 | 8.4% | General purpose, best intelligence features |
| Slam-1 (Oct 2025) | TBD | New architecture, multilingual streaming |
Audio Intelligence Features
AssemblyAI includes these features in the base transcription API:
import assemblyai as aai
config = aai.TranscriptionConfig(
sentiment_analysis=True, # Positive/negative/neutral per utterance
auto_highlights=True, # Key points automatically extracted
iab_categories=True, # IAB topic classification
entity_detection=True, # Named entity recognition
speaker_labels=True, # Speaker diarization
content_safety=True, # Hate speech, profanity detection
redact_pii=True, # Remove PII from transcript
summarization=True, # Automatic summary
auto_chapters=True, # Chapter segmentation
)
transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://your-audio-url.com/file.mp3", config)
# Sentiment per utterance
for result in transcript.sentiment_analysis:
print(f"{result.speaker}: {result.text} [{result.sentiment}]")
# Auto-extracted highlights
for result in transcript.auto_highlights.results:
print(f"Highlight: {result.text} (count: {result.count})")
LeMUR Framework
LeMUR (Leveraging Large Language Models to Understand Recognized Speech) is AssemblyAI's most distinctive feature:
# Apply LLM directly to transcribed audio
lemur_response = transcript.lemur.task(
prompt="What were the main decisions made in this meeting? Format as a bulleted list.",
final_model=aai.LemurModel.claude3_5_sonnet,
)
# Q&A over audio
qa_response = transcript.lemur.question_answer(
questions=[
aai.LemurQuestion(question="What was the total deal size discussed?"),
aai.LemurQuestion(question="Who are the key stakeholders mentioned?"),
]
)
# Structured output
action_items = transcript.lemur.action_items()
Process up to 10 hours of audio through LeMUR in a single API call — summarizing hours of podcast content, extracting decisions from long recordings, or generating reports from call center sessions.
Real-Time Streaming (Slam-1)
AssemblyAI's October 2025 Slam-1 model introduced:
- Real-time streaming transcription (latency comparable to Deepgram)
- Six language support for streaming (English, Spanish, French, German, Portuguese, Dutch)
- Safety guardrails during transcription
- LLM Gateway integration for immediate LLM processing
Pricing
| Feature | Cost |
|---|---|
| Transcription | $0.37/hour |
| Real-time streaming | $0.37/hour |
| LeMUR (base) | Free with transcription |
| LeMUR (LLM costs) | Model-dependent |
| Audio Intelligence | Included |
Strengths
- Best audio intelligence suite (sentiment, topics, entities, safety)
- LeMUR framework for LLM-over-audio
- Content safety and PII redaction built in
- Auto-chapters, auto-highlights, auto-summarization
- Straightforward hourly pricing (no per-feature add-ons)
- Free testing credits
When to choose AssemblyAI
Meeting intelligence and analytics, call center analysis, podcast intelligence, any application that needs to understand audio beyond transcription, applications requiring content moderation on audio content.
OpenAI Whisper / gpt-4o-transcribe
Best for: Language breadth, highest accuracy on multilingual audio, research/academic use
OpenAI's transcription story evolved significantly in 2025. The release of gpt-4o-transcribe (March 2025) outperforms the original Whisper Large-v2 on most benchmarks. Whisper remains available as whisper-1 for legacy integrations.
Models
| Model | Languages | WER | Latency | Price |
|---|---|---|---|---|
| gpt-4o-transcribe | 99+ | Low | 1-3s (batch) | $0.006/min |
| gpt-4o-mini-transcribe | 99+ | Good | Faster | Lower |
| whisper-1 (legacy) | 99 | ~5-7% | 1-3s | $0.006/min |
API Integration
from openai import OpenAI
client = OpenAI()
# Batch transcription
with open("audio.mp3", "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file,
response_format="json",
language="es", # Optional: specify language for better accuracy
timestamp_granularities=["word"], # Word-level timestamps
)
print(transcription.text)
Language Coverage
Whisper/gpt-4o-transcribe supports 99 languages — significantly more than Deepgram (36+) or AssemblyAI's streaming (6 languages for Slam-1). For applications handling multilingual audio from diverse user bases, OpenAI's language breadth is the decisive factor.
Limitations
- No real-time streaming API (batch only) —
gpt-4o-realtimehandles real-time audio separately but at higher cost - No free tier — every minute costs $0.006
- 1-3 second latency for batch — too slow for real-time voice agents
- No audio intelligence features built in — transcription only
When to choose OpenAI Whisper/gpt-4o-transcribe
Applications requiring 99-language support, highest accuracy on challenging multilingual audio, research and academic transcription, applications already deeply in the OpenAI ecosystem, cases where batch processing (1-3s) is acceptable.
Feature Comparison
| Feature | Deepgram | AssemblyAI | OpenAI |
|---|---|---|---|
| Real-time streaming | Yes (200-400ms) | Yes (Slam-1) | No (batch only) |
| Word-level timestamps | Yes | Yes | Yes |
| Speaker diarization | Yes | Yes | Limited |
| Sentiment analysis | No | Yes | No |
| Topic detection | No | Yes | No |
| Entity extraction | No | Yes | No |
| Content safety | No | Yes | No |
| PII redaction | No | Yes | No |
| Auto-summary | No | Yes | No |
| LLM integration | No | Yes (LeMUR) | Basic |
| Language count | 36+ | 6 (streaming), more (batch) | 99+ |
| Domain models | Medical, Finance | None | None |
| Free credits | $200 | Yes (limited) | None |
| Pricing | $0.0059/min | $0.37/hour | $0.006/min |
Choosing the Right STT API
For real-time voice applications (< 500ms latency required)
Choose Deepgram Nova-3. Nothing else delivers 200-400ms end-to-end latency for production real-time transcription. Voice agents, live captions, and interactive audio applications need Deepgram.
For meeting intelligence and audio analysis
Choose AssemblyAI. The LeMUR framework, audio intelligence features, and auto-chapters/highlights/summaries make it purpose-built for meeting analytics, podcast intelligence, and call center analysis.
For multilingual applications (> 36 languages)
Choose OpenAI gpt-4o-transcribe. 99 languages with good accuracy across all of them. Deepgram's 36 and AssemblyAI's limited streaming language support don't compare for truly multilingual applications.
For healthcare/medical applications
Choose Deepgram Nova-3 Medical. Specialized training on medical vocabulary reduces WER to 1-10% on clinical audio — significantly better than general models.
For maximum cost efficiency at batch scale
Choose Deepgram Nova-3 Batch ($0.0043/min) or Rev AI Standard ($0.002/min) if accuracy requirements are modest.
For content moderation and safety on audio
Choose AssemblyAI. Built-in content safety, profanity detection, and PII redaction are unique in the market.
Testing Recommendation
Published benchmarks use clean studio audio. Your production audio will have:
- Background noise
- Multiple overlapping speakers
- Accents and non-native speech
- Domain-specific terminology
- Variable recording quality
Before committing, test all three APIs on 30-60 minutes of your actual production audio. WER on your data is the only metric that matters. The 5% benchmark gap between providers often becomes 2% or 20% depending on audio conditions — your use case will determine which direction.
Verdict
Deepgram is the default choice for real-time voice applications and cost-sensitive batch processing. The combination of speed, price, and the $200 free credit makes it the best starting point for most voice projects.
AssemblyAI is the right choice when transcription is just the beginning — when you need to understand, summarize, analyze, and extract structured insights from audio content.
OpenAI is the choice for maximum language coverage and applications already in the OpenAI ecosystem. The accuracy improvements in gpt-4o-transcribe are real, but the lack of real-time streaming and no free tier limit its appeal outside its strengths.
Compare speech-to-text API pricing, features, and documentation at APIScout — find the right transcription API for your application.