Skip to main content

Best Speech-to-Text APIs 2026: Whisper vs Deepgram vs AssemblyAI

·APIScout Team
speech-to-text apiwhisperdeepgramassemblyaistttranscription apivoiceaudio api

Voice Is Eating Software

Real-time transcription. Voice agents. Meeting intelligence. Podcast search. Call center analytics. Audio content accessibility. The list of applications requiring production-grade speech-to-text has expanded dramatically in 2025-2026, and the API market has responded with genuinely impressive advances in accuracy, latency, and specialized audio intelligence features.

Three platforms lead the market for developers: Deepgram (real-time speed leader), AssemblyAI (audio intelligence and LLM integration), and OpenAI Whisper (language breadth and accuracy at scale). Each has a distinct position — the right choice depends on your use case.

TL;DR

Deepgram Nova-3 at $0.0059/minute is the fastest and cheapest for real-time voice applications (200-400ms latency, 5.26% WER). AssemblyAI at $0.37/hour leads on audio intelligence — sentiment, topic detection, auto-highlights, and the LeMUR framework for LLM-over-audio. OpenAI's gpt-4o-transcribe handles the broadest language coverage (99 languages) with the best accuracy on multilingual content. For voice agents: Deepgram. For meeting intelligence: AssemblyAI. For multilingual applications: OpenAI/Whisper.

Key Takeaways

  • Deepgram Nova-3 achieves 5.26% Word Error Rate on benchmarks with real-time streaming in 200-400ms — the fastest production STT API available.
  • AssemblyAI reduced pricing 43% to $0.37/hour and released Slam-1 (October 2025) with multilingual streaming in six languages and LLM Gateway integration.
  • OpenAI released gpt-4o-transcribe and gpt-4o-mini-transcribe in March 2025, outperforming Whisper Large-v2 on accuracy across most languages.
  • AssemblyAI's LeMUR framework applies LLMs directly to transcribed audio — summarization, Q&A, and analysis of 10+ hours of audio in a single API call.
  • Deepgram's Nova-3 Medical reaches 1-10% WER on healthcare vocabulary — the most specialized domain model in the market.
  • Real-world WER is 3-4x higher than benchmarks on challenging audio (noise, accents, jargon) — test on your actual production audio, not published benchmarks.
  • $200 free credit on Deepgram signup vs $0 free tier for OpenAI Whisper — Deepgram wins for experimentation budget.

Pricing Comparison

ProviderModelPriceBillingFree Credit
DeepgramNova-3$0.0059/min ($4.30/1K min)Per minute$200
DeepgramNova-3 Batch$0.0043/min ($3.20/1K min)Per minute$200
AssemblyAIUniversal-2$0.37/hour ($6.17/1K min)Per hourFree testing credits
OpenAIgpt-4o-transcribe$0.006/min ($6.00/1K min)Per minuteNone
OpenAIWhisper-1$0.006/min ($6.00/1K min)Per minuteNone
Google CloudStandard$0.004/minPer 15 sec$300 trial
Amazon TranscribeStandard$0.0004/sec ($0.024/min)Per secondAWS Free Tier
Azure CognitiveStandard$1.00/hourPer secondAzure credits

Cost for 1,000 hours of audio:

  • Deepgram Nova-3: ~$354
  • Deepgram Batch: ~$258
  • AssemblyAI: $370
  • OpenAI Whisper: $360
  • Amazon Transcribe: $1,440

Deepgram and AssemblyAI are nearly tied for cost on production volume. OpenAI matches. Amazon is 4x more expensive.

Deepgram

Best for: Real-time voice agents, low-latency transcription, high-volume batch processing

Deepgram is the speed and cost leader for production speech-to-text. Nova-3, their latest model, delivers 5.26% WER on benchmark audio with real-time streaming that produces words within 200-400ms of speech ending.

Models

ModelWERUse Case
Nova-35.26%General purpose, best accuracy
Nova-3 Medical1-10%Healthcare vocabulary
Nova-3 FinanceLowFinancial terminology
Whisper CloudVariableWhisper compatibility layer

Real-Time Streaming

import asyncio
import websockets
import json

async def transcribe_realtime():
    url = "wss://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&diarize=true"

    async with websockets.connect(
        url,
        extra_headers={"Authorization": f"Token {DEEPGRAM_API_KEY}"}
    ) as ws:
        # Send audio chunks as they arrive
        async def send_audio():
            # Your microphone/audio source
            async for chunk in audio_source:
                await ws.send(chunk)
            await ws.send(json.dumps({"type": "CloseStream"}))

        async def receive_transcripts():
            async for message in ws:
                result = json.loads(message)
                if result.get("is_final"):
                    transcript = result["channel"]["alternatives"][0]["transcript"]
                    print(f"Final: {transcript}")
                else:
                    # Interim results for immediate display
                    interim = result["channel"]["alternatives"][0]["transcript"]
                    print(f"Interim: {interim}", end="\r")

        await asyncio.gather(send_audio(), receive_transcripts())

Batch Transcription

import httpx

response = httpx.post(
    "https://api.deepgram.com/v1/listen",
    headers={"Authorization": f"Token {DEEPGRAM_API_KEY}"},
    params={
        "model": "nova-3",
        "smart_format": "true",
        "diarize": "true",
        "punctuate": "true",
        "paragraphs": "true",
    },
    content=audio_bytes,
)

result = response.json()
transcript = result["results"]["channels"][0]["alternatives"][0]["transcript"]
words = result["results"]["channels"][0]["alternatives"][0]["words"]  # Word-level timestamps

Voice Agent Features

Deepgram's Aura TTS and Flux STT combination is specifically designed for voice agent pipelines:

  • Model-integrated end-of-turn detection (knows when user stops speaking)
  • Configurable turn-taking dynamics
  • Ultra-low latency optimized for conversation
  • Voice Activity Detection (VAD) built in

Strengths

  • Fastest real-time transcription (200-400ms latency)
  • Cheapest at scale ($0.0059/min vs $0.006 for Whisper)
  • Domain-specific models (Medical, Finance)
  • $200 free credit on signup
  • Voice agent pipeline features (Flux, Aura)
  • 36+ languages supported
  • Self-serve model customization

When to choose Deepgram

Voice agents requiring real-time transcription, high-volume batch transcription at lowest cost, healthcare/finance applications with domain-specific vocabulary, any application where latency is the primary constraint.

AssemblyAI

Best for: Audio intelligence, meeting analytics, LLM-over-audio applications

AssemblyAI's differentiation in 2026 isn't transcription accuracy — it's what you can do with transcribed audio. The LeMUR framework and their suite of audio intelligence features (sentiment analysis, topic detection, content safety, PII redaction) make AssemblyAI the choice for applications that need to understand audio, not just transcribe it.

Models

ModelWER (benchmark)Notes
Universal-28.4%General purpose, best intelligence features
Slam-1 (Oct 2025)TBDNew architecture, multilingual streaming

Audio Intelligence Features

AssemblyAI includes these features in the base transcription API:

import assemblyai as aai

config = aai.TranscriptionConfig(
    sentiment_analysis=True,        # Positive/negative/neutral per utterance
    auto_highlights=True,           # Key points automatically extracted
    iab_categories=True,            # IAB topic classification
    entity_detection=True,          # Named entity recognition
    speaker_labels=True,            # Speaker diarization
    content_safety=True,            # Hate speech, profanity detection
    redact_pii=True,                # Remove PII from transcript
    summarization=True,             # Automatic summary
    auto_chapters=True,             # Chapter segmentation
)

transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://your-audio-url.com/file.mp3", config)

# Sentiment per utterance
for result in transcript.sentiment_analysis:
    print(f"{result.speaker}: {result.text} [{result.sentiment}]")

# Auto-extracted highlights
for result in transcript.auto_highlights.results:
    print(f"Highlight: {result.text} (count: {result.count})")

LeMUR Framework

LeMUR (Leveraging Large Language Models to Understand Recognized Speech) is AssemblyAI's most distinctive feature:

# Apply LLM directly to transcribed audio
lemur_response = transcript.lemur.task(
    prompt="What were the main decisions made in this meeting? Format as a bulleted list.",
    final_model=aai.LemurModel.claude3_5_sonnet,
)

# Q&A over audio
qa_response = transcript.lemur.question_answer(
    questions=[
        aai.LemurQuestion(question="What was the total deal size discussed?"),
        aai.LemurQuestion(question="Who are the key stakeholders mentioned?"),
    ]
)

# Structured output
action_items = transcript.lemur.action_items()

Process up to 10 hours of audio through LeMUR in a single API call — summarizing hours of podcast content, extracting decisions from long recordings, or generating reports from call center sessions.

Real-Time Streaming (Slam-1)

AssemblyAI's October 2025 Slam-1 model introduced:

  • Real-time streaming transcription (latency comparable to Deepgram)
  • Six language support for streaming (English, Spanish, French, German, Portuguese, Dutch)
  • Safety guardrails during transcription
  • LLM Gateway integration for immediate LLM processing

Pricing

FeatureCost
Transcription$0.37/hour
Real-time streaming$0.37/hour
LeMUR (base)Free with transcription
LeMUR (LLM costs)Model-dependent
Audio IntelligenceIncluded

Strengths

  • Best audio intelligence suite (sentiment, topics, entities, safety)
  • LeMUR framework for LLM-over-audio
  • Content safety and PII redaction built in
  • Auto-chapters, auto-highlights, auto-summarization
  • Straightforward hourly pricing (no per-feature add-ons)
  • Free testing credits

When to choose AssemblyAI

Meeting intelligence and analytics, call center analysis, podcast intelligence, any application that needs to understand audio beyond transcription, applications requiring content moderation on audio content.

OpenAI Whisper / gpt-4o-transcribe

Best for: Language breadth, highest accuracy on multilingual audio, research/academic use

OpenAI's transcription story evolved significantly in 2025. The release of gpt-4o-transcribe (March 2025) outperforms the original Whisper Large-v2 on most benchmarks. Whisper remains available as whisper-1 for legacy integrations.

Models

ModelLanguagesWERLatencyPrice
gpt-4o-transcribe99+Low1-3s (batch)$0.006/min
gpt-4o-mini-transcribe99+GoodFasterLower
whisper-1 (legacy)99~5-7%1-3s$0.006/min

API Integration

from openai import OpenAI
client = OpenAI()

# Batch transcription
with open("audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",
        file=audio_file,
        response_format="json",
        language="es",  # Optional: specify language for better accuracy
        timestamp_granularities=["word"],  # Word-level timestamps
    )

print(transcription.text)

Language Coverage

Whisper/gpt-4o-transcribe supports 99 languages — significantly more than Deepgram (36+) or AssemblyAI's streaming (6 languages for Slam-1). For applications handling multilingual audio from diverse user bases, OpenAI's language breadth is the decisive factor.

Limitations

  • No real-time streaming API (batch only) — gpt-4o-realtime handles real-time audio separately but at higher cost
  • No free tier — every minute costs $0.006
  • 1-3 second latency for batch — too slow for real-time voice agents
  • No audio intelligence features built in — transcription only

When to choose OpenAI Whisper/gpt-4o-transcribe

Applications requiring 99-language support, highest accuracy on challenging multilingual audio, research and academic transcription, applications already deeply in the OpenAI ecosystem, cases where batch processing (1-3s) is acceptable.

Feature Comparison

FeatureDeepgramAssemblyAIOpenAI
Real-time streamingYes (200-400ms)Yes (Slam-1)No (batch only)
Word-level timestampsYesYesYes
Speaker diarizationYesYesLimited
Sentiment analysisNoYesNo
Topic detectionNoYesNo
Entity extractionNoYesNo
Content safetyNoYesNo
PII redactionNoYesNo
Auto-summaryNoYesNo
LLM integrationNoYes (LeMUR)Basic
Language count36+6 (streaming), more (batch)99+
Domain modelsMedical, FinanceNoneNone
Free credits$200Yes (limited)None
Pricing$0.0059/min$0.37/hour$0.006/min

Choosing the Right STT API

For real-time voice applications (< 500ms latency required)

Choose Deepgram Nova-3. Nothing else delivers 200-400ms end-to-end latency for production real-time transcription. Voice agents, live captions, and interactive audio applications need Deepgram.

For meeting intelligence and audio analysis

Choose AssemblyAI. The LeMUR framework, audio intelligence features, and auto-chapters/highlights/summaries make it purpose-built for meeting analytics, podcast intelligence, and call center analysis.

For multilingual applications (> 36 languages)

Choose OpenAI gpt-4o-transcribe. 99 languages with good accuracy across all of them. Deepgram's 36 and AssemblyAI's limited streaming language support don't compare for truly multilingual applications.

For healthcare/medical applications

Choose Deepgram Nova-3 Medical. Specialized training on medical vocabulary reduces WER to 1-10% on clinical audio — significantly better than general models.

For maximum cost efficiency at batch scale

Choose Deepgram Nova-3 Batch ($0.0043/min) or Rev AI Standard ($0.002/min) if accuracy requirements are modest.

For content moderation and safety on audio

Choose AssemblyAI. Built-in content safety, profanity detection, and PII redaction are unique in the market.

Testing Recommendation

Published benchmarks use clean studio audio. Your production audio will have:

  • Background noise
  • Multiple overlapping speakers
  • Accents and non-native speech
  • Domain-specific terminology
  • Variable recording quality

Before committing, test all three APIs on 30-60 minutes of your actual production audio. WER on your data is the only metric that matters. The 5% benchmark gap between providers often becomes 2% or 20% depending on audio conditions — your use case will determine which direction.

Verdict

Deepgram is the default choice for real-time voice applications and cost-sensitive batch processing. The combination of speed, price, and the $200 free credit makes it the best starting point for most voice projects.

AssemblyAI is the right choice when transcription is just the beginning — when you need to understand, summarize, analyze, and extract structured insights from audio content.

OpenAI is the choice for maximum language coverage and applications already in the OpenAI ecosystem. The accuracy improvements in gpt-4o-transcribe are real, but the lack of real-time streaming and no free tier limit its appeal outside its strengths.


Compare speech-to-text API pricing, features, and documentation at APIScout — find the right transcription API for your application.

Comments