Skip to main content

Speech-to-Text APIs 2026: Whisper vs Deepgram vs AssemblyAI

·APIScout Team
speech to textwhisper apideepgramassemblyaigoogle speechstt apiapi comparison2026

The Speech API Market Has Stratified

In 2023, OpenAI Whisper disrupted the speech recognition market by open-sourcing a model that matched commercial APIs at zero cost for self-hosting. In 2025–2026, the market has responded: Deepgram, AssemblyAI, and Google have released specialized models that beat Whisper on specific tasks — real-time streaming, speaker diarization, domain-specific accuracy — while remaining cost-competitive at scale.

The choice now depends heavily on your use case: batch transcription, real-time streaming, feature richness, or cost minimization.

TL;DR

Deepgram Nova-3 leads on batch accuracy (5.26% WER) and real-time streaming latency at $0.0077/min. AssemblyAI wins on features AND is surprisingly cheap for streaming at $0.15/hr (Universal-Streaming) — cheaper than Deepgram for real-time workloads. OpenAI's new gpt-4o-mini-transcribe ($0.003/min, launched March 2025) is the budget batch pick — half the price of whisper-1 with better accuracy and streaming support. Google Chirp 3 (also March 2025) leads on multilingual accuracy with a built-in denoiser.

Key Takeaways

  • Deepgram Nova-3: $0.0077/min; 5.26% WER (batch); best real-time streaming latency; Nova-3 Medical at 3.44% WER; GA February 2025
  • AssemblyAI: $0.37/hr pre-recorded; $0.15/hr streaming (cheaper than Deepgram for real-time); LeMUR for LLM analysis; Slam-1 launched October 2025
  • OpenAI: whisper-1 $0.006/min (batch only); gpt-4o-mini-transcribe $0.003/min with streaming (launched March 2025, now recommended)
  • Google Chirp 3: $0.016/min standard; $0.004/min Dynamic Batch (75% off); built-in denoiser; native diarization; launched March 2025
  • WER benchmarks: Deepgram Nova-3 at 5.26%, AssemblyAI Universal-2 at 5.9%, gpt-4o-transcribe ~5-7%; Google Chirp 3 multilingual is best
  • Self-hosting: Whisper large-v3-turbo is the practical choice — 6x faster than large-v3 with <2% accuracy loss

Pricing Comparison

ProviderModelPrice per MinuteFree TierStreaming
OpenAIwhisper-1 (legacy)$0.006/minNone
OpenAIgpt-4o-mini-transcribe ⭐$0.003/minNone
OpenAIgpt-4o-transcribe$0.006/minNone
DeepgramNova-3$0.0077/min$200 credit
AssemblyAIUniversal (pre-recorded)~$0.0062/min ($0.37/hr)$50 credit
AssemblyAIUniversal-Streaming ⭐$0.0025/min ($0.15/hr)$50 credit
GoogleChirp 3$0.016/min60 min/month
GoogleChirp 3 Dynamic Batch$0.004/min60 min/month
Whisper (self-hosted)large-v3-turboInfrastructure only

Pricing Notes

OpenAI now has three transcription models (as of March 2025):

  • whisper-1: $0.006/min, batch only, still works but no longer recommended
  • gpt-4o-transcribe: $0.006/min, same price as whisper-1 but ~35% better accuracy + streaming
  • gpt-4o-mini-transcribe: $0.003/min, streaming supported, now the recommended choice for most use cases

Deepgram pricing varies by plan:

  • Nova-3 pay-as-you-go: $0.0077/min ($0.462/hr)
  • Nova-3 Growth plan: $0.0065/min ($0.39/hr)
  • Free tier: $200 in credits (~433 hours at pay-as-you-go rates)

AssemblyAI has two distinct pricing tiers worth knowing:

  • Universal (pre-recorded/batch): $0.37/hr — best accuracy + features
  • Universal-Streaming: $0.15/hr — cheaper than Deepgram for real-time; 6 languages (English, Spanish, French, German, Italian, Portuguese)
  • Nano: $0.12/hr — lower accuracy, budget option
  • LeMUR (LLM analysis): token-based on top of transcription cost
  • Free tier: $50 in credits (~185 hours Universal or 333 hours streaming)

Google Speech-to-Text (Chirp 3 is now current, launched March 2025):

  • Standard rate: $0.016/min (most expensive in this group)
  • Dynamic Batch (async, 24-hour SLA): $0.004/min — 75% discount for non-real-time workloads
  • Free tier: 60 minutes/month (permanent, not trial credit)

OpenAI: Whisper → GPT-4o Transcribe Models

OpenAI launched gpt-4o-transcribe and gpt-4o-mini-transcribe in March 2025. whisper-1 still works, but OpenAI now recommends the GPT-4o-based models. The key improvements: ~35% lower WER than whisper-1, and both support real-time streaming via WebSocket — something whisper-1 never had.

from openai import OpenAI

client = OpenAI()

# Recommended: gpt-4o-mini-transcribe (cheaper, streaming-capable)
with open("meeting-recording.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="gpt-4o-mini-transcribe",  # $0.003/min — half the price of whisper-1
        file=audio_file,
        response_format="verbose_json",
        timestamp_granularities=["word", "segment"]
    )

print(transcription.text)
for word in transcription.words:
    print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")
# Real-time streaming with gpt-4o-transcribe (WebSocket)
import websocket
import json
import base64

ws = websocket.WebSocket()
ws.connect(
    "wss://api.openai.com/v1/realtime?model=gpt-4o-transcribe",
    header={"Authorization": f"Bearer {OPENAI_API_KEY}", "OpenAI-Beta": "realtime=v1"}
)

# Stream audio chunks and receive transcripts
# (WebSocket streaming pattern similar to other real-time STT APIs)

Strengths:

  • Simplest API integration in the group
  • gpt-4o-mini-transcribe at $0.003/min is now the cheapest managed option
  • 99+ languages supported
  • Word-level timestamps available
  • Open-source Whisper model available for self-hosting
  • GPT-4o models now support real-time streaming (March 2025)

Weaknesses:

  • 25MB file size limit per request
  • No speaker diarization built-in
  • No LLM-powered post-processing features (unlike AssemblyAI LeMUR)
  • whisper-1 is batch-only (use gpt-4o-mini-transcribe for streaming)

When to use OpenAI transcription: Budget batch transcription (gpt-4o-mini-transcribe at $0.003/min), or teams already integrated with OpenAI who want streaming without adopting another SDK. Not the best choice if you need diarization or post-processing.

Deepgram Nova-3

Deepgram built its own speech recognition architecture from scratch. Nova-3 reached General Availability in February 2025 with the strongest batch WER benchmark in this comparison (5.26%) and a standout feature: self-serve custom keyword injection — up to 100 domain-specific terms without model retraining. Nova-3 Medical launches with 3.44% WER for healthcare audio.

from deepgram import DeepgramClient, PrerecordedOptions, LiveOptions
import asyncio
import httpx

# Batch transcription
client = DeepgramClient(api_key="YOUR_DEEPGRAM_API_KEY")

with open("audio.mp3", "rb") as audio_file:
    response = client.listen.rest.v("1").transcribe_file(
        {"buffer": audio_file, "mimetype": "audio/mp3"},
        PrerecordedOptions(
            model="nova-3",
            smart_format=True,
            diarize=True,  # Speaker detection
            punctuate=True,
            utterances=True,
        )
    )

# Access structured results
for utterance in response.results.utterances:
    print(f"Speaker {utterance.speaker}: {utterance.transcript}")
# Real-time streaming with Deepgram
import asyncio
import websockets
import json

async def transcribe_stream():
    uri = "wss://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&diarize=true"
    headers = {"Authorization": f"Token {DEEPGRAM_API_KEY}"}

    async with websockets.connect(uri, extra_headers=headers) as ws:
        # Stream audio chunks
        async for audio_chunk in audio_source():
            await ws.send(audio_chunk)

        # Receive transcripts in real-time
        async for message in ws:
            result = json.loads(message)
            if result.get("is_final"):
                transcript = result["channel"]["alternatives"][0]["transcript"]
                if transcript:
                    print(f"Final: {transcript}")

Strengths:

  • Best-in-class real-time streaming latency (sub-300ms to first word)
  • Strong accuracy on conversational speech, call center audio, and domain vocabulary
  • Speaker diarization included
  • Keyword boosting for domain-specific terms
  • Language detection (identify the language automatically)
  • Good filler word detection (um, uh, like)
  • Reliable WebSocket streaming API

Weaknesses:

  • More expensive than Whisper for simple batch transcription
  • Fewer LLM-powered post-processing features than AssemblyAI
  • Less multilingual strength than Google Chirp 2

When to use Deepgram: Real-time transcription (live captioning, voice assistants, call center analytics) where latency is critical. Best batch WER in this group at 5.26%. Strong for call centers using domain vocabulary (custom keyword injection without retraining). Note: for pure cost on real-time streaming, AssemblyAI Universal-Streaming at $0.15/hr is cheaper than Deepgram Nova-3 at $0.462/hr — evaluate if the latency difference matters for your use case.

AssemblyAI

AssemblyAI's pricing structure has a hidden advantage: Universal-Streaming at $0.15/hr is 3x cheaper than Deepgram Nova-3 for real-time ($0.462/hr). For teams that need streaming + features, this changes the math significantly. They also launched Slam-1 in October 2025 — a new speech-language model that goes beyond transcription toward deeper audio understanding.

import assemblyai as aai

aai.settings.api_key = "YOUR_ASSEMBLYAI_API_KEY"

transcriber = aai.Transcriber()

# Transcription with all features enabled
transcript = transcriber.transcribe(
    "https://example.com/podcast-episode.mp3",
    config=aai.TranscriptionConfig(
        speech_model=aai.SpeechModel.universal,  # Universal-2
        speaker_labels=True,   # Diarization
        auto_chapters=True,    # Chapter detection with summaries
        sentiment_analysis=True,
        auto_highlights=True,  # Key phrases
        entity_detection=True, # Named entities
        iab_categories=True,   # Topic classification
    )
)

# Access speakers
for utterance in transcript.utterances:
    print(f"Speaker {utterance.speaker}: {utterance.text}")

# Auto-chapters
for chapter in transcript.chapters:
    print(f"{chapter.start/1000:.0f}s - {chapter.end/1000:.0f}s: {chapter.headline}")

# Ask questions about the transcript via LeMUR
result = transcript.lemur.task(
    "What were the three main topics discussed and what conclusions were reached on each?",
    final_model=aai.LemurModel.claude3_5_sonnet
)
print(result.response)

Strengths:

  • Best post-processing feature set: auto chapters, sentiment analysis, entity detection, topic classification, PII redaction
  • LeMUR: ask free-form questions about transcribed content using Claude or other LLMs
  • Strong real-time streaming (WebSocket)
  • Universal-2 model is highly accurate on diverse audio types
  • Best for podcast/interview/meeting analysis workflows

Weaknesses:

  • Most expensive per-minute for raw transcription
  • LeMUR adds additional token cost on top of transcription
  • More complex setup than Whisper
  • Fewer language options than Google

When to use AssemblyAI: Meeting intelligence, podcast workflows, qualitative research, customer call analysis needing structured insights. Also the best-value real-time streaming choice at $0.15/hr. LeMUR is genuinely unique — ask "what action items came up in this meeting?" and get structured answers. Note: Universal-Streaming currently supports 6 languages only; use Universal (batch) for broader language coverage.

Google Speech-to-Text (Chirp 3)

Google launched Chirp 3 in March 2025 — the same month OpenAI launched its GPT-4o transcription models. Chirp 3 adds a built-in audio denoiser, native diarization, and automatic language detection. For non-English audio, Google's multilingual training scale (28 billion text sentences across 100+ languages) gives it an edge no specialist provider can match. The Dynamic Batch option at $0.004/min (75% off standard) makes Google cost-competitive for non-real-time workloads.

from google.cloud import speech_v2

client = speech_v2.SpeechClient()
project_id = "your-project-id"

# Batch transcription
with open("audio.wav", "rb") as f:
    audio_data = f.read()

config = speech_v2.RecognitionConfig(
    auto_decoding_config=speech_v2.AutoDetectDecodingConfig(),
    language_codes=["en-US"],
    model="chirp_2",
    features=speech_v2.RecognitionFeatures(
        enable_automatic_punctuation=True,
        enable_word_time_offsets=True,
        enable_speaker_diarization=True,
        diarization_speaker_count=2,
    )
)

request = speech_v2.RecognizeRequest(
    recognizer=f"projects/{project_id}/locations/global/recognizers/_",
    config=config,
    content=audio_data,
)

response = client.recognize(request=request)
for result in response.results:
    print(result.alternatives[0].transcript)

Strengths:

  • Best multilingual support (100+ languages with Chirp)
  • Strong for code-switched audio (multiple languages in same recording)
  • Streaming available via gRPC
  • Deep GCP integration (Cloud Storage, BigQuery, Pub/Sub)
  • Speaker diarization available
  • Proven production reliability (powers Google's own products)

Weaknesses:

  • Most expensive managed API in this comparison
  • GCP setup required (service accounts, project configuration)
  • More complex SDK/client library compared to Whisper or Deepgram
  • Overkill for English-only use cases

When to use Google Chirp 3: Non-English or multilingual audio (85+ languages, best multilingual accuracy). Applications where audio quality varies and the built-in denoiser helps. GCP-native applications. Dynamic Batch at $0.004/min makes it competitive for non-real-time workloads where 24-hour turnaround is acceptable.

Accuracy Benchmarks: Word Error Rate (WER)

Lower WER = better accuracy. Approximate WER figures on common benchmarks:

ProviderModelBatch WERStreaming WERMultilingual
DeepgramNova-35.26%6.84%Good (Nova-3 multilingual, 6+ languages)
AssemblyAIUniversal-2~5.9%~7–9%Good (99 languages batch; 6 languages streaming)
OpenAIgpt-4o-transcribe~5–7%~6–8%Strong (99+ languages)
OpenAIgpt-4o-mini-transcribe~6–8%~7–9%Strong (99+ languages)
GoogleChirp 3~8–11%~9–12%Best (85+ languages with denoising)

Deepgram WER figures from Deepgram's own benchmarks (2,703 files, 81.69 hours). Independent benchmarks show smaller gaps. Always test on your own audio.

Note: WER varies significantly by audio quality, accent, domain, and benchmark. These are approximate figures from published evaluations; your domain's accuracy may differ substantially.

No single model wins on accuracy across all conditions. For your production use case, benchmark with a sample of your own audio before committing.

Real-Time Streaming Comparison

ProviderModelProtocolTime to First WordMonthly Cost (1K hr)
DeepgramNova-3WebSocket~200–300ms~$462
AssemblyAIUniversal-StreamingWebSocket~300–500ms~$150
GoogleChirp 3gRPC~400–700ms~$960
OpenAIgpt-4o-transcribeWebSocket~300–500ms~$360
OpenAIwhisper-1Batch onlyN/A~$360

Deepgram has the lowest latency — meaningful for voice assistants and real-time captioning. But AssemblyAI Universal-Streaming at $0.15/hr undercuts Deepgram by 3x for teams where ~300ms additional latency is acceptable. For call center analytics where you don't need sub-300ms, AssemblyAI is the cost winner.

Self-Hosting with Whisper

For very high-volume batch transcription, self-hosting Whisper on your own GPU infrastructure can dramatically reduce costs:

from faster_whisper import WhisperModel

# large-v3-turbo: 6x faster than large-v3, <2% accuracy loss — the practical default
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")

# Transcribe
segments, info = model.transcribe(
    "audio.mp3",
    beam_size=5,
    word_timestamps=True,
    vad_filter=True,  # Voice activity detection — skip silence
)

for segment in segments:
    print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")

faster-whisper (CTranslate2-based) runs Whisper large-v3 at 4-8x faster than the original implementation and requires ~6GB VRAM. On a single A100 GPU at ~$2/hour, you can transcribe ~1,000 hours of audio per hour of compute — that's $0.002/minute vs Whisper API's $0.006/minute at the crossover point of ~500 hours/month.

Cost Scenarios

Scenario 1: Podcast transcription (batch) — 100 hours/month

ProviderMonthly CostNotes
gpt-4o-mini-transcribe100h × 60 × $0.003 = $18Best value for batch
Deepgram Nova-3100h × 60 × $0.0077 = $46.20Best accuracy
AssemblyAI Universal100h × $0.37 = $37Best features
Google Chirp 3 Dynamic Batch100h × 60 × $0.004 = $24Best for multilingual
Google Chirp 3 Standard100h × 60 × $0.016 = $96
Whisper self-hosted~$15–30 (GPU cost)

Scenario 2: Call center — 1,000 hours/month real-time streaming

ProviderMonthly CostStreaming Latency
AssemblyAI Universal-Streaming1,000 × $0.15 = $150~300–500ms
Deepgram Nova-31,000 × $0.462 = $462~200–300ms
OpenAI gpt-4o-transcribe1,000h × 60 × $0.006 = $360~300–500ms
Google Chirp 31,000 × $0.96 = $960~400–700ms

AssemblyAI Universal-Streaming's pricing advantage is substantial at scale. For call centers where ~300ms latency is acceptable and features (diarization, summaries) matter, it's the clear winner. Deepgram is worth the premium only when sub-300ms latency is a hard requirement.

When to Choose Each

Choose OpenAI (gpt-4o-mini-transcribe) if:

  • Best price for managed batch transcription ($0.003/min — cheapest managed option)
  • Already integrated with OpenAI and want one SDK
  • Need streaming but don't want Deepgram or AssemblyAI onboarding
  • English audio, moderate volume, simplicity priority
  • Note: use gpt-4o-mini-transcribe not whisper-1 — better accuracy, same or lower cost, streaming support

Choose Deepgram if:

  • Real-time streaming is required (live captions, voice assistants, call center)
  • Lowest per-minute cost for managed streaming APIs
  • Speaker diarization is needed at scale
  • Call center, customer service, or live meeting transcription

Choose AssemblyAI if:

  • You need post-processing features: summaries, chapters, sentiment, entities
  • Building meeting intelligence or podcast analysis workflows
  • LeMUR (ask questions about audio content) is valuable for your use case
  • You want the richest feature set and can pay the premium

Choose Google Chirp 3 if:

  • Non-English or multilingual audio is primary (85+ languages, best multilingual accuracy)
  • Audio quality varies and the built-in denoiser helps
  • You're already on GCP and want ecosystem integration
  • Non-real-time workloads where Dynamic Batch pricing ($0.004/min) makes Google cost-competitive

Compare STT API pricing, accuracy, and uptime at APIScout.

Related: ElevenLabs vs Cartesia: Best Voice AI API 2026 · Best Free APIs for Developers 2026

Comments