Skip to main content

API guide

Speech-to-Text APIs (2026)

Deepgram Nova-3: 5.26% WER, best real-time. AssemblyAI streaming cheapest at $0.15/hr. OpenAI's new gpt-4o-transcribe beats Whisper. 2026 STT API comparison.

·APIScout Team
Share:
Hero image for Speech-to-Text APIs (2026)

The Speech API Market Has Stratified

In 2023, OpenAI Whisper disrupted the speech recognition market by open-sourcing a model that matched commercial APIs at zero cost for self-hosting. In 2025–2026, the market has responded: Deepgram, AssemblyAI, and Google have released specialized models that beat Whisper on specific tasks — real-time streaming, speaker diarization, domain-specific accuracy — while remaining cost-competitive at scale.

The choice now depends heavily on your use case: batch transcription, real-time streaming, feature richness, or cost minimization.

TL;DR

Deepgram Nova-3 leads on batch accuracy (5.26% WER) and real-time streaming latency at $0.0077/min. AssemblyAI wins on features AND is surprisingly cheap for streaming at $0.15/hr (Universal-Streaming) — cheaper than Deepgram for real-time workloads. OpenAI's new gpt-4o-mini-transcribe ($0.003/min, launched March 2025) is the budget batch pick — half the price of whisper-1 with better accuracy and streaming support. Google Chirp 3 (also March 2025) leads on multilingual accuracy with a built-in denoiser.

Key Takeaways

  • Deepgram Nova-3: $0.0077/min; 5.26% WER (batch); best real-time streaming latency; Nova-3 Medical at 3.44% WER; GA February 2025
  • AssemblyAI: $0.37/hr pre-recorded; $0.15/hr streaming (cheaper than Deepgram for real-time); LeMUR for LLM analysis; Slam-1 launched October 2025
  • OpenAI: whisper-1 $0.006/min (batch only); gpt-4o-mini-transcribe $0.003/min with streaming (launched March 2025, now recommended)
  • Google Chirp 3: $0.016/min standard; $0.004/min Dynamic Batch (75% off); built-in denoiser; native diarization; launched March 2025
  • WER benchmarks: Deepgram Nova-3 at 5.26%, AssemblyAI Universal-2 at 5.9%, gpt-4o-transcribe ~5-7%; Google Chirp 3 multilingual is best
  • Self-hosting: Whisper large-v3-turbo is the practical choice — 6x faster than large-v3 with <2% accuracy loss

Pricing Comparison

ProviderModelPrice per MinuteFree TierStreaming
OpenAIwhisper-1 (legacy)$0.006/minNone
OpenAIgpt-4o-mini-transcribe ⭐$0.003/minNone
OpenAIgpt-4o-transcribe$0.006/minNone
DeepgramNova-3$0.0077/min$200 credit
AssemblyAIUniversal (pre-recorded)~$0.0062/min ($0.37/hr)$50 credit
AssemblyAIUniversal-Streaming ⭐$0.0025/min ($0.15/hr)$50 credit
GoogleChirp 3$0.016/min60 min/month
GoogleChirp 3 Dynamic Batch$0.004/min60 min/month
Whisper (self-hosted)large-v3-turboInfrastructure only

Pricing Notes

OpenAI now has three transcription models (as of March 2025):

  • whisper-1: $0.006/min, batch only, still works but no longer recommended
  • gpt-4o-transcribe: $0.006/min, same price as whisper-1 but ~35% better accuracy + streaming
  • gpt-4o-mini-transcribe: $0.003/min, streaming supported, now the recommended choice for most use cases

Deepgram pricing varies by plan:

  • Nova-3 pay-as-you-go: $0.0077/min ($0.462/hr)
  • Nova-3 Growth plan: $0.0065/min ($0.39/hr)
  • Free tier: $200 in credits (~433 hours at pay-as-you-go rates)

AssemblyAI has two distinct pricing tiers worth knowing:

  • Universal (pre-recorded/batch): $0.37/hr — best accuracy + features
  • Universal-Streaming: $0.15/hr — cheaper than Deepgram for real-time; 6 languages (English, Spanish, French, German, Italian, Portuguese)
  • Nano: $0.12/hr — lower accuracy, budget option
  • LeMUR (LLM analysis): token-based on top of transcription cost
  • Free tier: $50 in credits (~185 hours Universal or 333 hours streaming)

Google Speech-to-Text (Chirp 3 is now current, launched March 2025):

  • Standard rate: $0.016/min (most expensive in this group)
  • Dynamic Batch (async, 24-hour SLA): $0.004/min — 75% discount for non-real-time workloads
  • Free tier: 60 minutes/month (permanent, not trial credit)

OpenAI: Whisper → GPT-4o Transcribe Models

OpenAI launched gpt-4o-transcribe and gpt-4o-mini-transcribe in March 2025. whisper-1 still works, but OpenAI now recommends the GPT-4o-based models. The key improvements: ~35% lower WER than whisper-1, and both support real-time streaming via WebSocket — something whisper-1 never had.

from openai import OpenAI

client = OpenAI()

# Recommended: gpt-4o-mini-transcribe (cheaper, streaming-capable)
with open("meeting-recording.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="gpt-4o-mini-transcribe",  # $0.003/min — half the price of whisper-1
        file=audio_file,
        response_format="verbose_json",
        timestamp_granularities=["word", "segment"]
    )

print(transcription.text)
for word in transcription.words:
    print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")
# Real-time streaming with gpt-4o-transcribe (WebSocket)
import websocket
import json
import base64

ws = websocket.WebSocket()
ws.connect(
    "wss://api.openai.com/v1/realtime?model=gpt-4o-transcribe",
    header={"Authorization": f"Bearer {OPENAI_API_KEY}", "OpenAI-Beta": "realtime=v1"}
)

# Stream audio chunks and receive transcripts
# (WebSocket streaming pattern similar to other real-time STT APIs)

Strengths:

  • Simplest API integration in the group
  • gpt-4o-mini-transcribe at $0.003/min is now the cheapest managed option
  • 99+ languages supported
  • Word-level timestamps available
  • Open-source Whisper model available for self-hosting
  • GPT-4o models now support real-time streaming (March 2025)

Weaknesses:

  • 25MB file size limit per request
  • No speaker diarization built-in
  • No LLM-powered post-processing features (unlike AssemblyAI LeMUR)
  • whisper-1 is batch-only (use gpt-4o-mini-transcribe for streaming)

When to use OpenAI transcription: Budget batch transcription (gpt-4o-mini-transcribe at $0.003/min), or teams already integrated with OpenAI who want streaming without adopting another SDK. Not the best choice if you need diarization or post-processing.

Deepgram Nova-3

Deepgram built its own speech recognition architecture from scratch. Nova-3 reached General Availability in February 2025 with the strongest batch WER benchmark in this comparison (5.26%) and a standout feature: self-serve custom keyword injection — up to 100 domain-specific terms without model retraining. Nova-3 Medical launches with 3.44% WER for healthcare audio.

from deepgram import DeepgramClient, PrerecordedOptions, LiveOptions
import asyncio
import httpx

# Batch transcription
client = DeepgramClient(api_key="YOUR_DEEPGRAM_API_KEY")

with open("audio.mp3", "rb") as audio_file:
    response = client.listen.rest.v("1").transcribe_file(
        {"buffer": audio_file, "mimetype": "audio/mp3"},
        PrerecordedOptions(
            model="nova-3",
            smart_format=True,
            diarize=True,  # Speaker detection
            punctuate=True,
            utterances=True,
        )
    )

# Access structured results
for utterance in response.results.utterances:
    print(f"Speaker {utterance.speaker}: {utterance.transcript}")
# Real-time streaming with Deepgram
import asyncio
import websockets
import json

async def transcribe_stream():
    uri = "wss://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&diarize=true"
    headers = {"Authorization": f"Token {DEEPGRAM_API_KEY}"}

    async with websockets.connect(uri, extra_headers=headers) as ws:
        # Stream audio chunks
        async for audio_chunk in audio_source():
            await ws.send(audio_chunk)

        # Receive transcripts in real-time
        async for message in ws:
            result = json.loads(message)
            if result.get("is_final"):
                transcript = result["channel"]["alternatives"][0]["transcript"]
                if transcript:
                    print(f"Final: {transcript}")

Strengths:

  • Best-in-class real-time streaming latency (sub-300ms to first word)
  • Strong accuracy on conversational speech, call center audio, and domain vocabulary
  • Speaker diarization included
  • Keyword boosting for domain-specific terms
  • Language detection (identify the language automatically)
  • Good filler word detection (um, uh, like)
  • Reliable WebSocket streaming API

Weaknesses:

  • More expensive than Whisper for simple batch transcription
  • Fewer LLM-powered post-processing features than AssemblyAI
  • Less multilingual strength than Google Chirp 2

When to use Deepgram: Real-time transcription (live captioning, voice assistants, call center analytics) where latency is critical. Best batch WER in this group at 5.26%. Strong for call centers using domain vocabulary (custom keyword injection without retraining). Note: for pure cost on real-time streaming, AssemblyAI Universal-Streaming at $0.15/hr is cheaper than Deepgram Nova-3 at $0.462/hr — evaluate if the latency difference matters for your use case.

AssemblyAI

AssemblyAI's pricing structure has a hidden advantage: Universal-Streaming at $0.15/hr is 3x cheaper than Deepgram Nova-3 for real-time ($0.462/hr). For teams that need streaming + features, this changes the math significantly. They also launched Slam-1 in October 2025 — a new speech-language model that goes beyond transcription toward deeper audio understanding.

import assemblyai as aai

aai.settings.api_key = "YOUR_ASSEMBLYAI_API_KEY"

transcriber = aai.Transcriber()

# Transcription with all features enabled
transcript = transcriber.transcribe(
    "https://example.com/podcast-episode.mp3",
    config=aai.TranscriptionConfig(
        speech_model=aai.SpeechModel.universal,  # Universal-2
        speaker_labels=True,   # Diarization
        auto_chapters=True,    # Chapter detection with summaries
        sentiment_analysis=True,
        auto_highlights=True,  # Key phrases
        entity_detection=True, # Named entities
        iab_categories=True,   # Topic classification
    )
)

# Access speakers
for utterance in transcript.utterances:
    print(f"Speaker {utterance.speaker}: {utterance.text}")

# Auto-chapters
for chapter in transcript.chapters:
    print(f"{chapter.start/1000:.0f}s - {chapter.end/1000:.0f}s: {chapter.headline}")

# Ask questions about the transcript via LeMUR
result = transcript.lemur.task(
    "What were the three main topics discussed and what conclusions were reached on each?",
    final_model=aai.LemurModel.claude3_5_sonnet
)
print(result.response)

Strengths:

  • Best post-processing feature set: auto chapters, sentiment analysis, entity detection, topic classification, PII redaction
  • LeMUR: ask free-form questions about transcribed content using Claude or other LLMs
  • Strong real-time streaming (WebSocket)
  • Universal-2 model is highly accurate on diverse audio types
  • Best for podcast/interview/meeting analysis workflows

Weaknesses:

  • Most expensive per-minute for raw transcription
  • LeMUR adds additional token cost on top of transcription
  • More complex setup than Whisper
  • Fewer language options than Google

When to use AssemblyAI: Meeting intelligence, podcast workflows, qualitative research, customer call analysis needing structured insights. Also the best-value real-time streaming choice at $0.15/hr. LeMUR is genuinely unique — ask "what action items came up in this meeting?" and get structured answers. Note: Universal-Streaming currently supports 6 languages only; use Universal (batch) for broader language coverage.

Google Speech-to-Text (Chirp 3)

Google launched Chirp 3 in March 2025 — the same month OpenAI launched its GPT-4o transcription models. Chirp 3 adds a built-in audio denoiser, native diarization, and automatic language detection. For non-English audio, Google's multilingual training scale (28 billion text sentences across 100+ languages) gives it an edge no specialist provider can match. The Dynamic Batch option at $0.004/min (75% off standard) makes Google cost-competitive for non-real-time workloads.

from google.cloud import speech_v2

client = speech_v2.SpeechClient()
project_id = "your-project-id"

# Batch transcription
with open("audio.wav", "rb") as f:
    audio_data = f.read()

config = speech_v2.RecognitionConfig(
    auto_decoding_config=speech_v2.AutoDetectDecodingConfig(),
    language_codes=["en-US"],
    model="chirp_2",
    features=speech_v2.RecognitionFeatures(
        enable_automatic_punctuation=True,
        enable_word_time_offsets=True,
        enable_speaker_diarization=True,
        diarization_speaker_count=2,
    )
)

request = speech_v2.RecognizeRequest(
    recognizer=f"projects/{project_id}/locations/global/recognizers/_",
    config=config,
    content=audio_data,
)

response = client.recognize(request=request)
for result in response.results:
    print(result.alternatives[0].transcript)

Strengths:

  • Best multilingual support (100+ languages with Chirp)
  • Strong for code-switched audio (multiple languages in same recording)
  • Streaming available via gRPC
  • Deep GCP integration (Cloud Storage, BigQuery, Pub/Sub)
  • Speaker diarization available
  • Proven production reliability (powers Google's own products)

Weaknesses:

  • Most expensive managed API in this comparison
  • GCP setup required (service accounts, project configuration)
  • More complex SDK/client library compared to Whisper or Deepgram
  • Overkill for English-only use cases

When to use Google Chirp 3: Non-English or multilingual audio (85+ languages, best multilingual accuracy). Applications where audio quality varies and the built-in denoiser helps. GCP-native applications. Dynamic Batch at $0.004/min makes it competitive for non-real-time workloads where 24-hour turnaround is acceptable.

Accuracy Benchmarks: Word Error Rate (WER)

Lower WER = better accuracy. Approximate WER figures on common benchmarks:

ProviderModelBatch WERStreaming WERMultilingual
DeepgramNova-35.26%6.84%Good (Nova-3 multilingual, 6+ languages)
AssemblyAIUniversal-2~5.9%~7–9%Good (99 languages batch; 6 languages streaming)
OpenAIgpt-4o-transcribe~5–7%~6–8%Strong (99+ languages)
OpenAIgpt-4o-mini-transcribe~6–8%~7–9%Strong (99+ languages)
GoogleChirp 3~8–11%~9–12%Best (85+ languages with denoising)

Deepgram WER figures from Deepgram's own benchmarks (2,703 files, 81.69 hours). Independent benchmarks show smaller gaps. Always test on your own audio.

Note: WER varies significantly by audio quality, accent, domain, and benchmark. These are approximate figures from published evaluations; your domain's accuracy may differ substantially.

No single model wins on accuracy across all conditions. For your production use case, benchmark with a sample of your own audio before committing.

Real-Time Streaming Comparison

ProviderModelProtocolTime to First WordMonthly Cost (1K hr)
DeepgramNova-3WebSocket~200–300ms~$462
AssemblyAIUniversal-StreamingWebSocket~300–500ms~$150
GoogleChirp 3gRPC~400–700ms~$960
OpenAIgpt-4o-transcribeWebSocket~300–500ms~$360
OpenAIwhisper-1Batch onlyN/A~$360

Deepgram has the lowest latency — meaningful for voice assistants and real-time captioning. But AssemblyAI Universal-Streaming at $0.15/hr undercuts Deepgram by 3x for teams where ~300ms additional latency is acceptable. For call center analytics where you don't need sub-300ms, AssemblyAI is the cost winner.

Self-Hosting with Whisper

For very high-volume batch transcription, self-hosting Whisper on your own GPU infrastructure can dramatically reduce costs:

from faster_whisper import WhisperModel

# large-v3-turbo: 6x faster than large-v3, <2% accuracy loss — the practical default
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")

# Transcribe
segments, info = model.transcribe(
    "audio.mp3",
    beam_size=5,
    word_timestamps=True,
    vad_filter=True,  # Voice activity detection — skip silence
)

for segment in segments:
    print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")

faster-whisper (CTranslate2-based) runs Whisper large-v3 at 4-8x faster than the original implementation and requires ~6GB VRAM. On a single A100 GPU at ~$2/hour, you can transcribe ~1,000 hours of audio per hour of compute — that's $0.002/minute vs Whisper API's $0.006/minute at the crossover point of ~500 hours/month.

Cost Scenarios

Scenario 1: Podcast transcription (batch) — 100 hours/month

ProviderMonthly CostNotes
gpt-4o-mini-transcribe100h × 60 × $0.003 = $18Best value for batch
Deepgram Nova-3100h × 60 × $0.0077 = $46.20Best accuracy
AssemblyAI Universal100h × $0.37 = $37Best features
Google Chirp 3 Dynamic Batch100h × 60 × $0.004 = $24Best for multilingual
Google Chirp 3 Standard100h × 60 × $0.016 = $96
Whisper self-hosted~$15–30 (GPU cost)

Scenario 2: Call center — 1,000 hours/month real-time streaming

ProviderMonthly CostStreaming Latency
AssemblyAI Universal-Streaming1,000 × $0.15 = $150~300–500ms
Deepgram Nova-31,000 × $0.462 = $462~200–300ms
OpenAI gpt-4o-transcribe1,000h × 60 × $0.006 = $360~300–500ms
Google Chirp 31,000 × $0.96 = $960~400–700ms

AssemblyAI Universal-Streaming's pricing advantage is substantial at scale. For call centers where ~300ms latency is acceptable and features (diarization, summaries) matter, it's the clear winner. Deepgram is worth the premium only when sub-300ms latency is a hard requirement.

When to Choose Each

Choose OpenAI (gpt-4o-mini-transcribe) if:

  • Best price for managed batch transcription ($0.003/min — cheapest managed option)
  • Already integrated with OpenAI and want one SDK
  • Need streaming but don't want Deepgram or AssemblyAI onboarding
  • English audio, moderate volume, simplicity priority
  • Note: use gpt-4o-mini-transcribe not whisper-1 — better accuracy, same or lower cost, streaming support

Choose Deepgram if:

  • Real-time streaming is required (live captions, voice assistants, call center)
  • Lowest per-minute cost for managed streaming APIs
  • Speaker diarization is needed at scale
  • Call center, customer service, or live meeting transcription

Choose AssemblyAI if:

  • You need post-processing features: summaries, chapters, sentiment, entities
  • Building meeting intelligence or podcast analysis workflows
  • LeMUR (ask questions about audio content) is valuable for your use case
  • You want the richest feature set and can pay the premium

Choose Google Chirp 3 if:

  • Non-English or multilingual audio is primary (85+ languages, best multilingual accuracy)
  • Audio quality varies and the built-in denoiser helps
  • You're already on GCP and want ecosystem integration
  • Non-real-time workloads where Dynamic Batch pricing ($0.004/min) makes Google cost-competitive

Compare STT API pricing, accuracy, and uptime at APIScout.

Evaluate Deepgram and compare alternatives on APIScout.

The API Integration Checklist (Free PDF)

Step-by-step checklist: auth setup, rate limit handling, error codes, SDK evaluation, and pricing comparison for 50+ APIs. Used by 200+ developers.

Join 200+ developers. Unsubscribe in one click.