Speech-to-Text APIs (2026)

The Speech API Market Has Stratified

In 2023, OpenAI Whisper disrupted the speech recognition market by open-sourcing a model that matched commercial APIs at zero cost for self-hosting. In 2025–2026, the market has responded: Deepgram, AssemblyAI, and Google have released specialized models that beat Whisper on specific tasks — real-time streaming, speaker diarization, domain-specific accuracy — while remaining cost-competitive at scale.

The choice now depends heavily on your use case: batch transcription, real-time streaming, feature richness, or cost minimization.

TL;DR

Deepgram Nova-3 leads on batch accuracy (5.26% WER) and real-time streaming latency at $0.0077/min. AssemblyAI wins on features AND is surprisingly cheap for streaming at $0.15/hr (Universal-Streaming) — cheaper than Deepgram for real-time workloads. OpenAI's new gpt-4o-mini-transcribe ($0.003/min, launched March 2025) is the budget batch pick — half the price of whisper-1 with better accuracy and streaming support. Google Chirp 3 (also March 2025) leads on multilingual accuracy with a built-in denoiser.

Key Takeaways

Deepgram Nova-3: $0.0077/min; 5.26% WER (batch); best real-time streaming latency; Nova-3 Medical at 3.44% WER; GA February 2025
AssemblyAI: $0.37/hr pre-recorded; $0.15/hr streaming (cheaper than Deepgram for real-time); LeMUR for LLM analysis; Slam-1 launched October 2025
OpenAI: whisper-1 $0.006/min (batch only); gpt-4o-mini-transcribe $0.003/min with streaming (launched March 2025, now recommended)
Google Chirp 3: $0.016/min standard; $0.004/min Dynamic Batch (75% off); built-in denoiser; native diarization; launched March 2025
WER benchmarks: Deepgram Nova-3 at 5.26%, AssemblyAI Universal-2 at 5.9%, gpt-4o-transcribe ~5-7%; Google Chirp 3 multilingual is best
Self-hosting: Whisper large-v3-turbo is the practical choice — 6x faster than large-v3 with <2% accuracy loss

Pricing Comparison

Provider	Model	Price per Minute	Free Tier	Streaming
OpenAI	whisper-1 (legacy)	$0.006/min	None	❌
OpenAI	gpt-4o-mini-transcribe ⭐	$0.003/min	None	✅
OpenAI	gpt-4o-transcribe	$0.006/min	None	✅
Deepgram	Nova-3	$0.0077/min	$200 credit	✅
AssemblyAI	Universal (pre-recorded)	~$0.0062/min ($0.37/hr)	$50 credit	✅
AssemblyAI	Universal-Streaming ⭐	$0.0025/min ($0.15/hr)	$50 credit	✅
Google	Chirp 3	$0.016/min	60 min/month	✅
Google	Chirp 3 Dynamic Batch	$0.004/min	60 min/month	❌
Whisper (self-hosted)	large-v3-turbo	Infrastructure only	—	❌

Pricing Notes

OpenAI now has three transcription models (as of March 2025):

whisper-1: $0.006/min, batch only, still works but no longer recommended
gpt-4o-transcribe: $0.006/min, same price as whisper-1 but ~35% better accuracy + streaming
gpt-4o-mini-transcribe: $0.003/min, streaming supported, now the recommended choice for most use cases

Deepgram pricing varies by plan:

Nova-3 pay-as-you-go: $0.0077/min ($0.462/hr)
Nova-3 Growth plan: $0.0065/min ($0.39/hr)
Free tier: $200 in credits (~433 hours at pay-as-you-go rates)

AssemblyAI has two distinct pricing tiers worth knowing:

Universal (pre-recorded/batch): $0.37/hr — best accuracy + features
Universal-Streaming: $0.15/hr — cheaper than Deepgram for real-time; 6 languages (English, Spanish, French, German, Italian, Portuguese)
Nano: $0.12/hr — lower accuracy, budget option
LeMUR (LLM analysis): token-based on top of transcription cost
Free tier: $50 in credits (~185 hours Universal or 333 hours streaming)

Google Speech-to-Text (Chirp 3 is now current, launched March 2025):

Standard rate: $0.016/min (most expensive in this group)
Dynamic Batch (async, 24-hour SLA): $0.004/min — 75% discount for non-real-time workloads
Free tier: 60 minutes/month (permanent, not trial credit)

OpenAI: Whisper → GPT-4o Transcribe Models

OpenAI launched gpt-4o-transcribe and gpt-4o-mini-transcribe in March 2025. whisper-1 still works, but OpenAI now recommends the GPT-4o-based models. The key improvements: ~35% lower WER than whisper-1, and both support real-time streaming via WebSocket — something whisper-1 never had.

from openai import OpenAI

client = OpenAI()

# Recommended: gpt-4o-mini-transcribe (cheaper, streaming-capable)
with open("meeting-recording.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="gpt-4o-mini-transcribe",  # $0.003/min — half the price of whisper-1
        file=audio_file,
        response_format="verbose_json",
        timestamp_granularities=["word", "segment"]
    )

print(transcription.text)
for word in transcription.words:
    print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")

# Real-time streaming with gpt-4o-transcribe (WebSocket)
import websocket
import json
import base64

ws = websocket.WebSocket()
ws.connect(
    "wss://api.openai.com/v1/realtime?model=gpt-4o-transcribe",
    header={"Authorization": f"Bearer {OPENAI_API_KEY}", "OpenAI-Beta": "realtime=v1"}
)

# Stream audio chunks and receive transcripts
# (WebSocket streaming pattern similar to other real-time STT APIs)

Strengths:

Simplest API integration in the group
gpt-4o-mini-transcribe at $0.003/min is now the cheapest managed option
99+ languages supported
Word-level timestamps available
Open-source Whisper model available for self-hosting
GPT-4o models now support real-time streaming (March 2025)

Weaknesses:

25MB file size limit per request
No speaker diarization built-in
No LLM-powered post-processing features (unlike AssemblyAI LeMUR)
whisper-1 is batch-only (use gpt-4o-mini-transcribe for streaming)

When to use OpenAI transcription: Budget batch transcription (gpt-4o-mini-transcribe at $0.003/min), or teams already integrated with OpenAI who want streaming without adopting another SDK. Not the best choice if you need diarization or post-processing.

Deepgram Nova-3

Deepgram built its own speech recognition architecture from scratch. Nova-3 reached General Availability in February 2025 with the strongest batch WER benchmark in this comparison (5.26%) and a standout feature: self-serve custom keyword injection — up to 100 domain-specific terms without model retraining. Nova-3 Medical launches with 3.44% WER for healthcare audio.

from deepgram import DeepgramClient, PrerecordedOptions, LiveOptions
import asyncio
import httpx

# Batch transcription
client = DeepgramClient(api_key="YOUR_DEEPGRAM_API_KEY")

with open("audio.mp3", "rb") as audio_file:
    response = client.listen.rest.v("1").transcribe_file(
        {"buffer": audio_file, "mimetype": "audio/mp3"},
        PrerecordedOptions(
            model="nova-3",
            smart_format=True,
            diarize=True,  # Speaker detection
            punctuate=True,
            utterances=True,
        )
    )

# Access structured results
for utterance in response.results.utterances:
    print(f"Speaker {utterance.speaker}: {utterance.transcript}")

# Real-time streaming with Deepgram
import asyncio
import websockets
import json

async def transcribe_stream():
    uri = "wss://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&diarize=true"
    headers = {"Authorization": f"Token {DEEPGRAM_API_KEY}"}

    async with websockets.connect(uri, extra_headers=headers) as ws:
        # Stream audio chunks
        async for audio_chunk in audio_source():
            await ws.send(audio_chunk)

        # Receive transcripts in real-time
        async for message in ws:
            result = json.loads(message)
            if result.get("is_final"):
                transcript = result["channel"]["alternatives"][0]["transcript"]
                if transcript:
                    print(f"Final: {transcript}")

Strengths:

Best-in-class real-time streaming latency (sub-300ms to first word)
Strong accuracy on conversational speech, call center audio, and domain vocabulary
Speaker diarization included
Keyword boosting for domain-specific terms
Language detection (identify the language automatically)
Good filler word detection (um, uh, like)
Reliable WebSocket streaming API

Weaknesses:

More expensive than Whisper for simple batch transcription
Fewer LLM-powered post-processing features than AssemblyAI
Less multilingual strength than Google Chirp 2

When to use Deepgram: Real-time transcription (live captioning, voice assistants, call center analytics) where latency is critical. Best batch WER in this group at 5.26%. Strong for call centers using domain vocabulary (custom keyword injection without retraining). Note: for pure cost on real-time streaming, AssemblyAI Universal-Streaming at $0.15/hr is cheaper than Deepgram Nova-3 at $0.462/hr — evaluate if the latency difference matters for your use case.

AssemblyAI

AssemblyAI's pricing structure has a hidden advantage: Universal-Streaming at $0.15/hr is 3x cheaper than Deepgram Nova-3 for real-time ($0.462/hr). For teams that need streaming + features, this changes the math significantly. They also launched Slam-1 in October 2025 — a new speech-language model that goes beyond transcription toward deeper audio understanding.

import assemblyai as aai

aai.settings.api_key = "YOUR_ASSEMBLYAI_API_KEY"

transcriber = aai.Transcriber()

# Transcription with all features enabled
transcript = transcriber.transcribe(
    "https://example.com/podcast-episode.mp3",
    config=aai.TranscriptionConfig(
        speech_model=aai.SpeechModel.universal,  # Universal-2
        speaker_labels=True,   # Diarization
        auto_chapters=True,    # Chapter detection with summaries
        sentiment_analysis=True,
        auto_highlights=True,  # Key phrases
        entity_detection=True, # Named entities
        iab_categories=True,   # Topic classification
    )
)

# Access speakers
for utterance in transcript.utterances:
    print(f"Speaker {utterance.speaker}: {utterance.text}")

# Auto-chapters
for chapter in transcript.chapters:
    print(f"{chapter.start/1000:.0f}s - {chapter.end/1000:.0f}s: {chapter.headline}")

# Ask questions about the transcript via LeMUR
result = transcript.lemur.task(
    "What were the three main topics discussed and what conclusions were reached on each?",
    final_model=aai.LemurModel.claude3_5_sonnet
)
print(result.response)

Strengths:

Best post-processing feature set: auto chapters, sentiment analysis, entity detection, topic classification, PII redaction
LeMUR: ask free-form questions about transcribed content using Claude or other LLMs
Strong real-time streaming (WebSocket)
Universal-2 model is highly accurate on diverse audio types
Best for podcast/interview/meeting analysis workflows

Weaknesses:

Most expensive per-minute for raw transcription
LeMUR adds additional token cost on top of transcription
More complex setup than Whisper
Fewer language options than Google

When to use AssemblyAI: Meeting intelligence, podcast workflows, qualitative research, customer call analysis needing structured insights. Also the best-value real-time streaming choice at $0.15/hr. LeMUR is genuinely unique — ask "what action items came up in this meeting?" and get structured answers. Note: Universal-Streaming currently supports 6 languages only; use Universal (batch) for broader language coverage.

Google Speech-to-Text (Chirp 3)

Google launched Chirp 3 in March 2025 — the same month OpenAI launched its GPT-4o transcription models. Chirp 3 adds a built-in audio denoiser, native diarization, and automatic language detection. For non-English audio, Google's multilingual training scale (28 billion text sentences across 100+ languages) gives it an edge no specialist provider can match. The Dynamic Batch option at $0.004/min (75% off standard) makes Google cost-competitive for non-real-time workloads.

from google.cloud import speech_v2

client = speech_v2.SpeechClient()
project_id = "your-project-id"

# Batch transcription
with open("audio.wav", "rb") as f:
    audio_data = f.read()

config = speech_v2.RecognitionConfig(
    auto_decoding_config=speech_v2.AutoDetectDecodingConfig(),
    language_codes=["en-US"],
    model="chirp_2",
    features=speech_v2.RecognitionFeatures(
        enable_automatic_punctuation=True,
        enable_word_time_offsets=True,
        enable_speaker_diarization=True,
        diarization_speaker_count=2,
    )
)

request = speech_v2.RecognizeRequest(
    recognizer=f"projects/{project_id}/locations/global/recognizers/_",
    config=config,
    content=audio_data,
)

response = client.recognize(request=request)
for result in response.results:
    print(result.alternatives[0].transcript)

Strengths:

Best multilingual support (100+ languages with Chirp)
Strong for code-switched audio (multiple languages in same recording)
Streaming available via gRPC
Deep GCP integration (Cloud Storage, BigQuery, Pub/Sub)
Speaker diarization available
Proven production reliability (powers Google's own products)

Weaknesses:

Most expensive managed API in this comparison
GCP setup required (service accounts, project configuration)
More complex SDK/client library compared to Whisper or Deepgram
Overkill for English-only use cases

When to use Google Chirp 3: Non-English or multilingual audio (85+ languages, best multilingual accuracy). Applications where audio quality varies and the built-in denoiser helps. GCP-native applications. Dynamic Batch at $0.004/min makes it competitive for non-real-time workloads where 24-hour turnaround is acceptable.

Accuracy Benchmarks: Word Error Rate (WER)

Lower WER = better accuracy. Approximate WER figures on common benchmarks:

Provider	Model	Batch WER	Streaming WER	Multilingual
Deepgram	Nova-3	5.26%	6.84%	Good (Nova-3 multilingual, 6+ languages)
AssemblyAI	Universal-2	~5.9%	~7–9%	Good (99 languages batch; 6 languages streaming)
OpenAI	gpt-4o-transcribe	~5–7%	~6–8%	Strong (99+ languages)
OpenAI	gpt-4o-mini-transcribe	~6–8%	~7–9%	Strong (99+ languages)
Google	Chirp 3	~8–11%	~9–12%	Best (85+ languages with denoising)

Deepgram WER figures from Deepgram's own benchmarks (2,703 files, 81.69 hours). Independent benchmarks show smaller gaps. Always test on your own audio.

Note: WER varies significantly by audio quality, accent, domain, and benchmark. These are approximate figures from published evaluations; your domain's accuracy may differ substantially.

No single model wins on accuracy across all conditions. For your production use case, benchmark with a sample of your own audio before committing.

Real-Time Streaming Comparison

Provider	Model	Protocol	Time to First Word	Monthly Cost (1K hr)
Deepgram	Nova-3	WebSocket	~200–300ms	~$462
AssemblyAI	Universal-Streaming	WebSocket	~300–500ms	~$150
Google	Chirp 3	gRPC	~400–700ms	~$960
OpenAI	gpt-4o-transcribe	WebSocket	~300–500ms	~$360
OpenAI	whisper-1	Batch only	N/A	~$360

Deepgram has the lowest latency — meaningful for voice assistants and real-time captioning. But AssemblyAI Universal-Streaming at $0.15/hr undercuts Deepgram by 3x for teams where ~300ms additional latency is acceptable. For call center analytics where you don't need sub-300ms, AssemblyAI is the cost winner.

Self-Hosting with Whisper

For very high-volume batch transcription, self-hosting Whisper on your own GPU infrastructure can dramatically reduce costs:

from faster_whisper import WhisperModel

# large-v3-turbo: 6x faster than large-v3, <2% accuracy loss — the practical default
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")

# Transcribe
segments, info = model.transcribe(
    "audio.mp3",
    beam_size=5,
    word_timestamps=True,
    vad_filter=True,  # Voice activity detection — skip silence
)

for segment in segments:
    print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")

faster-whisper (CTranslate2-based) runs Whisper large-v3 at 4-8x faster than the original implementation and requires ~6GB VRAM. On a single A100 GPU at ~$2/hour, you can transcribe ~1,000 hours of audio per hour of compute — that's $0.002/minute vs Whisper API's $0.006/minute at the crossover point of ~500 hours/month.

Cost Scenarios

Scenario 1: Podcast transcription (batch) — 100 hours/month

Provider	Monthly Cost	Notes
gpt-4o-mini-transcribe	100h × 60 × $0.003 = $18	Best value for batch
Deepgram Nova-3	100h × 60 × $0.0077 = $46.20	Best accuracy
AssemblyAI Universal	100h × $0.37 = $37	Best features
Google Chirp 3 Dynamic Batch	100h × 60 × $0.004 = $24	Best for multilingual
Google Chirp 3 Standard	100h × 60 × $0.016 = $96
Whisper self-hosted	~$15–30 (GPU cost)

Scenario 2: Call center — 1,000 hours/month real-time streaming

Provider	Monthly Cost	Streaming Latency
AssemblyAI Universal-Streaming	1,000 × $0.15 = $150	~300–500ms
Deepgram Nova-3	1,000 × $0.462 = $462	~200–300ms
OpenAI gpt-4o-transcribe	1,000h × 60 × $0.006 = $360	~300–500ms
Google Chirp 3	1,000 × $0.96 = $960	~400–700ms

AssemblyAI Universal-Streaming's pricing advantage is substantial at scale. For call centers where ~300ms latency is acceptable and features (diarization, summaries) matter, it's the clear winner. Deepgram is worth the premium only when sub-300ms latency is a hard requirement.

When to Choose Each

Choose OpenAI (gpt-4o-mini-transcribe) if:

Best price for managed batch transcription ($0.003/min — cheapest managed option)
Already integrated with OpenAI and want one SDK
Need streaming but don't want Deepgram or AssemblyAI onboarding
English audio, moderate volume, simplicity priority
Note: use gpt-4o-mini-transcribe not whisper-1 — better accuracy, same or lower cost, streaming support

Choose Deepgram if:

Real-time streaming is required (live captions, voice assistants, call center)
Lowest per-minute cost for managed streaming APIs
Speaker diarization is needed at scale
Call center, customer service, or live meeting transcription

Choose AssemblyAI if:

You need post-processing features: summaries, chapters, sentiment, entities
Building meeting intelligence or podcast analysis workflows
LeMUR (ask questions about audio content) is valuable for your use case
You want the richest feature set and can pay the premium

Choose Google Chirp 3 if:

Non-English or multilingual audio is primary (85+ languages, best multilingual accuracy)
Audio quality varies and the built-in denoiser helps
You're already on GCP and want ecosystem integration
Non-real-time workloads where Dynamic Batch pricing ($0.004/min) makes Google cost-competitive

Compare STT API pricing, accuracy, and uptime at APIScout.

Evaluate Deepgram and compare alternatives on APIScout.

The API Integration Checklist (Free PDF)