<!-- APIScout AI-readable guide source -->
<!-- Canonical: https://apiscout.dev/guides/speech-to-text-api-comparison-2026 -->
<!-- Raw Markdown: https://apiscout.dev/guides/speech-to-text-api-comparison-2026/raw.md -->
<!-- Source path: content/guides/speech-to-text-api-comparison-2026.mdx -->

---
og_image: "/images/guides/speech-to-text-api-comparison-2026.webp"
title: "Speech-to-Text APIs (2026)"
description: "Deepgram Nova-3: 5.26% WER, best real-time. AssemblyAI streaming cheapest at $0.15/hr. OpenAI's new gpt-4o-transcribe beats Whisper. 2026 STT API comparison."
date: "2026-03-16"
author: "APIScout Team"
tags: ["speech-to-text", "whisper-api", "deepgram", "assemblyai", "google-speech", "stt-api", "api-comparison", "2026"]
---

## The Speech API Market Has Stratified

In 2023, OpenAI Whisper disrupted the speech recognition market by open-sourcing a model that matched commercial APIs at zero cost for self-hosting. In 2025–2026, the market has responded: Deepgram, AssemblyAI, and Google have released specialized models that beat Whisper on specific tasks — real-time streaming, speaker diarization, domain-specific accuracy — while remaining cost-competitive at scale.

The choice now depends heavily on your use case: batch transcription, real-time streaming, feature richness, or cost minimization.

## TL;DR

**Deepgram Nova-3** leads on batch accuracy (5.26% WER) and real-time streaming latency at $0.0077/min. **AssemblyAI** wins on features AND is surprisingly cheap for streaming at $0.15/hr (Universal-Streaming) — cheaper than Deepgram for real-time workloads. **OpenAI's new `gpt-4o-mini-transcribe`** ($0.003/min, launched March 2025) is the budget batch pick — half the price of `whisper-1` with better accuracy and streaming support. **Google Chirp 3** (also March 2025) leads on multilingual accuracy with a built-in denoiser.

## Key Takeaways

- **Deepgram Nova-3**: $0.0077/min; 5.26% WER (batch); best real-time streaming latency; Nova-3 Medical at 3.44% WER; GA February 2025
- **AssemblyAI**: $0.37/hr pre-recorded; **$0.15/hr streaming** (cheaper than Deepgram for real-time); LeMUR for LLM analysis; Slam-1 launched October 2025
- **OpenAI**: `whisper-1` $0.006/min (batch only); **`gpt-4o-mini-transcribe` $0.003/min** with streaming (launched March 2025, now recommended)
- **Google Chirp 3**: $0.016/min standard; **$0.004/min Dynamic Batch** (75% off); built-in denoiser; native diarization; launched March 2025
- **WER benchmarks**: Deepgram Nova-3 at 5.26%, AssemblyAI Universal-2 at 5.9%, gpt-4o-transcribe ~5-7%; Google Chirp 3 multilingual is best
- **Self-hosting**: Whisper `large-v3-turbo` is the practical choice — 6x faster than large-v3 with <2% accuracy loss

## Pricing Comparison

| Provider | Model | Price per Minute | Free Tier | Streaming |
|----------|-------|-----------------|-----------|-----------|
| OpenAI | whisper-1 (legacy) | $0.006/min | None | ❌ |
| OpenAI | gpt-4o-mini-transcribe ⭐ | $0.003/min | None | ✅ |
| OpenAI | gpt-4o-transcribe | $0.006/min | None | ✅ |
| Deepgram | Nova-3 | $0.0077/min | $200 credit | ✅ |
| AssemblyAI | Universal (pre-recorded) | ~$0.0062/min ($0.37/hr) | $50 credit | ✅ |
| AssemblyAI | Universal-Streaming ⭐ | $0.0025/min ($0.15/hr) | $50 credit | ✅ |
| Google | Chirp 3 | $0.016/min | 60 min/month | ✅ |
| Google | Chirp 3 Dynamic Batch | $0.004/min | 60 min/month | ❌ |
| Whisper (self-hosted) | large-v3-turbo | Infrastructure only | — | ❌ |

### Pricing Notes

**OpenAI** now has three transcription models (as of March 2025):
- `whisper-1`: $0.006/min, batch only, still works but no longer recommended
- `gpt-4o-transcribe`: $0.006/min, same price as whisper-1 but ~35% better accuracy + streaming
- `gpt-4o-mini-transcribe`: $0.003/min, streaming supported, now the recommended choice for most use cases

**Deepgram** pricing varies by plan:
- Nova-3 pay-as-you-go: $0.0077/min ($0.462/hr)
- Nova-3 Growth plan: $0.0065/min ($0.39/hr)
- Free tier: $200 in credits (~433 hours at pay-as-you-go rates)

**AssemblyAI** has two distinct pricing tiers worth knowing:
- Universal (pre-recorded/batch): $0.37/hr — best accuracy + features
- Universal-Streaming: $0.15/hr — cheaper than Deepgram for real-time; 6 languages (English, Spanish, French, German, Italian, Portuguese)
- Nano: $0.12/hr — lower accuracy, budget option
- LeMUR (LLM analysis): token-based on top of transcription cost
- Free tier: $50 in credits (~185 hours Universal or 333 hours streaming)

**Google Speech-to-Text** (Chirp 3 is now current, launched March 2025):
- Standard rate: $0.016/min (most expensive in this group)
- **Dynamic Batch** (async, 24-hour SLA): $0.004/min — 75% discount for non-real-time workloads
- Free tier: 60 minutes/month (permanent, not trial credit)

## OpenAI: Whisper → GPT-4o Transcribe Models

OpenAI launched `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` in March 2025. `whisper-1` still works, but OpenAI now recommends the GPT-4o-based models. The key improvements: ~35% lower WER than whisper-1, and both support **real-time streaming via WebSocket** — something whisper-1 never had.

```python
from openai import OpenAI

client = OpenAI()

# Recommended: gpt-4o-mini-transcribe (cheaper, streaming-capable)
with open("meeting-recording.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="gpt-4o-mini-transcribe",  # $0.003/min — half the price of whisper-1
        file=audio_file,
        response_format="verbose_json",
        timestamp_granularities=["word", "segment"]
    )

print(transcription.text)
for word in transcription.words:
    print(f"{word.start:.2f}s - {word.end:.2f}s: {word.word}")
```

```python
# Real-time streaming with gpt-4o-transcribe (WebSocket)
import websocket
import json
import base64

ws = websocket.WebSocket()
ws.connect(
    "wss://api.openai.com/v1/realtime?model=gpt-4o-transcribe",
    header={"Authorization": f"Bearer {OPENAI_API_KEY}", "OpenAI-Beta": "realtime=v1"}
)

# Stream audio chunks and receive transcripts
# (WebSocket streaming pattern similar to other real-time STT APIs)
```

**Strengths:**
- Simplest API integration in the group
- `gpt-4o-mini-transcribe` at $0.003/min is now the cheapest managed option
- 99+ languages supported
- Word-level timestamps available
- Open-source Whisper model available for self-hosting
- GPT-4o models now support real-time streaming (March 2025)

**Weaknesses:**
- 25MB file size limit per request
- No speaker diarization built-in
- No LLM-powered post-processing features (unlike AssemblyAI LeMUR)
- `whisper-1` is batch-only (use gpt-4o-mini-transcribe for streaming)

**When to use OpenAI transcription:** Budget batch transcription (`gpt-4o-mini-transcribe` at $0.003/min), or teams already integrated with OpenAI who want streaming without adopting another SDK. Not the best choice if you need diarization or post-processing.

## Deepgram Nova-3

Deepgram built its own speech recognition architecture from scratch. Nova-3 reached General Availability in February 2025 with the strongest batch WER benchmark in this comparison (5.26%) and a standout feature: **self-serve custom keyword injection** — up to 100 domain-specific terms without model retraining. Nova-3 Medical launches with 3.44% WER for healthcare audio.

```python
from deepgram import DeepgramClient, PrerecordedOptions, LiveOptions
import asyncio
import httpx

# Batch transcription
client = DeepgramClient(api_key="YOUR_DEEPGRAM_API_KEY")

with open("audio.mp3", "rb") as audio_file:
    response = client.listen.rest.v("1").transcribe_file(
        {"buffer": audio_file, "mimetype": "audio/mp3"},
        PrerecordedOptions(
            model="nova-3",
            smart_format=True,
            diarize=True,  # Speaker detection
            punctuate=True,
            utterances=True,
        )
    )

# Access structured results
for utterance in response.results.utterances:
    print(f"Speaker {utterance.speaker}: {utterance.transcript}")
```

```python
# Real-time streaming with Deepgram
import asyncio
import websockets
import json

async def transcribe_stream():
    uri = "wss://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&diarize=true"
    headers = {"Authorization": f"Token {DEEPGRAM_API_KEY}"}

    async with websockets.connect(uri, extra_headers=headers) as ws:
        # Stream audio chunks
        async for audio_chunk in audio_source():
            await ws.send(audio_chunk)

        # Receive transcripts in real-time
        async for message in ws:
            result = json.loads(message)
            if result.get("is_final"):
                transcript = result["channel"]["alternatives"][0]["transcript"]
                if transcript:
                    print(f"Final: {transcript}")
```

**Strengths:**
- Best-in-class real-time streaming latency (sub-300ms to first word)
- Strong accuracy on conversational speech, call center audio, and domain vocabulary
- Speaker diarization included
- Keyword boosting for domain-specific terms
- Language detection (identify the language automatically)
- Good filler word detection (um, uh, like)
- Reliable WebSocket streaming API

**Weaknesses:**
- More expensive than Whisper for simple batch transcription
- Fewer LLM-powered post-processing features than AssemblyAI
- Less multilingual strength than Google Chirp 2

**When to use Deepgram:** Real-time transcription (live captioning, voice assistants, call center analytics) where latency is critical. Best batch WER in this group at 5.26%. Strong for call centers using domain vocabulary (custom keyword injection without retraining). Note: for pure cost on real-time streaming, AssemblyAI Universal-Streaming at $0.15/hr is cheaper than Deepgram Nova-3 at $0.462/hr — evaluate if the latency difference matters for your use case.

## AssemblyAI

AssemblyAI's pricing structure has a hidden advantage: **Universal-Streaming at $0.15/hr is 3x cheaper than Deepgram Nova-3 for real-time** ($0.462/hr). For teams that need streaming + features, this changes the math significantly. They also launched Slam-1 in October 2025 — a new speech-language model that goes beyond transcription toward deeper audio understanding.

```python
import assemblyai as aai

aai.settings.api_key = "YOUR_ASSEMBLYAI_API_KEY"

transcriber = aai.Transcriber()

# Transcription with all features enabled
transcript = transcriber.transcribe(
    "https://example.com/podcast-episode.mp3",
    config=aai.TranscriptionConfig(
        speech_model=aai.SpeechModel.universal,  # Universal-2
        speaker_labels=True,   # Diarization
        auto_chapters=True,    # Chapter detection with summaries
        sentiment_analysis=True,
        auto_highlights=True,  # Key phrases
        entity_detection=True, # Named entities
        iab_categories=True,   # Topic classification
    )
)

# Access speakers
for utterance in transcript.utterances:
    print(f"Speaker {utterance.speaker}: {utterance.text}")

# Auto-chapters
for chapter in transcript.chapters:
    print(f"{chapter.start/1000:.0f}s - {chapter.end/1000:.0f}s: {chapter.headline}")

# Ask questions about the transcript via LeMUR
result = transcript.lemur.task(
    "What were the three main topics discussed and what conclusions were reached on each?",
    final_model=aai.LemurModel.claude3_5_sonnet
)
print(result.response)
```

**Strengths:**
- Best post-processing feature set: auto chapters, sentiment analysis, entity detection, topic classification, PII redaction
- LeMUR: ask free-form questions about transcribed content using Claude or other LLMs
- Strong real-time streaming (WebSocket)
- Universal-2 model is highly accurate on diverse audio types
- Best for podcast/interview/meeting analysis workflows

**Weaknesses:**
- Most expensive per-minute for raw transcription
- LeMUR adds additional token cost on top of transcription
- More complex setup than Whisper
- Fewer language options than Google

**When to use AssemblyAI:** Meeting intelligence, podcast workflows, qualitative research, customer call analysis needing structured insights. Also the best-value real-time streaming choice at $0.15/hr. LeMUR is genuinely unique — ask "what action items came up in this meeting?" and get structured answers. Note: Universal-Streaming currently supports 6 languages only; use Universal (batch) for broader language coverage.

## Google Speech-to-Text (Chirp 3)

Google launched Chirp 3 in March 2025 — the same month OpenAI launched its GPT-4o transcription models. Chirp 3 adds a **built-in audio denoiser**, native diarization, and automatic language detection. For non-English audio, Google's multilingual training scale (28 billion text sentences across 100+ languages) gives it an edge no specialist provider can match. The **Dynamic Batch** option at $0.004/min (75% off standard) makes Google cost-competitive for non-real-time workloads.

```python
from google.cloud import speech_v2

client = speech_v2.SpeechClient()
project_id = "your-project-id"

# Batch transcription
with open("audio.wav", "rb") as f:
    audio_data = f.read()

config = speech_v2.RecognitionConfig(
    auto_decoding_config=speech_v2.AutoDetectDecodingConfig(),
    language_codes=["en-US"],
    model="chirp_2",
    features=speech_v2.RecognitionFeatures(
        enable_automatic_punctuation=True,
        enable_word_time_offsets=True,
        enable_speaker_diarization=True,
        diarization_speaker_count=2,
    )
)

request = speech_v2.RecognizeRequest(
    recognizer=f"projects/{project_id}/locations/global/recognizers/_",
    config=config,
    content=audio_data,
)

response = client.recognize(request=request)
for result in response.results:
    print(result.alternatives[0].transcript)
```

**Strengths:**
- Best multilingual support (100+ languages with Chirp)
- Strong for code-switched audio (multiple languages in same recording)
- Streaming available via gRPC
- Deep GCP integration (Cloud Storage, BigQuery, Pub/Sub)
- Speaker diarization available
- Proven production reliability (powers Google's own products)

**Weaknesses:**
- Most expensive managed API in this comparison
- GCP setup required (service accounts, project configuration)
- More complex SDK/client library compared to Whisper or Deepgram
- Overkill for English-only use cases

**When to use Google Chirp 3:** Non-English or multilingual audio (85+ languages, best multilingual accuracy). Applications where audio quality varies and the built-in denoiser helps. GCP-native applications. **Dynamic Batch** at $0.004/min makes it competitive for non-real-time workloads where 24-hour turnaround is acceptable.

## Accuracy Benchmarks: Word Error Rate (WER)

Lower WER = better accuracy. Approximate WER figures on common benchmarks:

| Provider | Model | Batch WER | Streaming WER | Multilingual |
|----------|-------|-----------|--------------|-------------|
| Deepgram | Nova-3 | **5.26%** | **6.84%** | Good (Nova-3 multilingual, 6+ languages) |
| AssemblyAI | Universal-2 | ~5.9% | ~7–9% | Good (99 languages batch; 6 languages streaming) |
| OpenAI | gpt-4o-transcribe | ~5–7% | ~6–8% | Strong (99+ languages) |
| OpenAI | gpt-4o-mini-transcribe | ~6–8% | ~7–9% | Strong (99+ languages) |
| Google | Chirp 3 | ~8–11% | ~9–12% | **Best** (85+ languages with denoising) |

*Deepgram WER figures from Deepgram's own benchmarks (2,703 files, 81.69 hours). Independent benchmarks show smaller gaps. Always test on your own audio.*

*Note: WER varies significantly by audio quality, accent, domain, and benchmark. These are approximate figures from published evaluations; your domain's accuracy may differ substantially.*

No single model wins on accuracy across all conditions. For your production use case, benchmark with a sample of **your own audio** before committing.

## Real-Time Streaming Comparison

| Provider | Model | Protocol | Time to First Word | Monthly Cost (1K hr) |
|----------|-------|----------|-------------------|---------------------|
| Deepgram | Nova-3 | WebSocket | ~200–300ms | ~$462 |
| AssemblyAI | Universal-Streaming | WebSocket | ~300–500ms | **~$150** |
| Google | Chirp 3 | gRPC | ~400–700ms | ~$960 |
| OpenAI | gpt-4o-transcribe | WebSocket | ~300–500ms | ~$360 |
| OpenAI | whisper-1 | Batch only | N/A | ~$360 |

Deepgram has the lowest latency — meaningful for voice assistants and real-time captioning. But AssemblyAI Universal-Streaming at $0.15/hr undercuts Deepgram by 3x for teams where ~300ms additional latency is acceptable. For call center analytics where you don't need sub-300ms, AssemblyAI is the cost winner.

## Self-Hosting with Whisper

For very high-volume batch transcription, self-hosting Whisper on your own GPU infrastructure can dramatically reduce costs:

```python
from faster_whisper import WhisperModel

# large-v3-turbo: 6x faster than large-v3, <2% accuracy loss — the practical default
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")

# Transcribe
segments, info = model.transcribe(
    "audio.mp3",
    beam_size=5,
    word_timestamps=True,
    vad_filter=True,  # Voice activity detection — skip silence
)

for segment in segments:
    print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")
```

`faster-whisper` (CTranslate2-based) runs Whisper large-v3 at 4-8x faster than the original implementation and requires ~6GB VRAM. On a single A100 GPU at ~$2/hour, you can transcribe ~1,000 hours of audio per hour of compute — that's $0.002/minute vs Whisper API's $0.006/minute at the crossover point of ~500 hours/month.

## Cost Scenarios

### Scenario 1: Podcast transcription (batch) — 100 hours/month

| Provider | Monthly Cost | Notes |
|----------|-------------|-------|
| gpt-4o-mini-transcribe | 100h × 60 × $0.003 = **$18** | Best value for batch |
| Deepgram Nova-3 | 100h × 60 × $0.0077 = **$46.20** | Best accuracy |
| AssemblyAI Universal | 100h × $0.37 = **$37** | Best features |
| Google Chirp 3 Dynamic Batch | 100h × 60 × $0.004 = **$24** | Best for multilingual |
| Google Chirp 3 Standard | 100h × 60 × $0.016 = **$96** | |
| Whisper self-hosted | ~$15–30 (GPU cost) | |

### Scenario 2: Call center — 1,000 hours/month real-time streaming

| Provider | Monthly Cost | Streaming Latency |
|----------|-------------|------------------|
| AssemblyAI Universal-Streaming | 1,000 × $0.15 = **$150** | ~300–500ms |
| Deepgram Nova-3 | 1,000 × $0.462 = **$462** | ~200–300ms |
| OpenAI gpt-4o-transcribe | 1,000h × 60 × $0.006 = **$360** | ~300–500ms |
| Google Chirp 3 | 1,000 × $0.96 = **$960** | ~400–700ms |

AssemblyAI Universal-Streaming's pricing advantage is substantial at scale. For call centers where ~300ms latency is acceptable and features (diarization, summaries) matter, it's the clear winner. Deepgram is worth the premium only when sub-300ms latency is a hard requirement.

## When to Choose Each

### Choose OpenAI (gpt-4o-mini-transcribe) if:
- Best price for managed batch transcription ($0.003/min — cheapest managed option)
- Already integrated with OpenAI and want one SDK
- Need streaming but don't want Deepgram or AssemblyAI onboarding
- English audio, moderate volume, simplicity priority
- Note: use `gpt-4o-mini-transcribe` not `whisper-1` — better accuracy, same or lower cost, streaming support

### Choose Deepgram if:
- Real-time streaming is required (live captions, voice assistants, call center)
- Lowest per-minute cost for managed streaming APIs
- Speaker diarization is needed at scale
- Call center, customer service, or live meeting transcription

### Choose AssemblyAI if:
- You need post-processing features: summaries, chapters, sentiment, entities
- Building meeting intelligence or podcast analysis workflows
- LeMUR (ask questions about audio content) is valuable for your use case
- You want the richest feature set and can pay the premium

### Choose Google Chirp 3 if:
- Non-English or multilingual audio is primary (85+ languages, best multilingual accuracy)
- Audio quality varies and the built-in denoiser helps
- You're already on GCP and want ecosystem integration
- Non-real-time workloads where Dynamic Batch pricing ($0.004/min) makes Google cost-competitive

---

*Compare STT API pricing, accuracy, and uptime at [APIScout](https://apiscout.dev).*

## Related APIScout guides

- [Deepgram vs AssemblyAI vs Gladia Guide for Speech-to-Text APIs](/guides/deepgram-vs-assemblyai-vs-gladia-guide-for-speech-to-text-apis-2026)
- [Best Voice and Speech APIs 2026](/guides/best-voice-speech-apis-2026)
- [Realtime Voice AI APIs Comparison 2026](/guides/realtime-voice-ai-apis-comparison-2026)
- [OpenAI Realtime API: Building Voice Applications 2026](/guides/openai-realtime-api-building-voice-applications-2026)

*Evaluate Deepgram and compare alternatives on [APIScout](https://apiscout.dev/compare/deepgram-vs-openai).*
