ElevenLabs vs Cartesia: Best Voice AI API 2026
TL;DR
Cartesia for real-time voice agents — 199ms TTFA (vs ElevenLabs' 832ms), 27x cheaper at $0.011/1K chars, and the SSM architecture makes it the best latency choice for conversational AI. ElevenLabs for quality-first audio production — 70+ languages, the broadest voice library, dubbing, sound effects, and a complete platform for content creators and multilingual apps. The practical split in 2026: Cartesia for AI phone agents/voice assistants; ElevenLabs for narration, dubbing, and premium voice experiences.
Key Takeaways
- Cartesia pricing: $0.011/1K chars (~27x cheaper than ElevenLabs)
- ElevenLabs pricing: ~$0.206/1K chars (equivalent on creator+ plans)
- Cartesia TTFA: 199ms (Sonic model, self-serve tier)
- ElevenLabs TTFA: 832ms (self-serve tier), ~300ms on enterprise tier
- Architecture: Cartesia uses State Space Models (SSMs); ElevenLabs uses transformers
- Language support: ElevenLabs 70+ languages; Cartesia 15 languages
- Voice cloning: Cartesia requires 3 seconds; ElevenLabs requires 30 seconds
- Platform scope: ElevenLabs is full audio platform; Cartesia is API-only TTS
Why Voice AI Latency Matters
For voice agents (AI phone calls, real-time assistants, customer support bots), latency is the bottleneck. A 200ms TTFA feels like a natural conversation. An 800ms TTFA creates an awkward pause that feels broken.
User speaks → STT transcription → LLM inference → TTS → User hears response
Full turn latency budget:
STT: ~200ms (Deepgram/Whisper real-time)
LLM: ~400ms (streaming first token)
TTS: target <300ms TTFA
Total: ~900ms for natural conversation
Cartesia TTFA: 199ms → Total ~799ms (below 1s threshold)
ElevenLabs TTFA: 832ms → Total ~1432ms (above 1s, feels slow)
This is why Cartesia has dominated new voice agent deployments in 2026 — the latency advantage directly translates to better conversation quality.
Cartesia
Architecture: State Space Models
Cartesia's Sonic model is built on State Space Models (SSMs) — a fundamentally different architecture from transformer-based TTS. SSMs maintain a compact recurrent state that updates incrementally as text arrives, enabling streaming synthesis before the full sentence is processed.
# Cartesia Python SDK
from cartesia import Cartesia
import pyaudio
client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])
# Stream audio for low-latency playback
p = pyaudio.PyAudio()
rate = 44100
stream = p.open(format=pyaudio.paFloat32, channels=1, rate=rate, output=True)
# Generate and stream immediately
output_format = {
"container": "raw",
"encoding": "pcm_f32le",
"sample_rate": rate,
}
for output in client.tts.sse(
model_id="sonic-2",
transcript="Hello, how can I help you today?",
voice={"mode": "id", "id": "a0e99841-438c-4a64-b679-ae501e7d6091"},
output_format=output_format,
stream=True,
):
buffer = output.get("audio")
if buffer:
stream.write(buffer)
stream.stop_stream()
stream.close()
p.terminate()
WebSocket API for Real-Time Agents
For voice agents, use the WebSocket API to send text chunks as they arrive from the LLM:
import asyncio
import websockets
import json
async def voice_agent_response(llm_text_stream, voice_id: str):
"""Stream LLM output directly to Cartesia for ultra-low latency."""
uri = "wss://api.cartesia.ai/tts/websocket"
headers = {
"Cartesia-Version": "2024-06-10",
"X-API-Key": os.environ["CARTESIA_API_KEY"],
}
async with websockets.connect(uri, additional_headers=headers) as ws:
context_id = "ctx-001"
# Send text chunks as they arrive from LLM streaming
async for text_chunk in llm_text_stream:
await ws.send(json.dumps({
"context_id": context_id,
"model_id": "sonic-2",
"transcript": text_chunk,
"voice": {"mode": "id", "id": voice_id},
"output_format": {
"container": "raw",
"encoding": "pcm_f32le",
"sample_rate": 16000,
},
"continue": True, # More chunks coming
}))
# Signal end of utterance
await ws.send(json.dumps({
"context_id": context_id,
"transcript": "",
"continue": False,
}))
# Receive audio chunks and play/send to telephony
async for message in ws:
data = json.loads(message)
if audio := data.get("audio"):
yield base64.b64decode(audio)
Voice Cloning (3 seconds of audio)
# Clone a voice from 3 seconds of audio
import requests
response = requests.post(
"https://api.cartesia.ai/voices/clone/clip",
headers={
"Cartesia-Version": "2024-06-10",
"X-API-Key": os.environ["CARTESIA_API_KEY"],
},
files={"clip": open("sample.wav", "rb")},
data={"name": "Custom Voice"},
)
voice_id = response.json()["id"]
# Use immediately in generation
for output in client.tts.sse(
model_id="sonic-2",
transcript="Your cloned voice is ready.",
voice={"mode": "id", "id": voice_id},
output_format={"container": "mp3", "bit_rate": 128000, "sample_rate": 44100},
):
pass # Process audio chunks
ElevenLabs
The Full Audio Platform
ElevenLabs is more than TTS — it's a complete audio production platform. Beyond the API, it includes:
- Conversational AI: Pre-built voice agent framework with turn detection, interruption handling, and telephony integrations
- AI Dubbing: Automatically dub content into 29 languages preserving the original speaker's voice
- Text to Sound Effects: Generate custom SFX from text descriptions
- Studio: Long-form audio editor for narration and audiobooks
- ElevenReader: iOS/Android app that reads any content aloud
For developers, the API covers TTS, speech-to-speech, voice cloning, and the Conversational AI framework.
TTS API
from elevenlabs import ElevenLabs
client = ElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])
# Basic TTS with the best quality model
audio = client.text_to_speech.convert(
voice_id="JBFqnCBsd6RMkjVDRZzb", # George — deep British narrator
model_id="eleven_turbo_v2_5", # Best latency/quality balance
text="The quick brown fox jumps over the lazy dog.",
output_format="mp3_44100_128",
voice_settings={
"stability": 0.5,
"similarity_boost": 0.75,
"style": 0.0,
"use_speaker_boost": True,
},
)
with open("output.mp3", "wb") as f:
for chunk in audio:
f.write(chunk)
Streaming for Low-Latency Apps
# Streaming TTS for voice agents
for audio_chunk in client.text_to_speech.convert_as_stream(
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_flash_v2_5", # Fastest ElevenLabs model (~300ms enterprise)
text="How can I help you today?",
output_format="pcm_16000", # Raw PCM for telephony
):
# Send to telephony / WebSocket / audio buffer
send_audio(audio_chunk)
Multilingual TTS (70+ Languages)
# ElevenLabs handles non-English natively
audio = client.text_to_speech.convert(
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_multilingual_v2",
text="Bonjour, comment puis-je vous aider aujourd'hui?", # French
language_code="fr",
output_format="mp3_44100_128",
)
# Auto-detect language (no language_code needed)
audio_ja = client.text_to_speech.convert(
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_multilingual_v2",
text="こんにちは、本日はどのようなご用件でしょうか?", # Japanese
)
Conversational AI (Voice Agent Framework)
ElevenLabs includes a full voice agent SDK — not just TTS:
from elevenlabs.conversational_ai.conversation import Conversation
from elevenlabs.conversational_ai.default_audio_interface import DefaultAudioInterface
conversation = Conversation(
client=client,
agent_id=os.environ["ELEVENLABS_AGENT_ID"],
requires_auth=False,
audio_interface=DefaultAudioInterface(),
callback_agent_response=lambda response: print(f"Agent: {response}"),
callback_user_transcript=lambda transcript: print(f"User: {transcript}"),
)
conversation.start_session()
# Real-time two-way voice conversation — handles STT + LLM + TTS
Pricing Comparison
Cartesia (2026 pricing):
Free: 1,000 characters/month
Scale: $0.011 per 1,000 characters
Enterprise: Custom (volume discounts)
Example: 10M characters/month → $110/month
ElevenLabs (2026 pricing):
Free: 10,000 chars/month
Starter: $5/month — 30,000 chars ($0.167/K chars)
Creator: $22/month — 100,000 chars ($0.22/K chars)
Pro: $99/month — 500,000 chars ($0.198/K chars)
Scale: $330/month — 2,000,000 chars ($0.165/K chars)
Business: $1,320/month — 10,000,000 chars ($0.132/K chars)
Enterprise: Custom
Example: 10M characters/month → $1,320/month (vs Cartesia $110)
Cost ratio at 10M chars/month: ElevenLabs costs ~12x more. At 100M chars, Cartesia wins by an even larger margin. ElevenLabs' per-character rate improves with volume but never approaches Cartesia's pricing.
Latency Benchmarks
Time-to-First-Audio (TTFA) — p50 measurements:
Self-serve tier:
Cartesia Sonic: 199ms ← best for voice agents
ElevenLabs Turbo v2.5: ~450ms
ElevenLabs Flash v2.5: ~350ms
ElevenLabs Standard: ~832ms
Enterprise tier (dedicated infra):
Cartesia Sonic: ~150ms
ElevenLabs Flash v2.5: ~280ms
ElevenLabs Turbo: ~320ms
For context:
<300ms: Natural-feeling real-time conversation
300-600ms: Slight but noticeable delay
>600ms: Clearly perceptible pause, breaks conversational flow
Feature Comparison
| Feature | Cartesia | ElevenLabs |
|---|---|---|
| Price (per 1K chars) | ~$0.011 | ~$0.132–$0.206 |
| Best TTFA | 199ms | ~280ms (enterprise Flash) |
| Architecture | SSMs (recurrent) | Transformer |
| Languages | 15 | 70+ |
| Voice cloning speed | 3 seconds | 30 seconds |
| Voice cloning slots | Unlimited | 10–660 (plan-dependent) |
| WebSocket streaming | ✅ | ✅ |
| Conversational AI SDK | ❌ | ✅ full framework |
| AI dubbing | ❌ | ✅ (29 languages) |
| Sound effects | ❌ | ✅ |
| Voice design | ✅ | ✅ |
| Voice library | Limited | Massive (thousands) |
| Speech-to-speech | ❌ | ✅ |
| SOC 2 | ✅ | ✅ |
| HIPAA | Enterprise | Enterprise |
| SDKs | Python, JS/TS, Go | Python, JS/TS |
| Free tier | 1K chars/month | 10K chars/month |
Decision Guide
Choose Cartesia when:
- Building real-time voice agents (phone bots, voice assistants)
- Latency is critical — you need TTFA under 300ms
- Cost efficiency matters — high character volume
- English-primary or you only need 15 language support
- API-only, no platform features needed
Choose ElevenLabs when:
- You need 70+ language support
- Building multilingual dubbing pipelines
- Quality and voice variety matter more than latency
- Using the Conversational AI framework (built-in STT + LLM + TTS orchestration)
- Content creation, audiobooks, narration (not just real-time voice agents)
- You want the full audio platform (sound effects, studio)
Browse all voice AI and TTS APIs at APIScout.
Related: ElevenLabs vs OpenAI TTS vs Deepgram Aura · Best Voice and Speech APIs 2026