Skip to main content

ElevenLabs vs Cartesia: Best Voice AI API 2026

·APIScout Team
elevenlabscartesiavoice-aitext-to-speechtts-api2026

TL;DR

Cartesia for real-time voice agents — 199ms TTFA (vs ElevenLabs' 832ms), 27x cheaper at $0.011/1K chars, and the SSM architecture makes it the best latency choice for conversational AI. ElevenLabs for quality-first audio production — 70+ languages, the broadest voice library, dubbing, sound effects, and a complete platform for content creators and multilingual apps. The practical split in 2026: Cartesia for AI phone agents/voice assistants; ElevenLabs for narration, dubbing, and premium voice experiences.

Key Takeaways

  • Cartesia pricing: $0.011/1K chars (~27x cheaper than ElevenLabs)
  • ElevenLabs pricing: ~$0.206/1K chars (equivalent on creator+ plans)
  • Cartesia TTFA: 199ms (Sonic model, self-serve tier)
  • ElevenLabs TTFA: 832ms (self-serve tier), ~300ms on enterprise tier
  • Architecture: Cartesia uses State Space Models (SSMs); ElevenLabs uses transformers
  • Language support: ElevenLabs 70+ languages; Cartesia 15 languages
  • Voice cloning: Cartesia requires 3 seconds; ElevenLabs requires 30 seconds
  • Platform scope: ElevenLabs is full audio platform; Cartesia is API-only TTS

Why Voice AI Latency Matters

For voice agents (AI phone calls, real-time assistants, customer support bots), latency is the bottleneck. A 200ms TTFA feels like a natural conversation. An 800ms TTFA creates an awkward pause that feels broken.

User speaks → STT transcription → LLM inference → TTS → User hears response

Full turn latency budget:
  STT:    ~200ms (Deepgram/Whisper real-time)
  LLM:    ~400ms (streaming first token)
  TTS:    target <300ms TTFA
  Total:  ~900ms for natural conversation

Cartesia TTFA:   199ms → Total ~799ms (below 1s threshold)
ElevenLabs TTFA: 832ms → Total ~1432ms (above 1s, feels slow)

This is why Cartesia has dominated new voice agent deployments in 2026 — the latency advantage directly translates to better conversation quality.


Cartesia

Architecture: State Space Models

Cartesia's Sonic model is built on State Space Models (SSMs) — a fundamentally different architecture from transformer-based TTS. SSMs maintain a compact recurrent state that updates incrementally as text arrives, enabling streaming synthesis before the full sentence is processed.

# Cartesia Python SDK
from cartesia import Cartesia
import pyaudio

client = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])

# Stream audio for low-latency playback
p = pyaudio.PyAudio()
rate = 44100
stream = p.open(format=pyaudio.paFloat32, channels=1, rate=rate, output=True)

# Generate and stream immediately
output_format = {
    "container": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": rate,
}

for output in client.tts.sse(
    model_id="sonic-2",
    transcript="Hello, how can I help you today?",
    voice={"mode": "id", "id": "a0e99841-438c-4a64-b679-ae501e7d6091"},
    output_format=output_format,
    stream=True,
):
    buffer = output.get("audio")
    if buffer:
        stream.write(buffer)

stream.stop_stream()
stream.close()
p.terminate()

WebSocket API for Real-Time Agents

For voice agents, use the WebSocket API to send text chunks as they arrive from the LLM:

import asyncio
import websockets
import json

async def voice_agent_response(llm_text_stream, voice_id: str):
    """Stream LLM output directly to Cartesia for ultra-low latency."""
    uri = "wss://api.cartesia.ai/tts/websocket"
    headers = {
        "Cartesia-Version": "2024-06-10",
        "X-API-Key": os.environ["CARTESIA_API_KEY"],
    }

    async with websockets.connect(uri, additional_headers=headers) as ws:
        context_id = "ctx-001"

        # Send text chunks as they arrive from LLM streaming
        async for text_chunk in llm_text_stream:
            await ws.send(json.dumps({
                "context_id": context_id,
                "model_id": "sonic-2",
                "transcript": text_chunk,
                "voice": {"mode": "id", "id": voice_id},
                "output_format": {
                    "container": "raw",
                    "encoding": "pcm_f32le",
                    "sample_rate": 16000,
                },
                "continue": True,  # More chunks coming
            }))

        # Signal end of utterance
        await ws.send(json.dumps({
            "context_id": context_id,
            "transcript": "",
            "continue": False,
        }))

        # Receive audio chunks and play/send to telephony
        async for message in ws:
            data = json.loads(message)
            if audio := data.get("audio"):
                yield base64.b64decode(audio)

Voice Cloning (3 seconds of audio)

# Clone a voice from 3 seconds of audio
import requests

response = requests.post(
    "https://api.cartesia.ai/voices/clone/clip",
    headers={
        "Cartesia-Version": "2024-06-10",
        "X-API-Key": os.environ["CARTESIA_API_KEY"],
    },
    files={"clip": open("sample.wav", "rb")},
    data={"name": "Custom Voice"},
)
voice_id = response.json()["id"]

# Use immediately in generation
for output in client.tts.sse(
    model_id="sonic-2",
    transcript="Your cloned voice is ready.",
    voice={"mode": "id", "id": voice_id},
    output_format={"container": "mp3", "bit_rate": 128000, "sample_rate": 44100},
):
    pass  # Process audio chunks

ElevenLabs

The Full Audio Platform

ElevenLabs is more than TTS — it's a complete audio production platform. Beyond the API, it includes:

  • Conversational AI: Pre-built voice agent framework with turn detection, interruption handling, and telephony integrations
  • AI Dubbing: Automatically dub content into 29 languages preserving the original speaker's voice
  • Text to Sound Effects: Generate custom SFX from text descriptions
  • Studio: Long-form audio editor for narration and audiobooks
  • ElevenReader: iOS/Android app that reads any content aloud

For developers, the API covers TTS, speech-to-speech, voice cloning, and the Conversational AI framework.

TTS API

from elevenlabs import ElevenLabs

client = ElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])

# Basic TTS with the best quality model
audio = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",  # George — deep British narrator
    model_id="eleven_turbo_v2_5",       # Best latency/quality balance
    text="The quick brown fox jumps over the lazy dog.",
    output_format="mp3_44100_128",
    voice_settings={
        "stability": 0.5,
        "similarity_boost": 0.75,
        "style": 0.0,
        "use_speaker_boost": True,
    },
)

with open("output.mp3", "wb") as f:
    for chunk in audio:
        f.write(chunk)

Streaming for Low-Latency Apps

# Streaming TTS for voice agents
for audio_chunk in client.text_to_speech.convert_as_stream(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_flash_v2_5",  # Fastest ElevenLabs model (~300ms enterprise)
    text="How can I help you today?",
    output_format="pcm_16000",  # Raw PCM for telephony
):
    # Send to telephony / WebSocket / audio buffer
    send_audio(audio_chunk)

Multilingual TTS (70+ Languages)

# ElevenLabs handles non-English natively
audio = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_multilingual_v2",
    text="Bonjour, comment puis-je vous aider aujourd'hui?",  # French
    language_code="fr",
    output_format="mp3_44100_128",
)

# Auto-detect language (no language_code needed)
audio_ja = client.text_to_speech.convert(
    voice_id="JBFqnCBsd6RMkjVDRZzb",
    model_id="eleven_multilingual_v2",
    text="こんにちは、本日はどのようなご用件でしょうか?",  # Japanese
)

Conversational AI (Voice Agent Framework)

ElevenLabs includes a full voice agent SDK — not just TTS:

from elevenlabs.conversational_ai.conversation import Conversation
from elevenlabs.conversational_ai.default_audio_interface import DefaultAudioInterface

conversation = Conversation(
    client=client,
    agent_id=os.environ["ELEVENLABS_AGENT_ID"],
    requires_auth=False,
    audio_interface=DefaultAudioInterface(),
    callback_agent_response=lambda response: print(f"Agent: {response}"),
    callback_user_transcript=lambda transcript: print(f"User: {transcript}"),
)

conversation.start_session()
# Real-time two-way voice conversation — handles STT + LLM + TTS

Pricing Comparison

Cartesia (2026 pricing):
  Free:        1,000 characters/month
  Scale:       $0.011 per 1,000 characters
  Enterprise:  Custom (volume discounts)

  Example: 10M characters/month → $110/month

ElevenLabs (2026 pricing):
  Free:        10,000 chars/month
  Starter:     $5/month — 30,000 chars ($0.167/K chars)
  Creator:     $22/month — 100,000 chars ($0.22/K chars)
  Pro:         $99/month — 500,000 chars ($0.198/K chars)
  Scale:       $330/month — 2,000,000 chars ($0.165/K chars)
  Business:    $1,320/month — 10,000,000 chars ($0.132/K chars)
  Enterprise:  Custom

  Example: 10M characters/month → $1,320/month (vs Cartesia $110)

Cost ratio at 10M chars/month: ElevenLabs costs ~12x more. At 100M chars, Cartesia wins by an even larger margin. ElevenLabs' per-character rate improves with volume but never approaches Cartesia's pricing.


Latency Benchmarks

Time-to-First-Audio (TTFA) — p50 measurements:

Self-serve tier:
  Cartesia Sonic:          199ms ← best for voice agents
  ElevenLabs Turbo v2.5:  ~450ms
  ElevenLabs Flash v2.5:  ~350ms
  ElevenLabs Standard:    ~832ms

Enterprise tier (dedicated infra):
  Cartesia Sonic:          ~150ms
  ElevenLabs Flash v2.5:  ~280ms
  ElevenLabs Turbo:        ~320ms

For context:
  <300ms:   Natural-feeling real-time conversation
  300-600ms: Slight but noticeable delay
  >600ms:   Clearly perceptible pause, breaks conversational flow

Feature Comparison

FeatureCartesiaElevenLabs
Price (per 1K chars)~$0.011~$0.132–$0.206
Best TTFA199ms~280ms (enterprise Flash)
ArchitectureSSMs (recurrent)Transformer
Languages1570+
Voice cloning speed3 seconds30 seconds
Voice cloning slotsUnlimited10–660 (plan-dependent)
WebSocket streaming
Conversational AI SDK✅ full framework
AI dubbing✅ (29 languages)
Sound effects
Voice design
Voice libraryLimitedMassive (thousands)
Speech-to-speech
SOC 2
HIPAAEnterpriseEnterprise
SDKsPython, JS/TS, GoPython, JS/TS
Free tier1K chars/month10K chars/month

Decision Guide

Choose Cartesia when:

  • Building real-time voice agents (phone bots, voice assistants)
  • Latency is critical — you need TTFA under 300ms
  • Cost efficiency matters — high character volume
  • English-primary or you only need 15 language support
  • API-only, no platform features needed

Choose ElevenLabs when:

  • You need 70+ language support
  • Building multilingual dubbing pipelines
  • Quality and voice variety matter more than latency
  • Using the Conversational AI framework (built-in STT + LLM + TTS orchestration)
  • Content creation, audiobooks, narration (not just real-time voice agents)
  • You want the full audio platform (sound effects, studio)

Browse all voice AI and TTS APIs at APIScout.

Related: ElevenLabs vs OpenAI TTS vs Deepgram Aura · Best Voice and Speech APIs 2026

Comments