OpenAI Realtime API Voice Apps: WebRTC Guide (2026)

TL;DR

Use the OpenAI Realtime API when the product needs a live spoken conversation, not just a recorded audio request. For 2026 voice assistants, the practical starting point is gpt-realtime-2 with a browser WebRTC session or the Agents SDK RealtimeAgent / RealtimeSession helpers. Use WebSocket from a trusted server for phone, contact-center, or backend media pipelines. Keep API keys server-side, create ephemeral client secrets for browser/mobile clients, set OpenAI-Safety-Identifier from your backend, and confirm current model, voice, and pricing details before launch because Realtime API surfaces change faster than standard text APIs.

June 19 source refresh: the current OpenAI Realtime and WebRTC docs still support the same architecture split: WebRTC for browser/mobile media, WebSocket for trusted server pipelines, and official model/pricing pages as the final source for SKU-level cost math.

OpenAI's current Realtime voice options are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar; OpenAI recommends marin or cedar for best quality in the Realtime conversations documentation. Voice selection matters because the voice cannot be changed after the model has emitted audio in a session.

Key takeaways

Best fit: low-latency speech-to-speech agents, live support copilots, language practice, interruptible voice search, and workflows where users expect natural turn taking.
Not always best: file transcription, voice notes, batch summarization, or approval-heavy workflows where a chained speech-to-text → text agent → text-to-speech pipeline is easier to inspect and control.
Model path: OpenAI's Realtime overview points low-latency voice agents at gpt-realtime-2; older gpt-4o-realtime-preview names should be treated as migration/deprecation context, not a new-build default.
Transport: WebRTC is the default recommendation for browser or mobile clients; WebSocket is the server-to-server path.
Security: standard OpenAI API keys belong only on your backend. Browser/mobile sessions should use the unified WebRTC interface or short-lived client secrets created by your server.
Cost control: Realtime billing is token/session based for conversational sessions and varies by model. Read usage from response.done, account for separate input transcription costs, and use prompt caching-friendly session design.

Quick decision table

If your application needs...	Start with	Why
A browser voice assistant with barge-in and natural turn taking	Agents SDK `RealtimeAgent` + `RealtimeSession` over WebRTC	OpenAI's voice-agent docs put browser speech-to-speech here first.
A custom browser/mobile media client	Realtime WebRTC guide	WebRTC avoids relaying every audio packet through your app server and is the recommended client transport.
A phone, SIP, broadcast, or backend media pipeline	Realtime WebSocket or SIP path	The secret stays on your server and your backend owns audio framing, routing, and logging.
A predictable workflow with auditable intermediate text	Chained STT → text agent → TTS	You can inspect transcript text, run policy checks, and swap each vendor independently.
File upload transcription or generated speech from text	Request-based audio APIs	A persistent realtime session adds cost and complexity you do not need.

Current Realtime voice list for 2026

The most click-worthy answer for many searches is the voice list, so put it before the implementation details:

Voice	Practical note
`marin`	Current OpenAI docs recommend this for best Realtime output quality. Start here for most production assistants.
`cedar`	Also recommended by OpenAI for best Realtime output quality; useful as the second finalist in voice QA.
`alloy`	Neutral fallback that appears in many OpenAI audio examples.
`ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, `verse`	Available Realtime output voices; evaluate them against your real call scripts rather than demos.

Two caveats matter in production:

Voice availability is model-specific. The Realtime conversations guide is the source to check before launch because Text-to-Speech and Realtime voice lists are not identical.
Voice is effectively a session-level product decision. OpenAI documents that most session properties can be updated at any time, but voice cannot be changed after the model has responded with audio once. If enterprise and consumer users need different tones, choose the voice before audio output begins.

Recommended architecture for a browser voice agent

OpenAI's voice-agent guidance describes the usual browser flow this way: your server creates an ephemeral client secret, the frontend creates a RealtimeSession, the session connects over WebRTC, and the agent handles audio turns, tools, interruptions, and handoffs in that session.

A production browser architecture usually looks like this:

Browser microphone
  → WebRTC peer connection
  → OpenAI Realtime session
  → remote audio stream back to browser

Browser data channel
  ↔ session events, tool results, UI state, transcripts

Your backend
  → creates client secret / unified WebRTC call
  → attaches safety identifier
  → owns tools, account permissions, logging policy, budgets

Use this shape when latency is the product. Do not route all browser audio through your application server unless you need server-side media processing, compliance recording, telephony integration, or a network environment where WebRTC is not viable.

WebRTC setup choices: unified interface vs client secrets

OpenAI's WebRTC guide describes two browser connection options:

WebRTC option	How it works	Tradeoff
Unified interface	Browser posts SDP to your server; your server sends SDP + session config to `/v1/realtime/calls` with a standard API key and returns OpenAI's SDP answer.	Simpler credential model; your server is in the critical path during session initialization.
Ephemeral client secret	Browser asks your server for a client secret; browser then sends SDP to the Realtime API with that short-lived credential.	Keeps the standard API key off the client while letting the browser establish the peer connection directly.

In both designs, the standard OpenAI API key stays on your backend. If you assign stable end-user identifiers, set OpenAI-Safety-Identifier from your trusted server. OpenAI's docs call out that with ephemeral tokens, the safety identifier should be set on the server-side request that creates the client secret so the identifier is bound to the resulting session.

A current-style session configuration puts audio settings under session.audio:

const session = {
  type: "realtime",
  model: "gpt-realtime-2",
  output_modalities: ["audio"],
  audio: {
    input: {
      format: { type: "audio/pcm", rate: 24000 },
      turn_detection: { type: "semantic_vad" },
    },
    output: {
      format: { type: "audio/pcm" },
      voice: "marin",
    },
  },
  instructions: "Speak clearly and briefly. Confirm before taking actions.",
};

If you are updating an older tutorial, check for stale fields. The GA-shaped docs use session.type, output_modalities, audio.input, audio.output, and newer response events such as response.output_audio.delta and response.output_audio_transcript.delta. Older snippets that use gpt-4o-realtime-preview, modalities, voice at the top level, or response.audio.delta need review before copy-paste.

When to choose WebSocket instead

WebSocket is still the right path for server-to-server Realtime integrations. OpenAI's WebSocket guide explicitly positions it for backend systems and recommends WebRTC for browser/mobile clients.

Choose WebSocket when:

you are connecting a phone/SIP/contact-center media stream;
you need your backend to inspect, transform, or store audio packets before they reach OpenAI;
your workflow already runs in a stateful worker and browser WebRTC is not involved;
you need tighter control over reconnect behavior, call recording, region routing, or custom queueing.

The minimal server-side connection shape is simple:

import WebSocket from "ws";

const ws = new WebSocket(
  "wss://api.openai.com/v1/realtime?model=gpt-realtime-2",
  {
    headers: {
      Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      "OpenAI-Safety-Identifier": "hashed-internal-user-id",
    },
  },
);

ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      type: "realtime",
      model: "gpt-realtime-2",
      output_modalities: ["audio"],
      audio: {
        input: { format: { type: "audio/pcm", rate: 24000 } },
        output: { format: { type: "audio/pcm" }, voice: "marin" },
      },
      instructions: "Be concise and ask before taking irreversible actions.",
    },
  }));
});

With WebSocket you also own more of the audio loop: base64-encoded input audio chunks, playback buffering, interruption handling, transcript storage, reconnect UI, and session cleanup. That extra control is useful for backend media products, but it is unnecessary work for most browser assistants.

Build the product contract before the demo

A Realtime demo can feel impressive in an afternoon. A production voice app needs a stricter product contract:

Allowed actions: which tools may run during a spoken session, which need confirmation, and which are read-only.
Turn-taking rules: when the assistant should interrupt itself, ask a clarifying question, or stop speaking.
Transcript policy: what is stored, for how long, and whether users can delete it.
Fallback path: what the UI does when the session expires, the network drops, or Realtime rate limits update mid-call.
Voice identity disclosure: users should know they are speaking with an AI-generated voice, especially in support, sales, education, healthcare, or regulated workflows.

OpenAI's voice-agent docs also make a strategic point: voice uses the same core agent concepts as text. Tools, handoffs, guardrails, human review, observability, and durable state still belong in your application design; the audio surface only changes the transport and interaction loop.

Pricing and usage caveats

Avoid hard-coding dollar math into the product plan unless you have just checked the current model page. OpenAI's Realtime cost guide says conversational Realtime sessions accrue input and output tokens across text, audio, and image modalities, while streaming translation and transcription sessions are billed by audio duration. Prices vary by model and are listed on the model pages.

What you can monitor reliably:

response.done events include usage details such as input_tokens, output_tokens, audio token details, and cached token details.
Input transcription, if enabled, is billed separately from the speech-to-speech response model and has usage details on the conversation.item.input_audio_transcription.completed event.
Prompt caching can reduce repeated input cost in multi-turn sessions, but it is best-effort. Keep instructions, tool definitions, and stable context early in the session history to improve cache reuse.

Practical cost controls:

cap session length at the product level instead of letting idle tabs run;
use concrete output token limits for verbose assistants;
summarize or trim transcript history deliberately rather than constantly mutating the whole conversation;
track cost per completed task, not just cost per minute, because low latency may increase completion rates enough to justify the premium;
keep a chained STT → text agent → TTS fallback in the comparison set for high-volume or compliance-heavy workloads.

Testing checklist before launch

Use a small live test suite, not only happy-path microphone demos:

Can a new user complete the first WebRTC permission prompt and hear a response?
Does barge-in stop local playback quickly when the user interrupts?
Are response.output_audio.delta, transcript deltas, and response.done handled without duplicating UI text?
Do tool calls require confirmation before account changes, purchases, sends, bookings, or deletes?
Does the session end gracefully at OpenAI's documented maximum duration rather than looking frozen?
Does the backend create client secrets without logging raw API keys, client secrets, cookies, or full user identifiers?
Do staging tests cover noisy audio, silence, accented speech, fast interruptions, slow networks, and expired sessions?

OpenAI's Realtime conversations guide currently states a maximum Realtime session duration of 60 minutes. Treat that as a hard product boundary: design the UI to end or renew sessions intentionally instead of assuming an all-day voice connection.

Common mistakes

Copying old preview snippets. gpt-4o-realtime-preview examples and top-level voice settings are common in older articles. Check the current docs before shipping.
Using WebSocket in the browser by default. It can work with ephemeral credentials, but OpenAI recommends WebRTC for client apps because it is more robust for browser/mobile media.
Letting users pick a voice after the first response. Voice has to be selected before emitted audio if you want it to apply in the session.
Treating transcripts as free. Input transcription has its own usage and billing path when enabled.
Building a voice layer without tool guardrails. Spoken confirmation and human review are not nice-to-haves when the agent can touch real account data.
Ignoring OpenAI status and rate-limit signals. Voice products feel broken faster than text products when latency spikes; watch status, rate_limits.updated, and your own first-audio latency.

OpenAI Realtime API WebRTC Setup Guide 2026 — narrower setup detail for browser WebRTC sessions.
Realtime voice AI APIs comparison — start here when the decision is provider selection, not an OpenAI implementation.
Gemini Live API vs OpenAI Realtime vs Deepgram Voice Agent Guide — compare OpenAI against adjacent realtime voice providers.
ElevenLabs vs Cartesia vs Deepgram text-to-speech APIs — use when a request/streaming TTS layer is enough.
Speech-to-text APIs comparison — use when transcription quality matters more than live speech-to-speech.
Building Real-Time APIs: WebSockets vs SSE 2026 — transport tradeoffs for non-voice realtime products.

Source notes

Official sources checked on 2026-06-19:

OpenAI Realtime and audio overview: https://developers.openai.com/api/docs/guides/realtime
OpenAI Voice agents guide: https://developers.openai.com/api/docs/guides/voice-agents
OpenAI Realtime API with WebRTC: https://developers.openai.com/api/docs/guides/realtime-webrtc
OpenAI Realtime API with WebSocket: https://developers.openai.com/api/docs/guides/realtime-websocket
OpenAI Realtime conversations guide: https://developers.openai.com/api/docs/guides/realtime-conversations
OpenAI Realtime managing costs guide: https://developers.openai.com/api/docs/guides/realtime-costs
OpenAI status page: https://status.openai.com/

OpenAI's standard HTML docs and public pricing page may challenge automated clients, so this refresh used OpenAI's official developer Markdown docs where available and avoided unverified exact price claims. Recheck the model-specific pricing page before publishing a sales calculator or SLA promise.

Find and compare voice, speech, and realtime APIs at APIScout.

The API Integration Checklist (Free PDF)