Skip to main content

API guide

OpenAI Realtime API Voice Apps: WebRTC Guide (2026)

Build OpenAI Realtime API voice apps with WebRTC sessions, current voices, WebSocket tradeoffs, safety identifiers, pricing caveats, and production checks.

·APIScout Team
Share:
Hero image for OpenAI Realtime API Voice Apps: WebRTC Guide (2026)

TL;DR

Use the OpenAI Realtime API when the product needs a live spoken conversation, not just a recorded audio request. For 2026 voice assistants, the practical starting point is gpt-realtime-2 with a browser WebRTC session or the Agents SDK RealtimeAgent / RealtimeSession helpers. Use WebSocket from a trusted server for phone, contact-center, or backend media pipelines. Keep API keys server-side, create ephemeral client secrets for browser/mobile clients, set OpenAI-Safety-Identifier from your backend, and confirm current model, voice, and pricing details before launch because Realtime API surfaces change faster than standard text APIs.

OpenAI's current Realtime voice options are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar; OpenAI recommends marin or cedar for best quality in the Realtime conversations documentation. Voice selection matters because the voice cannot be changed after the model has emitted audio in a session.

Key takeaways

  • Best fit: low-latency speech-to-speech agents, live support copilots, language practice, interruptible voice search, and workflows where users expect natural turn taking.
  • Not always best: file transcription, voice notes, batch summarization, or approval-heavy workflows where a chained speech-to-text → text agent → text-to-speech pipeline is easier to inspect and control.
  • Model path: OpenAI's Realtime overview points low-latency voice agents at gpt-realtime-2; older gpt-4o-realtime-preview names should be treated as migration/deprecation context, not a new-build default.
  • Transport: WebRTC is the default recommendation for browser or mobile clients; WebSocket is the server-to-server path.
  • Security: standard OpenAI API keys belong only on your backend. Browser/mobile sessions should use the unified WebRTC interface or short-lived client secrets created by your server.
  • Cost control: Realtime billing is token/session based for conversational sessions and varies by model. Read usage from response.done, account for separate input transcription costs, and use prompt caching-friendly session design.

Quick decision table

If your application needs...Start withWhy
A browser voice assistant with barge-in and natural turn takingAgents SDK RealtimeAgent + RealtimeSession over WebRTCOpenAI's voice-agent docs put browser speech-to-speech here first.
A custom browser/mobile media clientRealtime WebRTC guideWebRTC avoids relaying every audio packet through your app server and is the recommended client transport.
A phone, SIP, broadcast, or backend media pipelineRealtime WebSocket or SIP pathThe secret stays on your server and your backend owns audio framing, routing, and logging.
A predictable workflow with auditable intermediate textChained STT → text agent → TTSYou can inspect transcript text, run policy checks, and swap each vendor independently.
File upload transcription or generated speech from textRequest-based audio APIsA persistent realtime session adds cost and complexity you do not need.

Current Realtime voice list for 2026

The most click-worthy answer for many searches is the voice list, so put it before the implementation details:

VoicePractical note
marinCurrent OpenAI docs recommend this for best Realtime output quality. Start here for most production assistants.
cedarAlso recommended by OpenAI for best Realtime output quality; useful as the second finalist in voice QA.
alloyNeutral fallback that appears in many OpenAI audio examples.
ash, ballad, coral, echo, sage, shimmer, verseAvailable Realtime output voices; evaluate them against your real call scripts rather than demos.

Two caveats matter in production:

  1. Voice availability is model-specific. The Realtime conversations guide is the source to check before launch because Text-to-Speech and Realtime voice lists are not identical.
  2. Voice is effectively a session-level product decision. OpenAI documents that most session properties can be updated at any time, but voice cannot be changed after the model has responded with audio once. If enterprise and consumer users need different tones, choose the voice before audio output begins.

OpenAI's voice-agent guidance describes the usual browser flow this way: your server creates an ephemeral client secret, the frontend creates a RealtimeSession, the session connects over WebRTC, and the agent handles audio turns, tools, interruptions, and handoffs in that session.

A production browser architecture usually looks like this:

Browser microphone
  → WebRTC peer connection
  → OpenAI Realtime session
  → remote audio stream back to browser

Browser data channel
  ↔ session events, tool results, UI state, transcripts

Your backend
  → creates client secret / unified WebRTC call
  → attaches safety identifier
  → owns tools, account permissions, logging policy, budgets

Use this shape when latency is the product. Do not route all browser audio through your application server unless you need server-side media processing, compliance recording, telephony integration, or a network environment where WebRTC is not viable.

WebRTC setup choices: unified interface vs client secrets

OpenAI's WebRTC guide describes two browser connection options:

WebRTC optionHow it worksTradeoff
Unified interfaceBrowser posts SDP to your server; your server sends SDP + session config to /v1/realtime/calls with a standard API key and returns OpenAI's SDP answer.Simpler credential model; your server is in the critical path during session initialization.
Ephemeral client secretBrowser asks your server for a client secret; browser then sends SDP to the Realtime API with that short-lived credential.Keeps the standard API key off the client while letting the browser establish the peer connection directly.

In both designs, the standard OpenAI API key stays on your backend. If you assign stable end-user identifiers, set OpenAI-Safety-Identifier from your trusted server. OpenAI's docs call out that with ephemeral tokens, the safety identifier should be set on the server-side request that creates the client secret so the identifier is bound to the resulting session.

A current-style session configuration puts audio settings under session.audio:

const session = {
  type: "realtime",
  model: "gpt-realtime-2",
  output_modalities: ["audio"],
  audio: {
    input: {
      format: { type: "audio/pcm", rate: 24000 },
      turn_detection: { type: "semantic_vad" },
    },
    output: {
      format: { type: "audio/pcm" },
      voice: "marin",
    },
  },
  instructions: "Speak clearly and briefly. Confirm before taking actions.",
};

If you are updating an older tutorial, check for stale fields. The GA-shaped docs use session.type, output_modalities, audio.input, audio.output, and newer response events such as response.output_audio.delta and response.output_audio_transcript.delta. Older snippets that use gpt-4o-realtime-preview, modalities, voice at the top level, or response.audio.delta need review before copy-paste.

When to choose WebSocket instead

WebSocket is still the right path for server-to-server Realtime integrations. OpenAI's WebSocket guide explicitly positions it for backend systems and recommends WebRTC for browser/mobile clients.

Choose WebSocket when:

  • you are connecting a phone/SIP/contact-center media stream;
  • you need your backend to inspect, transform, or store audio packets before they reach OpenAI;
  • your workflow already runs in a stateful worker and browser WebRTC is not involved;
  • you need tighter control over reconnect behavior, call recording, region routing, or custom queueing.

The minimal server-side connection shape is simple:

import WebSocket from "ws";

const ws = new WebSocket(
  "wss://api.openai.com/v1/realtime?model=gpt-realtime-2",
  {
    headers: {
      Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      "OpenAI-Safety-Identifier": "hashed-internal-user-id",
    },
  },
);

ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      type: "realtime",
      model: "gpt-realtime-2",
      output_modalities: ["audio"],
      audio: {
        input: { format: { type: "audio/pcm", rate: 24000 } },
        output: { format: { type: "audio/pcm" }, voice: "marin" },
      },
      instructions: "Be concise and ask before taking irreversible actions.",
    },
  }));
});

With WebSocket you also own more of the audio loop: base64-encoded input audio chunks, playback buffering, interruption handling, transcript storage, reconnect UI, and session cleanup. That extra control is useful for backend media products, but it is unnecessary work for most browser assistants.

Build the product contract before the demo

A Realtime demo can feel impressive in an afternoon. A production voice app needs a stricter product contract:

  1. Allowed actions: which tools may run during a spoken session, which need confirmation, and which are read-only.
  2. Turn-taking rules: when the assistant should interrupt itself, ask a clarifying question, or stop speaking.
  3. Transcript policy: what is stored, for how long, and whether users can delete it.
  4. Fallback path: what the UI does when the session expires, the network drops, or Realtime rate limits update mid-call.
  5. Voice identity disclosure: users should know they are speaking with an AI-generated voice, especially in support, sales, education, healthcare, or regulated workflows.

OpenAI's voice-agent docs also make a strategic point: voice uses the same core agent concepts as text. Tools, handoffs, guardrails, human review, observability, and durable state still belong in your application design; the audio surface only changes the transport and interaction loop.

Pricing and usage caveats

Avoid hard-coding dollar math into the product plan unless you have just checked the current model page. OpenAI's Realtime cost guide says conversational Realtime sessions accrue input and output tokens across text, audio, and image modalities, while streaming translation and transcription sessions are billed by audio duration. Prices vary by model and are listed on the model pages.

What you can monitor reliably:

  • response.done events include usage details such as input_tokens, output_tokens, audio token details, and cached token details.
  • Input transcription, if enabled, is billed separately from the speech-to-speech response model and has usage details on the conversation.item.input_audio_transcription.completed event.
  • Prompt caching can reduce repeated input cost in multi-turn sessions, but it is best-effort. Keep instructions, tool definitions, and stable context early in the session history to improve cache reuse.

Practical cost controls:

  • cap session length at the product level instead of letting idle tabs run;
  • use concrete output token limits for verbose assistants;
  • summarize or trim transcript history deliberately rather than constantly mutating the whole conversation;
  • track cost per completed task, not just cost per minute, because low latency may increase completion rates enough to justify the premium;
  • keep a chained STT → text agent → TTS fallback in the comparison set for high-volume or compliance-heavy workloads.

Testing checklist before launch

Use a small live test suite, not only happy-path microphone demos:

  • Can a new user complete the first WebRTC permission prompt and hear a response?
  • Does barge-in stop local playback quickly when the user interrupts?
  • Are response.output_audio.delta, transcript deltas, and response.done handled without duplicating UI text?
  • Do tool calls require confirmation before account changes, purchases, sends, bookings, or deletes?
  • Does the session end gracefully at OpenAI's documented maximum duration rather than looking frozen?
  • Does the backend create client secrets without logging raw API keys, client secrets, cookies, or full user identifiers?
  • Do staging tests cover noisy audio, silence, accented speech, fast interruptions, slow networks, and expired sessions?

OpenAI's Realtime conversations guide currently states a maximum Realtime session duration of 60 minutes. Treat that as a hard product boundary: design the UI to end or renew sessions intentionally instead of assuming an all-day voice connection.

Common mistakes

  • Copying old preview snippets. gpt-4o-realtime-preview examples and top-level voice settings are common in older articles. Check the current docs before shipping.
  • Using WebSocket in the browser by default. It can work with ephemeral credentials, but OpenAI recommends WebRTC for client apps because it is more robust for browser/mobile media.
  • Letting users pick a voice after the first response. Voice has to be selected before emitted audio if you want it to apply in the session.
  • Treating transcripts as free. Input transcription has its own usage and billing path when enabled.
  • Building a voice layer without tool guardrails. Spoken confirmation and human review are not nice-to-haves when the agent can touch real account data.
  • Ignoring OpenAI status and rate-limit signals. Voice products feel broken faster than text products when latency spikes; watch status, rate_limits.updated, and your own first-audio latency.

Source notes

Official sources checked on 2026-05-15:

  • OpenAI Realtime and audio overview: https://developers.openai.com/api/docs/guides/realtime
  • OpenAI Voice agents guide: https://developers.openai.com/api/docs/guides/voice-agents
  • OpenAI Realtime API with WebRTC: https://developers.openai.com/api/docs/guides/realtime-webrtc
  • OpenAI Realtime API with WebSocket: https://developers.openai.com/api/docs/guides/realtime-websocket
  • OpenAI Realtime conversations guide: https://developers.openai.com/api/docs/guides/realtime-conversations
  • OpenAI Realtime managing costs guide: https://developers.openai.com/api/docs/guides/realtime-costs
  • OpenAI status page: https://status.openai.com/

OpenAI's standard HTML docs and public pricing page may challenge automated clients, so this refresh used OpenAI's official developer Markdown docs where available and avoided unverified exact price claims. Recheck the model-specific pricing page before publishing a sales calculator or SLA promise.


Find and compare voice, speech, and realtime APIs at APIScout.

The API Integration Checklist (Free PDF)

Step-by-step checklist: auth setup, rate limit handling, error codes, SDK evaluation, and pricing comparison for 50+ APIs. Used by 200+ developers.

Join 200+ developers. Unsubscribe in one click.