<!-- APIScout AI-readable guide source -->
<!-- Canonical: https://apiscout.dev/guides/openai-realtime-api-building-voice-applications-2026 -->
<!-- Raw Markdown: https://apiscout.dev/guides/openai-realtime-api-building-voice-applications-2026/raw.md -->
<!-- Source path: content/guides/openai-realtime-api-building-voice-applications-2026.mdx -->

---
og_image: "/images/guides/openai-realtime-api-building-voice-applications-2026.webp"
title: "OpenAI Realtime API Voice Apps: WebRTC Guide (2026)"
description: "Build OpenAI Realtime API voice apps with WebRTC sessions, current voices, WebSocket tradeoffs, safety identifiers, pricing caveats, and production checks."
date: "2026-05-15"
author: "APIScout Team"
tags: ["openai", "realtime-api", "voice", "speech", "webrtc", "websocket", "2026"]
tier: 1
---

## TL;DR

Use the OpenAI Realtime API when the product needs a live spoken conversation, not just a recorded audio request. For 2026 voice assistants, the practical starting point is `gpt-realtime-2` with a browser WebRTC session or the Agents SDK `RealtimeAgent` / `RealtimeSession` helpers. Use WebSocket from a trusted server for phone, contact-center, or backend media pipelines. Keep API keys server-side, create ephemeral client secrets for browser/mobile clients, set `OpenAI-Safety-Identifier` from your backend, and confirm current model, voice, and pricing details before launch because Realtime API surfaces change faster than standard text APIs.

OpenAI's current Realtime voice options are `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, `verse`, `marin`, and `cedar`; OpenAI recommends `marin` or `cedar` for best quality in the Realtime conversations documentation. Voice selection matters because the `voice` cannot be changed after the model has emitted audio in a session.

## Key takeaways

- **Best fit:** low-latency speech-to-speech agents, live support copilots, language practice, interruptible voice search, and workflows where users expect natural turn taking.
- **Not always best:** file transcription, voice notes, batch summarization, or approval-heavy workflows where a chained speech-to-text → text agent → text-to-speech pipeline is easier to inspect and control.
- **Model path:** OpenAI's Realtime overview points low-latency voice agents at `gpt-realtime-2`; older `gpt-4o-realtime-preview` names should be treated as migration/deprecation context, not a new-build default.
- **Transport:** WebRTC is the default recommendation for browser or mobile clients; WebSocket is the server-to-server path.
- **Security:** standard OpenAI API keys belong only on your backend. Browser/mobile sessions should use the unified WebRTC interface or short-lived client secrets created by your server.
- **Cost control:** Realtime billing is token/session based for conversational sessions and varies by model. Read usage from `response.done`, account for separate input transcription costs, and use prompt caching-friendly session design.

## Quick decision table

| If your application needs... | Start with | Why |
|---|---|---|
| A browser voice assistant with barge-in and natural turn taking | Agents SDK `RealtimeAgent` + `RealtimeSession` over WebRTC | OpenAI's voice-agent docs put browser speech-to-speech here first. |
| A custom browser/mobile media client | Realtime WebRTC guide | WebRTC avoids relaying every audio packet through your app server and is the recommended client transport. |
| A phone, SIP, broadcast, or backend media pipeline | Realtime WebSocket or SIP path | The secret stays on your server and your backend owns audio framing, routing, and logging. |
| A predictable workflow with auditable intermediate text | Chained STT → text agent → TTS | You can inspect transcript text, run policy checks, and swap each vendor independently. |
| File upload transcription or generated speech from text | Request-based audio APIs | A persistent realtime session adds cost and complexity you do not need. |

## Current Realtime voice list for 2026

The most click-worthy answer for many searches is the voice list, so put it before the implementation details:

| Voice | Practical note |
|---|---|
| `marin` | Current OpenAI docs recommend this for best Realtime output quality. Start here for most production assistants. |
| `cedar` | Also recommended by OpenAI for best Realtime output quality; useful as the second finalist in voice QA. |
| `alloy` | Neutral fallback that appears in many OpenAI audio examples. |
| `ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, `verse` | Available Realtime output voices; evaluate them against your real call scripts rather than demos. |

Two caveats matter in production:

1. **Voice availability is model-specific.** The Realtime conversations guide is the source to check before launch because Text-to-Speech and Realtime voice lists are not identical.
2. **Voice is effectively a session-level product decision.** OpenAI documents that most session properties can be updated at any time, but `voice` cannot be changed after the model has responded with audio once. If enterprise and consumer users need different tones, choose the voice before audio output begins.

## Recommended architecture for a browser voice agent

OpenAI's voice-agent guidance describes the usual browser flow this way: your server creates an ephemeral client secret, the frontend creates a `RealtimeSession`, the session connects over WebRTC, and the agent handles audio turns, tools, interruptions, and handoffs in that session.

A production browser architecture usually looks like this:

```text
Browser microphone
  → WebRTC peer connection
  → OpenAI Realtime session
  → remote audio stream back to browser

Browser data channel
  ↔ session events, tool results, UI state, transcripts

Your backend
  → creates client secret / unified WebRTC call
  → attaches safety identifier
  → owns tools, account permissions, logging policy, budgets
```

Use this shape when latency is the product. Do not route all browser audio through your application server unless you need server-side media processing, compliance recording, telephony integration, or a network environment where WebRTC is not viable.

## WebRTC setup choices: unified interface vs client secrets

OpenAI's WebRTC guide describes two browser connection options:

| WebRTC option | How it works | Tradeoff |
|---|---|---|
| Unified interface | Browser posts SDP to your server; your server sends SDP + session config to `/v1/realtime/calls` with a standard API key and returns OpenAI's SDP answer. | Simpler credential model; your server is in the critical path during session initialization. |
| Ephemeral client secret | Browser asks your server for a client secret; browser then sends SDP to the Realtime API with that short-lived credential. | Keeps the standard API key off the client while letting the browser establish the peer connection directly. |

In both designs, the standard OpenAI API key stays on your backend. If you assign stable end-user identifiers, set `OpenAI-Safety-Identifier` from your trusted server. OpenAI's docs call out that with ephemeral tokens, the safety identifier should be set on the server-side request that creates the client secret so the identifier is bound to the resulting session.

A current-style session configuration puts audio settings under `session.audio`:

```ts
const session = {
  type: "realtime",
  model: "gpt-realtime-2",
  output_modalities: ["audio"],
  audio: {
    input: {
      format: { type: "audio/pcm", rate: 24000 },
      turn_detection: { type: "semantic_vad" },
    },
    output: {
      format: { type: "audio/pcm" },
      voice: "marin",
    },
  },
  instructions: "Speak clearly and briefly. Confirm before taking actions.",
};
```

If you are updating an older tutorial, check for stale fields. The GA-shaped docs use `session.type`, `output_modalities`, `audio.input`, `audio.output`, and newer response events such as `response.output_audio.delta` and `response.output_audio_transcript.delta`. Older snippets that use `gpt-4o-realtime-preview`, `modalities`, `voice` at the top level, or `response.audio.delta` need review before copy-paste.

## When to choose WebSocket instead

WebSocket is still the right path for server-to-server Realtime integrations. OpenAI's WebSocket guide explicitly positions it for backend systems and recommends WebRTC for browser/mobile clients.

Choose WebSocket when:

- you are connecting a phone/SIP/contact-center media stream;
- you need your backend to inspect, transform, or store audio packets before they reach OpenAI;
- your workflow already runs in a stateful worker and browser WebRTC is not involved;
- you need tighter control over reconnect behavior, call recording, region routing, or custom queueing.

The minimal server-side connection shape is simple:

```ts
import WebSocket from "ws";

const ws = new WebSocket(
  "wss://api.openai.com/v1/realtime?model=gpt-realtime-2",
  {
    headers: {
      Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      "OpenAI-Safety-Identifier": "hashed-internal-user-id",
    },
  },
);

ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      type: "realtime",
      model: "gpt-realtime-2",
      output_modalities: ["audio"],
      audio: {
        input: { format: { type: "audio/pcm", rate: 24000 } },
        output: { format: { type: "audio/pcm" }, voice: "marin" },
      },
      instructions: "Be concise and ask before taking irreversible actions.",
    },
  }));
});
```

With WebSocket you also own more of the audio loop: base64-encoded input audio chunks, playback buffering, interruption handling, transcript storage, reconnect UI, and session cleanup. That extra control is useful for backend media products, but it is unnecessary work for most browser assistants.

## Build the product contract before the demo

A Realtime demo can feel impressive in an afternoon. A production voice app needs a stricter product contract:

1. **Allowed actions:** which tools may run during a spoken session, which need confirmation, and which are read-only.
2. **Turn-taking rules:** when the assistant should interrupt itself, ask a clarifying question, or stop speaking.
3. **Transcript policy:** what is stored, for how long, and whether users can delete it.
4. **Fallback path:** what the UI does when the session expires, the network drops, or Realtime rate limits update mid-call.
5. **Voice identity disclosure:** users should know they are speaking with an AI-generated voice, especially in support, sales, education, healthcare, or regulated workflows.

OpenAI's voice-agent docs also make a strategic point: voice uses the same core agent concepts as text. Tools, handoffs, guardrails, human review, observability, and durable state still belong in your application design; the audio surface only changes the transport and interaction loop.

## Pricing and usage caveats

Avoid hard-coding dollar math into the product plan unless you have just checked the current model page. OpenAI's Realtime cost guide says conversational Realtime sessions accrue input and output tokens across text, audio, and image modalities, while streaming translation and transcription sessions are billed by audio duration. Prices vary by model and are listed on the model pages.

What you can monitor reliably:

- `response.done` events include usage details such as `input_tokens`, `output_tokens`, audio token details, and cached token details.
- Input transcription, if enabled, is billed separately from the speech-to-speech response model and has usage details on the `conversation.item.input_audio_transcription.completed` event.
- Prompt caching can reduce repeated input cost in multi-turn sessions, but it is best-effort. Keep instructions, tool definitions, and stable context early in the session history to improve cache reuse.

Practical cost controls:

- cap session length at the product level instead of letting idle tabs run;
- use concrete output token limits for verbose assistants;
- summarize or trim transcript history deliberately rather than constantly mutating the whole conversation;
- track cost per completed task, not just cost per minute, because low latency may increase completion rates enough to justify the premium;
- keep a chained STT → text agent → TTS fallback in the comparison set for high-volume or compliance-heavy workloads.

## Testing checklist before launch

Use a small live test suite, not only happy-path microphone demos:

- Can a new user complete the first WebRTC permission prompt and hear a response?
- Does barge-in stop local playback quickly when the user interrupts?
- Are `response.output_audio.delta`, transcript deltas, and `response.done` handled without duplicating UI text?
- Do tool calls require confirmation before account changes, purchases, sends, bookings, or deletes?
- Does the session end gracefully at OpenAI's documented maximum duration rather than looking frozen?
- Does the backend create client secrets without logging raw API keys, client secrets, cookies, or full user identifiers?
- Do staging tests cover noisy audio, silence, accented speech, fast interruptions, slow networks, and expired sessions?

OpenAI's Realtime conversations guide currently states a maximum Realtime session duration of 60 minutes. Treat that as a hard product boundary: design the UI to end or renew sessions intentionally instead of assuming an all-day voice connection.

## Common mistakes

- **Copying old preview snippets.** `gpt-4o-realtime-preview` examples and top-level `voice` settings are common in older articles. Check the current docs before shipping.
- **Using WebSocket in the browser by default.** It can work with ephemeral credentials, but OpenAI recommends WebRTC for client apps because it is more robust for browser/mobile media.
- **Letting users pick a voice after the first response.** Voice has to be selected before emitted audio if you want it to apply in the session.
- **Treating transcripts as free.** Input transcription has its own usage and billing path when enabled.
- **Building a voice layer without tool guardrails.** Spoken confirmation and human review are not nice-to-haves when the agent can touch real account data.
- **Ignoring OpenAI status and rate-limit signals.** Voice products feel broken faster than text products when latency spikes; watch status, `rate_limits.updated`, and your own first-audio latency.

## Related APIScout guides

- [OpenAI Realtime API WebRTC Setup Guide 2026](/guides/openai-realtime-api-webrtc-setup-guide-2026) — narrower setup detail for browser WebRTC sessions.
- [Realtime voice AI APIs comparison](/guides/realtime-voice-ai-apis-comparison-2026) — start here when the decision is provider selection, not an OpenAI implementation.
- [Gemini Live API vs OpenAI Realtime vs Deepgram Voice Agent Guide](/guides/gemini-live-api-vs-openai-realtime-vs-deepgram-voice-agent-2026) — compare OpenAI against adjacent realtime voice providers.
- [ElevenLabs vs Cartesia vs Deepgram text-to-speech APIs](/guides/elevenlabs-vs-cartesia-vs-deepgram-tts-apis-2026) — use when a request/streaming TTS layer is enough.
- [Speech-to-text APIs comparison](/guides/speech-to-text-api-comparison-2026) — use when transcription quality matters more than live speech-to-speech.
- [Building Real-Time APIs: WebSockets vs SSE 2026](/guides/building-real-time-apis-websockets-vs-sse-2026) — transport tradeoffs for non-voice realtime products.

## Source notes

Official sources checked on 2026-05-15:

- OpenAI Realtime and audio overview: `https://developers.openai.com/api/docs/guides/realtime`
- OpenAI Voice agents guide: `https://developers.openai.com/api/docs/guides/voice-agents`
- OpenAI Realtime API with WebRTC: `https://developers.openai.com/api/docs/guides/realtime-webrtc`
- OpenAI Realtime API with WebSocket: `https://developers.openai.com/api/docs/guides/realtime-websocket`
- OpenAI Realtime conversations guide: `https://developers.openai.com/api/docs/guides/realtime-conversations`
- OpenAI Realtime managing costs guide: `https://developers.openai.com/api/docs/guides/realtime-costs`
- OpenAI status page: `https://status.openai.com/`

OpenAI's standard HTML docs and public pricing page may challenge automated clients, so this refresh used OpenAI's official developer Markdown docs where available and avoided unverified exact price claims. Recheck the model-specific pricing page before publishing a sales calculator or SLA promise.

---

*Find and compare voice, speech, and realtime APIs at [APIScout](https://apiscout.dev).*