Skip to main content

OpenAI Realtime API: Building Voice Applications 2026

·APIScout Team
openairealtime-apivoicespeechwebsocketgpt-4o2026

TL;DR

OpenAI's Realtime API lets you build voice applications where GPT-4o listens, thinks, and speaks — with sub-200ms latency. Unlike the older STT → LLM → TTS pipeline (which had 2-4 second lag), the Realtime API is end-to-end: raw audio in, GPT-4o processes it natively, audio out. It supports function calling mid-conversation, interruption handling, and multiple voice personas. As of 2026, this is the fastest way to build AI voice assistants. Here's everything you need to ship one.

Key Takeaways

  • Latency: ~200ms end-to-end vs 2-4 seconds for STT+LLM+TTS pipeline
  • Protocol: WebSocket (server-to-server) or WebRTC (browser-direct)
  • Modalities: audio+text simultaneously — transcripts included with audio
  • Function calling: works mid-conversation, model pauses and resumes speech
  • Cost: ~$0.10/min (audio) + $0.01/1K input tokens — more expensive than text
  • Voices: alloy, ash, ballad, coral, echo, sage, shimmer, verse (8 options)

Two Connection Modes

Mode 1: WebSocket (Server-to-Server)
  Browser → Your Server → OpenAI Realtime API
  Your server relays audio streams
  Full control, works with any backend

Mode 2: WebRTC (Browser-Direct)
  Browser → OpenAI Realtime API directly
  Ephemeral tokens (30-second TTL)
  Lower latency, less server infrastructure

Most production apps use WebSocket mode so the API key stays on the server. WebRTC mode is for demos and simple use cases.


WebSocket: Server-Side Setup

// server/realtime.ts — WebSocket relay server:
import WebSocket, { WebSocketServer } from 'ws';
import type { IncomingMessage } from 'http';

const OPENAI_API_KEY = process.env.OPENAI_API_KEY!;
const OPENAI_REALTIME_URL = 'wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview';

export function createRealtimeRelay(wss: WebSocketServer) {
  wss.on('connection', (clientWs: WebSocket, req: IncomingMessage) => {
    console.log('Client connected');

    // Connect to OpenAI Realtime API:
    const openaiWs = new WebSocket(OPENAI_REALTIME_URL, {
      headers: {
        Authorization: `Bearer ${OPENAI_API_KEY}`,
        'OpenAI-Beta': 'realtime=v1',
      },
    });

    // Forward client → OpenAI:
    clientWs.on('message', (message: Buffer) => {
      if (openaiWs.readyState === WebSocket.OPEN) {
        openaiWs.send(message);
      }
    });

    // Forward OpenAI → client:
    openaiWs.on('message', (message: Buffer) => {
      if (clientWs.readyState === WebSocket.OPEN) {
        clientWs.send(message);
      }
    });

    // Handle disconnections:
    clientWs.on('close', () => openaiWs.close());
    openaiWs.on('close', () => clientWs.close());

    openaiWs.on('open', () => {
      console.log('Connected to OpenAI Realtime');
    });

    openaiWs.on('error', (err) => {
      console.error('OpenAI WebSocket error:', err);
      clientWs.close();
    });
  });
}
// app/api/realtime/route.ts — Upgrade HTTP → WebSocket:
import { createServer } from 'http';
import { WebSocketServer } from 'ws';
import { createRealtimeRelay } from '@/server/realtime';

// In Next.js, use a custom server for WebSocket:
// server.ts
const server = createServer();
const wss = new WebSocketServer({ server, path: '/api/realtime' });
createRealtimeRelay(wss);
server.listen(3001);

Session Configuration

After connecting, send a session.update event to configure the session:

// Send immediately after WebSocket opens:
const sessionConfig = {
  type: 'session.update',
  session: {
    modalities: ['audio', 'text'],  // Get both audio + transcript
    instructions: `You are a helpful voice assistant.
      Keep responses concise and conversational.
      Do not use markdown in your responses.`,
    voice: 'alloy',        // alloy, ash, ballad, coral, echo, sage, shimmer, verse
    input_audio_format: 'pcm16',   // 24kHz, 16-bit, mono PCM
    output_audio_format: 'pcm16',
    input_audio_transcription: {
      model: 'whisper-1',          // Get text transcript of user speech
    },
    turn_detection: {
      type: 'server_vad',          // Server-side Voice Activity Detection
      threshold: 0.5,              // Sensitivity (0-1)
      prefix_padding_ms: 300,      // Audio before speech detected
      silence_duration_ms: 500,    // Silence before model responds
    },
    tools: [
      {
        type: 'function',
        name: 'get_weather',
        description: 'Get weather for a location',
        parameters: {
          type: 'object',
          properties: {
            location: { type: 'string', description: 'City name' },
          },
          required: ['location'],
        },
      },
    ],
    tool_choice: 'auto',
    temperature: 0.8,
    max_response_output_tokens: 'inf',  // or a number
  },
};

ws.send(JSON.stringify(sessionConfig));

Browser: Capturing and Streaming Audio

// client/VoiceChat.tsx
'use client';
import { useEffect, useRef, useState } from 'react';

export function VoiceChat() {
  const wsRef = useRef<WebSocket | null>(null);
  const audioContextRef = useRef<AudioContext | null>(null);
  const processorRef = useRef<ScriptProcessorNode | null>(null);
  const [isConnected, setIsConnected] = useState(false);
  const [transcript, setTranscript] = useState('');

  const connect = async () => {
    // Get microphone access:
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

    // Connect to relay server:
    const ws = new WebSocket('ws://localhost:3001/api/realtime');
    wsRef.current = ws;

    ws.onopen = () => {
      setIsConnected(true);

      // Set up audio capture (24kHz PCM16):
      const audioContext = new AudioContext({ sampleRate: 24000 });
      audioContextRef.current = audioContext;
      const source = audioContext.createMediaStreamSource(stream);

      // ScriptProcessor to capture raw PCM:
      const processor = audioContext.createScriptProcessor(4096, 1, 1);
      processorRef.current = processor;

      processor.onaudioprocess = (e) => {
        if (ws.readyState !== WebSocket.OPEN) return;

        const inputData = e.inputBuffer.getChannelData(0);
        // Convert float32 → int16 PCM:
        const pcm16 = float32ToInt16(inputData);
        // Base64 encode and send:
        const base64Audio = btoa(
          String.fromCharCode(...new Uint8Array(pcm16.buffer))
        );

        ws.send(JSON.stringify({
          type: 'input_audio_buffer.append',
          audio: base64Audio,
        }));
      };

      source.connect(processor);
      processor.connect(audioContext.destination);
    };

    // Handle incoming events:
    ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
      handleServerEvent(data);
    };
  };

  const handleServerEvent = (event: Record<string, unknown>) => {
    switch (event.type) {
      case 'conversation.item.input_audio_transcription.completed':
        // User speech transcribed:
        setTranscript(`You: ${event.transcript}`);
        break;

      case 'response.audio.delta':
        // Incremental audio chunk from model — play it:
        playAudioChunk(event.delta as string);
        break;

      case 'response.text.delta':
        // Incremental text transcript of model speech:
        setTranscript((prev) => prev + (event.delta as string));
        break;

      case 'response.function_call_arguments.done':
        // Model wants to call a function:
        handleFunctionCall(
          event.name as string,
          JSON.parse(event.arguments as string),
          event.call_id as string
        );
        break;

      case 'response.done':
        console.log('Response complete');
        break;
    }
  };

  const handleFunctionCall = async (name: string, args: unknown, callId: string) => {
    let result: unknown;
    if (name === 'get_weather') {
      result = { temperature: 22, condition: 'sunny', location: (args as any).location };
    }

    // Return function result to model:
    wsRef.current?.send(JSON.stringify({
      type: 'conversation.item.create',
      item: {
        type: 'function_call_output',
        call_id: callId,
        output: JSON.stringify(result),
      },
    }));

    // Tell model to continue responding:
    wsRef.current?.send(JSON.stringify({ type: 'response.create' }));
  };

  // Audio playback queue:
  const audioQueue: AudioBuffer[] = [];
  let isPlaying = false;

  const playAudioChunk = (base64Audio: string) => {
    if (!audioContextRef.current) return;

    const binary = atob(base64Audio);
    const bytes = new Uint8Array(binary.length);
    for (let i = 0; i < binary.length; i++) bytes[i] = binary.charCodeAt(i);

    // Decode PCM16 to float32:
    const int16 = new Int16Array(bytes.buffer);
    const float32 = new Float32Array(int16.length);
    for (let i = 0; i < int16.length; i++) float32[i] = int16[i] / 32768;

    const buffer = audioContextRef.current.createBuffer(1, float32.length, 24000);
    buffer.getChannelData(0).set(float32);
    audioQueue.push(buffer);

    if (!isPlaying) playNext();
  };

  const playNext = () => {
    if (!audioContextRef.current || audioQueue.length === 0) {
      isPlaying = false;
      return;
    }
    isPlaying = true;
    const buffer = audioQueue.shift()!;
    const source = audioContextRef.current.createBufferSource();
    source.buffer = buffer;
    source.connect(audioContextRef.current.destination);
    source.onended = playNext;
    source.start();
  };

  return (
    <div className="flex flex-col items-center gap-4 p-8">
      <h1 className="text-2xl font-bold">Voice Assistant</h1>
      {!isConnected ? (
        <button
          onClick={connect}
          className="px-6 py-3 bg-black text-white rounded-lg"
        >
          Start Conversation
        </button>
      ) : (
        <div className="text-center">
          <div className="w-4 h-4 bg-green-500 rounded-full animate-pulse mx-auto mb-2" />
          <p className="text-sm text-gray-500">Listening...</p>
          <p className="mt-4 max-w-md">{transcript}</p>
        </div>
      )}
    </div>
  );
}

function float32ToInt16(buffer: Float32Array): Int16Array {
  const int16 = new Int16Array(buffer.length);
  for (let i = 0; i < buffer.length; i++) {
    const s = Math.max(-1, Math.min(1, buffer[i]));
    int16[i] = s < 0 ? s * 0x8000 : s * 0x7fff;
  }
  return int16;
}

Handling Interruptions

One of the Realtime API's key features: users can interrupt the model mid-sentence:

// When VAD detects user started speaking during model response:
// Server automatically sends: { type: 'input_audio_buffer.speech_started' }
// The model's audio is cut off — you should also stop playback:

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);

  if (data.type === 'input_audio_buffer.speech_started') {
    // User interrupted — stop current audio playback:
    audioQueue.length = 0;  // Clear queue
    if (audioContextRef.current) {
      // Stop all sources by recreating context (fast approach):
      audioContextRef.current.close();
      audioContextRef.current = new AudioContext({ sampleRate: 24000 });
    }
  }
};

WebRTC Mode (Browser-Direct)

For lower-latency browser apps, WebRTC bypasses your server:

// Step 1: Get ephemeral token from your server:
// GET /api/realtime/token → { client_secret: { value: "..." } }

// server: app/api/realtime/token/route.ts
export async function GET() {
  const res = await fetch('https://api.openai.com/v1/realtime/sessions', {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      model: 'gpt-4o-realtime-preview',
      voice: 'alloy',
    }),
  });
  const session = await res.json();
  return Response.json({ token: session.client_secret.value });
}

// Step 2: Use ephemeral token in browser for WebRTC:
const { token } = await fetch('/api/realtime/token').then((r) => r.json());

const pc = new RTCPeerConnection();
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
pc.addTrack(stream.getTracks()[0]);

const dc = pc.createDataChannel('oai-events');  // For sending/receiving events

const offer = await pc.createOffer();
await pc.setLocalDescription(offer);

const sdpResponse = await fetch(
  'https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview',
  {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${token}`,
      'Content-Type': 'application/sdp',
    },
    body: offer.sdp,
  }
);

const answer = { type: 'answer' as const, sdp: await sdpResponse.text() };
await pc.setRemoteDescription(answer);
// Audio flows directly browser ↔ OpenAI

Pricing Reality

Realtime API (gpt-4o-realtime-preview):
  Audio input:  $0.10 / min  → 10-minute call = $1.00
  Audio output: $0.20 / min  → 10-minute call = $2.00
  Text tokens:  $2.50/1M input, $10/1M output

vs. Traditional STT + LLM + TTS pipeline:
  Whisper (STT):   $0.006/min
  GPT-4o:          ~$0.01-0.05 per conversation turn
  OpenAI TTS:      $0.015/1K chars (~$0.006/min at 400 wpm)
  Total pipeline:  ~$0.03/min

Realtime is ~5-10x more expensive than the pipeline approach.
The tradeoff: 200ms latency vs 2-4 seconds.
For voice assistants where latency feels conversational,
the Realtime API is worth the cost.

When to Use the Realtime API

Use Realtime API if:

  • You're building a voice assistant or voice chat interface
  • Sub-second response time is critical to the UX
  • Interruption handling matters (users will cut off responses)
  • You need mid-conversation function calling

Stick with STT+LLM+TTS if:

  • Cost is the primary constraint (5-10x cheaper)
  • Your use case tolerates 2-4 second delays (voice memos, dictation)
  • You need more control over each step (custom STT, specific TTS voices)
  • You're already invested in a Deepgram/ElevenLabs/Whisper pipeline

Find and compare voice and speech APIs at APIScout.

Comments