Skip to content

OpenAI’s WebRTC Problem: How to Actually Fix It

OpenAI's WebRTC problem isn't latency - it's silent failures. Here's why your Realtime API connection dies, plus working fixes for ICE, Firefox, and tokens.

7 min readIntermediate

The number one mistake developers make with OpenAI’s WebRTC implementation: they copy-paste a tutorial from six months ago, hit “connect,” watch ICE freeze in checking, and assume their network is broken. It isn’t. The tutorial is.

The Realtime API moved fast. It went generally available with gpt-realtime on August 28, 2025 (OpenAI announcement), and the endpoint shape changed underneath everyone. The GA URLs are /v1/realtime/client_secrets and /v1/realtime/calls – the preview endpoint (/v1/realtime/sessions) is on borrowed time. If you’re following a post from before August 2025, you’re shipping dead code.

What “OpenAI’s WebRTC problem” actually means right now

It’s not one bug. It’s three categories of failure that all surface as “the mic light is on but nothing happens.”

  • Auth lifetime: ephemeral tokens are short. Like, really short.
  • ICE negotiation: peer connection silently dies behind NAT when you skip STUN config.
  • Browser quirks: Firefox is currently dropping sessions on the second speech turn while Chrome works fine.

Every one of these has a fix. None of them are in the “getting started” tutorial.

The token TTL trap (the #1 silent killer)

Older write-ups – including webrtcHacks’ beta-era deep dive – measured a 2-hour TTL on ephemeral tokens. That’s no longer true. The official Realtime API reference is clear: all tokens expire after one minute. The ephemeral key is a one-shot credential, meant to be minted server-side and used immediately in the browser.

Think of it like a boarding pass printed at the gate. It’s valid right now, for this flight. Cache it, hand it to someone else, or wait too long at the jetway – it doesn’t work. Same logic: if your frontend caches the token, the user reads a permission dialog slowly, or a tab sleeps and resumes, the SDP handshake hits a 401 and dies without a useful error message.

Here’s the correct mint flow against the GA endpoint:

// server.js - Node/Express
app.get("/token", async (req, res) => {
 const r = await fetch(
 "https://api.openai.com/v1/realtime/client_secrets",
 {
 method: "POST",
 headers: {
 Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
 "Content-Type": "application/json",
 "OpenAI-Safety-Identifier": "hashed-user-id",
 },
 body: JSON.stringify({
 session: {
 type: "realtime",
 model: "gpt-realtime",
 audio: { output: { voice: "marin" } },
 },
 }),
 }
 );
 const data = await r.json();
 res.json(data); // data.value starts with "ek_"
});

The response contains a value field starting with ek_. Use that as the client secret when establishing the WebRTC connection – and have your backend mint a fresh one every time the browser asks. Never cache it.

Fixing the ICE “checking” stall

This one bites everyone who tests locally and then deploys. Local works. Production hangs forever in iceConnectionState: checking. Community threads document exactly this failure mode: ICE stays in “checking” and eventually times out when only host candidates are available – and turns out it’s reproducible even with the official WebRTC quickstart code, which calls new RTCPeerConnection() with no config.

Behind any NAT or corporate firewall, host candidates alone can’t reach the other side. Add STUN:

const pc = new RTCPeerConnection({
 iceServers: [
 { urls: "stun:stun.l.google.com:19302" },
 { urls: "stun:stun1.l.google.com:19302" },
 ],
});

For production, you’ll want a TURN server too – Google’s STUN is fine for hobby projects, but serious deployments use Twilio Network Traversal or a self-hosted coturn instance.

Why does the official tutorial skip this? Probably because it was written assuming a controlled demo environment. WebRTC’s NAT traversal is a solved problem in the broader ecosystem – it just requires you to actually configure it. The Realtime API doesn’t add any magic on top.

The Firefox gotcha nobody is talking about

Realtime API sessions via WebRTC drop deterministically on the user’s second speech turn on Firefox – roughly 30 to 90 seconds in, depending on the AI’s first response length. The same code, same account, same gpt-realtime model works through 5+ minute sessions on Chrome and Edge. The drop fires on input_audio_buffer.speech_started and cascades through iceConnectionState: disconnected. This was documented in an OpenAI community bug report and has no patch yet.

Workaround until OpenAI patches this: Detect Firefox client-side (navigator.userAgent) and either show a “Chrome recommended” notice or fall back to the Agents SDK’s WebSocket transport for those users. Not elegant – but better than a session that dies mid-conversation.

Wiring up the browser side correctly

Two things every tutorial gets wrong by omission: autoplay policy and HTTPS.

// client.js
const tokenRes = await fetch("/token");
const { value: EPHEMERAL_KEY } = await tokenRes.json();

const pc = new RTCPeerConnection({
 iceServers: [{ urls: "stun:stun.l.google.com:19302" }],
});

const audioEl = document.createElement("audio");
audioEl.autoplay = true;
pc.ontrack = (e) => (audioEl.srcObject = e.streams[0]);

const ms = await navigator.mediaDevices.getUserMedia({ audio: true });
pc.addTrack(ms.getTracks()[0]);

const dc = pc.createDataChannel("oai-events");
dc.onmessage = (e) => console.log(JSON.parse(e.data));

const offer = await pc.createOffer();
await pc.setLocalDescription(offer);

const sdpRes = await fetch(
 "https://api.openai.com/v1/realtime/calls?model=gpt-realtime",
 {
 method: "POST",
 body: offer.sdp,
 headers: {
 Authorization: `Bearer ${EPHEMERAL_KEY}`,
 "Content-Type": "application/sdp",
 },
 }
);
await pc.setRemoteDescription({ type: "answer", sdp: await sdpRes.text() });

Per Microsoft’s Azure OpenAI WebRTC documentation (which mirrors the browser security model): HTTPS is mandatory for getUserMediahttp://localhost is the only exception. And audio output can be silently blocked by browser autoplay policies unless the user has already interacted with the page. Always gate the connect call behind a button click, not a page load.

When NOT to use WebRTC here

Use case Use WebRTC? Why
Browser voice assistant Yes Lowest latency, handles media
Mobile app (native) Yes Same reasons, plus codec handling
Phone calls (SIP/Twilio) No Use SIP transport or WebSockets server-side
Server-to-server pipeline No WebSockets, no media plumbing needed
Recording / transcription only No WebSocket + gpt-realtime-whisper

OpenAI’s own guidance is straightforward: WebRTC for browser and mobile media, WebSockets for server-side pipelines like phone calls or broadcast ingest. If your audio originates server-side, WebRTC adds peer-connection overhead for nothing.

Specs and cost to plan around (as of August 2025)

Sessions can last up to 60 minutes – doubled from the 30-minute beta cap. The token window is 32,768 tokens total: 4,096 max output, 28,672 max input. Long support calls will hit that ceiling. The API truncates from the oldest messages automatically, so if you need full transcripts, save them client-side as they stream in.

Pricing, per a third-party breakdown published at GA: $32 per million input tokens, $0.40 for cached input, $64 per million output tokens. Token-based, not minute-based – you pay for compute consumed by reasoning, not wall-clock session time. A short, decisive exchange costs almost nothing. A long back-and-forth with heavy context costs proportionally more.

Common pitfalls, ranked by how often they bite

  1. Caching the ephemeral token – 60-second TTL. Fetch a fresh one every connect.
  2. No STUN config – works on localhost, dies in production behind NAT.
  3. Calling the beta endpoint/v1/realtime/sessions still works but will be deprecated and blocks access to async function calling and MCP support. Migrate to /v1/realtime/client_secrets.
  4. Forgetting OpenAI-Safety-Identifier – recommended for user-facing apps to satisfy compliance and abuse-detection requirements.
  5. Ignoring the Firefox bug – if you skip browser detection, Firefox users get a broken experience with no obvious error. Build the detection in from day one.

FAQ

Is the old /v1/realtime/sessions endpoint dead?

Not yet – but migrate now. The GA endpoint is /v1/realtime/client_secrets, and the beta interface lacks async function calling and MCP support. Waiting means a forced rewrite later, under pressure.

Why does my session work in Chrome but break in Firefox after the second user turn?

Known bug, documented in the OpenAI community forum. The WebRTC data channel drops on input_audio_buffer.speech_started on the second turn – deterministically, Firefox only. Chrome and Edge run the same code through 5+ minute sessions without issue. It is not a bug in your code. Detect Firefox via navigator.userAgent and route those users to a WebSocket fallback or show a browser banner. That’s the only working workaround until OpenAI ships a fix.

Can I skip the backend and use my API key directly in the browser?

No. A visible API key in browser devtools is a leaked key – and one leaked key can generate thousands in charges before you notice. Mint ephemeral tokens server-side. Always.

Next action: Clone the openai-realtime-console repo, swap its token endpoint to the GA /v1/realtime/client_secrets URL, add the STUN config from this guide, and test on Chrome first. If Chrome works but Firefox doesn’t – you’ve reproduced the bug, and you know exactly what to do.