OpenAI ships three voice models + Realtime API GA — the voice agent era starts here
OpenAI unveiled GPT-Realtime-2, Realtime-Translate, and Realtime-Whisper on May 7, and graduated the Realtime API from beta to GA. Context expanded from 32K to 128K (4x), Big Bench Audio +15.2pp over 1.5. Voice agents are now plug-and-play in call centers, healthcare, education and translation.

Three voice models simultaneously + Realtime API GA — what May 7 means
Here's the deal. On May 7, OpenAI shipped the voice intelligence package in one go. (1) GPT-Realtime-2 — the first voice model with GPT-5-class reasoning, (2) GPT-Realtime-Translate — instant 50-language translation, (3) GPT-Realtime-Whisper — next-generation speech recognition. And most important: the Realtime API graduated from beta to GA. Which means SLAs, pricing stability, and enterprise contracts are now possible.
The numbers. GPT-Realtime-2 expands context from 32K to 128K — 4x. Big Bench Audio +15.2pp over 1.5, Audio MultiChallenge +13.8pp. Pricing: $32/1M input, $64/1M output (text-token equivalent). Translate $0.034/min, Whisper $0.017/min. For a 6-minute call, Translate costs $0.20 — about 1/100th of human-interpreter pricing.
Why this is an inflection. Prior voice AI was (1) too slow to maintain natural conversation, (2) context-limited so it couldn't carry a long call in one session, (3) Realtime API in beta scared production deployment. This launch addresses all three at once. Call centers, telehealth consultations, live translation, online education, drive-throughs — these categories are now on the runway for full voice-AI deployment in 12–18 months.
Each model — Realtime-2, Translate, Whisper
GPT-Realtime-2. True "voice in, voice out" with GPT-5-class reasoning. The biggest changes: (1) 32K → 128K context lets you carry a one-hour call without external state stitching, (2) tool / function calling executes mid-conversation without breaks, (3) barge-in (the user interrupting) is handled naturally. The +15.2pp on Big Bench Audio isn't just transcription accuracy — it's the composite of "audio reasoning + tool use + dialogue flow." Tone, prosody and emotion recognition also improved over 1.5.
GPT-Realtime-Translate. Bidirectional instant translation across 50 languages. Speaker recognition built in, so two people speaking different languages in a meeting each maintain their context. At $0.034/min, a one-hour meeting costs $2.04. Cisco, Zoom, MS Teams have already announced RC API integrations.
GPT-Realtime-Whisper. Next-generation Whisper-Large-v4. Word error rate (WER): English 1.8% (was 3.2% on v3), Korean 4.3% (was 7.1%), Spanish 2.1%, Mandarin 4.9%. Speaker diarization accuracy 92%, improved noise tolerance. At $0.017/min, STT pricing is now ~1/3.5 of last year's $0.06.
Realtime API GA. Beta 7 months → GA. SLA 99.9%, 12-month price-lock option, enterprise admin / logging / SOC 2 Type II integration. Voice Activity Detection auto-handled, max 10MB audio buffer per session, WebRTC + WebSocket modes both supported.
Core developments — pricing, benchmarks, integrations
| Model | Input | Output | Key benchmark | Notes |
|---|---|---|---|---|
| GPT-Realtime-2 | $32/1M tokens | $64/1M tokens | Big Bench Audio +15.2pp | 128K context |
| Realtime-Translate | $0.034/min | (included) | 50 languages bidirectional | Cisco/Zoom integrated |
| Realtime-Whisper | $0.017/min | (none) | English WER 1.8% | Speaker diarization |
| Prior GPT-Realtime-1.5 | $40/1M | $80/1M | Big Bench Audio baseline | 32K context |
Early adopter cases.
- Salesforce Service Cloud Voice: Realtime-2 deployed for primary call handling; average handle time dropped from 4:30 to 2:50 (37% reduction). First-call resolution went from 41% to 63%.
- Cisco Webex / MS Teams: Realtime-Translate integrated into multi-party meetings. Beta users report 30% meeting efficiency gain and 80% reduction in unnecessary interpreter costs.
- Khan Academy: GPT-Realtime-2-based voice tutor "Khanmigo Voice" launched. Students explain math reasoning aloud and get instant feedback.
- Anthropic Claude doesn't use the GPT-Realtime API — its own Voice Mode runs a separate path. The voice API standard is hardening around OpenAI Realtime.
Developer experience shift. Previously you had to glue (a) STT (Whisper) → (b) LLM (GPT-4) → (c) TTS (ElevenLabs) into a pipeline yourself, with 1–2 second latency. Realtime API GA collapses (a–c) into a single endpoint with 350ms average latency. The threshold for natural human conversation has been crossed. The ROI horizon for replacing call center IVR shrinks from 12 months to 6.
Who wins — beneficiary breakdown
OpenAI locks in the voice infrastructure standard. Running ChatGPT Voice Mode and the Realtime API on the same backbone makes "voice AI = OpenAI" the default for both enterprises and developers. On revenue, voice tokens price 4–5x text — a direct ARR accelerant.
SaaS integrators (Salesforce, Cisco, Zoom): They no longer need to build voice models in-house. Drop in OpenAI's API and overlay voice on existing workflows. Lower R&D spend, faster ship velocity. The lock-in risk: OpenAI pricing power directly determines voice AI unit economics.
Call center BPOs (Concentrix, Teleperformance): AHT down 30–40% reduces fulfillment cost. Concurrently, the human-agent share scenario shifts 50% → 30%. Structural pressure on the labor-business model. Differentiation comes from (1) strength in non-English languages (Korean, Japanese, Vietnamese), (2) hybrid human-AI workflow design.
Healthcare / telehealth (Teladoc, Babylon): Whisper v4 accuracy makes telehealth voice → chart auto-conversion clinically deployable. The category for HIPAA-compliant voice medical AI is now being defined.
Translation market (Lilt, DeepL): The market expands from text translation to voice translation. DeepL is responding with its own voice model, but the infrastructure investment burden is heavier than at OpenAI. The $0.034/min Translate price is direct pressure.
Past parallels — Whisper-1 2022, Twilio 2010, IBM Watson Voice 2014
Whisper-1 open-source release, 2022. When OpenAI first open-sourced speech recognition, the STT market price collapsed 80% in a year. Whisper-based startups exploded — Otter.ai, Fireflies, Read.ai. Realtime-2 extends the same pattern to "voice reasoning + dialogue + translation" wholesale. The impact on voice AI overall could be 10x what Whisper-1 did to STT.
Twilio 2010, voice/SMS API. Abstracting telecom infrastructure into an API cut build time for call centers and SaaS integrators by 90%. Twilio scaled to $30B-class market cap. Realtime API GA could put OpenAI in the "Twilio of voice AI" position. Valuation models extend to SaaS infrastructure.
IBM Watson Voice 2014. Bundled voice + natural language as integrated cognitive computing — failed on accuracy / pricing / integration friction. Realtime-2 differentiates with (1) WER 1.8%, half the prior generation, (2) competitive pricing, (3) single-API minimal integration friction. Even so, the IBM Watson lesson stands: "technical lead ≠ market adoption guarantee."
Counter-case: Amazon Lex / Google Dialogflow. Cloud call-center voice models have been on the market for 5+ years but adoption stalled. Reasons: (1) accuracy gap, (2) weak dialogue design tooling, (3) higher-than-expected pricing. Realtime-2 addresses all three. Amazon and Google now have to chase OpenAI quickly.
Competitor counterplays — Anthropic, Google, Amazon, ElevenLabs
Anthropic's counter. Claude Voice Mode competes directly with ChatGPT Voice Mode, but the timing for an external-developer API is still TBD. The May 5 Opus 4.7 release didn't mention a voice API. Strength: model accuracy. Weakness: voice latency infrastructure. A Realtime API peer must ship within 6–12 months — otherwise Anthropic risks losing the voice category to OpenAI entirely.
Google's counter. Gemini Live competes with ChatGPT Voice Mode. Vertex AI exposes Live API. Strengths: (1) Gemini 3.1 Ultra multimodal performance, (2) 50-language instant translation (already built). Weakness: lacks Cisco/Zoom-class enterprise integration. Google Cloud Contact Center AI needs to be re-packaged as a full stack.
Amazon's counter. AWS Polly + Transcribe + Bedrock Voice as a full stack. Strength: AWS-locked-in enterprise base. The May 7 Bedrock AgentCore Payments could have been paired with a voice agent announcement — wasn't. Voice trails OpenAI by 2–3 quarters.
ElevenLabs / Cartesia (voice specialists). ElevenLabs leads on TTS quality (naturalness). Right after Realtime-2, on May 9, ElevenLabs unveiled Conversational AI v3 targeting 200ms latency. Differentiation: (1) decisively superior tone/emotion customization, (2) high quality non-English voice synthesis, (3) proprietary voice cloning. Limit: voice reasoning still depends on external LLMs (OpenAI/Anthropic).
Korea-specific: Naver Clova / SKT. Clova Voice has a global edge on Korean STT/TTS accuracy. SKT has its own A. AI assistant infrastructure. Both face pressure from Realtime-2's improved Korean accuracy, but Korean call-center data, pronunciation specifics, and cultural context remain differentiators. Key cards: Korean-English code switching, dialect handling, Korea-specific BPO datasets.
So what changes — by persona
Developers / startups: Voice-AI SaaS builds are now possible with a single API + 350ms latency. Call center automation, medical voice charting, AI tutors, voice coaches, voice-guided games — category diversity will explode. Pricing stability improves ROI modeling. Realtime-2's 128K context handles a one-hour session in a single window.
Call center managers / CCOs: AHT reduction 30–40% + first-call-resolution 60%+ are achievable in 6–12 months. Operating model needs gradual headcount shift (50% → 30%) plus reskilling humans onto high-value escalations. Not "automation" — "AI-First, Human-Escalation" workflow.
Healthcare / telehealth leaders: Doctor-voice → chart automation cuts per-doctor visit time by 30%. HIPAA / privacy compliance is mandatory. Whisper-Realtime accuracy is now clinically usable, but medical-domain fine-tuning + human verification layers remain mandatory.
Education sector: Voice tutors like Khanmigo Voice become viable at $5–10 per student per month. Korean test-prep majors (Megastudy, Etoos) are likely close to voice-tutor adoption. Whether this pressures teacher hiring or settles into supplementary tooling depends on policy variables.
Translation / interpretation professionals: Live interpretation transitions from human to AI quickly. $0.034/min vs human interpreter $30–100/min — three orders of magnitude. Differentiation areas: (1) precision-critical (legal, medical), (2) context-deep (diplomatic, political). General business interpretation likely sees 80% AI conversion in 6–12 months.
Korea market implications: Direct shock to Korean BPO industry (KT cs, LG U+ etc.). Korean WER 4.3% puts Whisper-Realtime past the threshold. Korean call center automation share goes 30% → 60% in 12–24 months. Local voice-AI startups (Tridge Voice, Spitch.ai) face differentiation pressure but also infrastructure-cost relief.
References
- Advancing voice intelligence with new models in the API — OpenAI, 2026-05-07
- OpenAI launches new voice intelligence features in its API — TechCrunch, 2026-05-07
- OpenAI Releases Three Realtime Audio Models — MarkTechPost, 2026-05-08
- Realtime API documentation — OpenAI Platform
- Big Bench Audio benchmark — Google Research
- Whisper v4 model card — OpenAI
Key signals to track over the next 6 months
Five signals will determine whether the Realtime API GA truly opens the voice-agent era or stalls in adoption friction. First, Anthropic Voice API timing — if Anthropic ships a peer-grade Voice API within 90 days, the voice category becomes a true two-horse race; if not, OpenAI cements the standard alone. Second, Salesforce Service Cloud Voice churn metrics — the 37% AHT reduction and 63% first-call resolution will either replicate at Concentrix and Teleperformance or stay isolated to Salesforce's well-tuned vertical. Replication signals broad adoption. Third, ElevenLabs Conversational AI v3 latency — if the sub-200ms target ships and TTS quality stays #1, voice agents fragment between OpenAI (reasoning) and ElevenLabs (voice quality), creating a stack split. Fourth, Korean WER 4.3% real-world performance — Korean telecoms (KT, SKT) and BPOs need to validate this in production before mass deployment; benchmark numbers and field data often diverge in tonal/agglutinative languages. Fifth, regulatory response on voice cloning — California, EU, and Korea are drafting voice-clone disclosure requirements, and the speed/strictness of those rules shapes which voice features ship to consumer agents.
Bottom line
The May 7 launch isn't just a model update — it's the standardization moment for voice AI. Realtime API GA + 350ms latency + 128K context + 50-language translation + clinical-grade STT collapses what used to be five separate vendor pipelines into one endpoint. Call centers, telehealth, education, translation, and drive-throughs are now production-ready categories. The question is no longer "can voice AI work" but "which integrators ship first." OpenAI took pole position. The next 6 months show who chases.
출처
관련 기사

OpenAI's Lilli Replaces Internal Knowledge Search with AI Agents

GPT-5.4 Deep Dive — The First General-Purpose Model That Actually Uses Your Computer

GPT-5.4 Thinking Ships — 33% Fewer Tokens, 33% Fewer Errors, and the Reasoning AI Tipping Point
AI 트렌드를 앞서가세요
매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.
