OpenAI Just Made It Super Easy to Build Voice Apps That Actually Work

OpenAI's Realtime API brings human-quality voice AI to developers for $0.19 per call—99% cheaper than human agents—with companies like T-Mobile and Zillow already using it to transform customer service.

Grant Harvey

July 29, 2024

GPT Realtime Is Now Generally Available: gpt-realtime, SIP Calling, MCP Tools, Image Input, and 20% Lower Prices

OpenAI just made their Realtime API generally available with a brand new speech model called GPT Real-time. ‍

Here's why you should care: you can now build voice apps that sound completely human without being a coding wizard.

If you've ever used a voice assistant and thought “this sounds robotic and dumb,” this is OpenAI’s attempt to fix that. The new model can laugh, sigh, switch languages mid-sentence, and actually follow complex instructions without losing track.

Here's what you can actually do with it:

Build your first voice app: The API now handles everything—you don't need to stitch together multiple services. Just point it at your data and it starts talking naturally about your business, products, or services.
Connect to real phones: New SIP support means your voice app can answer actual phone calls. Think customer support that doesn't suck, or a personal assistant people can call directly.
Show it images: Mid-conversation, you can share a screenshot or photo and it'll describe what it sees while keeping the conversation flowing naturally.
Add real tools: Connect it to your calendar, CRM, or any other service through MCP servers (Connectors). It'll actually use them correctly instead of hallucinating responses.

For non-coders: This matters because every business you interact with will probably start using this. Customer service calls, appointment bookings, even educational tutoring could all get way better (or way weirder, depending on implementation).

And companies are already seeing results: T-Mobile is using it for phone upgrades, Zillow's testing it for home searches, StubHub reports "fast, natural, human-like" responses improving customer satisfaction, Oscar Health is exploring appointment scheduling that feels "magical," and Lemonade says they're resolving customer interactions in seconds while cutting operational costs.

Pricing

Now, this costs about $32 per million input tokens and $64 per million output tokens—which sounds expensive, until you do the math:

The average person speaks ~130 words per minute, so 1 million words = ~128 hours of talking.
That breaks down to $0.25 per hour for processing human speech and $0.50 per hour for AI responses.
A typical 30-minute customer service call where both sides talk equally costs ~$0.19 total.
Now, compare that to a human customer service rep at $15-25/hour, and you're looking at 99%+ cost savings… if it works (fx, OpenAI Chairman Bret Taylor has his own AI customer service company called Sierra that only charges you when the AI solves the problem autonomously).

Let us break down that math:

For human speech input ($32 per million tokens):

Average person speaks ~130 words per minute.
1 million words = ~128 hours of talking.
$32 ÷ 128 hours = $0.25 per hour of human speech processed.

For AI speech output ($64 per million tokens):

At the same 130 words per minute pace.
1 million words = ~128 hours of AI talking.
$64 ÷ 128 hours = $0.50 per hour of AI speech generated.

Why this is insanely cheap:

Customer service comparison: A human rep costs $15-25/hour. The AI processes their speech for $0.25/hour and responds for $0.50/hour.

So the total cost = $0.75/hour vs $20/hour for humans.

So when I said "sounds expensive but is actually cheap," I meant it sounds like big numbers until you realize you're paying basically nothing for human-quality voice interaction that would cost 50-100x more with humans. Great for businesses? Yes. Great for customers? It depends on if the agents can actually resolve most issues. Nothing's worse than trying to convince a bot you really just need to speak to a human. Also? Probably means more robocalls that actually fool you.

Here's a real-world example: Consider a 30-minute customer service call where both sides talk equally...

Human talks 15 minutes (~2,000 words) = $0.06.
AI responds 15 minutes (~2,000 words) = $0.13.‍
Total cost: $0.19 for the entire call.

For a phone service comparison: Most business phone services charge $0.02-0.10 per minute. This AI voice service works out to $0.008 per minute for full two-way conversation.

Now here's a voice actor comparison: Professional voice work costs $100-500+ per hour. AI does it for $0.50/hour. That's 99.9% cheaper.

Quick tangent: this is NOT a reason to completely replace human voice talent, and in fact, we actually think voice actors should be paid as part of this process. We need to get to the point where voice actors can license their unique voices to AI companies like OpenAI, and get paid on a royalty basis for every hour or N number of million words their voices are used. Spotify already does this with music (roylaties on number of streams)... why can't AI companies do it with real actor's voices? This 1. opens up the voice options for consumers to hundreds or thousands of voices to choose from (instead of the 6-12 standard options you have inside ChatGPT), and 2. creates a healthy, passive income stream for voice actors between commercial or narrative gigs. Just saying!

Keep in mind, OpenAI isn't alone here:

‍Kokoro, an 82-million parameter open-source model, has been crushing benchmarks despite being tiny compared to bigger models. It's Apache-licensed and costs under $1 per million characters to run.

Then there's Sesame, which some consider the current gold standard for voice quality. The fact that we're getting multiple high-quality options means voice interfaces might actually become usable.

Results from actual testing:

Wes Roth tested the tool live on his stream, and even compared it to Sesame (what he considers “the best of the best” voice AI). The part where he asks Maya to plead for why he should use her voice instead of OpenAI’s is uncanny valley level freaky. GPT-realtime, by extension, was fairly impressive, but not perfect.

X user kwindla described a dramatic shift from forcing themselves to try voice programming a few months ago to now finding it "difficult to imagine going back to writing code by hand with a keyboard." They use voice-only programming with OpenAI's Realtime API for pair programming with LLMs, treating voice dictation as a challenging benchmark that requires high contextual intelligence and reliable tool calling.

The new gpt-realtime model has crossed the "jagged frontier" threshold for them, making real-world voice programming tasks viable. Their goals include talking naturally to their computer (no rigid dictation), performing keyboard/mouse actions via voice, maintaining context across sessions, and integrating new tools. They've open-sourced their voice dictation code in the pipecat-dictation repo and note the cost is typically under $1 per hour of coding.

And then there's the Shoggoth mini use-case...which really, you just gotta see yourself to understand lol.

Want to try it?

Here’s the links you’ll need:

The Realtime API docs to get started.
The Realtime prompting guide and Realtime prompting cookbook.
The playground link to try it out yourself.

Oh, and apparently you can test it out here or call 425-800-0042 to try it yourself.

Our take:

Voice interfaces are about to get good enough that you'll soon prefer using them to clicking through websites. We're pretty bullish on voice becoming the new interaction framework. But being able to seamlessly switch between voice and text depending on your situation (quiet office versus hitting the road) is where things need to head.

Eventually, we want to get to the point where you can draw on your screen, highlight something, and say “wtf is this” and have AI respond instantly. But that’ll require an always-on voice assistant, which is a privacy nightmare we haven't figured out yet.

This is similar to how Google Pixel 10 works now—with Circle to Search, you can highlight or scribble on your screen, and Gemini Live can see what's on your screen while providing visual help and even highlighting solutions directly.

The pattern here is clear: We're moving from interfaces that force us to adapt to them (clicking through menus, typing specific commands) toward interfaces that adapt to how we naturally communicate. Voice was step one. Multimodal interaction—combining voice, vision, and gesture, but all the time, everywhere—is step two. The ultimate goal isn't just better voice assistants, but computers that understand context the way humans do.

Now here's the problem:

‍Our current operating systems weren't built for this. Windows, macOS, iOS, and Android all assume the old paradigm—discrete apps you launch, use, and close. They treat AI as another app running on top, not as the fundamental interface layer.

Does this mean we need an AI-native operating system? ‍One where the AI assistant isn't an app, but it is the OS?