OpenAI Drops GPT-4 Omni, New ChatGPT Free Plan, New ChatGPT Desktop App

May 14, 2024
Apple Podcasts

Show Notes

In a surprise launch, OpenAI dropped GPT-4 Omni, their new leading model. They also made a bunch of paid features in ChatGPT free and announced a new desktop app. Pete breaks down what you should know and what this says about AI.

Transcripts: ⁠https://www.theneuron.ai/podcast⁠

Subscribe to the best newsletter on AI: https://theneurondaily.com⁠

Listen to The Neuron: https://lnk.to/theneuron

Watch The Neuron on YouTube: ⁠https://youtube.com/@theneuronai


Welcome to The Neuron! I’m Pete Huang.

We have a LOT to talk about today. OpenAI dropped GPT-4 Omni, their new leading model. In a word, it’s gonna make ChatGPT fantastic, but there’s a hidden piece that most people aren’t talking about.

It’s Tuesday, May 14th. Let’s dive in!

I’ll admit it. I was wrong.

Twice now, I have said that the big upcoming launch for Monday was going to be OpenAI’s competitor to Google, a search engine that could finally displace Google Search.

But as early as Friday, OpenAI leadership were ready to tell the world that this Monday’s launch was not in fact a search engine. Like, very directly - here’s the tweet - very explicitly it says: not gpt-5, not a search engine.

Ok we’ll get back to the leaker accounts that got this wrong in a second here. Let’s talk about what went down.

Streaming live from the OpenAI office in San Francisco, OpenAI announced three big changes:

The first one is GPT-4 Omni, their new leading AI model.

The second one is a set of updates to ChatGPT, which includes a bunch of paid features that are now free, plus a new Voice Mode that they spent a lot of time demoing.

The third is a desktop app, which makes it possible for you to directly ask ChatGPT to analyze and respond to what you’re looking at on your screen.

There’s a LOT behind each one, so let’s take them one at a time.

First, GPT-4 Omni, or GPT-4o. It’s OpenAI’s new best model that is a step above GPT-4. You can give it any text, image or audio. And it can generate any text, image or audio.

And yes, some of that was already possible with the previous ChatGPT. When you opened up Voice Mode in ChatGPT, you could talk to it and it could talk back. You could already upload images and have ChatGPT understand them.

But that was 3 AI models taped together. There was the core GPT-4 for text, there was the DALL-E 3 model for images, and there was the Whisper model for voice. It was like 3 kids in a trenchcoat.

GPT-4 Omni is a totally new model that handles all of those capabilities in one model. Doing this has made the performance a lot better for image and audio.

On tests gauging vision and image understanding abilities, GPT-4o scored 5-10 percentage points better than the latest GPT-4 across every single test. And of course that means GPT-4o is even better than Google Gemini and Anthropic Claude, which were starting to encroach on GPT-4 level quality with their latest models in the last few months.

For audio, they tested speech recognition capabilities across global language groups. The measurement here is word error rate, meaning the higher your number, the more errors you’re making. The previous way of processing speech was the Whisper model, which for a region like Sub-Sahara Africa, had an error rate of around 34%. GPT-4o drives that down by half towards 16%. Still quite high, but way better than making errors one-third of the time.

Even for something less exotic to the West, like Eastern European languages, the error rate goes from around 15% down to around 5% when you move from Whisper to GPT-4o.

The previous GPT-4 model could handle text and code quite well already, so OpenAI is claiming that GPT-4o is about the same as GPT-4 capability wise. However, they also admitted today that GPT 2-chatbot, that mysterious chatbot that appeared online a couple weeks back, was indeed from OpenAI and turns out was GPT-4o.

Just as a reminder, these bots were tested on something called the LMSys Chatbot Arena, which is basically people voting which bots were better, which is separate from other benchmarks that are more like standardized tests than popular votes.

The official results are out. Even though GPT-4 and GPT-4o score about the same for text and code on those standardized test benchmarks, GPT-4o ended up with a significantly higher score on the chatbot arena. The latest GPT-4 had something like 1250 points, again with other models like Google Gemini and Anthropic Claude closing in just a few points behind. GPT-4o clocks in today at 1310, over 60 points ahead.

For coding, that gap goes even higher. GPT-4o is 100 points ahead of GPT-4 there, GPT-4o at 1369 and GPT-4 at 1269. Huge difference.

One last thing about GPT-4o for now. It’s freaking fast. Before, when you switch from GPT-3.5 to GPT-4, you can feel the tradeoff. Faster output but dumb. Smarter output but slow. Now you have the best of both worlds. It’s the best intelligence but starts generating instantly and finishes in a flash.

Let’s quickly go to ChatGPT before we talk more about the model.

First, a lot of stuff became free for everyone. You used to have to pay $20 per month in order to get GPT-4 and a whole bunch of other features, like memory, vision, GPTs. All of that is now free. Including GPT-4o.

Which means…if for some reason you or someone you know were affected by unimpressive outputs for GPT-3.5, and you weren’t able to try GPT-4, this is your moment to give another try getting ChatGPT worked into your day to day.

If you are already paying for ChatGPT, I wouldn’t blame you if you no longer want to. The free plan got wayyyy better and if you can squeeze your usage under the free limit, you get to use the best AI model on the planet for free. Just for completeness, OpenAI said the paid plan will have higher usage limits and early access to features like the new desktop app, so if that’s valuable to you like it is to me, then go ahead and keep your paid plan.

Ok, let’s talk about the big kahuna for ChatGPT updates, which is the new voice mode. It’s powered by GPT-4o. That speed that I mentioned earlier has also carried through to other modes than just text. OpenAI says the voice part of GPT-4o can respond in less than 300 milliseconds, which is about the same as humans.

The demo spent a lot of time showing this off, and you know, it really worked. As I was watching the demo, I had switched tabs a couple times and took my eyes off the screen. And the conversations they were having really did feel like they were talking with a real person in real time.

Let’s try it together! Close your eyes and listen to this:


Having a voice that responds as fast as a human can changes so much of the interaction. Your train of thought doesn’t get broken by the pauses, you feel like someone’s really listening and present.

But it’s not just the speed, either. It’s the dynamism, the tone and personality that the voices have. It’s so much better than the existing state of play, especially the leading candidates before today like ElevenLabs and Play.ht. It can joke with you, it could match the tone that you wanted, it could modify its voice to play certain characters.

Here’s another example:


You know, it’s amazing to me how much of a difference a solid voice can make. And when you go through the various demos they posted on their YouTube, it feels magical. If you get a chance, definitely go to the OpenAI YouTube channel to check those out.

Ever since ChatGPT, people have been referencing Her. It’s that movie from 2013 where a man falls in love with a computer program named Samantha.

ChatGPT when it first launched was close enough in concept that we knew it was possible but it was a matter of building the right pieces to get there.

And every few months, one of those pieces would unlock that made this vision, this idea of an AI that was basically as capable as a human person in terms of creating a relationship with you, possible.

GPT-4 would come out and ChatGPT got smarter, but it was slow. Voice mode came out, but it was too formal, too stiff, and you couldn’t interrupt it. Platforms like Character.ai exploded in popularity and showed us that people were willing to spend so much time and emotional energy with AI characters, but that was just text on a website. Devices like Humane and Rabbit came out and people wondered if we’d start to carry these new AI personalities in their own little carrying case.

But GPT-4o with this new voice mode ties it all together. The conversation feels natural. It has personality. And because GPT-4o can understand what you’re pointing the camera at, you feel like it’s a friend that’s on FaceTime.

It’s really remarkable that in one and a half years, we’ve gotten Her. We have it and it’s happening. And over the next few weeks, OpenAI will be rolling out this voice mode to paying users, so I’m excited to play around with this.

Ok, I want to talk about two parts that I think need a lot more attention than people will give it.

Number one is the desktop app. For Mac users, OpenAI is launching a ChatGPT app on MacOS today. At time of recording, this is not out yet but I’m hoping it’ll be soon. And for Windows users they said later in the year with no specific timeline.

The desktop app has an interesting first part and an even more interesting second part.

The interesting first part is that it’s a total upstaging of Apple’s Siri. With a simple keyboard shortcut, you can bring up ChatGPT, it’ll show like the Spotlight search that shows up when you press command space. From there, you can ask it a question, you can even take a screenshot right inside that launcher and include the screenshot.

And eventually that same voice mode that we just talked about will also be available in this app.

Hit a key and you can get an AI that can help you do anything. Hit a key and you can get an AI that can help you understand what’s on your screen.

I mean, is that not Siri? Like, that’s what we’re all hoping for when we want to ask Siri something and it fails because your question didn’t fit into one of the only 10 things it understands.

This is what that was supposed to be for, right? So now, the question is how Apple will play this. As Apple is nearing their deal to integrate ChatGPT into the iPhone and as Apple WWDC, their flagship conference that will run in mid June, gets closer, you have to wonder what their approach will be and when they decide to roll it out.

Ok, so that’s the first part about the desktop app.

Here’s the second. The desktop app is the beginning of the ultimate vision for AI and computing.

Let’s rewind. At some point, computers were run by punch cards. You literally punched holes in cards and fed them into the machine and that was how you ran a program in a computer.

At another point, computers finally got keyboards, but your screen was still mostly black with green blinking text and there was a lot of typing to do the thing that you wanted to do.

Then fast forward another set of years and computers finally got screens with things that you could click. Heck, even the idea of moving a mouse around to point at things on the screen, that was a big moment.

So over time, things got easier and easier to use because you had to learn less and less about the computer. It got abstracted away from you. Instead of having to learn about punch cards, you typed stuff in. Then instead of typing stuff, you just moved your mouse around. Everything got easier to use.

What ChatGPT is doing by sitting on your actual computer is the next level of this. At some point, OpenAI will give ChatGPT the ability to actually do things on your computer. That seems so obvious at this point. It can already see what’s on the screen. It understands what’s on the screen and can tell you what to do next. So what’s stopping it from doing it for you?

Now, think about all the times when you had to point and click something on the screen. Maybe it’s a series of websites or links in order to get information. Maybe it’s opening an app. Replace that with English, either you speak it out loud or type it on the screen.

Isn’t that the next form of interacting with our computers? Isn’t that JARVIS from Iron Man?

For what it’s worth, this is very similar to the vision that Microsoft has laid out with its Copilot. I was at their launch event when they first released the repackaged Copilot brand in September 2023. There, in both the event itself and the private conversations I had with Microsoft leaders afterwards, they said the vision was to have one service that united all of your devices. One interface, one entry point for interacting with AI that could pull from everything that you had.

This ChatGPT desktop app is something similar. One AI app that sees everything and is smart enough to understand, process, explain things to you and eventually do things.

That is transformative.

The second thing I think more people need to pay attention to.

GPT-4o is marketed as an AI model that can understand text, pictures and voice and output text, pictures and voice.

But it is of course an AI model all images, not just pictures of things. And it’s all audio, not just voice. And it’s dealings with these formats is very impressive.

Some examples: It can create a series of images with consistent characters, like in a storyboard. It’s probably the best I’ve seen so far when it comes to generating images with clear wording, though you might have to give it a few tries since my early testing shows its still not a solved problem. It can combine and restyle images, like if you give it two pictures of people it can combine and style them into a movie poster.

It can even create new custom fonts that have never existed before. It can simulate what it would look like to put your logo on a piece of swag.

And as for the audio, we’re talking the usual use cases like transcribing things, but also producing new audio like sound effects, not just voices.

Interestingly enough, as impressive as this model is, there is another reaction to today’s announcement that is actually negative.

The reasoning is this: if this is supposed to be your new leading model, then the improvement in GPT-4o’s reasoning capabilities suggests that we’re starting to hit a plateau.

Or put another way, if you had clear line of sight into building a truly better model that will blow us all away from an intelligence standpoint, you wouldn’t be spending your time making a model faster and cheaper, like you are with GPT-4o. This sort of release is an admission that GPT-4 intelligence is going to be roughly as good as it gets for a while, so now we’re focused on optimizing it and packaging it the right way.

I’m going to be very explicit about this. I completely disagree with this view. You can do both. This release was more about ChatGPT than it was about the model capabilities and how smart AI can get. It was also about PR.

Keep in mind that Google I/O is happening today, May 14th. They want to take as much oxygen out of the room as they can from Google. And Google, for what it’s worth, is planning on showcasing something quite similar. They had a preview video yesterday that showed someone pointing a camera at something, asking the AI about it and getting a response with an AI-generated voice.

Now, what Google launches will look completely unsurprising. There’s no wow factor anymore.

And OpenAI has been ready for a long time. GPT-4o was in beta as early as April of 2023. That’s over a year that they’ve had this in the wings and just waiting for the right moment to polish it up and release it.

Which brings me back to the start of this episode, the leakers that got the information of yesterday’s launch wrong.

They’re now claiming that OpenAI’s search engine, the thing that started this whole rumor mill about the launch event, just wasn’t ready, and they’re simply punting the launch. And some of these leaking accounts are saying that they wouldn’t be surprised if there was a follow-on launch next week.

I don’t know anymore. Maybe. Let’s just get through this week, let’s see what Google has in store for us at Google I/O and take this one day at a time, shall we?

This is Pete wrapping up The Neuron for May 14th. I’ll see you in a couple days.