Welcome all you cool cats to The Neuron! I’m Pete Huang.
Today, we’re diving into an experiment: how researchers made AI doctors and nurses, and the surprising results when you let them play hospital for a day.
It’s Thursday, May 9th. Let’s dive in!
GPT-4 has now been out for over a year. OpenAI had announced it in March of 2023, and immediately, it seemed to re-open a shock factor that had been percolating through the Internet at that time.
Just 4 months prior, OpenAI released ChatGPT, which was then powered by GPT-3.5. 4 months clearly was not enough time for us to digest what was going on with ChatGPT. People were still going bananas watching ChatGPT do things that seemed impossible for a computer to do. It was writing clearly, it seemed to react in humanlike ways, it seemed to demonstrate real thought.
All that seemed to be in place with the GPT-3.5 version of ChatGPT. Then OpenAI drops GPT-4. Remember, just 4 months after the very first version of ChatGPT. And people are going nuts.
Of course, GPT-4 was way, way better when it came to its coherence, the quality, the wittiness of responses, the craftiness of jokes, its ability to solve puzzles that GPT-3.5 didn’t have a chance at, it’s ability to handle complex tasks correctly.
But really, the reaction was not even about the quality of the responses anymore. It was how quickly we got GPT-4 right after ChatGPT. The narrative became, OK, if GPT-4 was just 4 months after ChatGPT, and it’s so much better, where the heck is AI going to be by the end of 2023? Like, is SkyNet coming? What are they working on now that is going to come out in November of 2023 and blow us all away?
In some ways, we’re lucky that there wasn’t a GPT-5 that came out last year. Not that anyone really expected it, it would’ve been a little too quick. But mostly, it gave people time to digest what we even had on our hands with GPT-4.
One of the major realizations was that we had a pathway to agents on our hands. And this pathway is ultimately behind the surprising research that came out this week.
Ok, so agents. Agents is one of those things that lots of people talk about but everyone seems to have a different definition of them. The simple example is this: right now, ChatGPT does pretty well at things that are super tightly scoped. For example, if you want it to write a blog post about something, the way you’d do it is you’d give it a simple action like “write a blog post” then give it a lot of extra stuff that serves as the information it should use for that blog post.
So you’d tell it what information to include in that blog post. You’d give it guidelines for the tone and voice that you want. Maybe you’d even outline it for it.
Then it’d go and write the thing.
So even if the general idea is that you’ve told it to write a blog post, you’re really handholding through it some key points of the blog post, right? You’ve done a lot of the thinking for it. It hasn’t really had to come up with any novel ideas or a set of information that you haven’t given it. All it’s doing is essentially taking whatever information you’re telling it and repackaging it in the form of the blog post. It’s saving you the time to type it all out, but you still had to spend the time on the important stuff.
One interpretation of agents is that you’d be able to tell an agent “write a really interesting blog post about content marketing” and it would independently be able to think about what would be an interesting piece of content, gather a bunch of information, develop a coherent outline, and write the full output.
So in the end, you don’t even have to do the thinking. You just have to tell it to write the blog post about content marketing and it would do the rest. Heck, an agent could even do that part. You could just say “come up with and execute on a really good content strategy for this quarter” and theoretically the same skills that an agent would need to do that blog post independently would also allow it to do that, right? The same thinking, the same planning.
In fact, Andrew Ng, one of the leading computer scientists on AI, specifically calls out the ability to plan and the ability to reflect on its work as two key pieces of what it means to be an AI agent.
It turns out that GPT-4 was the first time we got a glimpse of those abilities to plan and reflect in a very visceral way. If you asked GPT-3.5 to plan and reflect on something, it would be able to do it, but there was something about GPT-4 that made you really believe for the first time that a computer was doing it in a way that felt like humans.
In the 6 months after GPT-4’s release in March of 2023, we saw a lot of commentary about this type of capability. People here in San Francisco were actively readjusting their expectations for what the future would look like and when.
And in September 2023, we see a research paper, a really novel project that catches everyone by completely surprise and that ultimately becomes the inspiration for the AI agent doctor research that I’m gonna walk through in a few minutes here.
Today, that research paper is nicknamed the Stanford Town paper. The official title: Generative Agents: Interactive Simulacra of Human Behavior. It got its nickname because it came from Stanford and it simulated a town.
The TL;DR of Stanford Town is that the researchers literally built Westworld. You know what I mean right? Westworld is basically a theme park filled with these robots that would interact with you like a human, then they’d reset.
And Stanford Town did something very similar.
Because GPT-4 exhibited this sort of planning and reflecting, they set up these little characters in a town. Each character got a short paragraph that described two things. First, they got some description of who they were, so a name, what they did for work, a bit of a personality. Second, they got relationships. So, John is Derek’s brother, John is friends with Christina, John thinks Chris is an annoying neighbor, etc. Relationships also include some memories of their interactions, so things like John and Derek love to talk about sports with each other. Or John and Christina first met at a friend’s birthday party.
And really, all they did from there, was set them loose.
They told the agents: given these two things as your starter memory, plan your day and go about it. Along the way, whatever happens to you that you think you should remember, go ahead and add that to your memory.
Now, just given these simple ingredients - you have a starting condition of a personality and a memory of your interactions with others, you also have instructions to plan your day and remember things that you feel are important - what emerges from this again changed people’s expectations about what was possible with the technology that we had on our hands.
These little characters, these agents, actually did start to resemble a little town. You had people going to their respective jobs. They’d have coherent conversations with each other that built on each other’s prior relationships. In those conversations, they’d spread information to each other. One of the characters had decided to run for office, and you could watch the entire town slowly learn about his campaign. They would coordinate with each other. One of the characters was told to plan a party at the cafe, and eventually a few characters decided to help plan, and a few more show up for the actual event that evening. And they’d also remember things about each other. In one of the conversations, one of the characters said to another that they were working on a photography project. Later that day, these two bumped into each other again, and that second character asked the first how the project was going.
So you had this emergent social behavior, all stemming from GPT-4, these very simple setups for each character and a rather simple setup for the entire town.
This week, researchers released a paper that was essentially the same setup, but for a hospital. And the eventual result wasn’t just “hey this is interesting, look at them interact like humans!” but a more impactful one. This type of setup eventually produced characters who scored super high on certain medical questions, even beating entire AI models that specialized in answering those questions.
Keeping the Stanford Town setup in mind, the adjustments we make are the following. Instead of characters with personalities, you have two groups. One group is patients, who randomly get sick with a disease and come down with symptoms. The other group is doctors and nurses, who work at a hospital.
Every day, some patients get sick, they show up at the hospital, and they basically go through the entire hospital cycle: you check in, you get triaged to a certain part of the hospital, you get examined, you get a diagnosis, you get a treatment.
And the next day, you either stay at home and get better because the treatment worked or you go back to the hospital because you’re feeling worse.
There’s another important part of this setup, which is that the doctors are always learning, but they’re doing it in two ways. The primary way is experience, which means as they’re interacting with patients, they’re adding to their bank of knowledge. Every patient they see, exam, diagnose and give treatment to is another data point to their bank of knowledge.
The other thing they’re doing is reading stuff. The designers of the simulation downloaded a bunch of text related to the diseases they’d be covering and gave them reading material on their off hours.
Combined, the doctor agents have two sources of information when they are diagnosing a patient. They have a memory of their prior experiences. And they have memory of what they’ve read. And every time they see a patient, they look through those two memories and pull what they think are the most relevant cases to inform the patient case at hand.
Sounds pretty similar to what we do as humans, right? Maybe a little simplistic, they don’t factor in things like intuition or gut feel, which definitely comes up in any decision-making we do, but still, not far off.
So, to recap: you have doctors and patients. Patients are getting sick, they go to the hospital. Doctors try to help, sometimes it works and sometimes it doesn’t, but they’re always learning. Right? Feels like an appropriate, scaled down version of what’s actually happening in real life, though you do take out all the insanities of US healthcare like the crazy billing and overworked healthcare professionals and all that.
They let this simulation run for 10,000 patient cases, 10,000 times the patients get sick and the doctors try to help.
The point is to see how good the doctors can get through this process.
And by the end, they have the doctor agents answer questions off of this standardized set that are relevant to the diseases included in the simulation. These doctor agents end up scoring 93.06%, which is the best score they’ve seen.
It even beats Med-Gemini, which is an AI model that Google trained specifically to tackle medical questions. Med-Gemini scored 91.1%.
Now, the caveat again is that these doctor agents from the simulation only answered a subset of these medical questions, while Med-Gemini did the whole thing. They’re not particularly comparable.
Still, just like Stanford Town, a surprising conclusion to a simulation that had pretty simple starting conditions.
Your big takeaway on the AI agents hospital:
Turns out that playing the Sims performs surprisingly well.
It used to be that for any specialized topic or area of knowledge, you’d assume that you’d have to train a model that only tackled that specific area. For example, Google has Gemini as a base model that is trained for general purposes and reasoning. But they made MedGemini for specifically medically related questions. Making that model is like taking a base model like Gemini then slapping on an extra power pack of specialized training. This is called fine-tuning. Your base model did most of the work, then now you just give it an extra little nudge for a particular area.
This paper shows that we can create these agents that can plan and reason and access forms of memory. And if you create enough data points, you could match or even exceed these AI models. So you’re comparing three things: a general model, a specialized model and a general model with specialized memory.
We’ll have to wait for further tests to see if the general model with specialized memory really holds up. For example, this test would have to expand to letting the simulation doctors deal with all types of diseases and answer the entire set of medical questions.
For now, let’s just enjoy another instance of a cool AI simulation.
Some quick hitters to leave you with:
After some further digging, people have figured out that this GPT 2 chatbot thing is indeed from OpenAI. The chatbot is again available at lmsys.org, but it’s still unclear where exactly it’ll be used.
ChatGPT Search is delayed. It was supposed to be launched today, May 9th, but the rumors are that it got moved to next Monday, May 13th. The critical date is May 14th, which is the Google I/O conference. Their playbook is to upstage Google, so May 13th is the last day to make their move.
Google DeepMind announced AlphaFold 3, which predicts protein behavior. AlphaFold 3 has played and will play very large parts in accelerating drug discovery, crop science and new material science. It’s a big deal for science that makes the world a better place.
This is Pete wrapping up The Neuron for May 9th. I’ll see you in a couple days.