Most AI still collaborates like a polite email thread.
You speak. It waits. You stop. It thinks. It responds. Then everything freezes while you decide what to say next. That rhythm has gotten us surprisingly far, but it also explains why even very smart AI can feel weirdly bad at working with you in the moment.
Thinking Machines Lab, the AI startup founded by former OpenAI CTO Mira Murati, wants to change that. In a new research post, the company introduced what it calls interaction models: AI systems trained to handle continuous, real-time interaction across audio, video, and text instead of relying on external software glue to fake conversation.
The big idea is simple: if AI is going to become a true collaborator, it can’t only get smarter. It also has to get better at timing.
That means knowing when to listen, when to interrupt, when to stay quiet, when to react to something on screen, and when to hand off deeper work to a background model while keeping the conversation alive. You know, the stuff humans do automatically in meetings, pair programming sessions, tutoring, coaching, and mildly chaotic family tech support calls.
Thinking Machines’ thesis is that today’s AI interface is the bottleneck. Models are improving quickly, but the interaction layer is still mostly turn-based. The user gives a complete input, the model gives a complete output, and anything more fluid gets patched together with speech detection, tool orchestration, memory systems, and other harnesses.
Interaction models try to make that behavior native to the model itself.
What Thinking Machines Announced
The announcement centers on TML-Interaction-Small, a research-preview model designed for real-time collaboration. It can continuously process audio, video, and text, while also speaking, responding, and using tools without waiting for neat conversational turns.
The company describes capabilities like:
- Dialog management without a separate turn-taking system
- Verbal and visual interjections
- Simultaneous speech, such as live translation
- Awareness of elapsed time
- Tool calls, browsing, and generative UI while the conversation continues
That last part matters. The model is not just trying to be a better voice assistant. It is trying to create a different shape of AI product: one where the front-facing model remains present with the user while another background model handles slower reasoning, searches, tool use, and longer tasks.
Think of it as a receptionist and analyst in one system. The interaction model keeps eye contact. The background model does the spreadsheet work in the back room. Ideally, neither drops the thread.
If it works, this could make AI feel less like “submit prompt, await answer” and more like working with someone who can see what you see, hear what you mean, notice what changed, and keep helping while the situation evolves.
My immediate curiosity stemmed to having the model in a pair of Meta Ray Bans with cameras where it could literally see what you are, when you are -- tagging along as a companion, of sorts.
Why This Matters
The current AI market is obsessed with agents that go off and do things autonomously. That is useful, especially for coding, research, workflow automation, and long-running tasks. But there is a quieter problem: lots of real work is not fully specifiable upfront.
A designer does not always know exactly what they want until they see a version. A developer catches bugs while explaining the code. A doctor, teacher, coach, analyst, or customer support rep often needs a back-and-forth process where the human stays involved.
Today’s AI tools often push humans out of the loop not because the human is unnecessary, but because the interface is too narrow.
That is the useful framing in Thinking Machines’ post. The company is not saying autonomy is bad. It is saying collaboration has been under-optimized. The future may not be “AI does the whole job while you disappear.” It may be “AI stays with you while the work changes.”
This lines up with the broader shift we’ve been tracking at The Neuron: voice and multimodal AI are moving from demo magic into actual product infrastructure. OpenAI’s Realtime API, for example, made it easier to build human-sounding voice apps. But Thinking Machines is pushing a deeper claim: real-time behavior should not just be an API wrapper. It should be trained into the model.
This also feels like part of a larger shift, where we're seeing newer innovations coming out of smaller labs of late that may impact the AI we use every day. There are companies working to solve memory, context, multi-modality, all of which (like Thinking Machines' work) will likely be used to improve the AI in our lives over the coming months and years.
This also comes about two months following the announcement of a multi-year partnership with NVIDIA to deploy at least one gigawatt of next-generation Vera Rubin systems, with deployment targeted for early 2027. NVIDIA is also making what both companies describe as a significant investment.
How It Works
Traditional chat models flatten interaction into a sequence: user input, model output, user input, model output. That structure works fine for text. It gets awkward in real life, where people talk over each other, gesture, pause, hesitate, correct themselves, and react to visual changes.
Thinking Machines’ approach uses “micro-turns.” Instead of waiting for a complete human turn, the model processes and generates in 200-millisecond chunks. Audio, video, and text streams are interleaved in time, so silence, overlap, interruption, and visual context remain part of the model’s input.
That lets the model do things a normal turn-based system struggles with: count reps while watching a workout, correct pronunciation as someone speaks, translate while listening, or notice a visual cue before the user explicitly calls it out.
The architecture has two major pieces:
The interaction model handles the real-time loop. It sees and hears what is happening, responds quickly, and maintains the live collaboration.
The background model handles slower work. When a task requires deeper reasoning, tools, browsing, or agentic workflows, the interaction model delegates while continuing to talk with the user.
This is a smart split. Real-time conversation has brutal latency demands. Deep reasoning often needs more time. Trying to make one model do both perfectly at once is hard. Thinking Machines’ answer is to coordinate two systems sharing context.
Under the hood, the company says it trained the interaction model from scratch with early fusion across modalities. Audio is represented with dMel features, images are split into patches, and components are co-trained with the transformer rather than relying on large separate encoders and decoders. It also built streaming-session infrastructure so each 200ms chunk can be appended to a persistent GPU sequence instead of constantly rebuilding model state.
Translation: they are not just demoing a fancy voice UI. They are reworking the model and serving stack around continuous interaction.
The Benchmarks
Benchmarks for this kind of system are still immature, which Thinking Machines openly acknowledges. That caveat is important. Measuring whether a model is “good at interaction” is much fuzzier than measuring whether it solved a math problem.
Still, the numbers are interesting.
On FD-bench v1, TML-Interaction-Small reports 0.40 seconds of turn-taking latency, compared with 1.18 seconds for GPT-realtime-2.0 in minimal mode, 0.59 seconds for GPT-realtime-1.5, and 0.57 seconds for Gemini-3.1-flash-live-preview in minimal mode.
On FD-bench v1.5 average quality, Thinking Machines reports 77.8, compared with 46.8 for GPT-realtime-2.0 minimal, 48.3 for GPT-realtime-1.5, and 54.3 for Gemini-3.1-flash-live-preview minimal.
On FD-bench v3 with audio and tools, it reports 82.8% response quality and 68.0% Pass@1 with the background agent enabled.
The model is not dominant everywhere. On QIVD video-audio accuracy, it reports 54.0, behind several baselines. On Audio MultiChallenge, it reports 43.4, below GPT-realtime-2.0 in xhigh mode at 48.5. On text IFEval, it is basically tied with GPT-realtime-2.0 minimal but below higher-reasoning systems.
That is the important nuance: Thinking Machines is not claiming this small model beats every frontier system on intelligence. It is claiming a new tradeoff point: strong enough intelligence plus much better real-time interaction.
The more novel results are on internal proactive benchmarks. On TimeSpeak, designed to test whether a model can initiate speech at specific times, TML-Interaction-Small scores 64.7 versus 4.3 for GPT-realtime-2.0 minimal. On CueSpeak, which tests whether it can speak at the right moment based on verbal cues, it scores 81.7 versus 2.9. On visual counting with RepCount-A, it scores 35.4 versus 1.3. On Charades temporal action localization, it scores 32.4 mIoU versus 0.
Those are the “if true, pay attention” numbers. They point to AI that can act on timing and context, not just content.
The Mira Murati Context
This announcement also matters because of who is making it.
Mira Murati is the founder and CEO of Thinking Machines Lab. Before starting the company, she spent six years at OpenAI, joining in 2018 as VP of applied AI and partnerships before becoming CTO in 2022. She helped lead some of OpenAI’s defining product work, including ChatGPT, DALL-E, and Codex, and briefly served as interim CEO during the 2023 leadership crisis.
That history matters because Thinking Machines is not a random new lab with a shiny demo. It is one of the most closely watched OpenAI diaspora companies, alongside other frontier-AI breakouts. The company reportedly raised a massive $2 billion seed round, with CNBC reporting that Murati said the startup would announce its first product in the following months.
The interaction-model post is the clearest signal yet of what Thinking Machines wants to be: not just another lab chasing a bigger chatbot, but a company betting that the interface between humans and AI is still weirdly primitive.
That is a very OpenAI-adjacent insight. ChatGPT’s breakthrough was not just model capability. It was packaging. It turned language models into something ordinary people could use. Thinking Machines seems to be asking the next packaging question: what happens when the chat box itself becomes the constraint?
What This Makes Possible
If interaction models work, the product surface gets much bigger.
In education, an AI tutor could watch a student solve a problem, interrupt only when the student goes off track, and adapt based on hesitation or confusion.
In coding, an AI pair programmer could notice a bug as it appears on screen, answer questions while tests run, and keep investigating in the background.
In healthcare and accessibility, AI could follow live context, support hands-free workflows, describe visual changes, or help people navigate tasks without requiring perfect prompts.
In robotics and embodied AI, real-time perception and response are not optional. A system that handles multimodal timing natively could become a useful bridge between software agents and physical-world agents.
In enterprise software, the implications are less flashy but maybe more valuable. Imagine AI copilots that can sit inside live workflows, watch dashboards change, take verbal corrections, search company systems in the background, and surface answers without turning every action into a prompt-writing exercise.
That is the real prize: less prompt engineering, more shared context.
The Caveats
The announcement is still a research preview. A limited preview is expected in the coming months, with wider release later this year. Until developers and users can test it, the demos and benchmarks should be treated as promising, not definitive.
There are also hard deployment problems. Continuous video and audio chew through context and bandwidth. Low-latency serving is expensive. Real-time systems create new safety challenges because mistakes can happen mid-conversation, mid-task, or mid-visual interpretation. And the current model, while called “Small,” is a 276B-parameter mixture-of-experts model with 12B active parameters. That is not exactly pocket-sized.
There is also a product-design risk. A model that can interrupt, watch, and proactively speak can be magical. It can also be deeply annoying. The difference between “helpful collaborator” and “overcaffeinated Clippy with a webcam” will come down to taste, control, and trust.
The Bigger Signal
The AI industry has spent the last few years scaling intelligence. Now the competition is shifting toward interaction and interface.
That means intelligence has to show up in time, in context, and in the user’s actual workflow. A brilliant model that responds two beats too late can feel less useful than a smaller model that knows exactly when to jump in.
Thinking Machines’ announcement is important because it names the next frontier clearly: AI collaboration is not only about what the model knows. It is about whether the model can participate.
If Murati and team are right, the next great AI interface may not look like a bigger chat window. It may look like a system that can listen, watch, reason, wait, interrupt, and act without making the human constantly translate reality into prompts.
That would be a much bigger shift than faster voice chat.
It would mean AI is finally learning how to work in the room.