In a direct challenge to the “autonomous agent” trend dominating Silicon Valley, Mira Murati, the former CTO of OpenAI and founder of the newly minted Thinking Machines Lab introduced a revolutionary class of AI on May 12, 2026. Dubbed “Interaction Models,” these systems are designed to collapse the awkward delay between human input and AI response. By processing audio and video streams continuously rather than waiting for a user to finish speaking, Thinking Machines is attempting to move AI from a “transactional chatbot” to a “fluid collaborator” that can listen, see, and interrupt in real-time.
The End of the “Awkward Pause”
For years, the primary friction in human-AI interaction has been the “turn-taking” bottleneck. Conventional models operate like walkie-talkies: you speak, the model waits, it processes, and then it responds. This 1-to-2 second lag often makes natural conversation impossible.
Thinking Machines’ breakthrough model, TML-Interaction-Small, solves this through a “Full-Duplex” architecture. Instead of treating a conversation as a series of distinct turns, the model breaks communication into tiny 200-millisecond “micro-turns.” This allows the AI to perceive and react to cues—like a user’s facial expression or a change in tone, while it is still in the middle of generating its own response.
Encoder-Free Early Fusion: The Secret Sauce
To achieve its industry-leading latency of under 0.4 seconds, Thinking Machines has abandoned the traditional “bolted-on” approach to multimodality. Most AI systems use separate encoders to “translate” audio and video into text before the model can understand it.
TML-Interaction-Small utilizes a proprietary technique called “Encoder-Free Early Fusion.” This allows raw audio and visual signals to be processed directly through the model’s core transformer layers. On the “FD-bench” (a benchmark for interaction quality), this architecture outperformed Google’s Gemini 3.1 Flash (0.57s) and OpenAI’s GPT-Realtime 2.0 (1.18s), making it the fastest multimodal conversational model currently in existence.
Dual-Brain Processing: Fast Talk, Deep Thought
One of the most innovative features of the new system is its Asynchronous Background Model. Thinking Machines separates “interaction” from “reasoning”:
-
The Interaction Model (276-billion parameters): This model handles the “front-end” of the conversation, managing presence, tone, and immediate reactions.
-
The Background Model: While the interaction model keeps the conversation flowing, a deeper model works in parallel to search the web, run complex code, or retrieve data.
This means a user can keep talking to the AI while it simultaneously performs a complex task in the background, seamlessly integrating the results into the conversation without a “Processing…” loading screen.
Visual Awareness and Time Sensitivity
Because the model “sees” continuously, it possesses a built-in sense of temporal and visual context. In one demonstration, the model was able to:
-
Count Repetitions: It watched a video of a user exercising and counted “reps” in real-time without being explicitly told to start.
-
Monitor Safety: It alerted a lab researcher to a safety violation (not wearing gloves) the moment their hands entered the frame.
-
Live Translation: It performed “simultaneous speech” translation, where both the human and the AI spoke at the same time, much like a professional human interpreter.
The Strategic Wedge: Collaboration vs. Autonomy
Murati’s launch of Interaction Models is a clear philosophical departure from the “Agentic” path taken by OpenAI and Anthropic. While those companies focus on “autonomous agents” that can perform tasks solo for hours, Thinking Machines is betting that the most valuable AI will be the one that stays “in the loop” with humans.
“The way we work with AI matters as much as how smart it is,” Murati stated during the unveiling. By focusing on messy, visual, and spoken collaboration, Thinking Machines is carving out a niche for AI in high-stakes environments like surgical suites, manufacturing floors, and creative studios where every millisecond of human-AI synchronicity counts.
As of May 12, 2026, the Interaction Models are available only to a limited group of research partners, with a broader public rollout expected later this year. While the system currently struggles with “context bloat” during extremely long sessions (where continuous video data fills up the model’s memory), the technical achievement is undeniable.
By turning the “digital arteries” of audio and video into a single, streaming flow of intelligence, Mira Murati has signaled that the future of AI isn’t just about thinking, it’s about staying present. The race for AGI has moved beyond who has the biggest brain; now, it’s about who has the fastest reflexes.




