Bip Sandiego

collapse
Home / Daily News Analysis / AI voice chat sucks. This startup thinks it’s cracked it

AI voice chat sucks. This startup thinks it’s cracked it

May 13, 2026  Twila Rosenbaum  12 views
AI voice chat sucks. This startup thinks it’s cracked it

Voice chat with artificial intelligence has long felt stilted and unnatural, forcing users into a rigid turn-taking structure reminiscent of CB radio conversations. Current systems like ChatGPT or Gemini must wait for a user to finish speaking before generating a response, and cannot perceive anything else happening in the environment—including the passage of time. This limitation stems from the underlying single-threaded architecture of standard language models, which can neither think while listening nor react while speaking.

Now, a startup founded by former OpenAI executive Mira Murati, Thinking Machines, claims to have cracked the code for truly interactive AI communication. Their new 'interaction models' employ a multi-stream, micro-turn configuration that enables real-time, back-and-forth dialogue. In demonstration reels, the AI spots products held up to the camera, keeps a running tally of specific words as the user continues talking, and even interrupts to correct mispronunciations or factual errors—all while maintaining natural conversational flow.

The technology behind natural AI conversation

Thinking Machines' approach pairs two AI models working in concert. The first is an 'interaction model' that stays continuously present with the user, processing audio and visual inputs in rapid 200-millisecond chunks. This model handles the quick give-and-take of conversation, including interruptions and contextual cues. The second, a 'background model,' tackles heavier computational tasks—such as complex reasoning or knowledge retrieval—and hands off results to the interaction model when ready.

This dual-model design overcomes the fundamental bottleneck of standard AI voice interfaces. In traditional systems, a single model must sequentially listen, think, and respond, creating latency and awkward pauses. The Thinking Machines architecture allows parallel processing: while the background model works on a deeper query, the interaction model can still react to new inputs, maintain engagement, and even interject appropriately.

The company demonstrated several impressive scenarios. In one, a user holding two water bottles asks the AI to identify them; the AI recognizes the brands and colors in real-time. In another, during a free-flowing conversation, the AI silently waits as the human takes a sip of coffee, demonstrating restraint instead of jumping in. In a more active example, the AI interrupts to correct a user who mispronounces 'acai' and then challenges the user's claim that acai bowls originated in Argentina—all while the user is still speaking.

Founder Mira Murati and the Thinking Machines vision

Mira Murati, who previously served as Chief Technology Officer at OpenAI and oversaw the development of ChatGPT, launched Thinking Machines in 2024 with a mission to create AI that interacts more naturally with humans. Her background in developing large language models and handling the ethical challenges of AI deployment positions her uniquely to tackle the limitations of current voice interfaces. The startup has attracted talent from leading AI labs, including researchers specializing in multimodal processing and real-time systems.

The 'interaction model' paradigm represents a shift from the typical text-first approach. By treating conversation as a continuous stream rather than discrete turns, Thinking Machines aims to make AI feel more like a human counterpart—capable of following non-verbal cues, remembering context across interruptions, and engaging in simultaneous listening and speaking. This aligns with Murati's earlier emphasis on multimodal AI that integrates sight and sound.

Current limitations and future potential

Despite the breakthroughs, Thinking Machines acknowledges that the technology remains in a research phase. The current interaction model is relatively small, as larger models are too slow for real-time processing. The system also requires reliable connectivity and may struggle with very long conversations. However, the company is confident that scaling and optimization will address these issues over time.

The implications for voice assistants, customer service bots, and accessibility tools are significant. Today's AI voice modes often frustrate users because they cannot handle interruptions, mishear words without context, or fail to notice visual cues. A system that understands conversational flow could transform tasks like ordering food, seeking directions, or even having a casual discussion. For example, a user could interrupt mid-sentence to correct a recipe the AI is reciting, and the AI would adjust instantly.

Thinking Machines is not the only company working on this challenge. Other startups and research labs are exploring similar 'full-duplex' architectures, but Thinking Machines' demonstration of visual cue integration and interruption handling suggests they may have a lead. The key differentiator is the micro-turn handling—breaking down each 200ms slice into a mini-interaction, allowing the AI to respond almost instantly while still engaging background processes for deeper analysis.

As AI voice chat evolves, the barriers that have kept it in the realm of gimmickry are slowly falling. The clumsy 'over and out' exchanges of current systems may soon be a relic, replaced by fluid, human-like conversations that can flow as naturally as any dialogue between people. Thinking Machines' research preview offers a tantalizing glimpse into that future, where AI can listen, think, speak, and see—all at once.


Source: PCWorld News


Share:

Your experience on this site will be improved by allowing cookies Cookie Policy