AIVO: The Observation That Predicted How We'd Talk to AI

AIVO: The Observation That Predicted How We'd Talk to AI

A few years ago, we were building a conversational AI interface for a client, and we ran into a problem that had nothing to do with the technology.

The technology worked. The voice recognition was accurate. The responses were useful. But users weren't engaging with it the way we expected. They would open the interface, hesitate, and then type instead of speak — even though speaking was faster and the system was clearly designed for voice.

We spent a lot of time trying to figure out why. The answer, when we found it, was obvious in retrospect: people weren't yet comfortable talking to their phones in public. Or at their desks. Or anywhere that another human might observe them doing it. Talking to a machine felt strange. It felt performative. It felt like something that would require explanation.

This was a real constraint. And it was a constraint that had nothing to do with the quality of the AI.

What We Noticed

What we observed — and what became the foundation of what we now call AIVO, or Audio Input / Visual Output — was a behavioral pattern that was already deeply embedded in how people used their devices.

People were already talking into their phones. They were leaving voice notes. They were using Siri for timers and reminders. They were dictating text messages when their hands were full. The behavior existed. What was missing was the expectation of a visual response.

The pattern we identified was this: people are comfortable speaking into a device when they expect the device to respond visually. The audio input feels natural because it mirrors how we communicate with other humans. The visual output feels natural because it mirrors how we consume information — on a screen, at our own pace, with the ability to review and re-read.

The combination of audio input and visual output is not a novel interface paradigm. It is, in fact, the most natural interface paradigm available to us. It maps to how humans already communicate. The reason it took so long to become mainstream is not that the technology wasn't ready. It is that the social permission wasn't there yet.

The Social Permission Problem

This is the part of the AIVO story that I think is most instructive for anyone building AI-native products today.

Technology adoption is not purely a function of capability. It is a function of capability plus social permission. Social permission is the invisible threshold at which a behavior stops feeling strange and starts feeling normal. It is the moment when talking to your phone in public stops being something you do quietly, apologetically, and starts being something everyone around you is also doing.

We identified the AIVO pattern several years before that social permission threshold was crossed. At the time, there were serious, intelligent people arguing that voice interfaces would never achieve mainstream adoption because people simply weren't willing to talk to their devices in front of other people. They weren't wrong about the constraint. They were wrong about its permanence.

What changed was not the technology. What changed was that AirPods normalized talking to invisible things in public. That Siri and Google Assistant normalized asking questions out loud. That the pandemic normalized talking to screens. Each of these shifts moved the social permission threshold, and each of them validated the behavioral pattern we had identified years earlier.

What AIVO Means for Product Design

The practical implication of the AIVO observation is straightforward: when designing AI-native interfaces, the most natural interaction model for most use cases is voice in, visual out.

This is not a universal prescription. There are contexts where text input is more appropriate — when discretion matters, when precision matters, when the user is in an environment where speaking is impractical. But as a default design assumption, audio input / visual output aligns with the most deeply embedded behavioral patterns humans bring to their devices.

The interfaces that are winning right now — ChatGPT's voice mode, Apple Intelligence, the emerging class of AI wearables — are all converging on this pattern. The screen doesn't disappear. The keyboard doesn't disappear. What changes is that voice becomes a first-class input modality rather than a fallback.

For organizations building AI-native products today, the AIVO framework offers a useful design heuristic: before defaulting to a text interface, ask whether the use case would be better served by a voice-first interaction model. The answer is not always yes. But asking the question will surface use cases where the answer is obviously yes — and where the text-first assumption is simply a habit inherited from a pre-voice era.

The Broader Lesson

The reason I keep returning to the AIVO story is not because the observation was particularly clever. It is because the observation was early — and being early, in the context of technology adoption, is a specific kind of uncomfortable.

When you identify a behavioral pattern before the social permission threshold is crossed, the people around you will often tell you that the pattern doesn't exist. They will point to current adoption rates. They will cite surveys showing that users prefer text. They will be right about the present and wrong about the future.

The discipline required to hold an early observation with conviction — to keep building toward a future that hasn't arrived yet — is one of the most underrated skills in technology strategy. It is also one of the most valuable, because the organizations that build for the future that is coming, rather than the present that exists, are the ones that find themselves with a durable advantage when the future arrives.

AIVO is, in the end, a small story about a big pattern: the gap between what technology can do and what people are ready to do with it. Understanding that gap — and having the patience to build across it — is what separates execution-based strategy from the kind that ages poorly.


Christopher Roberts is the founder of Intevate Labs, an AI-native strategy and execution firm based in Utah. He advises organizations on building AI-native products and interfaces. To discuss your product strategy, schedule a conversation here.