VapiCon 2025: Foundations of Voice AI with Kwindla Kramer, Anoop Dawar, Dr Alberto Montilla & more

By Vapi

Summary

## Key takeaways - **End-to-End Observability Crucial**: Observability goes beyond what the agent or model outputs; it includes network conditions like bad reception on mobile or poor Wi-Fi, which drastically alter the experience even with the same LLM and tuning, making it essential for great customer experience and scaling to millions of users. [06:20], [07:17] - **Conversation Design Underestimated**: The art of conversation remains unsolved and left to builders, involving challenges like identifying primary speakers, switching turns, keeping customers engaged, and orchestrating natural flows—such as reducing an 8-minute call to under 30 seconds in modern voice AI. [07:54], [08:36] - **Turn Detection Evolving Rapidly**: Turn detection, knowing when a person finishes speaking for the LLM to respond, drove Daily to train an open-source model now on V3 with community contributions, complemented by approaches like Gemini's selective refusal, as the ecosystem breaks off specialized pieces. [09:55], [10:26] - **Perceived Latency Trumps Actual**: While actual latency matters, perceived latency is significantly more important in conversations, requiring better vendor orchestration of conversational pieces to address varying flows and contexts, as no single template fits all business use cases. [14:14], [15:18] - **No Model Hits Five 9's Reliability**: No model provider delivers five 9's reliability yet due to GPU constraints and rapid growth, so builders must implement graceful failover across providers or regions, which is use-case dependent but essential for application-level reliability. [21:49], [22:06] - **Domain-Specific Benchmarks Essential**: General benchmarks fail for domain-specific use cases as they're out-of-distribution, leading to surprises like low word error rate not translating to real performance; instead, create in-domain datasets, distinguish major vs. minor errors, and tie metrics to outcomes like successful interactions. [38:12], [40:25]

Topics Covered

Edge computing will eclipse server-side AI?
Network observability trumps model intelligence?
Conversation orchestration remains unsolved?
Perceived latency defines usability?
Domain-specific benchmarks beat general ones?

Full Transcript

Please welcome Quinda, Anu, Paige, Alberto, and Dan to the stage.

Well, uh, thank you everybody for joining us.

Thank you these team speakers.

I'm very excited to dive into the foundations of OCI with you all today.

Um, we are scheduled for 50 minutes.

uh we're going to start from the very beginning uh going into the orchestration then going into the real world constraints and like what are some of the pain points of building and deploying these systems in production.

Um, starting from the very beginning, uh, we will go for quick round of introductions.

Uh, my name is Dan Dusin.

I'm a developer relations and partnerships lead at Vappy.

And, uh, yeah, I have here Quindla, Anoop, Alberto, Paige.

Uh, if you guys want to introduce yourself one by one, starting with you.

>> Sure.

Greetings every Wow, that was very loud.

Hi, greetings everyone. My name is Paige.

I lead developer experience uh for Google DeepMind which means that I get to work on all of our great APIs for models as well as um you know tools that you might have used like AI studio the AI agent features in Google Collab and the like.

So uh very excited to be here and to see what y'all are building with voice AI systems. >> Beautiful Alberto.

>> Thank you. Hey folks, good day.

My name is Alberto Montia. I'm a senior director of products at Twilio and uh I am responsible I lead the voice application services which is basically anything that runs on top of a phone call voice AI included.

Thank you.

>> Hello everyone. My name is Anub Daw.

I'm the chief strategy officer at Dgram.

What does that mean? I it means I build strategy where the rubber meets the road not where the rubber meets the sky because startups have to be very focused on the here and now while also planning for the long term. And uh yeah, I'm really glad to be here with the panel.

>> It is a really good title.

I'm I'm envious of that chief strategy officer title.

It sounds great.

>> I prefer chief janitor, but they're very >> uh that's what I am. I'm Quinnla.

I'm the CEO or chief janitor at a company called Daily.

We make uh networking, infrastructure, software developer tools.

Uh we uh power Vappy.

And since the launch of GPT4, I personally have been really excited about what I used to call talk to an LLM use cases.

Now we call them voice agents. And I've have the privilege of working with all the people on this panel because we are so low down in the network stack that if you need to move audio and video around, we can help you.

>> Beautiful. Uh thank you for introductions once again. Thank you for joining.

Uh it is really really exciting.

Uh I get to work same as you do Quinn with all of you uh on different parts.

uh of the stack. Uh we handle the entire orchestration uh uh everything you need to do to build a conversational AI experience.

We try to abstract away all the complexity so that the developers uh focus on uh actual business logic uh not the integrations not the adaptation or for uh external connections etc. Um but uh starting from the very beginning I'm going to take a deeper dive into what it means to actually orchestrate to build a conversational AI experience. Uh what is orchestration?

Uh there are a bunch of building blocks.

Well actually first of all before we get into that I'm got to get a feel for the room. How many of you folks are developers here? If you can raise your hands.

Excellent. All right. Lovely. All right.

That's awesome. Uh then yeah uh we're going to go into the orchestration.

Um the three primary components as I the way I see it uh in conversational AI voice AI experience is texttospech uh LLM and speech to text models right but this is not everything there are a lot more components and a lot more pain points that need to be addressed in order to facilitate a great user experience um whether or not it would be VAD utterance detection uh telephon and um curious in your opinion uh uh and we'll go around and uh share our perspective for a few minutes.

What's the most underestimated building block out of the entire pipeline? Um what are your thoughts on this page?

>> Excellent. So I'm I'm happy to start.

One uh one thing that I I really appreciate as well as how much these systems have kind of evolved over the span of the last year. Um I'm not sure how many folks in the room might have heard of something called Project Astra um or seen it. It was a a kind of a prototype first announced at IO 2024 which allows you to have a voice conversation with the Gemini models.

So you speak to it um there's a reasoning step in the middle and then you get a response back.

We've now made that available for folks to use as the Gemini live API.

Um so it works in hundreds of different languages.

uh you can uh kind of use uh use it in combination with many of the products that the Daily Team has created uh to kind of power the uh the real- time voicetovoice experiences for all of your apps. Um but something even cooler is that uh just a few months ago um we've been able to get an experience that's roughly the same um but just running completely on a mobile device.

So instead of having to, you know, ping a server and get a response back, you could have the voice conversation with the speechtoext LLM enabled component in texttospech completely on your mobile device um uh powered by Gemma, but also some of the smaller models that you might have experience with using that are all open models.

Um, so, so I think that that's one of the things that I I'm really excited to see future work on is that right now we're relying so much on server side uh pings and I I think that in the future um those will still be important but we'll be able to more elegantly toggle between um you know these more local experiences versus things that need a little bit more horsepower if that makes sense.

>> Yeah. And I I really do share your passion for ondevice on the edge uh intelligence because no matter how you look at it, it's it's it's where we're moving.

Uh it's going to happen and advancements in these smaller models such as Gemma uh it's very very exciting to observe.

Uh but curious uh in your opinion, Alberto, uh what is the most underestimated building block of conversational AI?

Um I think um I'm going to go with a um um network um item and that has to do with end to end observability.

>> When when we say observability we um um talk about you know what the agent is doing what the model is producing as output but there is much more than that.

I mean, imagine that um the end user, the customer is on a mobile phone with a bad reception versus uh somebody that is on their desk um uh connected via a bad Wi-Fi um or, you know, someone that has excellent quality and have HD audio, the experience with the same LLM, with the same uh um um you know, personality tuning, etc., etc., the experience is completely different.

Um some of them might complain to you know to you as as provider um how do you know what is going on?

Uh observability is critical to what to a to provide customer a great customer experience and b to a scale because you're going to have the moment you have thousands millions of customers calling you every day you're going to have customers across the spectrum all type of devices all type of connections.

So I think that's uh and that requires technology investment and uh operational and process investment as well and uh that's sometimes overlooked.

>> Oh yes uh very often overlooked.

People are like okay give me the black box just do the magic you know.

>> Yeah. Uh what are your thoughts?

>> Yeah it's a great question. I think all of us here think we've solved some important problems. Uh and I think we have that's why we're here. But to me, the biggest underestimated piece is the art of conversation because that's the thing that we and none of us have done yet. And so it's all on you. We have kind of left that incredibly complex piece of the equation on all of you builders. And what do I mean by that? Like starting from the simplest thing of how do you identify the primary speaker? Uh when do you switch?

How do you actually have a conversation when you have to do something long and reasoning related task?

How do you keep the customer engaged?

If you call a call center today and you do a simple thing like today I had to call for my wife to get an international plan because she's flying out to India and so okay easiest thing is to call. So I'm calling them and first thing you get is thank you for calling pick your provider and then it gives you we are going to record your call for privacy and I have called you 20 times I know that why are you wasting 14 seconds of my life to tell me that and then you go through the whole conversation it took me 8 minutes in a modern voice it should take me less than 30 seconds but how do you orchestrate it how do you make it natural we've kind of completely left it to all of you and I think that's the most underestimated Yeah.

And uh Vappy in particular actually shares your vision for this.

We've announced uh our mission earlier today in Jordan's talk uh about building a human interface for companies, right?

So very much resonates with us.

>> Um what do you think Quinda?

>> I think it's useful to talk about sort of the trajectory because we live in this moment where everything is changing really really really fast. And I talk a lot about 2024 problems where we were turn trying to get all the basic building blocks to like the 8020 level.

So we could actually deploy these things to production and see them scale.

I think we got there last year.

Now we have a sort of new set of pain points uh because we're trying to do more things because we have better building blocks.

One of those is turn detection.

So how do you know when the person is done talking and the LLM should respond?

I always said that at daily we would never get distracted by training models because we work with all the great companies that training models and then of course you should never say never because we got really tired of not being able to solve the turn detection problem.

So we trained an open- source uh open weights open data sets turn detection model last year. We're now on V3 of that model. Lots of people are contributing data and ideas to it.

It's exciting part of the open source ecosystem.

But now there are also other equally good I think models that have come out in the last couple of months including really good turn detection in the model you announced today and really interesting approaches like the Gemini models doing selective refusal angles on turn detection which I think is complimentary to having like a small specialized model do it. So the the the ecosystem is moving so fast I think we're all trying to break off these pieces.

I I see the under served stuff that is kind of the ferment now that's bubbling as first how do you use multiple models together effectively in one conversation that actually turns out to be a lot of different things depending on what you're trying to do but I'm super excited about like multiple kinds of inference maybe multiple different models some of them fine-tuned maybe some of the models running on device and some of the inference running in the cloud and there's just so so work to do there. And then the other thing, and this is again inspired by your release today, I am increasingly getting pulled into conversations that are about multiple people in a voice conversation.

And so you can address that from like the speaker diorization angle.

There's some other ways you can think about it.

But the next big step function in some parts of that humanlike conversation are breaking out of the one person, one LLM mode and taking these models out into the real world and having them understand all the noises that we all hear and what do those things mean and being better at processing cross talk and not having to work quite so hard in our system instruction to say, "Hey, LLM, there's three people in this conversation.

You need to do X for person Y." and a for you know it's just so much of a prompt engineering challenge but it really should get pushed down into the model and the orchestration layer.

So I'm excited about that too.

>> Right. That's that's a lot of pain points you actually mentioned but since you started talking about pain points I'm curious um if you are working on a real time experience right on a conversation uh you're speaking to it right if it takes forever to respond if it takes more than a second more than two seconds to respond it doesn't matter how intelligent it is it's just not usable right humans don't have the patience to wait >> latency matters >> latency matters and there's a compromise usually at least from what I'm seeing between latency and quality and like stri striking the balance.

Some people might find it complicated.

So, I'm curious, what have been your observations on balancing that?

And what is your general rule of thumb of figuring out when somebody's overoptimizing for one over the other?

>> I mean, I'm the guy who yells about latency all the time on social media.

>> Yeah.

>> So, I I obviously care about latency, but I will say that there's been a transition again as we've started to mature in the ecosystem. Early on, all the demos we built, all the PC's we helped customers build really focused on latency almost to the exclusion of everything else.

Partly because we knew it mattered, partly because people have this very strong walk up sensitivity to latency.

The fir one of the first things you perceive as you were saying in that voice interaction experience is the voicetovoice latency actually measured from the time I know I stopped speaking to the time I hear the first audio hit my ears.

Now though, I tell people get it working because we can solve the latency problem.

Like it's we have best practices around latency and we can show people demos that are more than fast enough for them to feel natural.

And we've almost flipped it a little bit and said, you're trying to do some complicated thing with Voice AI.

Let's get the model choice right.

Let's get the orchestration pieces tuned well.

Let's plug it into your enterprise backend system and check those boxes for you.

I know we can solve the latency problem for you but let's get it working >> right really curious to hear your thoughts on this Anoo because I know deepgram works across the pipeline text to speech speech to text and you guys have excellent time to first bite uh metrics on your model so what are your thoughts >> I paused for two seconds you guys notice >> well so I think latency is important absolutely in a conversational context but I think perceived latency is significantly more important important and uh to Quinn's point I think at this point we feel like even in cascaded systems the latency problem is solved or solvable if you know how to orchestrate these pieces together but the perceived latency problem is still avail you know current and present um and that requires again the conversational pieces that we all as vendors can do significantly better and and do better for for you so you don't have to orchestrate There's also a sense of taste and aesthetic.

There's also like different conversations flow differently.

So there's no one single like template that you can say okay it's a customer support template and boom I'm done.

uh uh in distribution context is significantly under represented in what we all provide you and as you go through all your business use cases you have that context and right now I think that will continue to be the challenge when perceived latency versus actual latency.

>> Yeah. Um great. I'm curious to dive a bit deeper on the other end of like uh latency and delivering latency at scale.

uh scaling and working with infrastructure.

Infrastructure companies they talk about 59 is reliability, right?

But there are a lot of other factors that play into delivering an excellent experience to a user.

Uh we're dealing with jitter, packet loss, all these things in the field uh that people usually undermine, don't think about.

uh really curious um Alberto in your uh perspective in particular since you guys are building Exon APIs targeting developers uh what has been your uh sort of uh point of view on what's the hardest that people really don't pay attention to?

>> It's a great question and it has been exacerbated by generative AI. Um I think um when um uh generative AI has made uh and some of the other technologies around voice AI have made uh extremely easy to put a demo to put a P.

Um however um scaling that to your entire customer base um is still as pretty much as challenging as it has always been you know from the days of predictive uh predictive AI and um I think that's probably still the hardest part which is uh first um from a technology perspective um getting from the 95% or the 80% to the 95% the 100% it's extremely hard and um there there is an understanding of what technology choices you make not only to put the P and also to a scale scale that P into all your customer base examples things as simple as language coverage we all put the PC in English uh do you have customers who speak Spanish and you know 10 or 20 other languages what are you going to do about that? Is your model as good in you know um Portuguese as it is in uh in English? Probably not.

Uh the second part has to do with, you know, network connections.

Uh you put a PC, you use you use an SDK and um you know, probably you go, you know, fine um using your desktop or your mobile phone, but then um do you require customers to connect to the PSTN, mobile phones, landlines, etc., etc. is your case inbound or outbound? Um those uh those technology and then the last part of it is you were talking about 59s.

Um what model models offer you 59s on their own today?

Uh so you need to start thinking about okay how do I do um failover uh so that you know I have continuity of my services.

uh what are the you know what is the this degraded service uh what does this degraded services look like uh because those are the kind of things that your customers are going to complain about those are the kind of things that are going to break your asset service level agreement and you need to look at that so that's the technology part then the second part is the um um operations procedures uh do you have a plan on how you're going to scale up your uh voice AI application from the P to the moment you're on board the first beta cohort to the moment you say I'm going to go GA with the first uh cohort to the moment you you are 100% uh deployed um you need to be able that you're able to gather the data gather the feedback um have an improvement loop to the point that where you know you achieve or get very close to your uh objective so that's a framework that's an adoption framework which by the way is not very different from the adoption frameworks on bots or any other technologies that have exist uh in the past.

So I think that those are things that need to have into considerations uh from the day after you put your you know or the day you decide you want to build your PC with the intention of uh building a product or a service out of it.

>> Right. Right. Makes a lot of sense.

What do you think Quinn since you know you operate in the protocol level it's your bread and butter you probably have a few things to say. There's a lot of education of developers who are new to the space and part of what I think you guys do really well at Vappy is provide these curated best practices bundles that help people succeed. One of the conversations I have over and over is which kind of networking you should use in which circumstances. you know, use WebRTC for edge to cloud, use websockets, server to server, and understand the telefan infrastructure that you're going to use because most of us are building telefan use cases so far in the evolution of voice AI and understand how PSTN and SIP work at a good enough level that you can debug and have good intuitions about your systems and understand which providers are going to give you the right level of nines.

I can tell you because we have lots of qualitative and quantitative experience that Twilio is head and shoulders above their competitors in terms of both reliability and you know the sort of surface area of bugs that all complex systems have and it's worth using the providers that give you that level of service.

We're all so new at this and one of the cognitive dissonance pieces I have is as I think you were just saying none of the model providers are anywhere close to the reliability all of us were trained to have to deliver to our customers in the infrastructure world and that's a problem and like none of us have quite figured out how to deal with that building on top of the model providers but you know nobody has enough GPUs and running GPU cluster supercomputers is like a new Um, and everybody's growing.

So, one thing you have to figure out when you're building this stuff at the application level is, and you all do this for your customers at Vappy is how do you fail over gracefully from a model provider to some alternative depending on what that is.

It might be another region, it might be another model. Um, but you know, it's use case dependent, but it's a problem.

>> Yeah, for sure.

Um going back to serving developers uh we sort of operate in this developer tooling developer facing space all of us here uh and um really curious on your thoughts like there's a spectrum right uh some would say treat API as a black box magic magic happens I don't care what happens there just trust the provider others will say you know you need to deeply understand the technology under the hood otherwise you're not going and be, you know, using it effectively.

Um, where do you fall on that spectrum?

Let's, uh, start with you Peach.

>> I definitely feel like we need to have people be able to understand each component piece of the system so they can help debug and triage when all of the component parts inevitably will go wrong.

And that's some component of observability like you were mentioning as well as deeply understanding um all of the uh the the kind of changes that exist whenever you go from having a more local experience to a to a server side experience.

Um I also think that that helps you kind of design um much more bespoke and very customer centric experiences.

I loved um what y'all were just saying about the perceived latency of a conversation. I know that if you know somebody calls a call center and they're emotionally charged, they probably want somebody to to help diffuse that situation while also like behind the scenes sort of prefetching any of the data that's required in order to help answer their questions.

I think we've uh you know voice is a new interface um just like a browser is an interface for interacting with websites and I think that there are still many of these hard-learned lessons that have been incorporated into things like browsers like pre-fetching like being able to detect oh a person's likely to click this next button um that uh voice systems will need to be able to design as well uh even if it's as simple as this person's calling they're emotionally charged probably about money I should be pulling ing up their account.

I should be like pulling up all of the documentation related to what they're worried about. Um, so I don't know if this is answering your question, but it is very much like I think we're at this kind of really exciting time where we're defining the new interface just like browsers were defining the way that people interact with the internet and voice and visual content is going to be that interface.

>> Right? So generally speaking, you have to understand the pieces at this.

>> Yes, you have to understand the pieces like like uh unfortunately black boxes just won't be sufficient. At least not for now.

>> Yeah, we're not magicians here.

We're engineers, right? So makes a lot of sense.

What do you think, Alberto?

>> Um um just to double click a little bit on that, there are um I think there are a couple of things, right, to to consider.

The first one is um every company that aims to be successful in this century have to be a digital company.

Uh that means you know you need to be deep down into developing your specific experiences for your customers because nobody else knows um your customer as you do. Um having said that uh we know that modern technologies are built uh on you know stacks that goes over you know on top of each other.

Uh you have to build all of them. Oh no you have to be those that are you need to understand them.

Um and you build those that are critical to your to delivering the experience.

Uh there is also a little bit of you know rule of economics.

need to understand how how much it takes to develop a specific technology, maintain a specific technology and whether you're going to get any return or enough return from it um and whether you can do a good job on it.

So maybe you know when I look at things like for example a speech recognition uh unless you are building a speech recognition engine yourself and that's your business your core business you should go and you know uh use uh an existing uh technology provider for a speech recognition let's say you know let's say DRAM let's say Google um um because you know are you going to do a better job than them and are you going to get the AOI from there so understand the technology and then build uh build um what is uh critical for delivering your unique uh experience.

>> Thank you for saying nice things about Google.

>> Oh yeah, >> we appreciate it.

>> Yeah. No, but that all tracks we're all like building for developers here.

So developer empathy important.

Very important. Um curious in your uh thoughts on >> I didn't think this was going to be the controversial question.

So I'm going to I'm going to say something and then let's see what goes like so I agree you should understand it but I'm one of those people who spent a decade plus in engineering uh hacking Linux kernels uh building switches and routers wireless access points f large scale distributed file systems analytics data science okay so what I've learned is if you're curious and you want to really understand everything the level just there is like the depth goes all the way down to the chips right So, so yes, but at what?

So, so the rule of thumb I've built for myself to keep myself sane is if I'm using something, I need to go three levels down to understand, okay, if I'm if I'm using speech to text, well, what is the problem I'm solving?

How am I going to measure speech to text?

One level. Okay, if I measure it this way, what are the things that go wrong?

Second level. And then if these things go wrong, how am I going to engage with the vendor or myself to know that am I capable of fixing it or do I need help?

So I would say like you have to figure out in your curiosity system how deep you need to go and which 3 7 five whatever that thing is because otherwise like my brain explodes if I try to understand everything deepgram does and it's a small company of 135 people but I can't just get into the depth of everything but I need to get my job done and I think most developers are in the same place so I'm I'm saying yes but not quite yes >> sense I I think it's a challenge too in the sense that if you are getting into the business of kind of training your own models or fine-tuning your own models.

Um the the component complexity gets very large very quickly. Um as an example, if you did want to use models on device, then suddenly it matters like what kind of hardware are people running uh on their on their mobile device?

Um what is their their latest generation of cell phone?

Um has the model drifted?

Like do you need to upgrade to the latest version?

Do you need to fine-tune as soon as you fine-tune? Do you need to refine it because there's a new version that's out?

Like all of all of these things are um really really high maintenance costs that that I think that we're experiencing in a really painful way today because everything to Quinn's point is so new.

>> Um but that I I hope um over time will will get to be a little bit more manageable maybe by some of the companies that y'all are in the room already building.

I mean I I think these things are all complimentary that all of you have said and I think it's incredibly important.

I spend a lot of time helping people in shared Slack channels and Discord and Twitter DMs and LinkedIn DMs and email and I have done this for a long time like you I've been in developer tools for a long time and I now have a heristic about whether I'm going to be able to help a team or not.

And part of that heristic is are you trying to do something where I've told you what will work and you are just continuing to tweak the parameters and change the pieces around and asking me why it's not working. And I'm super sympathetic, right?

Cuz I'm an engineer and I want to get things the way I want them to.

But it's a judgment thing.

>> And there's like the the build the stuff that has business value or you're probably going to have to go get another job.

There's the really valuable heristic of minimize complexity surface area or you are just not going to be able to keep writing this piece of software sustainably.

And then there's the engineering taste which I think going down a certain number of levels is like a taste criterion >> and it's do stuff that's going to work.

Like if you're experimenting with stuff on the weekend because you want to see where the boundary conditions are, that is awesome.

Yeah. And that's useful and educational, but it's not going to actually help you build the product that you can sell because you probably can find a set of building blocks that demonstrabably work and then stop fussing with those knobs. And I I just think that's a really important thing.

And I I often see people sort of get stuck on that in something new in a domain like we're all doing where nothing's perfect, everything's new, there's a lot to figure out.

>> Yeah, very important observation and uh I really love that you brought up engineering taste too in there.

It's something people don't really pay attention to, but I found that it is actually quite important the taste part of the equation for building software that uh serves people. Anyway, uh next question is for you Quinn. Uh and it is touching once again upon latency uh and definition of real time. Like what the hell is a real time? Like there's always going to be some latency, some delay.

And I'm curious like usually people mean sub-second latency or in the ballpark.

What should real time mean in your opinion?

>> I think the answer from psych there's lots of psychology research. I think the answer from psychology research about where you want to land with perceived latency in a voice conversation is sub 500 milliseconds.

There's so much psychology research.

We could have a whole panel about that and we could talk about the subtleties, but I think that is the answer.

The practical answer though for building these systems is more complicated.

And we talked a little bit before about how we really tried to demonstrate that we could get to low latency as the primary thing we were demonstrating early in this arc.

And now it's a little more complicated than that. What I tell people today is we have a lot of data from deployments.

Now, if you're under 1500 millisecond voicetooice latency, you are probably fine for your enterprise use case and you need to make sure you're complex voice AI workflows are as reliable as possible before you do too much more if you've gotten under 1500 milliseconds.

But I do think as an ecosystem and as an industry I think we are pushing down towards that 500 millisecond number and that's what we should be aiming for >> right so that golden threshold right 500 what happens when we optimize below that does that matter >> how much do you want to talk about the psychology research is like one super interesting tidbit I'll just throw out is that there are some studies of how fast actual voice responses happen in real human conversations across cultures >> and there are Some so like American English like 400 milliseconds. A lot of stuff clusters around 400 milliseconds.

There's some that are a little slower, some they're a little faster.

There's some outliers that are negative.

There are cultures where it is so common that you start talking before your interlocutor is done talking that if you just do it numerically, it's like 17 milliseconds.

>> Wow. So there's a lot of human variety and I think to the point of we're trying to build humanlike conversations, we should respect that. But I think the practical answer is almost everybody doesn't really notice once you're down below 500 milliseconds true voicetooice, >> right?

>> There is a um there is a a plenty of uh data available.

You know, contact centers have been around for, you know, many many years, you know, focusing on on business to consumer uh interactions and there is plenty of data that uh illustrates um not only latency but what is the flow of a conversation because you know when when for example you call a support let's let's use a support use case when you call support you're typically reaching level one support and they only know the basics and they have to type which means you know latency uh in the response.

Um there are studies also that says what are the most rewarding conversations and they don't necessarily are the the ones that are always you know good.

>> There is we reward >> a little bit of a you know um ups and down during the conversation. So I think the the technical answer it's you know it's 500 millconds but that's only for the cases where we expect somebody to reply.

When you say you know good morning you expect somebody to tell you good morning right away but when you say hey I want to know uh when I'm due to refresh my iPhone probably you know it might take a few seconds.

>> So that's the the kind of things that you have to build and test uh in your u specific uh flows.

>> Yeah. And these are all conversations about uh uh not pun not intended but I guess it works but conversations about synchronous conversations that are user invoked right I I think we're also getting to this place where you mo uh you might have an agent that wants to initiate a more proactive conversation with a person saying like hey did you know that there's a concert happening this week that might be really interesting for you? Um, do you want me to to kind of comb Craigslist to see if I can find any discounted tickets or something similar?

Or like, hey, your friend really enjoys this rock climbing paste.

Do you want me to schedule a weekend adventure for the two of you to go and and explore? Um I I think we're also seeing for agents that are that are working on code bases. Um you know people feeling a lot uh friendlier towards these asynchronous scenarios where you can kind of ask for a change to be made across the codebase or or something similar.

like if if I want uh you know a conversational agent to plan out uh weeks full of meals and then go and order everything on Instacart, I'm okay with not getting a response back immediately because I I wouldn't necessarily um wouldn't necessarily need one.

So, so I think we're also um as engineers trying to understand what things need to be synchronous and kind of user invoked conversations versus what should be more proactive initiated by an agent or asynchronous conversation.

>> So much UI design to do so much UI design and there are totally places where pushto talk is the right interface.

I think they're limited but I definitely think they exist. I think one other important thing to say about latency is most people don't measure it correctly >> and that is worth like understanding if you're in this room and basically all the latency numbers you ever see quoted in benchmarks are wrong for voice AI if you're measuring latency for your agent and you absolutely should measure it like manually mechanically so you really have a sense of it record the call put the audio file in an audio editor on your computer draw a line in the gap between the waveform of the user stopping talking and the bot starting talking.

That's the ground truth.

Like you can't fake that. You can't get that wrong.

>> You should be doing that >> manually regularly with your voice agents >> and we found that to be the way at Vappy too.

So that tracks% >> that tracks. Anyway, I'm going to keep things moving along. Uh we talked a bit about latency again. Uh now when I when I like to I like to imagine a vector space when I think of like all the things that can go wrong with the conversation I experience right there's accuracy there's reproducibility um consistency speed many many angles and u and this question is for you I'm really curious um we talked a bit about measuring about benchmarking about like achieving consistent outcomes if you were to design let's say a benchmark of the next generation that would get things right?

How would you design it?

What would you measure? What would you focus on?

>> Yeah, that that's a great question.

So before I answer that question, I'll say this, right?

Most of the benchmarks that we use all of us that are public are sort of general purpose benchmarks.

And what ends up happening is if if you your use case especially if it's a business use case, it's highly likely that it's out of distribution of that benchmark.

So you will go to the benchmark and say oh look word error rate low yay let's go and then you use it in your system and it doesn't work right it and so first of all understand that's the truth and then the second thing of the thing is okay so should we all publish our benchmarks and uh so that you all can get it well we could let's say we have an indistribution benchmark for let's say automotive vehicles okay and we publish it now as soon as we publish it everybody has it right including us now it's part of our training sets and now there is no held out of like eval system that's hidden so it's gamed and again it's going to be out of distribution because no benchmark is ever representative of the whole uh distribution that you have.

So hopefully that helps you know like benchmarks are helpful but they're not helpful.

So just remember like your domain in distribution is sort of the holy grail for making it work for yourselves.

Uh so trust the benchmarks but don't trust the benchmarks.

Yeah.

>> Okay. So now having said that how do we think about this at DRAM? We think about it as u we think that there is there is a general purpose benchmark but things like word error rate are probably not the right measurements in today's voice a realtime world. What do I mean by that?

Okay, there are certain things if I get wrong and even I'm not non-native English speaker.

I mess up my grammar.

You understand me? Uh, but if I if I miss the pronounce the wrong word, it can make a huge difference, right?

I don't want the subscription versus I want the subscription. If I drop the don't, it is a big difference in how whether you're going to be happy with me or not.

So, there are things that are you could consider them as major errors and there are things that are minor errors.

uh what are the major errors for your system?

How do you measure them?

Right? So that's one thing uh of how deep thinks about it. Uh and the second thing is then okay in domain how do we create in domain data set.

So usually what we do is we end up working with and we've been fortunate enough to be in voice before voice took off which means we worked with a lot of contact center and other customers and we jointly built in domain meticulously labeled data sets and we've sort of carved them out and we're very religious about not contaminating them into our training systems and we continue to update them.

So we have per industry or per domain benchmarks and then we'll work with you as a customer to say okay you want to work with us and create a representative benchmark for yourself let's do it you want us to train a model only for you that nobody else in the world gets let's do it you want us to host it you want you do yourself to host it let's do it and why do I say that because voice is not quite there yet where you can have one benchmark to rule the world because what you're measuring differs.

And I'll give you one more example before I stop my monologue.

Uh if you're in a if you're in a if you're in a classroom and the professor is speaking, well, that's the primary voice you want to transcribe. And if there are two kids out there laughing and making a joke, that's not that important.

It's far field. I don't want to hear that.

Okay. Imagine you're in an E911 call where a mother is calling and you hear a child saying something really which which shows that the child is now unsafe >> way back in the background.

That's also farfield audio. But all of a sudden that's the most important thing you want to hear.

Yeah.

>> Right. So how can you have one benchmark for where your domain is different?

>> And so voice is therefore different than textbased systems and large language models.

And it's one one of the underappreciated aspects of >> and and I would say that even for the text and the codebased systems like one benchmark cannot at all rule the world for those either like if you really really deeply care about your user experience you're going to need to be constantly creating >> yeah all very excellent points like engineers love measuring things and it is very very common to measure and optimize for the wrong thing.

you have to take a step back, take critical look.

All right, is the metric correct?

And as you mentioned, uh being domain specific is very very important. Uh these general uh use benchmarks, they don't really work for whatever uh you're optimizing.

>> Yeah. And then we'll look at each component and do the benchmark >> and you'll you'll pick the best of breed, but your outcome metric is still suffering, >> right?

So I think the benchmark needs to be connected to your outcome metric.

Is it average whole time? Is it the number of successful drive-thru things completed?

and then do the whole system and hold any val set and don't give it to anybody.

Don't give it to D.

Don't give it to you should have it and you should you should keep updating it and then anytime a new vendor comes in you run them through the bases.

>> Yeah, you you should listen to this guy.

He knows what he's talking about.

>> But that's hard in Genai era, right?

Because different models perform differently partly depending on how you contextualize them.

How do you write a benchmark that actually works?

>> Yeah. Um anyway, going to keep things moving along.

Uh Alberto curious in your opinion since you've been at the intersection or rather Twilio has been at the intersection uh of telecom and developer APIs and you've been abstracting so much complexity away developers don't really appreciate enough how much is done for them um and uh what would you say is the hardest thing about you know operating in this mode of just abstracting complexity away and shipping APIs that just work for people?

Um I think and and uh and I love the challenge.

We love this challenge.

Uh things are moving extremely fast.

So um um uh this impose a challenge uh when we think about enterprise systems um because you know enterprise systems we do have to honor uh compliance and we want to whether it is telecom compliance or whether it is industry compliance.

So think of HEIPA for healthcare for example or PCI for payments and um um the one of the you know key challenges is how we keep up to date and not only up to date but at the forefront of uh innovation providing um our developers uh capabilities to continue innovating and still maintaining regulatory and industry compliance.

Um um that's where partners like you know DRAM daily Google Vappy uh comes in in which we basically you know divide um uh divide the the work or the split the challenge into the chunks that we can all you know contribute to um make it happen.

Um um the the other you know the the this is if you think about um uh general uh you know genai models and how fast they are evolving um how you build predictability.

you were talking about, you know, um benchmarks, um you know, benchmarking GPT4 versus GPT5 and um you know, how do you make sure that you know, not only you stay consistent or enhance the experience, but also you stay uh compliant. That's, you know, that's a a still a big challenge and that's that's part of the complexity we're also abstracting from you, right?

uh folks making sure that we allow you to innovate uh very consciously um in compliance with your you know industry requirements and government regulation.

>> Yeah, I I I can see this being challenging especially considering how fast this industry is moving.

Every day something new is happening. Uh as long as you're having fun while doing it.

>> Oh, we are we are we are.

>> Yeah.

>> Could I ask a question like a very selfish question though?

>> Sure. Yep. If y'all could wave a magic wand for the uh kind of the generative model providers, so the ones who are providing the texttospech APIs or kind of the speechtoext LLM texttospech component systems. What would be uh because I'm I'm sure there are many things that you would want to see like uh maybe one each. What would be the things that would be the nicest like Christmas wish list item to have?

>> Oh, like and you can say anything.

Uh, I'm going to put an ubin trouble.

Uh, word error rate equal to zero at sub 500 milliseconds.

>> Gotcha.

>> Beautiful.

>> In every language, by the way.

>> Yeah. That's true.

Because multilingual stuff is a real pain point as our ecosystem is growing 100%.

>> And especially the UX for multilinguality.

Like we've just recently changed for the Gemini Live API.

you can specify in the system prompt what language the model should respond in or or kind of if the model should respond dynamically based on what the user addresses um as opposed to selecting a specific uh a specific language at the beginning and and that's we're hearing mixed feedback about that too.

>> Yeah.

>> Yeah.

>> Do you have anything to say any?

>> Yeah. I'll say this right and this is for text to text or large language.

The way I think about large language models is it does three things together that have didn't happen in the history before.

Natural language understanding so it understands your intent.

Uh query construction because it understands your intent and then tries to figure out what what you're how to query and then it retrieves right information from the system.

to me uh continue to improve the understanding and intent continue to build the best queries but I think retrieval is where it starts becoming very bespoke and disperse so to me like having a much stronger I don't know if it's only tool calling if it's what but like the retrieval system needs to be extremely fast and easy to uh have build to build reliably and that's where I feel especially in real time where you don't have a chance to do do it again and again and you don't really want to do model racing where you're spending it sending it to three different systems and actually >> we already melting GPUs and now we are doing three times GPU melting to get the like what are we doing right so that's >> we're inflating Nvidia stock price we all are right and >> but inflating is the wrong word right because there's so much utility >> well there is but imagine if we solved it hopefully we'll have enough GPUs for at least the next two years of our work right >> um so like provide ing a very a stronger retrieval system.

And I don't know what the solution is, but to me that's a interesting problem.

If that could be solved, we'll all have significantly faster adoption and less GPU melting.

>> Interesting. Do you have anything to add?

>> Yeah, Paige knows my answer to this because I've talked a lot to the deep team about this trying to come up with sort of common vocabulary and common understanding, but I actually think it's super useful to talk about this as a community.

My very strongly held belief is that writing voice agents is a code-shaped problem, not an API shaped problem.

I actually think a lot of Vappy's success, I've known Nquille and Jordan since before there was a Vappy, and I think a lot of Vappy's success is approaching it from the beginning as a code-shaped problem, not an API shaped problem.

The model providers have to write APIs because that's the way you give us access to the amazing things that the weights can do. I would like those model providers to view the API's job as to be the best possible partner for the orchestration layer for the code layer.

And historically that is not the way the APIs have been designed.

The APIs have been designed as a kind of from the view of the people who train the models.

I totally understand why.

But those of us who spend a lot of time in the trenches with customers and have built a bunch of software on top of the APIs have feedback about how those APIs should be designed to solve the orchestration layer problems more cleanly.

And I that's that's my magic wand is like think about the APIs think about the consumer of the API as the orchestration layer code.

>> Beautiful. Uh we're out of time so I'll be wrapping things up here at this point.

Uh once again, thank you so much esteemed speakers for joining us.

Paige, Alberto, Anoo, Quinn. It has been a great pleasure to take a deep dive and maybe even touch on the philosophical problems uh in VoiceCi. And uh yeah, thank you everybody for joining us.

>> Dan, thank you. Thank you, Dan.

Loading...

Loading video analysis...