Building Voice AI Agents That Don’t Suck [Kwindla Kramer] - 739

By The TWIML AI Podcast with Sam Charrington

Summary

Topics Covered

Voice UI Demands UDP WebRTC Networking
Consumer Voice Demos Aren't Products
Production Voice Requires Multi-Model Pipelines
Turn Detection Beats Fixed Silence Windows
Video Agents Multiply Modality Complexity

Full Transcript

I think there's an existence proof that you can use LLMs in conversation very flexibly from the growth of the

enterprise voice AI stuff we see and I think the delta between what we're seeing there on the enterprise side and and what you're seeing on the you know kind of chat GPD advanced voice Gemini live side if you want the hot take

expression of it those are demos not products the version of it you interact with is a demo not a product they could be products But for a whole variety of structural

reasons at at at OpenAI and and Google, they are not products today.

[Music] [Music] All right, everyone. Welcome to another episode of the Twimmel AI podcast. I am

your host Sam Sharington. Today I'm

joined by Quinn Kramer. Quinnler is

co-founder and CEO of Daily and the creator of Pipecat. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Quinn,

welcome to the podcast.

Thank you for having me. I'm a big fan of what you do on the show. Excited to

be here.

I appreciate that. I guess this is technically your second time on the show because we did that kind of panel discussion interview at the most recent

Google IO, which was a lot of fun. And

shout out to Swix from Leighton Space for introducing us and putting that all together. But uh I think we found a lot

together. But uh I think we found a lot of interesting things to kind of talk about and vibe on and I wanted to dig in a little bit deeper about what you've

been up to and uh for folks that didn't hear or don't know you that's going to be primarily around voice AI is what you've been focused on. Uh, but you

know, let's give you an opportunity to introduce yourself to the audience.

Yeah, I'm Quinn Lman Kramer. I'm an

engineer. I've been doing large scale real-time network audio and video stuff for most of my career. Uh, I co-founded a company called Daily. We make audio and video infrastructure for developers.

So, if you're building something like a teleahalth app or an education app and you're trying to connect people together, you can use our infrastructure on our SDKs. When GPT4 came out, it

started to look to us uh like not only could computers do all these amazing new things around structured data extra extraction and kind of open-ended conversation, but that those things felt

like maybe you could have humans talking to computers in a new way. So, we built a bunch of stuff, experiment stuff with customers. Uh, and I got more and more

customers. Uh, and I got more and more convinced that voice AI and real-time voice AI was a big part of this platform shift we're all excited about. So we

open sourced all the tools we built internally at daily that became Pipcat which is now the most widely used voice agent framework or before 2025 I would

have called it a an orchestration layer for real time AI.

That's awesome. Yeah, it's amazing how quickly these terms are evolving. Uh, I

find it funny that voice tends to be um, you know, people either love it or hate it as an idea for the way to

interact with AI and computers in general. It has always been fascinating

general. It has always been fascinating for me and really exciting. Like I don't know if I was in high school or junior high school and was like into dialogic boards and stuff like that. Uh, and I

was super excited about, you know, Twilio, knew them when they were really early on. Uh, and this, you know, idea

early on. Uh, and this, you know, idea that we can like control computers, uh, and interact with computers via, you know, just via natural speech, I I find

fascinating. How did you, you know, get

fascinating. How did you, you know, get into it? What, uh, was the the spark

into it? What, uh, was the the spark beyond the kind of business opportunity you saw?

I mean, I am like you. I think it's super interesting to be able to actually talk to a computer and have that be a big component of the user interface. Uh,

and it seems obvious to me, like I think it does to a lot of people who've been sort of consciously trying to experiment with this, that this platform shift towards generative AI is going to require bunch of new interface building

blocks. And we haven't even started to

blocks. And we haven't even started to scratch the surface there yet. And I am pretty sure that a big big part of those building blocks is going to be figuring

out what voice first UI looks like.

People are really comfortable talking.

And even though those of us who like are used to doing this to interact with computers feel maybe a little weird when we talk to our computers, you get over that pretty fast. And all of a sudden it

opens up this whole sort of like uh efficiency channel that's very very different from from the mouse and the keyboard. And we we just couldn't do it

keyboard. And we we just couldn't do it before because we didn't have a way to take that unstructured conversation actually turn it into something computers could do something with. But

LLMs can totally do that. And they're

just as good at processing your voice as they are at processing a text stream from your keyboard. Um, and the step function difference between typing and voice as input when you add an LLM to

the mix is, I think, even bigger. So,

you know, I'm encouraging everybody I know to talk to your computer as much as you can. If you're a programmer and

you can. If you're a programmer and you're interested in this stuff, experiment with what like a little building block for for a voice first experience looks like because you're totally living in the living in the

future when you do that. It frustrates

me to no end that every web page that has a form on it doesn't have a microphone that I can press and uh do it. And you know, obviously I've got it,

it. And you know, obviously I've got it, you know, with the various keyboards on the phone, but you have to go into the fields and I just want something to just take care of that. Let me talk to the

web. I have probably had the same

web. I have probably had the same conversation maybe, you know, 50 times over the last month or so with both friends and people I'm working with.

professionally about voice and it always goes like and it's always with programmers and it always goes like this. I say I'm trying to talk to my

this. I say I'm trying to talk to my computer as much as I can and these days when I'm programming I talk more than I type and people are like yeah I I don't see it like I'm not really like I don't

really like talking to the computer and also like what do I do in my open plan office and my response to that is twofold. First I totally hear you. It's

twofold. First I totally hear you. It's

a shift. You know changing a big part of your like professional day like that's a big deal. I'm not discounting that at

big deal. I'm not discounting that at all. But also, we started daily to do

all. But also, we started daily to do video and audio communication stuff in 2016. And I've been doing startups long

2016. And I've been doing startups long enough that I was lucky that I only had to pitch investors that I already knew.

So, very easy conversations, people who knew me, people who were sort of biased towards, you know, taking what I was saying I wanted to do for a new company seriously. Still 12 out of 15 of those

seriously. Still 12 out of 15 of those initial pitch conversations in the 2016 summer of 2016 for daily went like this.

I said I think we're all going to be doing real-time audio and video all the time on the internet. That's what I'm starting a company around. And

professional tech investors would say to me, I don't know, man. Like I I like I like phone calls. Like if I want to talk to somebody, I just want to talk on the phone. I don't want to have to set up a

phone. I don't want to have to set up a video call. I was like, okay, you will.

video call. I was like, okay, you will.

You will.

That's nuts. That's nuts. Uh it's been astounding like how quickly the technology evolves you or has evolved over the past, you know, and this I guess is also like preaching to the

choir. We all have been living this and

choir. We all have been living this and holding on with like white knuckles. But

um I remember it couldn't be more than six months ago like before or maybe a little more than six months ago like a little bit before vibe coding was the

cool word. I was essentially vibe coding

cool word. I was essentially vibe coding an application that let me I think at the time I was like trying to track macros. And so I was I like built this

macros. And so I was I like built this app that let me like speak what I ate, you know, with units and quantities and stuff like that. Uh so a little bit more

granularity than just take a picture and it would like parse that and uh then go find the macros and stuff from a a database.

But the I mentioned it because the way that I did the voice was like capture a segment of voice locally and uh send that into I think Gemini or some model

doesn't really matter to transcribe it and get that back and then like ask it to pull out the quantities and probably like two weeks after like I stopped

working on that the way I would do it like totally changed like OpenAI came out with like the live voice and you know all other auto providers have since, you know, come out with their

versions of that. Um, and so I think I bring that up to say that, you know, two things, like just to to nod at the way this technology has been evolving, but

also uh to note that, you know, for me like the okay, I'm going to like capture a recording and send it into an LLM was like very easy cognitively to

understand, but then the, you know, the live APIs and like, you know, WebRTC and all this other stuff seem like you know if I spent more than a couple minutes

like digging into it like I'm sure it would have been naturally natural and easy to work with but you know it was a little bit more complicated and so I wanted to use this opportunity to ask

you to give us like a primer on like getting started with voice like what's the way to think about it as a or the the dominant like abstractions for

building voice AI applications nowadays.

Yeah, I I have this conversation a lot too because there's so much new interest in voice and I I'll try out my latest version. You can tell me if it lands

version. You can tell me if it lands well. So, because you've got a technical

well. So, because you've got a technical audience, I think it's worth just talking about the stack. Like talking

about what the stack is is probably helpful. At the bottom of the stack,

helpful. At the bottom of the stack, you've got the models. So, you've got the weights basically. Uh so whatever those models are including like the multimodal models you're talking about

like the open AI real-time model the Gemini uh live model uh or you can have you know text mode LLMs and you can do text to speech and speech to text and kind of glue everything together or you

might have a bunch of small fine-tuned LLMs all collaborating but at the bottom you got the weights on top of those you've got the APIs that the model providers or whatever your inference

jack is providing. So like you're you're hitting an HTTP endpoint from OpenAI or from Google or a websocket endpoint from somebody above that because generally we're trying to do non-trivial things

with this technology. We're trying to go beyond building demos. Above the APIs, you've got some kind of orchestration.

You're gluing things together. You're

implementing the kind of pipelining of data to make it possible to do the multi-turn real-time conversation. And

then on top of that you've got your application code which is sort of sitting on top of that kind of orchestration layer. And so if you if

orchestration layer. And so if you if you put all those things together you've got a real application. Now you can get started with voice AI today using a platform where everything is all those

things are bundled for you into one kind of interface where you build something maybe even you just build in a dashboard. You don't even have to write

dashboard. You don't even have to write any code. So companies like Vappy have

any code. So companies like Vappy have pioneered that all-in-one combined batteries included with really nice dashboards, really nice uh kind of

developer tooling. On the other end of

developer tooling. On the other end of the spectrum, you could build kind of everything yourself. You could make all

everything yourself. You could make all of those individual choices in the stack yourself. You mix and match. You put

yourself. You mix and match. You put

everything together kind of as a as a programmer, you know, writing code with libraries. What I work on a lot is this

libraries. What I work on a lot is this orchestration layer called Pipecat which tries to give you a little bit of have your cake and eat it too where it's easy to get started because the core implementation of things like

interruption handling and turn detection and multi-turn context management are all there for you as Python functions basically but you also have complete control and you can you know mix and

match all those parts and that's an open source project.

Yeah, totally open source totally vendor neutral. I spend most of my time these

neutral. I spend most of my time these days on pipecat uh because it's such an interesting new vector for everything we do in the real time world. Uh but it is a completely open source completely

vendor neutral project. Lot most of the big labs contribute to pipcat. Hundreds

of startups contribute to pipcat.

There's probably 120 contributors now in the GitHub repo. In thinking about the relationship between Pipcat and Daily, talk a little bit about the overlap just so I can kind of understand. It's not

that Daily is commercializing Pipcat.

It's that DY is providing infrastructure for people who are building these applications and Pipcat just makes it easier to build those applications whether they're hosted on Daily or elsewhere.

Yeah, that's exactly right. Many many

more people use Pipcat without daily than use Pipcat with Daily, which I think is a mark of success for an open source project. Um, we at Daily are the

source project. Um, we at Daily are the very low-level network infrastructure.

So, we move the audio and video bytes around the network at super high reliability and super low latency. Uh,

so anytime you need to do things very fast in a real-time interaction on the internet, uh, we'd love for you to think about using our network infrastructure.

Uh, but, you know, Pipecat supports lots of different options for network transport. Daily is just one of them. Uh

transport. Daily is just one of them. Uh

we do increasingly try to help our customers get up and running with production quality voice AI infrastructure and we have built a on top of our global infrastructure we've

built a hosting platform for pipecat or related voice AI things called pipcat cloud but that's again totally separate from the the open source project which has no has no commercial dependencies at

all and so pipcat cloud would be kind of analogous to like Cloudflare and Cloudflare functions like the cloud is

like the the runtime and the uh Pipcat cloud would be like the runtime environment and daily would be like the level infrastructure. It's not a perfect

level infrastructure. It's not a perfect analogy but yeah I mean I I've been trying to figure out what the best analogies are in a bunch of ways for this new era of like

for voice AI agents.

Yeah, the the original like web hosting platform that I thought kind of did a really good job like balancing uh flexibility with with high level abstractions early on was Heroku. So I

sort of think of pipecat cloud as as Heroku for voice AI. Um but you know maybe to a technical audience it's even more clear to just say you push us a

docker container and we autoscale it and monitor it for you and everything else you you get to choose. So it we're just taking our global infrastructure that's very good at scaling things everywhere

in the world and we're hosting a Docker container that is wired up for ultra low latency voice AI for you running on Kubernetes or something else.

Yeah. Yeah. It's it's a lot of Kubernetes under the covers and I'm sure we will talk more about this from a bunch of angles, but there are a lot of things that make voice workloads different from you know the HTTP

workloads or even the websocket workloads that we all spend a lot of our time building. And one of those things

time building. And one of those things is you have to have this very low latency network transport and you have to support longunning conversations. And

we we get a lot of people who come to us and say I tried to build this on AWS Lambda or I tried to build this on GCP cloud run both of which are fantastic

platforms but do not have the components you actually need to support the voice workflows. I'm sure they will at some

workflows. I'm sure they will at some point because this space is growing so fast, but there's just a bunch of things you have to do that are not the normal

Kubernetes config to get the voice platform stuff to to kind of run to scale to have the cold starts be right for voice cold start times and stuff like that. Yeah, we spend so much time

like that. Yeah, we spend so much time back.

Let's dig into those challenges because I think that's uh in a lot of ways where the rubber meets the road and kind of differentiating um you know web applications as you

mentioned from uh from voice applications. I think uh at a high level

applications. I think uh at a high level the things I talk a lot about with people building like going from prototype to production on voice AI

today are evaliability latency and then the sort of fundamentally multi-model nature of almost every production voice app. Um

and we can take those in order if you want or you can throw some out that you Yeah. Well, let's start from let's start

Yeah. Well, let's start from let's start from like lower level uh types of concerns which is latency. So for

example when you describe the stack I was wondering you know qu I had questions like did you have to you know write your own

container runtime or is there like a latency optimized container runtime or like how how much tinkering in the stack you know do you have to do? um to

support voice applications like you know custom kernels and all kinds of weird stuff or because that says what's different between that and just like spinning something up in you know digital ocean or AWS or wherever so

everything is trade-offs uh probably earlier in my career we would have written our own container runtime um because there are advantages doing that but what we did this time was we said

okay every developer is going to be able to use docker let's stick let's optimize as other places. Let's stick to Docker compatibility because that's going to make the onboarding and the growth for

you know every developer who uses Pipcat Cloud a lot easier. So we kept vanilla Docker but then we have to surround that with a bunch of fairly specialized Kubernetes stuff on a couple levels. One

is just all the cold starts and rolling deployment stuff that's specific to voice AI that that you were mentioning.

Um so you got to get those Docker containers loaded. You got to get them

containers loaded. You got to get them wired up to UDP networking. You've got

to uh have uh auto auto you've got to have schedulers and deployment logic that doesn't terminate halfhour long running conversations which you know all the all the all the default Kubernetes

stuff when you push code there's fairly short drain times you have to have long drain times and a bunch of stuff that goes with that the other layer is you've got to support UDP networking so you

have to support WebRTC uh because for edge device to cloud realtime audio you need to not be using websockets or TCP based protocols. You

need to be using UDP based protocols, which the wrapper for those these days is WebRTC. Um, so that's a big deal to

is WebRTC. Um, so that's a big deal to wire up Kubernetes properly to UDP and do all the routing and be able to start the WebRTC conversations. Uh, there's a

whole bunch of little things you have to kind of customize in Kubernetes. And so

the biggest single thing that you do for latency is you you get that network layer right with the UDP networking.

Then on top of that, you just try to you try to like pull out every tiny bit of extra few milliseconds of latency everywhere in the data processing pipeline. Um, which you never really

pipeline. Um, which you never really have to worry about doing if you're doing kind of text mode HTTP based inference cuz a few tens of milliseconds here and there like you don't really notice it. But you really really notice

notice it. But you really really notice it in a voice conversation where you're trying to get below a second of voicetovoice latency. I'm curious in

voicetovoice latency. I'm curious in talking about the networking stuff when we were chatting before you mentioned that you listened to the uh recent episode with Vjoy Pande from Cisco and

one of the things that he talked about that I found interesting was uh this slim protocol that uh they're building

and promoting or starting to promote. um

any takes on, you know, how that fits into meeting the requirements that you're describing or are there other efforts like that? Yeah,

I like that direction. Uh I'd actually love to talk to Vjoy uh because I'm interested in how that work he's doing can play nicely with the stuff we're

doing. Uh I Slim has a bunch of stuff

doing. Uh I Slim has a bunch of stuff built in that I think is really important. I don't think it has UDP

important. I don't think it has UDP support yet. So from my perspective that

support yet. So from my perspective that would be like a really good thing to add. There aren't any other

add. There aren't any other uh real time oriented transport sort of standards yet other than what we're doing in the pipecat ecosystem where

we've defined a standard that lets everybody plug into pipecat. I think

there will be a a real need for a real-time standard. The way I talk about

real-time standard. The way I talk about it with the partners we sort of bring into the pipecat ecosystem is for better and worse because standards are always

like that. The uh open AI chat

like that. The uh open AI chat completions HTTP standard became the standard for everybody who does uh textbased inference and now people have built a bunch of stuff on top of that

and a bunch of improvements to it. But

chat completions is what we all use if we're say I I might use OpenAI or I might deploy my own model, you know, using VLM or whatever. I'm going to use chat completions. We don't yet have that

chat completions. We don't yet have that standard for real time multimedia. Uh we

need it and we are definitely working towards that in the Pipecat ecosystem because I think we've built a lot of a lot of we've learned a lot of lessons about what that standard needs to look like.

Yeah. Yeah. And I think this kind of goes back to uh something we were talking about earlier which is uh or even uh goes back

to the story I I mentioned uh with my own experience like when I think when a developer who's not used to living in the voice world hears you know UDP and WebRTC they're like wait I don't

usually have to think about that stuff.

Uh so talk a little bit about the you know the APIs and abstractions uh that you know make sense in the voice context

you know whether it's you know what you're doing uh with pipecat or you know other popular options if they differ and how developers should think about you

know kind of the way that they suggest inter interacting with uh voice LLMs. Yeah, I mean the basic idea that you know you sort of have to use as the first building block when you're

thinking about writing these voice agents is you've got to move the audio and there's video agents too, but I'll I'll just because they're they're earlier in the growth curve and they're super exciting, but let's just talk

about audio for for the moment is simpler. Um, you got to move the audio

simpler. Um, you got to move the audio from the user's device to the cloud where you're running some piece of code you wrote that takes the audio,

processes it however you need to process it, runs one or more inference steps with one or more models, then generates audio at the end of that processing loop, and sends it back to

the user, and then does that over and over and over again for every turn in the conversation. managing things like

the conversation. managing things like knowing that the user might interrupt the LLM and needing to handle that gracefully or you might even have longunning tool or MCP or function calls

that are running in the background and the LLM might actually want to interrupt the user at certain points. So as you start to build these things out and production cover more and more use cases, you have more and more complexity. But the basic idea is move

complexity. But the basic idea is move the audio to the cloud because that's where you have the processing power to get the best results from inference and then move the generated audio back to

the user so you can play it out in real time over the speakers or you know headphones that the user is wearing.

Uh I think that maybe gets us to challenges. One of the ones that uh well

challenges. One of the ones that uh well we I guess we were kind of going through challenges maybe doesn't matter. One of

the things you just mentioned there is um like activity detection and interruption handling and

uh that is something that I think that folks who have tried to use voice AI systems you know whether it's open AAI or Gemini live um you know that strikes

me as the biggest like user experience hurdle that we have right now you know curious whether you agree with that but also like how you see that evolving. Um,

I find that I enjoy using, you know, chat GPT advanced voice mode, uh, Gemini Live. Uh, but the conditions have to be

Live. Uh, but the conditions have to be fairly perfect for it to not feel like a kind of stunted conversation like you know, forget about doing it in the car

like it it's very difficult. uh so

what's the path as an industry for us to uh overcome that is it better models is it better infrastructure is it at the API level you know glue or you know

pre-processing or something how do we you know get there and you know start starting with do you agree that that's a big hurdle like are you seeing that also or am I just using old models or

something so I I think there's an existence proof that you can use LLM in conversation very flexibly from the

growth of the enterprise voice AI stuff we see. Uh it's a little bit under the

we see. Uh it's a little bit under the radar to people who are building consumer stuff or or or just experimenting with the new tech. And I

mean that like in the best possible way.

But I I have had multiple industry analysts tell me that the the fastest growing Gen AI use cases today from a monetization perspective are programming tools and the second fastest is

enterprise voice AI. So there are things like call centers that are now answering 80% of their calls with voice agents. uh

financial services companies that where somebody's just taken out a new mortgage and they really really really want to remind people that you know the first mortgage payment is coming up because that's a known failure mode. You taking

out a new mortgage, you thought you've done all the paperwork to like get your bank account wired up so that you know and you haven't and it's it's it's the end user it's the customer's fault if

that happens but nobody wants it to happen. Everybody wants that you know

happen. Everybody wants that you know mortgage payment to happen seamlessly.

So, you know, you didn't have the human staff bandwidth to call every single new customer 5 days before their first payment is due. Before, you just

couldn't do it cost- effectively. Now,

you can do that with voice AI. Um, we

see uh a number of our partners and customers doing things like answering the phone for small businesses and they start out answering the phone with an AI agent when the business is closed, when

they didn't have anybody answer the phone before. That goes so well after

phone before. That goes so well after three or four or five months. They're

answering the phone all the time. And

humans are only picking up the phone when you actually really need a human, which is, you know, 20% or less of the calls usually. So there's just a huge

calls usually. So there's just a huge amount of growth in these really working enterprise voice agents. And I think the delta between what we're seeing there on the enterprise side and and what you're

seeing on the, you know, kind of chat GPD advanced voice, Gemini live side, which I agree with, is that if you really are strongly incentivized to build a product that has a particular

surface area that works, you're taking certain approaches. If you're building

certain approaches. If you're building from the models up and your your goal is to build the state-of-the-art model and then wrap it in functionality that sort of shows how to use that model, you're

doing something very different and your your pain points are different, your timelines are different, your goals are different. I love what the uh live API

different. I love what the uh live API team and the real-time API team are doing at those two big labs. I also

think that if you want the hottake expression of it, those are demos, not products. the version of it you interact

products. the version of it you interact with is a demo, not a product. They

could be products, but for a whole variety of structural reasons at at at at OpenAI and and Google, they are not products today. It would be interesting

products today. It would be interesting to me just from a thought experiment perspective what it would look like if you know either of those companies were super super serious about that product

surface area and they might become, but today they're not. So you can solve all those problems with like background noise um or interruption handling or

maintaining the context in flexible ways depending on exactly what is happening in the conversation. Those are solvable problems today, but they're products and you have to have a product team that's like working on those problems full-time.

Yeah, that's a super interesting take and I don't think it would surprise anyone, right? Like chat GPT wasn't ever

anyone, right? Like chat GPT wasn't ever meant to be a product itself, right? is

meant to be a demonstration of capability and you can see that uh if nothing else like there's open AI has a lot more to gain by getting folks

excited about the idea of using voice and chewing through a lot of voice tokens than they do necessarily uh you know for investing a lot of money in a specific voice product. Now, you

know, there's always uh a product manager's view on that, like how good does it have to be to really inspire people versus not, but like you said, that's a product decision as opposed to

the technology. I think it raises the

the technology. I think it raises the question that you know their view of the the world and this is something that came up in our conversation with Google

as well like is a very much a kind of a you know single big model view of the world as opposed to building modular systems and it sounded like one of the

distinctions you were making between what they're doing and what you might need to do from a product perspective is you know build out a specific subsystem that's looking for background noise or looking for, you know, trying to detect

interruptions. Is is that kind of the

interruptions. Is is that kind of the direction you were going?

Yeah, almost all of the production voice agents today, you know, especially in the enterprise side are multimodel. So,

they've got uh, you know, a a transcription model and then an LLM operating text mode and then a voice generation model. And you've usually got

generation model. And you've usually got a little dedicated voice activity detection model to help you with turn detection. And you might actually have a

detection. And you might actually have a semantic uh turn detection model as well in that pipeline. If you're an enterprise and you're really concerned about certain kinds of supply compliance and regulatory stuff, you might also

have a model doing some inference in parallel with the main voice conversation pipeline. That's like a

conversation pipeline. That's like a guardrails content model. You might be doing a bunch of other stuff. So, one of the things I think that distinguishes voice AI from a lot of other use cases

is basically every voice AI agent today is multi-model as well as multimodal.

And that's just a very different architecture from what you know Google and and and OpenAI are pushing towards this this worldview where these incredible state-of-the-art models kind of do everything and they're much less

multi-model in their in their philosophy. I think that's actually a

philosophy. I think that's actually a really interesting question for all of generative AI. you know that I think all

generative AI. you know that I think all of us who are building these solutions think about at least a little bit which is how much does the future world where we're building this stuff look like we're using those SOTA models and how

much are we using you know smaller midsized maybe fine-tuned models or are we sort of doing all of it depending on you know what what we're doing at the moment I think nobody really knows because as you said this techn is

evolving so quickly.

Yeah. Yeah. I think that theme for folks that listen to the the podcast that theme of you know build a system out of

you know modules uh versus train some endto-end thing with lots of data and uh solve the problem in that way is one that comes up you know quite a bit. The

example that comes to mind is autonomous vehicles or robotics uh embodied AI uh more broadly because we've got this rich history of kind of physics-based models

that uh or you know slam based models in the case of you know autonomous uh vehicles that um you know can play a role in in the

solution and have a lot of interesting you know properties. uh you know but that is you know often put up against the promise of the model being able to figure out things on its own that we

can't teach the model you know based on you know our own uh view of the world.

Uh and so it's interesting to hear that in this space as well like that modular approach is kind of where folks are building today.

And I do think that will change but I think it'll change in complicated ways that are are hard to predict. I mean we definitely feel that tension you're talking about every day because the

speechtoech models from OpenAI and Google are genuinely better at audio understanding and at natural voice output. So, if you're what one thing I

output. So, if you're what one thing I often tell people who are asking me for advice is if you're building something like a language learning app or a storytelling app for kids, you probably want to use those speech-to-pech models.

But if you're building something where you've got to go through a checklist like collect a bunch of information from a healthcare patient before their visit,

you really probably want to use the text mode LLMs and a multi-model system because you can guide and control and and eval in real time whether you're

getting what you need just much much much more reliably.

And you know, we've I always said we would never train models at daily because what we do is infrastructure, but I got so frustrated by the turn detection problem, uh, you know, late

last year that when Christmas break came around and I didn't have to, you know, do actual meetings style work all day.

Um, I trained a version of a of a turn detection model, an audio input turn detection model that came out well enough that we released it and now there's like a pipecat ecosystem around

it. And there's a totally, you know,

it. And there's a totally, you know, really, really good, totally open source, totally open data, open training code turn detection model. And I fully anticipate that turn detection model

will not be useful two years from now because we will have embedded that functionality into these bigger LLMs. But you sure need it now to build something that's kind of best performing

conversational dynamic uh agent.

Yeah, that's super interesting. And I

I'd like to maybe dig into that in a little bit more detail if only to help folks get a sense for like how these modules fit into a bigger system. So

talk a little bit about what are the inputs to this turn detection system, what are the outputs and how those are used in you know orchestrating a an AI flow.

Yeah. So the the classic pipeline looks like you've got audio coming in from the network connection. You have to chunk

network connection. You have to chunk that audio up into segments because no matter what kind of LLM it is today, the LLMs all expect you to ask them to do

one thing like at a time. you you have to fire inference. And this is another a tiny little aside, uh, but another big architecture leap I expect to happen in

these LLMs in the near future is yeah, 100%. Like birectional streaming all the

100%. Like birectional streaming all the time. You're always streaming tokens in,

time. You're always streaming tokens in, you're always streaming tokens out. When

you're not when the LM isn't talking, those tokens are like silence tokens or stop tokens or whatever you want to call them. um when the LLM is talking they're

them. um when the LLM is talking they're you know meaningful tokens but you should always be streaming or thought tokens totally right and and there's some architectural experiments where you

actually have multiple output streams you have an audio output stream a text output stream some kind of internal dialogue output stream that's being fed back in all the time so like there is

going to be new architecture stuff that changes how we think about these things but today you take that audio you chunk it so you and start to think about how to feed it to the LLM. You decide those

you're you're making those chunks based on trying to decide when the user feels like they're done and they expect the LLM to respond. And that's called turn detection. So the the voice activity

detection. So the the voice activity detection you're talking about as not feeling fully natural is today just a

fixed window of the user is not talking anymore. It's like 800 milliseconds. If

anymore. It's like 800 milliseconds. If

the user doesn't talk for 800 milliseconds, you decide to respond.

That is not great because often I pause longer than 800 milliseconds when I'm trying to figure out what to say to a human or an LLM.

Well, just people like they have to like look off to the side and figure out what they want to say and come back, right?

and and that depending on the conversational flow that can be you know very short or that can be very long even in one sentence even with one person's speaking patterns. The thing that I

speaking patterns. The thing that I sorry the thing that I experience the most though I think it's the flip side of that and it is

uh maybe overaggressive turn detection.

I don't know what it would be like, but it's like the, you know, the advanced voice mode speaking and then it just stops. Like I said something, but I said

stops. Like I said something, but I said nothing. It just heard some background

nothing. It just heard some background noise and it got thrown off track and it's like waiting for me to say something, but it can't figure out that I'm not saying anything.

So those two things are linked slightly different pro. Okay, talk about the linkage. Yeah,

the linkage. Yeah, they're linked because they're implemented in the pipeline by the same components. And that that's a good call

components. And that that's a good call out that maybe they should be more specialized components as we evolve this stuff. But the the the the the beginning

stuff. But the the the the the beginning of almost every voice pipeline is a small specialized model uh called a voice activity detection model. And that

voice activity detection model's job is to take you know 30 milliseconds or so of audio and say this looks like human speech or this doesn't look like human speech. It's a classification model. Um,

speech. It's a classification model. Um,

and then you decide you do both turn detection and interruption handling based on that model's classification of those speech frames. So the turn detection would be say there's an 800

millisecond gap. That's a turn. The

millisecond gap. That's a turn. The

interruption handling would be I got three frames in a row that look like speech. That's an interruption. Um, and

speech. That's an interruption. Um, and

so you know if that's not tuned exactly right, you cough. it can cause the model to be interrupted or somebody's playing a really loud radio in the next car over

and the radio announcer is like call K 105 now that's an interruption.

So the next step in both turn detection and interruption handling is to make those two components more sophisticated.

Make them more semantic. Make them more aware that some kinds of background speech are background speech not primary speech. And there are a bunch of

speech. And there are a bunch of techniques for that. that those that we are making progress on both those problems, but we're definitely not, you know, universally there yet.

Yeah, I think it was actually at IO they did a demo where like someone's talking to a a voice agent and then like someone comes into the room and is like talking to them and it doesn't throw it off at

all. Um, so that's maybe an existence

all. Um, so that's maybe an existence proof of progress there. I don't recall uh what specifically they were doing to enable that. You you may know. So this

enable that. You you may know. So this

is another good example of the small model versus big model approach. Both of

which are valuable. But the big models are starting to be trained to understand interruptions natively and to be able to understand both because if they're

multimodal, they have access to the audio and they also have a lot of like semantic understanding of how language works. So when you combine those two

works. So when you combine those two things, you ought to be able to tell, hey, this is a radio in the background talking. this is not the person I'm

talking. this is not the person I'm talking to talking and ignore everything that's not the person you're talking to.

So you can do that with the big model.

You can also specially train a small model to try to separate out foreground and background speech. So one of the models I often recommend to people which is extremely good at that is a model by

the company Crisp that you may know of KRISP because they have some really good desktop audio processing applications.

They also have models that are designed to be run as part of these generative AI workflows to do exactly this kind of primary speaker isolation. And running

those models as part of that initial stage of the pipeline makes a huge difference in enterprise reliability.

Okay. And so

what if someone is starting and they're listening to what we're saying and uh you know they were thinking that they had this problem to solve and they needed to call an API and now the

problem just got a lot bigger because they need all these different components as part of an orchestrated system.

uh yeah, should they have that fear or are there like templates or something that they can get with Pipecat that like does all of the crap that they don't care about and they could just plug in their thing?

Yeah, there's uh various starter kits for different use cases in the Pipcat open source repos that are, you know, 75 or something lines of Python code, including all the imports, and have all

of these pieces totally standard in the pipeline. and you can just change out

pipeline. and you can just change out the prompt and you've got a working voice agent that you can run locally, you can deploy to the cloud and you can start to iterate.

Uh you suggested this earlier as we were kind of ticking off challenges, but you know eval is got to be a big one. It's a

challenge for folks that are building textbased applications. Now we're, you

textbased applications. Now we're, you know, starting to make progress there, but it's a you know, it's an evolving practice. What's the state-of-the-art or

practice. What's the state-of-the-art or landscape like from a voice perspective?

Last year almost all of us had only vibe based evals for our voice agents and you know sounds like textbased agents actually.

Yeah, it kind of does but I think we're probably I mean I think we're probably a little bit you know six months behind or something the the the uh the the textbased agent teams in getting all the way there. Although we're making

way there. Although we're making progress and there's a couple of things that are harder for voice agents about evals. One is they're always multi-turn

evals. One is they're always multi-turn conversations. Like just the definition

conversations. Like just the definition of a voice agent is that it's a fairly long multi-turn conversation. Um and the other is that whatever your pipeline is, if it's the three kind of uh

transcription, LM speech model pipeline or if it's the voicetovoice pipelines from you know the live API or the real-time API, you've got audio in there as well. So you've got this end to end

as well. So you've got this end to end problem that includes not just text but audio. You have to figure out do you

audio. You have to figure out do you just do your eval based on text or do you try to incorporate all the failure modes that are additional to the text failure modes and audio. Um so those are the things we grapple with kind of

uniquely in the voice space. I think

it's worth talking a little bit. I'm

curious how you're thinking about this because you talked to lots and lots of people. What we have learned in the

people. What we have learned in the voice space is the multi-turn stuff takes you way out of distribution for the current training data from the big models. And you can look at all the

models. And you can look at all the benchmarks for like here's how good instruction following is, here's how good function calling is. Those are

totally a good uh a good guide to how well your agent will perform for the first five turns of the conversation. as

you get 10, 15, 20 turns deep, those benchmarks just sort of your your actual performance on instruction following function calling falls off a cliff. So

you almost have to build custom evals in the voice space because you kind of don't have benchmarks. You're kind of out of distribution. Like every agent is just different. I've had people who tell

just different. I've had people who tell me Gemini 25 Flash just doesn't do what I want to do at all. And other people tell me Gemini 25 Flash is the best model by like a factor of five for my

voice agents. Like yeah, I get it. We're

voice agents. Like yeah, I get it. We're

just kind of Do you see that in the nonvoice space as much?

I was just going to say I don't think that that's unique to voice. I think um you know for a while now the you know

public benchmarks have become you know noisier and noisier with regards to an individual you know engineers's ability to you know get the results they want

for their thing and so you know often times now you know when there's a new model you know you you look at the model's performance on the benchmarks but then you're also going to social

media and hearing People talk about like you know their private benchmarks and you know you're running against your own you know whatever your pet problem is or your you know product you know

requirements uh are and how you've captured those in a you know eval or benchmark. I I think it's the same

benchmark. I I think it's the same across the board. Uh, but it does strike me as being harder, you know, with voice for the the reasons that you mentioned.

Like if text is my intermediary, that maybe solves a lot of the problems. But if you know for example the problem we were talking about with voice activity detection and turn taking like it doesn't necessarily

help me evaluate that part of the process unless I'm like end to end feeding some voice in and uh you know

somehow instrumenting the system so that I can you know evaluate yeah even you know thinking about like how I might do

that like uh it's nonobvious not obvious um how I would do that end to end as opposed to you know yeah sure you can evaluate a voice activity detector in

isolation but um you know as part of an endto-end problem it becomes I think a little bit more interesting yeah and how do you how do you kind of build a success metrics

rubric when you've got even more moving parts including things like conversation length and you know number of pauses in the conversation were those pauses expected were they not expected number

of interruptions in the conversation. Is

that good? Is that bad? You really have to build up the intuition. And I and I think this is similar to all val, but you know, your domain is going to be specific for your application always. I

often tell people get to the point where you feel confident that the agent works based on everything you've been able to throw at it from a Vibes perspective and then do a little bit of production roll

out.

But before you do the production roll out, make sure you can capture all the traces, at least capture all the text.

And then you will start that data flywheel where you've got enough captures that you'll be able to just kind of manually start to try to build that intuition up about what success is

and what what failure modes are. And

then you can iterate on that in a bunch of ways, including just do textbased evals for a while. That's totally better than nothing. or start to try to either

than nothing. or start to try to either build yourself or leverage some like eval or ops platform tooling or that's more specific to voice which more and

more of the eval tooling folks are starting to build audio support which is great. Um there's I think at least half

great. Um there's I think at least half a dozen uh pipecat integrations with you know good ops and eval tools uh that hopefully make it easier once you get some of that data flowing into the

system to do that kind of end toend analysis you're talking about.

Yeah. Yeah. Yeah. Yeah. One of the things that I I find interesting in this conversation and I think it goes back to a conversation I had not too long ago

with uh Scott Stevenson who founded Deep Graham. It's I guess I would put it as

Graham. It's I guess I would put it as like in in the the context of this modular versusn or like modular versus

single model architecture like text as an intermediary is an observability strategy. Right. It's like you don't

strategy. Right. It's like you don't necessarily need observability like you don't we don't even know you know unless you're talking about anthropic circuit

tracing class things like how to observe inside that single yeah multimodal LLM

like the you get get a lot just by doing you know text as an intermediary in terms of being able to evaluate and monitor what the system is doing and and

enforce some controls. s etc.

Yeah, you really do. If you have that uh what I would one one way to put it is almost every enterprise use case needs that text for observability for

compliance, you know, for other other reasons. You kind of have to have it. I

reasons. You kind of have to have it. I

think it's true that I think it's really interesting to think about the uh ways you could use text and audio together too in consumer applications. I we have been building enough of these things

long enough that we've started to have a lot of fun I think and some some opinions about how these evol these these um UIs need to evolve. Right. I

was having a conversation with one of the big labs people before they released a voice product and they said to me, "Yeah, nobody wants to see the text. You

you know when you're in voice mode, you just want to talk." And I was like, "No, actually not. There's a whole slice of

actually not. There's a whole slice of these use cases where what you want is to talk and then you actually primarily read and you maybe have the voice on cuz like that's a useful channel and you can

look away if you want to but like the mode is audio in text out from a cognitive perspective. And then there

cognitive perspective. And then there are other voice applications where you literally have no way to display the text because I've like called the agent on the phone or whatever. And so there's just this huge spread of use cases. Then

as we move towards these kind of next generation UIs, you you have to figure out how to support a huge variety of things that people are going to actually want to do and text matters, voice

matters, images matter. Increasingly

video input and output are super useful modalities. So there's just a ton of new

modalities. So there's just a ton of new user interface experimentation that I that I think we are just barely starting to do.

You've mentioned video a few times. what

are some of the use cases you're starting to see and uh you know what are you excited about in terms of opportunities there?

I'm excited about the real time video avatar and real-time video scene models getting out of the uncanny valley into a place where they're as good as the

real-time voice models we have today. I

think video dem I mean we have this like progress of technology throughout history it's always like text audio video video right and as you add video

you know you get sort of this more kind of deeper level of engagement and and uh and connection so I'm I'm a big believer that a lot of the things we do with real

time AI conversation are going to have a video component I think it's a year or two away before we're all the way there but I think we're starting to see uh from models models from people like

Tavis and Lemon Slice uh really interesting adoption. We're seeing the adoption

adoption. We're seeing the adoption there I think mostly in things like education and corporate training and job interviews but I think we're as the cost

comes down and the quality goes up I think we're just going to see a huge amount of social and gaming use cases. I

mean, one of one of the thought experiments for me is, you know, what would Tik Tok look like if it wasn't feeding me a bunch of really welltailored to my uh revealed

preferences pre-recorded video, but if it were generating it? Yeah. 100%. Like

that that's what the next Tik Tok is going to be, right? Well, we've seen Amazon start to experiment with like this choose your own adventure style of

production. Uh but you know that's very

production. Uh but you know that's very granular you know it's still very much a traditional production model like if those things can be generated on the fly that's you know a very different world

for for them and for media you know production and consumption. One of the first voice AI things we released publicly in like 2023 was a choose your

own adventure voice interactive story generator for kids. Um, and it it was a very like clarifying moment for me when we built that and thought it was

compelling enough to show other people cuz lots of us have kids uh at Daily and our kids were just like, "Oh, yeah. No,

I would obviously talk to this thing forever." Um, and you you see like you

forever." Um, and you you see like you you can see technological progress sometimes best when you see people younger than you take to it in a new way.

You know, we've talked a little bit about the challenges, you know, from a voice perspective. Like, are those

voice perspective. Like, are those challenges kind of the same but more for video or does video introduce new challenges or what are the new

challenges? I'm imagining it's yes and

challenges? I'm imagining it's yes and so the single biggest challenge for video right now is that it's so much more expensive that the use cases are limited. So that's going to take some

limited. So that's going to take some time to, you know, kind of push the video per minute cost down of the basically the GPUs.

So infrastructure, that was my question.

So I'm sorry, not infrastructure. Uh

inference as opposed to transport.

That's right.

And storage and it's the inference. It's it's the GPU time. Um you just can't you can't run

time. Um you just can't you can't run that many simultaneous video generations on a you know H100 or whatever. Um, so

you just you just have a lot of GPU cost. Um, as that comes down, I think

cost. Um, as that comes down, I think that will go away. And then the next interesting thing is there sort of all the things we talked about with voice latency matters. Everything is sort of

latency matters. Everything is sort of multi-model.

You have a bunch of stuff going on that you've got to orchestrate. That's even

more true for video because if you think about video, what you've really got going on is a an avatar or more than one avatar. There's voice generation,

avatar. There's voice generation, there's body pose, there's facial expression. Those are like three

expression. Those are like three different things. Now, maybe it's one

different things. Now, maybe it's one model producing those things or maybe it's not, but those are sort of three different things that something is orchestrating. Then there's the scene

orchestrating. Then there's the scene and the lighting and the camera movement. That's three more things that

movement. That's three more things that if you're creating a really dynamic real-time video experience, you at least in you may want to change all of those things dynamically. Um, so you're just

things dynamically. Um, so you're just sort of layering on more and more like multimodality or or multimodal complexity.

That makes me wonder what you've seen in terms of pushing inference to the edge.

You know, probably not much for video, but like for voice, it seems like we could be close to that. And if that's the case, how does the pipeline or the orchestration need to change to

accommodate it? That's a little bit

accommodate it? That's a little bit related to the question about whether we're going to use these really big models for everything or whether going to we're going to use a bunch of different models. If you can use

different models. If you can use medium-sized or small models for the the all for everything in the pipeline, you do you can run a bunch of stuff on the edge. Now, like on my, you know, fancy

edge. Now, like on my, you know, fancy Mac laptop that I paid a bunch of money for, I can actually run a really good local voice agent. I can use uh one of

the open source transcription models. I

can use something like uh Google's Jimma openweights, you know, 27B model or the Quinn 3 series of of of LLMs. And then there are several really good open

source voice models. Um, and I can just wire those things all up locally with no network at all. That's out of reach of most people's devices. You know, the sort of typical laptop can't can't run

good enough models to do a good real-time voice agent. The typical phone can't. But, you know, we're 2 3 4 5

can't. But, you know, we're 2 3 4 5 years from the typical device being good enough and and and being able to run a lot more stuff locally or being able to run parts of the pipeline locally and

call out to the cloud only when you really need more inference horsepower.

And I do think that's the future. I

think there's so many advantages to running a bunch of stuff locally that the hybrid pipeline in the way I think about the world the sort of processing pipeline the hybrid pipeline is where

we're going to get and is the pipeline amenable to that or where are there tight coupling so for example I'm

thinking about like voice activity detection like that's probably a small enough model that I could run it on my device But is the latency between that and the cloud where everything else is

being processed such, you know, so large that it's going to be that that the signal from the voice activity detector is out of date by the time it gets to the rest of the pipeline. Like those

issues, you know, have to be, you know, significant barriers to hybrid pipelines.

Where are there specific places where you see opportunity or, you know, to shift out to the edge?

Yeah. No, it's it's a great question and there's no one-sizefits-all answer.

Partly because the technology is moving so fast and partly because there's a big diversity of use cases. In some ways, the biggest reason just to do everything in the cloud today is you really do want

every if you're doing multiple inference calls, you really want everything to be as close as possible to the inference servers.

You don't want to be making multiple round trips from that client to the cloud because the worst connection is always going to be the edge device to the cloud. The best connection is going

the cloud. The best connection is going to be server to server once you're already in the cloud. So it is it is easier to sort of engineer everything with send the audio and video to the

cloud, do whatever inference you need, send audio and video back. But cost,

privacy, uh, flexibility definitely motivate towards running pieces of those on device. And even though it's harder, I

device. And even though it's harder, I think it's kind of just engineering to figure out how to build the abstractions that make it pretty easy to build the the pipelines. Yeah. Just more code to

the pipelines. Yeah. Just more code to write, more orchestration layer code to write. Uh along those lines like you

write. Uh along those lines like you know with Vipcat being new uh and this technology evolving quickly

um do the you know codegen models, vibe coding platforms and the like do they know about it? Are folks having good success like you know building applications using those kind of tools

or are there ones that do better than others or you know cloud code and the usual suspects for everything now? Yeah,

I mean this is much very much on my mind because I increasingly see a lot of code in like the pipecat discord that's clearly AI generated. Uh which I think

is great like we are we we are going to have a new generation of programming tools that make all of us more productive. Um it is challenging partly

productive. Um it is challenging partly with uh open source stuff that has changed a lot in the year since you know there was a beta and is now stable but

there's old versions floating around. I

think the coaching tools have trouble.

I've also been trying to figure out how do you package up? So I don't think this is a solved problem and I would love to hear from people who have solved it better than we have. There are a bunch

of good canonical examples of what code structure and pipecat should look like and they're all in the main repo. They

aren't necessarily installed in your Python in locally because they're examples. So it's not clear to me how to

examples. So it's not clear to me how to make cursor and windsurf and claude code know that those are the canonical examples and the project is big enough that none of those tools as far as I can

tell today can pull everything into context. So the more agentic the tooling

context. So the more agentic the tooling is the better job it does generating pipecat code. It's like cloud code is

pipecat code. It's like cloud code is pretty good. Vanilla winds surf without

pretty good. Vanilla winds surf without a bunch of help which I use every day is not so good. Um, so I would like to figure out how do we how do we point a programming tool at like the canonical

examples so they're always in the context and like you don't have the mistakes that seem super solvable like the import is wrong. Like if I add something to the middle of my Python

code, it sure seems like the import that gets autogenerated at the top of the file should always be right. But that's

not not true today. Um, and I I feel like that's solvable, but not completely solved yet.

Is part of it a like a conventions.mmd

file that you have in your repo that you can then instruct people like at least until you know that becomes a convention and the tools look for that file, you can tell people to register with their,

you know, I know cursor has its version of project files and the other tools do as well. Like is that part of it or does

as well. Like is that part of it or does that not fully get you to where you're trying to go?

No, I think you're right. I think that's the approach and maybe it's just somebody needs to take a week and figure out how to make that file or maybe that

MCP server or whatever for Pipcat work across all the tools. Um I've hacked, you know, improvements in for my own workflow, but it's clearly not like a packaged

thing yet. And it and it really should

thing yet. And it and it really should be. And yeah, maybe it's just somebody

be. And yeah, maybe it's just somebody needs to take a week and sit down and make it work across all the all the common AI AI editors.

You mentioned MCP. uh what are the interaction points or opportunities with regard to MCP and these other agentic

protocols uh and voice AI voice agents generally pipecat in particular you know in that whole

space like what are you seeing with regards to the use of MCP a what have you lots of excitement about MCP uh there's

native pipecat MCP client support in the in the repo. So you can just sort of add MCP servers and then you know like everything in AI you have to like prompt appropriately so that you get what you

need from those MCP servers but like then they're just in the pipeline meaning so in your pipecat orchestrated workflow

that pipecat can call out to an MCP server. So in that sense, Pipcat is like

server. So in that sense, Pipcat is like a voice AI application server and it's calling out to uh you know various MCP servers as a back end. So you don't have to wire that up.

Exactly. And we built that on top of the native function calling abstractions in Pipcat because I think in general that's how MCP servers are accessed by LLM

driven workflows, right? Like there's

different ways to access MCP servers, but if what if what you're doing is talking to an LLM and that LLM is talking to an MCP server, usually what's happening is there's like a tool call set of tool call definitions that are

the are the glue. But the difference between statically defining all the tool calls and just defining the tool calls that can access the MCP servers is you have all that beautiful brittle

non-determinism that you've talked about in other podcasts, which gets you a long way and gets you more problems too, right? Um, so, so you can sort of just

right? Um, so, so you can sort of just just use MCP servers in a in a Pipecat voice agent. What I usually tell people

voice agent. What I usually tell people is don't use an MCP server unless you have a very good reason to use MCP for two reasons. One is that

two reasons. One is that non-determinism.

Like start with determinism if you know you need something. Move to

non-determinism if you have a specific need for non-determinism.

Um, the reasons you might want an MCP server are you are building an ecosystem and you want other people to be able to add stuff to your ecosystem. MCP is a very good abstraction for letting other

people add stuff into your agent ecosystem. Uh, another reason might be

ecosystem. Uh, another reason might be you really do have a specific workflow where that non-determinism is valuable and packaging up a bunch of endpoints

into a single MCP server is a lot more kind of maintainable and modular. If

what you're trying to do is have the LLM that's driving the conversation pick from a whole bunch of different things to do anyway. But if you just have four or five things you know your agent needs to do, hardcode those tools. Don't wrap

them in an MCP server because you're going to get more evaluable, better results, and you're going to get lower latency. It's interesting because I

latency. It's interesting because I think about it almost the opposite. Um,

you said start with determinism and if you have a specific need for non-determinism, like then use the the MCP server. I tend to think of it as

MCP server. I tend to think of it as like you've got this generic MCP server, use that for your proof of concept, but then you know there's going to be some set subset of all the the tools that

that thing exposes and you will figure out through your you know PC and user interactions like what those are then like take off that outer wrapper of the

API and just use APIs.

Totally no that that that 100% makes sense. The gloss I would put on top of

sense. The gloss I would put on top of that for the voice specific development cycle is know that in that first iteration you're going to have much much higher latency than you're going to aim

for in production. And you're going to you know you're going to get to a point where you're like okay I need to bring my latency down from 3 seconds to 1.2 seconds. One of my big things I'm going

seconds. One of my big things I'm going to have to do is rip out as many of those MCP uh calls as possible.

Okay. Makes sense. Makes sense. And it

it calls to mind a question um you know in this space like picking apart you know what is observability

versus you know what is eval versus what is test and measurement like you know there are all these different terms but

I'm envisioning in the voice space in particular you know there's a category of tool that I might want that shows me like step-by-step latency throughout my

workflow. Does that exist or is it easy

workflow. Does that exist or is it easy to do by hand or is it something that is like lacking and really needed? The

Pipecat pipeline will produce metrics frames that show the latency of each step in the pipeline. Uh so it'll give you kind of a good starting point.

There's a couple of things that are harder. Like everything uh when you're

harder. Like everything uh when you're really trying to dig down and pull out the last bits of latency, there's a couple of things. I

the leaky parts of the abstraction.

Yeah, exactly. All the all the abstractions are leaky. All all of the measurements are uh gappy, right?

There's always gaps between your measurements. So, you got to be aware of

measurements. So, you got to be aware of those. There's a couple of things that

those. There's a couple of things that that people should be aware of. One is

the network tren the the pipecap pipeline is running somewhere in the cloud. It's giving you the metrics for

cloud. It's giving you the metrics for everything that's running in the cloud.

Then you also have that edgeto cloud latency. You can measure that

latency. You can measure that programmatically. It's actually very

programmatically. It's actually very hard to measure that programmatically in any perfect way. So a little bit like you should do eval by hand to build up an intuition. What I always tell people

an intuition. What I always tell people to do is if you are building a production voice agent, record the conversation like offline. Like record

the conversation from the client side.

load that recording up into an audio editor and measure the the silence periods in the waveform. Like that is not you can't cheat. You can't get that wrong. Um and one of the things that

wrong. Um and one of the things that that will highlight for example is that a bunch of the voice models will produce fairly long silence bites at the

beginning of their voice output and then the the text or the the the the speech.

You don't know what that is if you're just measuring the time to first bite from that inference call. You actually

have to look at those bites.

Yeah. And there's good reasons they do that because like if you if you start with silence, you can tune the model to do, you know, much more complex things.

It makes me think of like windowing protocols or something like that. Like

Yeah, totally. So there's a bunch of little things like that in the pipeline that you actually have to sort of dig down and try to measure uh when you're when you're really trying to squeeze the last, you know, 100 milliseconds or so

out. But you get a long way with just

out. But you get a long way with just sort of the standard metrics of how long did the transcription and turn detection take, how long did the LLM inference take to start streaming, how much did you have to buffer before sending to the

voice model, what was the time to first bite from the voice model, and you can sort of add those those up and get a pretty good starting point.

Very cool. Very cool. any um you know if you had to shout out a few interesting like use cases that we haven't talked about that you know are inspirational

you know for you and the community like even you know bonus if they're public and folks can play with them but uh perhaps not you know given that the focus is enterprise like you know what's

really cool that folks are doing I mean I'll sort of take it out of the enterprise and give a couple of things that I think are super inspiring that I've seen a bunch of good work on and and and that I would love to see even

more people work on. One is I'm really convinced that AI and education is going to be transformative for our world.

Giving every kid a tutor is additive to everything we all do in the classrooms. Not talking about trying to replace, you know, the classroom teacher, but just

giving every kid self-directed, you know, totally in infinitely patient, infinitely sort of scalable one-on-one attention is amazing. And I don't I

don't think we're talking enough about the impact on childhood learning and on adult learning, too. I mean, like I use LLMs in that way. Uh, and voice is a big part of that because kids like all of us

are very voice oriented. So sort of voice driven but not only voice like tutors I think are are really amazing and I would love to see more more people working on that. Um the other thing I'm

obsessed with is like what does it look like when we generate UI on the fly. So

if I'm having a voice conversation with an application, I want it to write UI code and display that UI code dynamically for whatever I'm talking about doing. And I see this increasingly

about doing. And I see this increasingly in how I use programming tools. like I

was debugging some like very low-level audio timing thing over the weekend and in the past I would have dumped out very detailed logs and then I would have written some code myself to analyze

those logs. Well, what I did all

those logs. Well, what I did all weekend, I didn't write a single line of log analytics code over the weekend. I

captured all those logs and I gave them to claude code and said, "Here's what I think we need to do to look at these logs. Can you do that?" And it could. it

logs. Can you do that?" And it could. it

could debug based on very detailed audio timing logs. The next step would have

timing logs. The next step would have been for it to graph and give me ways to drill down totally dynamically into those logs. So I would like to see

those logs. So I would like to see people doing more experimentation with on the-ly generated user interfaces. Um,

Shrista Basu Malikalik, who's the uh PM of of the Gemini APIs, who you and I were both hanging out with at Google IO, and I did a talk at Swix's AI engineer

World's Fair with a little bit of autogenerated UI. Um, and I would have

autogenerated UI. Um, and I would have liked to do a whole like long workshop on that uh at the World's Fair because I think that would have been an amazing workshop. But I think we should all like

workshop. But I think we should all like figure out how to do that at some big event coming up.

Oh, that's awesome. That's awesome.

Well, Quinn, thanks so much for jumping on and sharing a bit about what you and uh the team have been up to. It's very

cool and very interesting stuff and uh certainly an exciting, you know, point in time and, you know, voice and video and multimodal AI for sure. Well, thanks

for joining me for the conversation, having me on. It's always super fun to listen to you and super fun to get to talk to you. Absolutely. Thanks so much.

[Music] [Applause] [Music] [Applause] [Music] You got heat.

Loading...

Loading video analysis...