No Priors Ep. 143 | With ElevenLabs Co-Founder Mati Staniszewski
By No Priors: AI, Machine Learning, Tech, & Startups
Summary
## Key takeaways - **Polish Dubbing Horror Sparks ElevenLabs**: In Poland, all foreign movies are dubbed with one single flat voice for every character, creating a terrible experience that people switch away from once they learn English, inspiring the need for high-quality, emotional voice dubbing in original voices across languages. [04:02], [04:20] - **$300M ARR, 50/50 Self-Serve Enterprise**: ElevenLabs has grown to 350 people globally, remote-first with hubs in London, New York, Warsaw, San Francisco, Tokyo, and Brazil, achieving $300 million ARR split roughly 50/50 between self-serve creative platform with 5 million monthly actives and enterprise agents with thousands of customers including Fortune 500s. [02:12], [02:34] - **Lab Sequencing: Research Then Product**: They organize by creating labs of researchers, engineers, and operators starting with voice lab for narration and dubbing, then agent lab to orchestrate speech-to-text, LLMs, text-to-speech with integrations, sequencing research first then simple product layers on top. [08:11], [09:20] - **Proactive Voice Agents Boost E-Commerce**: With Meesho, India's biggest e-commerce shop, agents shifted from reactive support like refunds to proactive frontend experience where users engage via voice widget to navigate items, get gift recommendations, and checkout based on offers. [18:55], [19:12] - **Immersive Media: Talk to Darth Vader**: Working with Epic Games, they brought Darth Vader's voice into Fortnite allowing millions of players to interact live with the character, shifting static IP to interactive immersive media. [20:14], [20:37] - **Audio Needs Architecture, Not Scale**: In audio, architectural breakthroughs and model innovations matter more than scale; with 10 of the top 50-100 audio researchers, they beat big labs on TTS, STT, and orchestration benchmarks by focusing obsessively on research to production. [29:01], [29:23]
Topics Covered
- Polish Dubbing Reveals Voice AI Opportunity
- Sequence Research Labs Before Products
- Proactive Voice Agents Elevate Experiences
- Voice Unlocks Personalized AI Tutors
- Audio Needs Breakthroughs Over Scale
Full Transcript
Hi listeners, welcome back to No Priors.
Today I'm here with Mi Stendes, the co-founder and CEO of 11 Labs, which was founded to change the way we interact with each other [music] and with computers with voice. Over three short years, they've skyrocketed to [music]
more than 300 million in run rate. Motti
and I talk about the future of voice education, customer experience and the other applications of this voice as well as how to build a multi-segment [music] from self-s serve to enterprise and
combined research and product company.
Welcome Marty.
>> S thanks for having me >> and thank you for doing this at 7 in the morning.
>> Our pleasure. Thank you for doing that at 7:00 in the morning. It's great we we we got to finally do this together. Uh I
think a lot of our listeners will have used or played with 11 at some point but for everybody else can you just reintroduce the company?
>> Definitely we uh at 11 Labs we are solving how humans and technology interact how you can create seamlessly with that technology. Um what this means in practice is we build foundational
audio models. So models in a space to
audio models. So models in a space to help you create speech that sounds human, understand speech in a much better way or orchestrate all those components to make it interactive and then build products on top of that
foundational models. And we have our
foundational models. And we have our creative product which is a platform for helping you with narrations for audiobooks with voiceovers for ads or movies or dabs of those movies to other languages. in our agent uh platform
languages. in our agent uh platform product which is effectively an offering to help you elevate customer experience built an agent for personal AI education
new ways of immersive immersive media uh but all is kind of under under light of that mission of solving how we can interact with technology on our terms in a better way >> you started the company in 2022
>> that's right >> and you've had amazing like rocket ship growth since then I'm sure it's felt up and down different ways I want to ask you about that can you give a sense of what the scale of the company is today.
>> So we've grown to 350 people globally.
We started from from Europe. We started
as a remote company and are still first remote first but have hubs around the world with London being the biggest, New York being second biggest, Warso, San Francisco and now Tokyo and and one in
Brazil. We are at uh 300 million in in
Brazil. We are at uh 300 million in in in ARR which is uh roughly 50/50 between self-s serve so a lot of subscription
and creators using our creative platform and then approaching 50 uh% on the enterprise side using our agents platform uh uh work and that's on the salesled classic salesled side and we
serve more than 5 million monthly activives on that on that on that creative uh side of the work and then on the enterprise side we have few thousand customers from Fortune 500s to some of the fastest AI growing startups.
>> I think this is such a you're an amazing founder, but I also think it's such an interesting company because it is um very unintuitive to I think many people and investors in particular. I don't
know if you faced this at the beginning, but I we were both there in 2022.
There's a there's a class of companies that allow creation in some way when we look at your like first business beyond the research itself. Uh, and I would put
11 and Midjourney and Sunno and Hunen in this category. And I think there's like
this category. And I think there's like this overall sense of like who really wants to do this? Um, what was your initial read of like how many people want to make voices or what made you
believe that was going to be much broader than like if I look at dubbing for example like it's not a huge market.
I think first piece was which is as you mentioned there's like a very >> it's very tricky to do both the product and the research. I'm in a in a lucky position that I that my co-ounder and I known each other for 15 years. I think
he's the smartest person I know and has been able to create a lot of that research work to be able to create that foundation to then elevate that experience. But both of us are from from
experience. But both of us are from from Poland originally and the original belief came from Poland. It's a it's a very peculiar thing. But if you if you watch a movie in Polish language, a foreign movie in Polish language, all
the voices, whether it's a male voice or a female voice are narrated with one single character. So you have like a
single character. So you have like a flat delivery for everything in a movie.
>> A terrible experience.
>> It is terrible experience and it's still you like if you grow up the the as soon as you learn English, you like switch out and you don't want to watch content in this way. Um and it's crazy that it
still happens until today in this way for majority of of of content. combining
that and I worked with Palunteer, Michael founder worked at Google, we knew that that will change in the future and that all the information will be available globally. And then as we
available globally. And then as we started digging further, we realized >> in in every language in a high quality way. That was the starting point and the
way. That was the starting point and the and the big the big thing was like instead of having it just translated um could you have the original voice, original emotions, original inonation
carried across? Mhm.
carried across? Mhm.
>> So like uh imagine having this podcast but say people could switch it over to Spanish and they still hear Sarah, they still hear Matty and and the same voice, the same the same delivery. Um which is kind of exactly what we did with Lex
back when he interviewed Narendra Modi and you could like kind of immerse yourself in that story a lot better.
Mhm.
>> Um so that was the original uh uh kind of insight and um and we we then started digging further which is that just so much of the technology we interact with
will will change whether this is how you create. It's still relatively tricky to
create. It's still relatively tricky to bring voice alive. You you you you would you need to go through the expensive process of hiring a voice talent having a studio space having expensive tooling
to then actually adjust it. The tooling
isn't intuitive to be able to do this.
So like all that creation process will and should change to make it easier for new people with keenness to bring that to life. Then a lot of the technology
to life. Then a lot of the technology wasn't possible for you to be able to um recreate a specific voice or be able to create that in that high quality way.
And then of course as we dived into further and and shifted away from the static piece, the whole interactive piece is still crazy in the way it functions where most of us seen this
technological evolution over last over last decades. But you still will spend
last decades. But you still will spend most of your time on the keyboard. You
will look at the screen and and that interface feels broken. It should be where you can communicate with the devices through through speech through the most natural interface there is one
that kind of started when the humanity humanity started and um and we realize we want to we want to solve that and I think now fast forward from 2022 I feel like many people will carry that belief
too that voice is the interface of the future as you think about the devices around us whether it's smartphones whe it's computers whether it's robots speech will be one of the key ones but I think 2022 it wasn't and um and as If
you think about the market for the creative side or whether for for that interactive side, it was like very clear it will be a huge a huge huge one.
>> So even when you think about uh just the research part of your business and then you have products for at least two different markets and then you have this larger mission. A lot has changed in the
larger mission. A lot has changed in the last 5 or 10 years but it used to be like a very strongly held traditional belief of like one must do one thing well in a startup and there's no other
path. like you're treating this like an
path. like you're treating this like an interaction company, a platform company.
How did you think about sequencing like the research and the product effort?
>> Does that make sense? Or like thinking about new markets?
>> And maybe wrapped up in that question too is just like well where are we in quality on on voice as well? Because if
if I I would sort of claim like if the models are not good enough for certain use cases at all like it kind of doesn't make sense. Do product
make sense. Do product >> and I think that's right. It's it's
almost exactly like when we when we started originally what we what we did was try to actually use existing models that were in the market and kind of optimize them for our first use case was
actually starting with combination of of narration and dubbing and then on that creative side and um we realized pretty quickly that the models that existed just produced such a robotic and and and
not not good speech that people didn't want to listen to it and that's where microfunders genius came in where he was able to assemble the team and and do a lot of the research himself to actually create new version of of creating that
work. But like to your question, I think
work. But like to your question, I think the the way we are kind of organized internally and how we think about sequencing a lot of that was looking at the first problem and then creating effectively a lab around that problem
which is like a combination of mighty researchers, engineers, operators to go after that problem. And the first problem was the problem of voice. So how
can we recreate the the voice and like you say it needs to have that research expertise to be able to do that well. So
we started with effectively a voice lab which was that mission of can we narrate the work in in in in better way. It was
a combination of roughly five people that were that were doing that work and then sequence the research first and then build a simple layer on top of that work to to allow people to use that work
and then kind of expand it from there with a holistic suite for creating a full audiobook and then creating a full movie narration movie dab. Um and then we move to the next problem which is
realization that okay we have solved the voice great for making content sound human >> the first problem for that to be useful for us to interact with the technology you need to solve how you bring the
knowledge on demand into that. M
>> so we effectively started then the second team which was a second lab uh an agent lab effectively which was a team that would combine researchers engineers and operators once more uh which would
try to fix okay we have text to speech how do they now combine this with lambs and speechto text and orchestrate all those components together while integrating that with other systems to
make it easier and then similarly you know you you you kind of expand from looking just at the voice layer into how those systems work together and here too. You need the research expertise to
too. You need the research expertise to do that in a low latency way, efficient way, accurate way. Um, but at the same time, there's that product layer that starts forming that it's not only the
orchestration that matters. It's also
the integrations of how you link up to the legacy systems, how you build functions around it or how you deploy that in production and test, monitor, evaluate over time.
>> Do you feel like you were creating new use cases when you built the tools? Do
people know that they wanted to do this already? Um because one argument like
already? Um because one argument like that I remember hearing was like ah like you know enterprises don't know what to do with voice how many people really want to do it and then you're serving essentially like perhaps the like
creator publisher side of your business.
Yeah, >> it's definitely a combination of like initiatives that we believe will happen in the world and then like response to a lot of that like as I think back we you know of course voice the internal voice
lab or agents lab then kind of that kickstarted so many of the other labs in response to the problems we started a music lab because people wanted to create music with 11 labs was a fully
licensed model where people wanted to use and create speech but they wanted to add music in a in a simple way. We
wanted to deliver that and then of course that kind of came together through how do we combine music, audio, sounds. Uh we are now integrating
sounds. Uh we are now integrating partner models from image and video into that suite is how could you combine all of that in in one and a lot of that was in response to the market saying like hey we would love this and then you will
have completely different use cases even in that space. Let's say dabbing.
Dubbing is a use case that we didn't feel there was like a a big push for for that but we we knew that in the ideal world in the future you will be able to have that content delivered naturally
around the languages still carrying that um and I still think actually this market will be immense because it's not going to be only the static delivery in movies but if you travel around the world and want to communicate in real
time like the full bubblefish idea from hitchhiker's guide of galaxy this will happen it will be like the biggest >> uh like the whole breaking down language barriers that are the barriers to communication to creation like all of
that will break and and and that will be like the foundational real time dabbing concept. So super excited about that
concept. So super excited about that part. And similarly on on the on the
part. And similarly on on the on the agent side, you um you you you are like some obvious things that of course customers that we work with or partners will will want to want to integrate
which is we want integrations with XYZ systems. But then there are like other parts that might not be as easy to predict of as you interact with technology you of course want to understand what's happening but you also
want to understand how the things are being said and bring that into the fold which would be something we try to prioritize on our side. So then the people when they actually interact with the technology they realize oh express a
thing is actually so so much more enjoyable and beneficial and helpful. So
I want to ask a question about this which relates to quality. Um uh you [clears throat] know I work with a series of companies where we're >> selling a product to uh the buyers are
generally not machine learning scientists. Right. Right.
scientists. Right. Right.
>> And even the the scientific community does not have the like full suite of eval benchmarks to understand every domain. Well there a well-known problem
domain. Well there a well-known problem but I imagine for a lot of your customers it's not like they like know how to choose good voice. So how do you how do you deal with that problem? like
is it like a hey I make a clone and like that sounds like me and I believe it I'm going to try all of these different options or or you know actually are you teaching people to do eval?
>> It's a great question because I think there are like two big problems. One is like how do you benchmark the general space in audio where like you say it's like so dependent on the specific voice
let alone like if you are training into interactive then it's like even more tricky. Um and then the second piece
tricky. Um and then the second piece which is as you are working on specific use case how you select a voice. So I'll
take the f second front first which is uh we have like a voice silier effectively with as we work with with enterprises we we we deploy that person to work with them and help them navigate
that person is like a voice coach has an incredible voice themselves and uh and now we have like a team under that person that like will partner to help you find what's the right branding >> and now you have like the celebrity marketplace
>> and now we have a celebrity marketplace to like help you even get iconic talent in there like sir Michael pain that piece was important because of course the voice will depend on the use case that you are trying to build the
language all of that will will have an impact of what's the right voice for your customer base. So we have effectively a um a voice person helping those companies and some companies will be very opinionated on what they want.
So they will sometimes select it themselves sometimes give us a brief of hey we want a voice that sounds professional neutral is coming. We
recently had a company, one of the one of the biggest European companies that wanted uh that gave us a brief which is very original uh that uh they wanted as robotic voice as possible.
>> Okay.
>> It was counterintuitive.
>> Um but for >> you like we can't do that anymore [laughter] >> almost. But we were like trying to go
>> almost. But we were like trying to go backwards of like how do we do that? But
I think we we got a good result. Uh uh
but recently we had a company in in Japan where um Japan and Korea where they wanted to serve different voices depending on the customer that's calling in. They have a
in. They have a >> older populationing and a very younger population. The younger one they wanted
population. The younger one they wanted like one of the famous voices in the market that's very excitable and happy.
Uh and for the older one they wanted like a calm slow speaking one. We help a lot with that. So that's on the voice piece and I do think it's going to be a a big important >> like a personalized choice and then it
can even be dynamic in a customer.
>> Yes. Okay. Exactly. Exactly. And then
maybe in the future it's like going to be like fully depending on your interaction. You will like have a voice
interaction. You will like have a voice created as we understand the preferences of what people want. So you know like let's say you're in the evening and you are tired and you want a slightly different or maybe not. Maybe that's
like the best uh focus time that you have like a voice that's that's giving that energy and probably it's a different when you wake up and gives you the morning news of what's happening or what's the weather. So like all of those
could be different. Yesterday we had a we had a dinner with some of our our partners uh and one of them the first thing they said is like hey I have a new request for you. I want a New York voice
with a Long Island voice uh uh accent which I never knew is a thing and it's territory is supposedly a thing. So uh
so we have that and then on the first piece I don't I think it's unsolved problem still where I think you have a good benchmarks of course in LMS I think in image space they are pretty good in
voice space you you have of course the speech quality but then so much of whether you like or not the speech depends on the voice that just if you compare a model A to model B and you
serve them different voices even if the quality is very different the voice itself can just make that sort of different we've seen this I don't know if you know artificial analysis benchmarks. I think they're pretty good.
benchmarks. I think they're pretty good.
Just switching the voice makes that makes such a such a big impact.
>> That's so interesting. Yeah. And I
wonder if um as you said this is uh the most dominant interaction mode we've had for millennia of all all of human history, right? And so
history, right? And so >> and bias is of serving but I think so.
>> We're just very sensitive to it. Um and
I think people are going to be very sensitive to uh their their own personalization as well.
>> 100%. I think there's also a third piece which maybe is not directly to your to your to your note but we've also realized that you have uh so you have the benchmarks you have like how do I find the right voice for my audience but
even the understanding of how you describe audio data is still lagging in the industry like when we initially started we of course went into the traditional players for them to help us
label not only what was said so like transcription but also how it was said like what are emotions use accent and most people just weren't able to do that workffect effectively because you kind
of need to hear and have like a little bit of a skill set of like how would I describe this specific delivery. So we
needed to create that ourselves. So I
think there is that piece as well of like how do you effectively interpret the data of audio in a in a in a in a more qualitative basis. That's
>> that's that's yeah trickier. Can you
talk about what's happening on uh the agent platform side like what is challenging for you know businesses or even creators that are trying to build agents and what the what maybe what the
surprising or high traction use cases are. I think everybody's kind of aware
are. I think everybody's kind of aware of the idea of like agent-based customer support but I imagine you're doing many things beyond that.
>> Yeah. So the exactly customer support is probably the one that's like kicking off the quickest and and that's the the one that like we see overtaken so many use cases whether it's where I work with
Cisco or Twilio or Tel Digital all of all of them are kind of elevating that to a high extent. I think the second exciting piece within that domain which is happening is the shift from
effectively a reactive customer support.
I have a problem. I'm reaching out the customer support into more of like a proactive >> part of the experience customer support.
So to make it explicit, uh we work with the biggest e-commerce uh um shop in India, Misho, where they started working on the customer support side where I
want the uh refund. I want to see the tracking of the the package to actually having an agent be a front part of the experience. So, if you go to the
experience. So, if you go to the website, you can you have um you have the the widget, you can engage it through voice, and you can ask it, hey, can you help me navigate to item X, item
uh Y, or can you explain what's the right thing for me to give up for a gift for this period of time? And then it will actually help you based on your questions, based on what is on the offer, show you those items, navigate to
the right parts of the piece, maybe go all the way through the checkout. And I
think this will be a phenomenal thing of like elevating the full experience where that's more of an assistant across the whole thing. We kicked off our work with
whole thing. We kicked off our work with Square that enable sort of businesses to do that work. Exactly the same pattern started with voice ordering. Uh how can now this be part of the full discovery
experience too where you get items shown to you. You can have a lot more
to you. You can have a lot more explanation which I think will be a phenomenal piece where where effectively from the beginning to the end. So that's
one category. The second one is the wider shift from static to immersive media where there's just so much incredible stories and IP that today
exist in effectively one way of delivery and now you'll be able to interact with that content in a completely new way. We
uh I think one of the incredible use cases was working with Epic Games. We
worked with them on bringing the voice of Darth Vader and Darth Vader into Fortnite where millions of players could interact with Darth Vader life in the game where you had like a full
experience of of Darth Vader in a in a in in a new way. And I think this will be a theme across whether it's talking to a book, talking to the character that you like to the whole the whole space
shifting. And then I think the one
shifting. And then I think the one that's that I'm most excited about for the world and for the shift is going to be education where you will just be able
to have like effectively a personal tutor uh on your headphone and you like actually study something in a in a in an amazing way. I'll give you like two
amazing way. I'll give you like two quick examples. One is uh we recently
quick examples. One is uh we recently worked uh with chess.com. I'm a I'm a huge fan of chess. I'm a true chess fan.
Okay, great. So you can learn chess but you can have Hikaru Nakamura or Magnus Carlson be your your teacher of how you deliver that which is amazing or even
Botus sisters or it's like all all the plethora of different players that engaged with that which I think is great and then maybe a last one which is a master class who we worked with to shift
from you can of course have the content go through step by step >> um but you can also have like an interactive experience and the best example of that was working with Chris Boss the FBI negotiator, one of the top
negotiators who has a masterclass lesson, but then you can actually call him and have a practice negotiation, which is crazy.
>> Yeah. Got to get that hostage out. We'll
definitely try it.
>> Yeah. Um
>> can I add one more? I think the one one last one which combines all of them together which I which I realized just recently is uh which was crazy. So
recently I went to uh to Ukraine where we are working with ministry of transformation where they are effectively creating a first agent government.
>> And the crazy thing is they have all of those >> government >> agentic government. So they want to like rechange of how they run all the ministries.
>> Okay.
>> And it sounds like a big ambitious goal and lofty.
>> No, I think the baseline is like here.
So actually I'm I'm by that immediately.
Yeah. And the crazy thing is I think they are like so ahead in actually doing that >> and I think they are like uh uh two concrete things there. One they they kind of combine all those use cases. So
they we we are looking into how they can have effectively customer support of government whether it's asking about benefits or employment about process of of how you leave uh the country. All of
that be run through effectively a digital app. Then two how you can have
digital app. Then two how you can have proactive way of informing citizens of things that might be happening. but then
having education system that also run through like this personal uh uh tutoring experience and all of that is happening. So that was that was
happening. So that was that was incredible to see and the second amazing thing was that the way they've done it.
So they have the digital transformation piece but they have engineering leaders in each of the ministries that lead those efforts and then bring them back to that one central piece. So that is
like incredible to see and and and also proud to be able to be working with with them on on that shift. that despite
everything that's happening they're like so >> that's amazing that's really encouraging um can I ask you a business model question here because looking at the strategic landscape um actually I have many questions here um one of the
observations I'd have is if I look at one of these like rich voice and action agent experiences there a lot of uh let's say fortune 500
global 2000 leaders who listen to the pod uh they I think a lot of them are going to buy the idea of like I want this amazing um automatic like real time
available 24/7 every language experience for my customer that's consistent and high quality. The ways I might get there
high quality. The ways I might get there include working with a Palunteer or a large consulting firm, uh, working with
11 or a like platform technology company or or like an open AI or something, right? Let's talk about that. Uh, or
right? Let's talk about that. Uh, or
working with a sort of more use caseoriented company like Sierra, right?
[clears throat] How do you think about how people are making that decision or how they should make that decision? the
so so my past is also in Palunteer so I started exactly kind from from that side and we do blend a lot of the forward deployed engineering inside of the company too as I think about the kind of our offering and and the customers
making that choice if you're looking just as as a like onepointed solution uh and only that one then likely we aren't the best choice if you are looking to deploy that across a plethora of
different experiences so be it customer support but then you also want internal training then you might want to elevate your sales sales part and actually increase the top line with new experiences of how you engage customers
beyond that kind of reactive piece. Mhm.
>> Um then it's a great platform to build and then we effectively as we engage with customers combine that platform work with uh with our engineering resources to help those companies deploy
on that or which we also see increasingly in um in Fortune 500s G2 G2000s where they will want to build parts of the things themselves because they already have a lot of the
investments in that platform while then engage us on some of the the new ones and combine those and and and I think that our model and the way it's different to to a lot of the use case specific ones is that our platform is
relatively open where you can use pieces of that platform and not all of them um for for those different use cases.
Palunteer of course will will or or some of the consulting companies will have a lot more resources to go in the wider digital transformation journey. In our
case, it's like very specific conversational agents. It's like if you
conversational agents. It's like if you are looking for new interface with customers, that's the the best way. And
um and companies like Sierra phenomenal of course on on how they are thinking about the the specific pointed uh uh use case and the maybe the other piece is uh like as we think about our work
depending on how you are what you are optimizing for. So we we have a lot of
optimizing for. So we we have a lot of international partners. If you have like
international partners. If you have like a a wider geographic user base, great.
That's what we optimize for. Our voices,
our languages, our support for integrations internationally are just so much broader. There's frequently a piece
much broader. There's frequently a piece that you will look into depending on your exact scope, this will be this will be a big factor. But I would summarize that if you're looking for a solution
across the set of different use cases that you want our engineering help and deploy that, then we are the right solution and probably the best solution.
I want to talk a little bit about maybe like opening eye and the foundation LLM foundation model companies. One of the reasons a lot and I called this podcast no priors is because we're like okay people are making a lot of assumptions
all the time about how the market is going to work and lo and behold like many of those assumptions end up being nonsense actually and you you you can't you have to very much decide your own narrative at this point in time. I
think, correct me if I'm wrong, like in 2022 and 23, you probably heard a lot of people say like Google can do this and OpenAI can do this and like why do you get to persist working on voice anyway
as a general capability? What's the
answer? That also adds adds a kind of another element to to to that the couple of the other previous questions where whether it's agents work whether it's the creative work deploy the value in those in those work you need a very
strong product layer you need the integrations you need to help people deploy the work which is the most common piece but our superpower and our focus for a long time was building the foundational
models to actually make that experience seamless and as I think about a lot of the companies in the market they will optimize for a lot of other things and that that will be like the
differentiator um in our case where we will make the whole experience especially with voice seamless human controllable in a in a much better way
>> and so fundamentally you would argue that like the labs just aren't going to focus on this and haven't >> exactly so I think most of those companies and that's the thing about the
long term it's going to be incredible research and incredible product that meets customers where they are and work backwards from there. I don't think the the labs will focus on building that product layer that's so important.
>> But I think the you know part of the question that you're asking is like how uh or and and and why they haven't done even the research part >> to the quality that that we've been able
to as here I'm also biased but we are happily beating them on benchmarks with text to speech or speech to text or the orchestration mechanisms and here credit to my co-founder and the team uh that they've been able to do it. It's just a
mighty researchers just continuing their work. But I think the main part that I
work. But I think the main part that I think is different in audio space is that you don't need the scale as much as you need the architectural breakthroughs, the model breakthroughs
to really to really make a a dent. And
um and we've been able to do that couple of times and I think the number of people doesn't matter but the people that you do does. We think there's maybe 50 to 100 researchers in audio space
that could do it. We think we have probably 10 of them um in the company that um that are some of the best ones.
And I think this like obsession of just those people working across and then actually giving the full focus on the company on making them actually work on that and bringing their work to
production, seeing how the users interact back was was so important. So
that's the that's I think how we how we been able to create models um better than some of the the the top companies out there. But you know the truth is
out there. But you know the truth is it's like to large extent is why they weren't able to do it is also like an interesting we we don't know it's like it's a it's a it's uh they are like they have such an incredible talent there
too.
>> How do you think at the same time about um like open source models? anyone you
ask in the company I think will say that same and that's like a narrative we think about it's in the long term models will commoditize or the differences between will be negligible for some use cases they will still matter for most
like the long like the most use cases they they won't um >> and they'll be broadly available and >> they will be broadly available exactly and we don't know where that is whether it's two years three years four years
but it's it's going to happen at some stage then of course you will have a fine tuning layer that will matter a lot um on on top of those models but like the base models I think will get pretty good. Um and that's why for us the
good. Um and that's why for us the product piece is so important from the company perspective but also from the value perspective because if you have a model that's great but to actually
connect your business logic and knowledge to um to be able to have the right interface for creating a an ad for your work or a completely new material that's uh that's a very different
exercise um but open source models are getting if I split into two like more of that async content narration I think narration is pretty much open source is
great, commercial models are great, the differences are are getting smaller on the on the out of the box quality. What
most of the models haven't figured out um and I think we we wear is how to make them controllable.
>> So that's the kind of the narration piece I think the whole interaction piece of how you orchestrate the components together whether that's cascaded speechto text and lamb text to speech approach or whether in the future
it's a fused approach where you train them together. I think this is is good
them together. I think this is is good for customer support or customer experience but it's still away from like conversation like we have and like passing that Turing test. So I think
this is still like a at least a year like a within a year and then you'll have like real time dubbing kind of variation of like real-time translation conversation and I think that's maybe
like more two years within two years away. You know, a very uncomfortable
away. You know, a very uncomfortable belief that I I feel comfortable having this belief, but I think is uncommon in the market right now is that actually most advantages in technology, like they
could they could last you a year or they could last you 10, but they're not like infinitely defensible. And if you think
infinitely defensible. And if you think about that from a model quality perspective or a product perspective, they allow you to like serve the
customer better and build momentum and build scale for some period of time. And
actually that's really powerful over time, right? But it's not like a clean
time, right? But it's not like a clean forever answer. And so I think that
forever answer. And so I think that makes I don't know business people and investors uncomfortable.
>> And I mean it's it's it's it's very true as well. [laughter]
as well. [laughter] >> The way we I mean the way you think about it research is head start. This
gives us we can give advantage to the customer earlier and it's six 12 months of advantage. That is also a way for us
of advantage. That is also a way for us to build a right product layer for you to get best of that research. Frequently
we do that in parallel. So the moment the research is out there you have the product because we know our initiatives we know what the product is. That's
right. So you have research product in parl that extends that. But the kind of the thing that will really give that long-term value is the ecosystem that you create around whether that's the run and distribution whether that's the
collection of voices you can have the collection of integrations you can build the workflows that you can build. Um
like I think that's that's the way we kind of sequence that in our mind that research product ecosystem that we built and um and research all it is is a is a head start and being able to like accelerate the future a little bit
closer. I think that's a really powerful
closer. I think that's a really powerful insight especially if you know the research and the research team and the company team believe that as well internally >> it's it's it's I think the piece that we
what's like interesting for us is u and I think this is like the the the big questions for all companies that do research in product is do you wait for research or do you do like a a product
change uh or even not only research product companies like do you wait for someone else to do the research because the timeline for that isn't clear is it 3 months 6 months 12 months don't know exactly what it will do which is the hard choice of like do I invest into
product layer or do I just wait more for the research so like in our case we internally let all the product teams the research initiative so we can paralyze that work uh but we don't hold them that
if if a product team thinks we should deliver value to the customer by doing something different they can and rough rule of thumb is like three months if we think it's going to be longer than three months we will probably build it if it's
less than that we probably won't >> can you talk about some of the research that you're doing now and and how you think about like the cadence of delivery and what's worth working on.
>> We have now number of of different initiatives across the audio space and there are there are kind of two big buckets and and and roughly they will relate to that creative and agent side.
On the creative side what this means uh with the texttospech models that are controllable. Uh we then added
controllable. Uh we then added speechtoext model that transcribes in a high accurate way but across a low resource languages as well. So covering
almost 100 languages. then created a music model, a fully licensed music model. Um, and as you think about the
model. Um, and as you think about the future is how those models will also interact with some of the visual space.
So that's uh a lot of effort and how you can get a best of audio and then potentially combine that with existing video that you have to to to really have the best delivery. And then on the agent
side, it's of course how you optimize the real-time speech to text, real-time text to speech. We just released our speechto text model scribe v2 which is
under 150 milliseconds 93.5% accuracy across the top 30 languages on on flares and it's only top 30 here because we serve so many others but most of the people don't so uh so it's uh so it's
beating beating the the all the models on on benchmarks but as you think about the future it's also the orchestration piece of how you bring speechto text lm and text to speech we are releasing
we'll be releasing over the next couple of months a new orchestration mechanism that will lower the end to end part um we think in a great way. But second
thing which is what is so hard is it's not going to only allow you to combine those pieces but add also the emotional context of the conversation so you can actually uh respond with the model and
we think and more expressive in a in a better way. And in the future and
better way. And in the future and something we're investing is paralyzing a speechtospech more fused approach as well. And of course depending on the use
well. And of course depending on the use case if you are like enterprise reliable use case the cascaded approach is the approach for the next year too >> has more structure yeah >> more structure you have more uh
visibility into each of the steps it's reliable you can I'll call tools if you're think more expressive and can hallucinate speech to speech might be the choice and maybe over time you'll see them the kind of go one over over
another depending on the on the industry but that's like a huge investment on our side which is where the foundation of all the platform and and and the main part that we are continually investing in is is is kind of plethora of
different models that combine the best of audio with some of the best of the other modalities together.
>> I want to take uh our last few minutes and ask you a few questions about just the future that I think you'll have a really good point of view on given you think about voice and audio all the time. What do you think of AI
time. What do you think of AI companions?
>> I think they will be a big thing and exist in a big way. not something I'm personally excited about or something that we spend much uh time on but uh I
think the whole line of of what's a assistant companion character that you enjoy as part of experience will kind of blurry and blend to a large extent
>> they can be very common but you're not like enthusiastic personally about it >> I'm more excited about like more of um the Jarvis version of that or like like more of like I have a super assistant
superpower. It's like
superpower. It's like >> versus the social version >> versus the [clears throat] social version that's like I think it's it just would be like such an incredible unlock and and it also like is in a something blending in that person context like I
would love to start the day and like someone that understands me and like start and tell me what's like relevant to me and open the blinds and then like tell me about the weather and the sunshine is and play music straight away.
>> It's going to happen.
>> It's going to happen. That I'm excited for. I think the companion um use cases
for. I think the companion um use cases will will will mention solving loneliness and in that part I think that's one way maybe there are like different ways of engaging people back I do I do think there will be like an
interesting future even if you think about education where you will have superpower with learning from AI tutors but I think on the flip side of that and I think this will like that's my
personal take you will have education good percent of time spent with AI tutors but then explicit percent of time spent and without any technology human to human.
>> So you can you can kind of learn that part too.
>> Yeah, I think this is the correct model.
Um both in terms of like emotional guidance and coaching and um you know uh guard rails, right? As well as like peerto-peer. Um
peerto-peer. Um >> exactly. What do you think about um
>> exactly. What do you think about um dictation or what happens in terms of how we like control uh technology that isn't necessarily personified as well >> or does it just all become personified?
>> I think not all personified. I think
like some you know communicating within oven and and home probably will like stay pretty static and >> or code I might just >> Yeah, exactly. like you don't probably need that much of of like additional emotional input
>> but uh but I think it's yeah it's going to be huge part where like in a way what I hope will happen is you will have ability to like stay more immersed in the real life with the devices going
into back into the pocket back into some version of a um attached element um assuming that's that's in in the right setting and um and that kind of acts on
your behalf and in many ways like let's say dictation it's as Karpati says decade of agents. Let's let's let's call it a decade. Then you'll have a decade of of robots. If you are interacting with robots, of course, voice will be
the input and the output as one of the key interfaces. So, you will need that
key interfaces. So, you will need that dictation as a as a huge part. But
similarly, >> I think the robot's going to be personified.
>> Yeah. 100% 100%. Yeah. No, like I think I think most of the use cases will be personified.
>> Okay. Last one. What's like one thing that you've seen already exist today or if you project out a few years will change about how we interact with content maybe it's like personalized
voice content or um just something people are going to do with with AI voice that they don't do today or that not everybody knows about.
>> I think this still the biggest one that hasn't yet kicked into the the the system is like how education will be done. I think this go like I think
done. I think this go like I think learning with AI will with voice where you it's like on your headphone or in a speaker it's just going to be such a big thing where you have like your own
teacher on demand and who understands you very personified and kind of delivers the right content through your life I think this will be one of the biggest use cases uh uh and I don't think it happened yet I think we have
see of course some of the commercial partners but like schools universities how that's deployed in a safeguarded way in a way that like supports the other part of the education, the social part of education. I think all of that will
of education. I think all of that will will evolve and maybe there's a cool version of that where you have like Richard Fineman or Albert Einstein deliver those lecture notes or other teachers that you love. It's it's it will be sick.
>> It's a great note to end on. Thanks for
doing this, Marty.
>> Thanks so much.
>> Find us on Twitter at No Prior Pod.
Subscribe to our YouTube [music] channel if you want to see our faces. Follow the
show on Apple Podcasts, Spotify, or wherever you listen. That way, you get a new episode every week. And sign up for emails [music] or find transcripts for every episode at no-briers.com.
Loading video analysis...