LongCut logo

The Rise of Generative Media: fal's Bet on Video, Infrastructure, and Speed

By Sequoia Capital

Summary

Topics Covered

  • AI Video Ignored for Clear Use Cases
  • Video Models Compute-Bound Unlike LLMs
  • Video Thrives on Long Tail Models
  • Top Video Models Half-Life 30 Days
  • Hollywood IP Adapts Like Animation Did

Full Transcript

We recently had our first generative media conference and Jeffrey Katzenberg, former CEO of DreamWorks was there and he he made a comparison. He said this is

exactly playing out how animation when it first came out. People people

revolted against it. It was all handdrawn before that and computer graphics it was new and there was a lot of rebellion against computer-driven animation and something very similar is

is happening with AI right now but there's there's no way of stopping technology it's just going to happen you're either going to be part of it or or that.

In this episode, we sit down with a team from Fall, the developer platform and infrastructure powering generative video at scale. follow is a place that

at scale. follow is a place that developers can go to access more than 600 generative media models simultaneously from OpenAI Sora and Google VO to openweight models like

Clling. We'll discuss why video models

Clling. We'll discuss why video models present fundamentally different optimization challenges than LLMs why the open source ecosystem for video has a thriving long tail in ways that text

models never did and why the top video models have a halfife of just 30 days.

The team also shares insights from the demand side of the video model equation.

We discuss what's happening in the app layer from AI native studios to personalized education, what's happening in Hollywood and more. Enjoy the show.

Borai Gork Batwan, thank you so much for joining us today. I want to start with the problem space that you decided to tackle. So, FAL is a developer API and

tackle. So, FAL is a developer API and platform for generative video and image models. Video is massive. Obviously, it

models. Video is massive. Obviously, it

is more than 80% of the internet's bandwidth and it follows that generative video is going to be similarly massive.

But there's not that many companies that are focused on this problem. Why do you think that is?

>> Yeah, in a way generative image and then video was a overlooked market in in this current phase of AI in my opinion for two reasons. Number

one, uh there wasn't a very clear industry use case that people were going after. uh there wasn't wipe coding that

after. uh there wasn't wipe coding that auto automates software engineering or there wasn't search uh which seems LLM

market is going after or customer support anything like that also number number two the investment on the research side wasn't wasn't as big three

years ago and then that ramped up a little bit slower than LLM's but still considerably since then and now the the the models are much more capable

much more useful and real industry use cases compared to what it was 3 years ago. It felt like a toy use case. This

ago. It felt like a toy use case. This

was just going to be for fun on the side and it's going to be a small market at the end. And now we can see that it's

the end. And now we can see that it's going to be a massive market with very unique use cases and customers compared to the LM market. Like if if you

actually go back to like as we were experiencing it, I think uh that that was an interesting time. We were uh we were working on some like Python compute infrastructure and then these these

models like Delhi 2 had just come out and then soon after that Chachet had come out and then Llama Llama had come out and we were just like we were

initially we we didn't know uh that that you know image and video market was going to get that big. um we were actually just curious about like running

image models much faster. That was that was like our initial entry point. And

then uh we saw like the initial growth.

We had a few customers and they were growing really fast. We were like what the heck is going on? And and then you know a few customers later we actually uh thought hey we should double down

here. And around that time also the

here. And around that time also the other thing that was happening was people were overindexed on language models. this like story of AGI was being

models. this like story of AGI was being told and and you know that attracted all the dollars that attracted all the talent. So everyone was just like

talent. So everyone was just like working on that where we thought like we had something niche like growing fast you know don't tell anyone uh and then

and then we just like started focusing on that and and soon after like as we got more familiar with the models we thought the u I remember I think we

changed our website copy to say generative media like generative media platform and then it was only like two or three months after that uh Sora was announced. Uh so so we were definitely

announced. Uh so so we were definitely like ahead but but we we really saw the whole like future kind of coming with like better image models, video models, etc. So so yeah, we we made this early bet.

>> I mean you guys have a front row seat to the sorts of new experiences people are building. I think it the market's only

building. I think it the market's only going to expand from the media market that we know today.

>> Yeah, absolutely. I think like I'll quote tweet, you know, no good podcast without it. He he did he did one like

without it. He he did he did one like recently where he was talking about like why he's excited about the you know media models and one of the things he said was that like he he also mentioned that right like know people are visual

and we're going to we have more so much more video than than text like wall of text and he was saying like um he was making a point around like education and a lot of the like content you consume

just to like like learn things I think right now the model quality is just like like relatively it's so much like it's so much worse than what it can be where

like you could actually like have you know I do a lot of learning on chat GPT but it's like through text but if if if it actually rendered a video where it could compress like a concept right

instead of you know 10,000 characters if it could do it in in like 15 seconds uh it'd be so much better. I I think like there's there's sort of like uh the

quality bar where if like it's it's going to go up and and if you have that once we have that we're going to have even more more penetration. So it's a it's really a function of the quality right now and we're just like in the

very early beginnings.

>> Totally. education market almost untouched right now with with video generation and there's so much potential there and it's just waiting the quality the predictability to get there and I think it's going to have a lot of

potential.

>> Totally. I mean you guys sent me that generative video Bible app. I think it's a much better way to learn some of the lessons from the Bible and it's it's you know if you're capturing consumers

attention right where they are. Uh I

think it's I I agree with you. We're

we're just at the beginning. Um, so Fall is an infrastructure company and so we're going to structure today's interview. I love infrastructure

interview. I love infrastructure companies and in terms of the technical layer cake. So we're going to start from

layer cake. So we're going to start from the core inference engine compilers, kernels that you built. Uh we're going to go up to the model layer and then the workflows and and then end with some observations on the markets and what people are building. Sound good?

>> Let's do it. Sounds exciting.

>> Okay, let's do it. Uh the inference engine batwin, how old are you?

>> 22.

>> You're 22 years old. Okay, stay away on your background. I think it's super

your background. I think it's super badass and uh makes complete sense why why this company is so hardcore at >> I I I started working on compilers when I was 14. So in a way I have like a lot

of experience on that front you know it's not just that but like I started working on open source projects. So my

first contributions were around tooling around the Python language and then I started started to slowly contribute back to the Python language core compiler core parser and the core interpreter itself and became like one

of the core maintainers of it. uh I

think at the time I was like the youngest core maintainer of the language and this kind of gave me like a unique appreciation of compilers and how flexible they are. So when we first

started like working on serving these image models at fall the the main idea was okay there is like these like three different image models three different architectures but this is surely going

to explode. there's upscalers, there's

to explode. there's upscalers, there's like you know there's going to be video models. We were predicting that and we

models. We were predicting that and we didn't want to go optimize a single model, put our eggs into a sing single basket and then go became invalidated when the next model comes. So we started

building this inference engine which is a tracing compiler that traces the execution and essentially tries to find common patterns that are fitting within the templated kernels that we do. So our

our bread and butter is like spending we have a 10% performance team that's spending all their efforts into writing kernels that are like 95% there but like generalized with templates. So we find we trace the execution of a model and

find common patterns that could like replace these templated semi-generic kernels to like specialized kernels at runtime and optimize the performance of these models. And we found this

these models. And we found this technique to yield like pretty much spir results from anything that's out there in the market. And this led us to claim like number one spot on performance on

on all the benchmarks. And another big thing about this is we specialize in doing like this sort of kernel level mathematically correct sound abstractions that let us like you know maintain the same quality of these

models which is a very high bar when you're in this media industry and when you really care about the output that you're getting.

>> What's different between optimizing a diffusion model versus an autogressive LLM? in autogressive LMS like your

LLM? in autogressive LMS like your bottleneck is how fast you can move uh all those like giant weights from memory to SRAMM because you have like a $600 billion parameter model and you're trying to predict the next token you're

doing the attention for like all the tokens like a couple tokens before that in diffusion models you're trying to den noiseise like thousands tens of thousands of tokens for a video at the same time doing attention of it so

you're essentially saturating all the compute bandwidth of these mod of these GPUs you're not necessarily bound on memory bandwidth but like the the computational operations that you do are like fully saturated. So you're trying

to find better ways to execute around the GPU. This could be like writing more

the GPU. This could be like writing more efficient kernels or this could be overlapping you know softmax with gams that you do like it's essentially like you're you are trying to use all of the power of the GPU leverage in a way that

like you know that gets you all the capabilities.

>> So it's a different binding constraint.

It's on the compute versus the memory.

And what's the intuition for why LLMs are relatively memory constrained and why video models are by comparison relatively compute constrained but not as large in terms of just sheer number of parameters. I I think it's scaling

of parameters. I I think it's scaling issue, right? Like in terms of like if

issue, right? Like in terms of like if you scale this video also 600 million parameters with the same dense architecture, you're going to have to do attention with all those full like 100 like let's say a single video is 100,000

tokens and you do this attention step or like you do this like denoising step 50 times and every 50 times you do like attention over all these like 100,000 tokens, it's insanely insanely expensive. So I think the constraint

expensive. So I think the constraint there is like just how fast you can do the inference and the same applies with LM at like larger batch sizes but like at like the traffic patterns that people do like you know the batch size are not

that much and you're mainly constrained by memory bandwidth so people do optimizations like speculative decoding and other other factors to like reduce that overload.

>> Yeah. what exactly goes into being at the top of the leaderboard in terms of performance because I would imagine there's other teams that also have very smart people and you know this is this is my Olympics and so what exactly goes

into I imagine people have like very similar ideas on the techniques and different optimizations they can do. So

>> I don't think anyone cares about it as much as us. We are literally obsessed with Jared Media. We're literally

obsessed with these models. We have a team that's like just focusing on this.

Like so far like it seems like and from Nvidia to other inference players, everyone is like super obsessed with language models. Everyone is trying to

language models. Everyone is trying to get like one more tokens per second on like Deepseek benchmarks, whatever. And

like we're like on a on a different lane. We have like competitors but like

lane. We have like competitors but like no one close to us because I think we assembled one of the best teams. We found out like the best way to optimize this general models and we just focus on this. This is like a purely focus thing,

this. This is like a purely focus thing, right? Like at the end of the day,

right? Like at the end of the day, you're constrained by the hardware.

There's nothing unique about it. But

like we're just like three months ahead, six months ahead. Like when we benchmark Torch, like the latest version of Torch against like you know our inference engine from a year ago, we're clearly we are clearly underperforming because like

torch caught up. The same thing is going to happen with other players. You're

going to always like the lead that you can maintain is three months, six months ahead at most. The thing that matters is just focus. If you focus on it, if you

just focus. If you focus on it, if you purely put all your energy into it, I think that's like there's it's very hard to get out competed by others >> because models are slightly changing

each month uh each release. So, it's

still the same general architecture, but there are slight differences where we can go in and optimize where that's different and no one else is paying that much attention to it. Also, hardware is

changing as well. So we were able to adapt to B200s earlier than anyone else and we were able to run video models much faster basically throughout the year because of that obsession with

running video models with the latest hardware.

>> Yeah. Got it.

>> What are the hardest technical problems that you think you're solving?

>> So one thing people don't appreciate it as much is we are running 600 different models at the same time. We have to be running them. We have to be so good at

running them. We have to be so good at running them that we should be running a single one of them better than as if someone else is running a single model because when a foundational lab is

running models, maybe they have a single version of the model, maybe they have like couple other different versions and that's what all they care about. We have

to be better than them at running those models and we have to be doing all 600 at the same time. So on top of the inference optimizations that happens in

the GPU a lot of optimizations on the infrastructure level needs to happen. We

need to manage the GPU cluster in a way that's efficient to load unload these models at the right times. We need to route traffic to the right GPUs who have the warm cache of these models. You need

to be smart about choosing the right kind of machines which which kind of chips are running what kind of models and the the customer traffic is changing all the time and we need to adapt

towards that. So on top of the inference

towards that. So on top of the inference engine the the overall infrastructure is also really really hard beast to manage and so far we've done an incredible job

at that. Would you add anything to to

at that. Would you add anything to to >> I I I think that's a pretty fair uh fair explanation of what we do. Like I call this distributed supercomputing. I don't

know why people don't like that name but like you know it >> uh but like the idea is we are at like 28 like this was like a month ago. Now

we are at probably at 35 different data centers and you have these like heterogeneous groups of compute that split across with their own like you know different specs different networking whatever and you're trying to

schedule workloads as if it's like a homogeneous cluster that you got from hyperscaler it doesn't work like that so we built like we spend the last three years building abstractions over it from our own orchestrator to building our own

CDN like we go back to like you know fundamentals of web and we built our own CDN service uh deploying racks to coloss like you know just routing traffic. So we build all these

routing traffic. So we build all these technologies to essentially make sure that we can tap into capacity wherever it is and schedule our workloads which is like very different than like a traditional enterprise LM usage pattern

you know like the the the use case that we have are like so much more spread across so much more you know like uh more consumerf facing and like when when you consider that like there's so much investment going into making sure we can

tap into this like scarce capacity of GPUs. Yeah, you mentioned hyperscalers

GPUs. Yeah, you mentioned hyperscalers and you know I hear distributed compute and I hear managing giant clusters and I naturally think that's somewhere where hyperscalers should have the incumbent advantage. Um why do you think that

advantage. Um why do you think that you've been able to so far execute them on the core engine?

>> I there's two things about the core engine right there's the inference part where none of the hyperscalers have any expertise. Uh this is a net new field. I

expertise. Uh this is a net new field. I

think this has been only happening for the past three years inference optimization. So it's like a brand new

optimization. So it's like a brand new lane that we have been out competing anyone in our field. I think like that's like a pretty much uh answer of its own.

And the second one is infrastructure. I

think right now hypers scalers are very busy with their traditional uh pattern of oh we have this data center capacity.

We'll just deploy GPUs and we don't care about the rest. This has been changing recently. You know even like Microsoft

recently. You know even like Microsoft is going buying from neoclots. this is

like there's like an interesting pattern happening because the GPUs and the demand and the growth of GPUs doesn't fit to the patterns of these hyperscalers the growth patterns they expect. So I think at this age not even

expect. So I think at this age not even hyperscalers have that big of an advantage of scale because like they're going buying GPUs from neocclass like the tables have turned a bit.

>> Yeah. It almost helps to also be like slightly earlier like in the c like in the company journey, right? Like if if you're a public company, you also have

to kind of abide by like what the what the market's expecting of you. So like

>> the the the other thing is that there's a huge price discrepancy with hyperscalers and neo clouds, right? So

like it's it's maybe sometimes 2x three 3x more expensive to use things at, you know, through a hyperscalers. what's

driving that?

>> Um well I think I think one is like market pressure, >> right? Uh and also like there's added um

>> right? Uh and also like there's added um kind of operational expenses that hyperscalers have for like having you know better they just have a better service right better uptime and better

SLAs's and all of these things add up and and then on top of that there's kind of an established like cloud margin right and and you know the market expects the cloud margin to be a certain

level whereas like if you have a three-year-old neocloud you know you're you're a private company maybe you don't have as much pressure and there's like assuming infinite demand and and uh

limited capacity you can actually you know hyperscalers >> can keep their prices high and and you know they will fill out the capacity and also like get get slightly better

economics whereas like neoclouds compete over the whole whole like infinite demand and and that that pushes the prices down.

>> Perfect price competition. Um what does it take to run image versus video models? as well like you guys started a

models? as well like you guys started a company around a stable stable diffusion moment. It was the field was mostly

moment. It was the field was mostly image at the time. How does running video models compare to image?

>> Let's actually do text, image, video even let's compare all three of them. Um

so for for let's say a sort LLM like we don't know let's say deepseek or or something like that where we know the numbers running a single prompt like 200

tokens uh let's say it takes 1x of of of teraflops uh I think it's tens of teraflops but let's call that unit one

uh one image is around 100x of of that and if you are doing a 5-second video 24 fps test that is around 120 frames. So

100x from one image. So you you are already 100x of um of 100x. So you are

at you're at a,000 uh 10,000x for for a st low low of standard definition video.

And if you want to do 4K, that's another 10x. So 10 10,000x compared to a a

10x. So 10 10,000x compared to a a single uh 200 token LLM input. So it is a lot more compute intensive in terms of

uh amount of flops you are doing.

>> Yeah. Uh in in in general like when we started with image the infrastructure was relatively easier to do because it's like it takes three sec or it took 15 seconds back in the days. It takes 15 seconds to generate an image. You don't

need necessarily shave like that 50 ms 100 ms you have overall in the system.

And then when we went to video, it's like even like easier because it takes like 20 seconds, 30 seconds to generate a video. The way that has been happening

a video. The way that has been happening in the past like couple months is real-time video where you need to stream 24 fps videos over the or like a network link to the from these GPUs. That's

where we actually spend like now some of our time. We started this progress with

our time. We started this progress with like speechtospech models a year ago. We

started optimizing them where we were able to like reduce the latency of our system with like globally distributed GPU fleet. when you send a request, we

GPU fleet. when you send a request, we route it to the closest GPU, minimize our own overhead and then do stuff like, you know, we we pick the best run runner, whatever, stuff like that. So,

we are now applying those same optimizations with it to real-time video. And we actually see like really

video. And we actually see like really good interesting demand there where people want to experience these stuff like as they type, as they prompt. And

that's where like some of the infrastructure technical challenges differ from traditionally running image and video models because image and video are similarish you know like just more compute expensive but like you actually

need to care about infrastructure stuff when you go from like less than a second generation time for some of these uh models.

>> Yeah. Another another interesting thing is like uh image image models especially you were able to run them on a single GPU uh like the parameter counts is

actually much smaller. So that actually makes it like a little bit easier for us as opposed to LMS. Uh and then with video parameter count is going up. Right

now I think we're around like for the open source ones I don't know 30 billion parameter. Uh whereas you know we we we

parameter. Uh whereas you know we we we hear rumors about you know GPT4 being like in the trillions GPT5 maybe more.

So, so that's that's another that's like, you know, on the on the flip side, it's a little bit easier, but it doesn't mean video models are not going to grow,

right? Uh there's, you know, rumors

right? Uh there's, you know, rumors around numbers for VO, numbers around Sora. So, like there's also a like

Sora. So, like there's also a like increase in parameter count. So, so that that you're going to have to kind of, you know, use more distributed computing. But but if you're just you

computing. But but if you're just you know eight nodes that's a or one node or eight nodes you you kind of have a slight advantage.

>> Yeah totally. Okay let's pop one layer up the stack to the models.

>> Let's do it.

>> So one thing I think people don't fully appreciate at the media space and you mentioned this you alluded to this before is that there are there's a very very long tale of models that are actually used in practice. And so I was hoping you give people a sense of on

your platform how many models are people actively using? How's it distributed?

actively using? How's it distributed?

And like why do you think there's such a long tale of models being used compared to the LM space?

>> This is actually one of the things I would say three years ago people got it wrong. I mean jury is still out but

wrong. I mean jury is still out but people right after the chachi PT people started talking about omni models.

There's going to be these giant models that are going to be able to generate video audio image and code text every type of token. This this might still

happen I think but it's more clear that you are better off if you optimize for a certain type of output. Uh even this is true for code generation definitely true

for image or video output. Uh so that's one thing when we were pitching three years ago everyone like that's one feedback we got. Oh there's going to be omnimodels and there's going to be a

single way of running these. it's going

to be hard to create an edge on on the modality. But turns

out it's it's it's not true and it it actually makes sense to to have a technical edge on the modality. Um, and

this is one of the reasons why there's also a variety of models because still the best upscaling model is just doing upscaling and the best image editing model even the best text to image model

is different from the image editing model. So all these special tasks

model. So all these special tasks require their their own model. It might

be the all the the the the similar model model family or similar architecture but at the end of the day it has its own weights that needs to be deployed independently and that creates the

variety in the ecosystem. I I think there's also the like this also applies to language models where even this in the same modality there is different families of models with different taste different characteristics there's

different personas and this happens with like language models still like the code that claude writes is very different than the code GPT5 does right like and like we see this happening but the the good thing about here is like there's

like these three four different personas on top of like different like categories upscaling editing video text video whatever stuff like that so it get it gets you like you know close to 50

models that are active at any point in time and then you have a very long tale of models that people still choose because they might like the person of that better.

>> Yeah, totally. Um speaking of model personalities, what are some of the most popular models on your platform? What do

you think are the personalities of them?

>> So one thing that's been true since the beginning, the popular models change all the time. So there's always new releases

the time. So there's always new releases from different labs that take over the other and it's it's always a moving target. But that being said, there's two

target. But that being said, there's two types of models usually preferred by our customers. It's it's usually there's a

customers. It's it's usually there's a one big expensive model that has the best quality on video generation. This

this could be VO uh this could be cling this could be Sora and then there's usually a workhorse model which is cheaper smaller but good enough and

people usually use that at higher volumes I would say this has been true for the past almost two years that there's an expensive high quality model uh that that keeps changing there's a

cheaper good enough model that keeps changing but overall this this has been constant and is workhorse model for prototyping and then you run it through the big expensive model for the final product or what what do people use the

workhorse versus?

>> It's for higher volume use cases. Um and

depending on the application you are building you might you might encourage um different like lots of variations of the same output maybe but it's it's very application specific I would say. Yeah,

there's there's also another dimension I think that's like kind of happening in real time right now which is uh based on like the different use case you want to

use the model for. So like one when um uh OpenAI released GPT uh image editing that model had just like like superior

text editing uh text generation and editing capabilities and for things that require like a lot of text people started going you know and choosing that

model versus the other models. So it it also tends to correlate with like different capabilities models are uh bringing and and also like what they're like what they're good at right so like

cling for example uh people really like it for visual effects type types of uh workflows because you know they they had they had that kind of data in their data

set uh as opposed to uh you know some some some other models uh for example Cance is like very good at like detailed textures and and artistic diversity

things like that. So it's it's really it's really a matter of also like this this sort of use case dimension uh that that models excel at. An interesting

metric that we saw on Q2 and Q3 was halflife of a top five model was 30 days.

>> Wow.

>> That's it's very very interesting to me where like you know these models are continuously shifting like the top five of of the models are continuously shifting. tough depreciation schedule

shifting. tough depreciation schedule for the for the model providers.

>> Hopefully they are building on top of the work that they already done. So it's

it's you know additative to to the end.

But yeah.

>> Yeah, I'm I'm teasing and and the model probably is in a more turbulent state right now than than what the end state will probably be. Um what do you guys think is the most underrated model? Like

what's your personal favorite?

>> I I usually like cling models for video.

Um but this kind of has been changing because they don't have sound. Uh for

sound we have V3 and Sora. They they are the only ones. A lot of people are working on it. So would love to have more more variety there as well.

>> Image models. I like Rev's model and like Flux still holds like a very nostalgic even though it's been a year value for me, you know, like I still go back to Flux. There's like variations of

Flux models now that I like. I'll go

with midjourney which is not on file.

It's not available on API. Uh I I just like the how they navigated the space I think I think is is very interesting.

>> Like they kind of brought this like photo realism which was um you know that was like a very big deal at the time.

You know no model could do it. And then

now they're more like this artsy model, right? Like it's no long photo realism

right? Like it's no long photo realism is kind of cracked and like no one cares about it. So now they have this like

about it. So now they have this like niche very artistic like uh visuals which is very cool.

>> Yeah. I'd love to chat about the marketplace dynamics a little bit. So I

understand your business as a little bit of a marketplace where you aggregate developers on one side of the market that's the demand side and you aggregate model vendors on the other side of the market. That's the supply side. And the

market. That's the supply side. And the

model vendors are both you know proprietary APIs model labs that view you as a distribution partner. Um and then also

distribution partner. Um and then also open models that you uh that you host and run yourselves. And so maybe talk a little bit about for the closed model

providers. Uh you have partnerships with

providers. Uh you have partnerships with OpenAI, Sora, with with Deep Mind on VO.

Um what's in it for them? Why did they choose to partner with you?

>> We were one of the first platforms that accumulated the developer love and you know following from that these developers work at big companies. So they started

working with us and we really built the platform for simplicity and being able to get get going really fast. And

because the thing Banan mentioned the half-life of these models is is really short. Uh people usually work with many

short. Uh people usually work with many different models at at the same time. So

we were able to claim that we have this big developer base that that love the platform and not tied into any single model and here for the platform and uh

model research labs see this and they they they use the platform for as a distribution channel and tap into the developer ecosystem that we built.

On the other side, this helps us with the next model provider because they see all the all the developers, they want to they want to be on the platform as well, which attracts more developers on the

platform and creates a very nice positive flywheel for us.

>> Yeah, it very much is a marketplace business and and for developers, it's a single choke point to be able to access uh multiple model vendors. And to your point on like the model space is changing so quickly, um I think they

really do value that choice. Yeah, we

call it marketplace plus+ because we get to provide infrastructure to the research labs as well also to the developers. So there's additional

developers. So there's additional benefits which ties into the uh flywheel effect that we are creating. So it's

marketplace plus other services next to it.

>> How do you position yourselves to get you know in some cases day zero launch access sometimes exclusive launch access to models like cling and minax? Have you

done that? Yeah, throughout the last two years we were able to build a very robust marketing machine as well. And

this this this is our connection point with with the developers who are on the platform. Every time we release

platform. Every time we release something, this creates another opportunity for us to introduce a new capability, introduce a new model and

model developers also see that. And we

we usually do co-arketing together and part of that co-arketing we get exclusive release access for a certain period of time, sometimes sometimes

forever. Uh we are like we have couple

forever. Uh we are like we have couple competitors that are on the smaller side. So model developers want to work

side. So model developers want to work with the biggest platform out there and increasingly that platform is ours and we get to have these exclusive benefits with the model providers.

>> That's awesome. Why do you think it is that the open-source model ecosystem has been so vibrant for video models? Uh you

know it almost feels like the text models are just consistently a generation behind whereas in video you know there's there's so much that's happening in in the open source realm.

video and also image editing image editing as well.

>> Why do you think that is?

>> It started with stability. They first

open source stable diffusion and got insane adoption and almost the same team then started black forest labs and they knew the power of open source. how it it

helps them create the ecosystem. And

with image and media models, the ecosystem actually matters. When

developers are training Lauras, they are building adapters, they are building on top of your model. It really brings free marketing but also creates stickiness.

So that the developer there are still people who are using stable diffusion models because they like that ecosystem because it was so open. Yeah. Um, so the

flex team saw this from their experience at stability and they had a very smart strategy of having at least some models that are open source, some that are

closed source and uh a a lot of video model providers that came after is is follow following the same playbook because you can have a very robust ecosystem. It gives you a lot of

ecosystem. It gives you a lot of advantages uh in terms of marketing, in terms of developer love and I think it's going to keep keep going like this.

>> Yeah, totally. I want to add on to that like the domain is also very interesting like I think in the visual domain like the ecosystem actually matters more like I think when when like llama 2 first

came out there was like many fine tunes out there but like if you actually downloaded it and start using one like you can't I mean >> can't tell it's a fine >> you can't tell like the difference like

you can't really you know if you're using a >> uh I don't know like a control net like it concept doesn't even exist like it doesn't you know language language models are a lot more general like

generalized. So you can't really

generalized. So you can't really understand like the difference if you if you were to actually fine-tune it, right? So so it kind of just ends up

right? So so it kind of just ends up being very monolithic as opposed to like in the visual realm, it's just like um any small adjustment you make to the

model, it can actually uh you know uh it can actually have huge implications, right? And so and so it's it's just it's

right? And so and so it's it's just it's just uh you know very uh fertile ground for like a lot of a lot of customization.

>> Yeah. I mean speaking of midjourney, David Holtz, one of his one of his quotes I like is you know he's curating the aesthetic space uh with Midjourney and I I very much think you just have this combinatorial explosion of of

styles. Yeah.

styles. Yeah.

>> Aesthetically and I think that's the reason why some of the I think some of the models on your platform are are fine tunes of other models, right? Yes. And

and like the the thing is like even if you add a lot of diversity of aesthetics, like if you if you actually train on everything, like if you have train on too many, you may not be able

to like actually get the exact like like there's so many times you want the exact aesthetics. Yeah.

aesthetics. Yeah.

>> And then you still you you may still have to like fine-tune the model to get exactly the output you want. Whereas

like with LLMs, that's not really like how you operate. You don't exactly want a particular outcome. It's it's like a different it's a different problem. Uh

so so this is a lot more you know it's it's very uh subjective. So like you kind of have to do do these like post- training things on top of the models.

Sora is another like good example like Sora 2 is very fine-tuned on like social looking stuff right and and and so you know you could you could probably you

know you can have tens of different styles and you you still want to probably uh push the model towards that direction with post training.

>> Yeah, absolutely. It all depends on the use case too. Like a customer support chatbot does not need personality. Like

you want it to be as as vanilla as possible, but you're talking about filmmakers, marketing teams, they all want to add the personality of their

style or their brand. Uh so they they want to have greater control over over the outputs. Whereas maybe in LM that's

the outputs. Whereas maybe in LM that's that's not necessarily true all the time. If you have an agent, if you are

time. If you have an agent, if you are doing code generation, there's there's no equivalent of style and personality.

>> Yeah. Okay, that's a good segue for us to go one one more layer up the stack.

Let's go to workflows. What's the what is the average developer workflow inside fall look like today?

>> They are using many different models.

First of all, um so we looked looked this up recently. Our top 100 customers, they are using 14 different models at the same time. These are sometimes

chained to each other. So one text to image model, one upscaler, one image to video model, all part of a same workflow or like a many a more complicated

combination of this part of the same workflow or different uh models used in different use cases. I think that's the most interesting part. The variety of

the models people use on the platform.

We do have a uh no code workflow builder as well. We built this in in

as well. We built this in in collaboration with Shopify and this is usually very good for their PMs, their marketing teams, the nontechnical

members of the team who are playing with these models. Uh it's really good for

these models. Uh it's really good for trying different things, really good for comparing different models, but eventually this makes it into the product as well. You can reach to this

workflow through an API. uh it's been very popular recently and more and more people in a typical software engineering organization is now interested in image

and video models. So the users of this platform has been increasing.

>> H okay so the average workflow is not just text to prompt. It's not create a five minute commercial that does it. If

I wanted to create a fiveminute commercial what what would the workflow be?

>> Yeah. So for this reason people actually prefer open like that's one of the reasons why people prefer open source models because they get to have more control over over the model and they

they can add things here and there to steer the model towards the outputs they want. Yeah. And when we when we go talk

want. Yeah. And when we when we go talk to studios or more professional marketing teams, they all love working with the open source models because of

the pieces they can replace and control they can add into the it. And then these workflows are usually like the ones if you've seen any like big conf UI workflows with many different nodes, it

resembles those where each different piece of the model can be replaced to create more control for the creator. H

got it.

>> Yeah. And I think like what we have like our workflow tool, it's not the final form of like there's almost like a another layer of abstraction maybe on top in terms of workflow and like as we

talked to like these studios, we actually figure out like there's so many ways of just like there's so many ways of using Photoshop like there's no single workflow. In fact, like there's

single workflow. In fact, like there's probably like based on your role, right?

like you're a marketing person or you're a animator or whatever like you have different workflows right and so I think I think that is also emerging like as as more and more like professionals are

actually starting to use these tools like you see the emergence of like uh very particular workflows right one of our favorite creators is PJ Ace uh he

actually like shares his workflows uh online and every time like he posts things you know every month he actually has like a different kind of workflow.

It's it's really driven by like the new models like based on based on new model he may have a completely new new workflow next time. I think I think once

like we sort of reach some sort of uh I guess like some sort of productivity and and you know some prof professionals actually adopting these tools there will probably be more sort of standardized

like best practices around around using these these abstractions but like you know it's not I don't think anyone knows like the final final form yet and it's it's like every day we see new things

and we try to like update our product to make sure like it it cers those people.

>> Totally. One of the workflows I'm seeing somewhat commonly is, you know, you have an idea for high level what you want and you type that in and then um and the aesthetics that you want and you iterate on the aesthetics from an image model

>> and you use that image model with the aesthetics you want to then generate a series of images which then form the storyboard. Yes. So to speak

storyboard. Yes. So to speak >> and then cascades down from there >> and the video models kind of interpolate in between them. And it's funny because that's actually how you know that's how you know Pixar and all these companies

work right in terms of storyboards. And

so >> I think it was a cost thing in the beginning. Yeah. Like that's why they

beginning. Yeah. Like that's why they had to do it like that. But like it actually also makes sense, right? Like

it makes sense in so many ways to do it to do it like that. And yeah, they they call the the that stuff pre-production and then you know post or production, right? So pre-production is all the all

right? So pre-production is all the all the tooling around storyboarding etc. like that's what everyone does like even today. Uh even though it was like a very

today. Uh even though it was like a very cost cost thing now it's more of a speed thing >> and AI makes the workflow you know very interesting where you have everything

laid out and let's say a new model new text to image model comes out they they built it in such a way that okay you can press a button and now all different combinations are going to be generated

with this other model and then you can like generate all the videos again. We

we've seen those insane workflows. You

want to update one thing and the whole thing is going to cost like a thousand dollars to rerun it again. But these

individuals like they spend a ton of money on on on creator platforms. I've seen bills like half a million dollars just spent by a single individual, maybe

even even more when it's it's a small production studio, stuff like that. So

it's it's it's pretty incredible.

>> Totally wonderful. Okay, speaking of studios who are building building on your platform, let's let's go our final layer up the stack. Let's talk about customers and and markets and then what

the future might hold. Um maybe what are the coolest things that people are building on your platform today and are with are they what we would think of as traditional media businesses or are they net new businesses?

>> It's it's all over the place. Like

what's most what's so exciting about this space is that it just goes across like all of the you know markets you can possibly imagine. Like I'll give you

possibly imagine. Like I'll give you some more um I guess longtail stuff first because because it's it's super fun and interesting. There's a security company that's building on top of fall

and they basically have these like trainings and the trainings are generated on the fly >> and and the content is all dynamic.

Obviously, they have some scripts I'm I'm guessing to to kind of fit like the curriculum, but like the the content you get, you know, per person um is is is all dynamic.

>> This is Brian Long's company.

>> Yeah, this is adaptive security.

>> Um yeah, they they do they do some really cool stuff. Uh I think that's one of the like most unique uh use cases. Uh

you can see how that translates into like rest of education. I think that market is like kind of picking up. Uh

another one I think like um you know this is this is more common use case I guess is is uh like AI native studios you you mentioned like the Bible app that was that was one of my favorites uh

it's called faith it's one of the like highest ranked apps on the on the app store and yeah they have like stories for each of the stories from the Bible and and they're like really well

produced uh and you know this this sort of category of AI native studio studios u either in the form of you know applications or or like they're doing

like you know feature um uh feature films and and you know series and things like that that's a huge category um so I would call this like maybe new media or

like AI native media and entertainment um there is also a lot of like design and productivity like out of our uh public uh customers like Canva is one of

those Adobe is one of those so they're integrating kind of like in this um you know in this older tooling they're integrating new new models um ads is a

big one u so and ads kind of come in many flavors um basically there's like the UGC style ads like the stuff you see like there's a person you know demoing a

product that's like a very big category uh so AI generated versions of those there's also kind of like older styles of ads right more professionallook higher higher production uh maybe you

saw the Coca-Cola ad uh that came out recently. Some

recently. Some >> controversy about that.

>> Yeah. Uh so that's that's like a kind of a higher production um you know style of ads but but you know what we're excited about is also like programmatic ads, right? So where where you can do

right? So where where you can do personalized um you know to the degree of like like literally individuals >> um >> you know yourself being the ad or in the movies whatever. So like that's that's

movies whatever. So like that's that's also a a big like growing use case.

>> Yeah, I'm most excited for the education use case. I think that you know ads is

use case. I think that you know ads is ads is you know the backbone of the of commerce and the internet and so like like super compelling business case but education is a market that's like so important and has never really had that

many compelling business cases behind it. Yes.

it. Yes.

>> And part of the challenge with education I mean the challenge has been the bottleneck to creating high quality content at scale that's actually ideal for the for the learner. And so I'm personally most excited about education.

>> Same like I I really love the education use cases and I actually think that like chat PT or like just you know LLM in general I think they are already solving

it in a way but it's not the right form factor like if you if you actually want to fully realize like the the the the sort of power that these models are bringing you actually need to go into

the visual space because then you know it's so much more compact. it's more

approachable. Um, and and yeah, I think once we actually crack like visual learning like through these video models, that's when it's going to, you know, really just like impact people.

>> Do you think that the advent of generative media is going to increase the value of existing IP? So like Mario Brothers Nintendo Disney Pikachu all

these things. Or do you think it's going

these things. Or do you think it's going to lead to the democratization of the creation of IP? I love this question because it felt like I would say 6 months ago, it felt like

>> this was all happening too fast for Hollywood the the IP holders to adapt and be part of it. And from from our viewpoint, we thought, all right, these

these AI native studios, they're just going to take over and hold it is just going to be too slow and this is going to just go pass past them and they're going to be left behind. Uh but this

this summer something changed and we've been talking to a lot of usual suspects from the Hollywood. We recently had our first generative media conference and

Jeffrey Katsenberg, former CEO of DreamWorks was there and he he made a comparison. He said this is exactly

comparison. He said this is exactly playing out how animation when it first came out. People people revolted against

came out. People people revolted against it. It was all handdrawn before that and

it. It was all handdrawn before that and computer graphics. It was new and there

computer graphics. It was new and there was a lot of um rebellion against computer-driven animation. And something

computer-driven animation. And something very similar is is happening with AI right now. But there's there's no way of

right now. But there's there's no way of stopping technology. It's just going to

stopping technology. It's just going to happen. You're either going to be part

happen. You're either going to be part of it or or not. Uh so we are seeing a lot of existing IP holders are now taking this very seriously and at least

for the medium term I think they are pretty well positioned because they have the technical people who are actually really interested behind the scenes in this technology. They also have the IP

this technology. They also have the IP but they also have storytelling and film making knowhow. Uh you still need quite

making knowhow. Uh you still need quite large budgets. Maybe things are going to

large budgets. Maybe things are going to get cheaper but in the mid medium term film making is still going to be expensive. Yes, AI is going to make it

expensive. Yes, AI is going to make it maybe a little bit cheaper but we need these deeply technical people who know film making who has the IP who know storytelling to actually in the

beginning be part of this and I think they're going to play a big role in the next coming years in the AI ecosystem.

Yeah, when there's infinite uh content generation, it almost puts a value on the things that are finite. And I think, you know, for those of us who grew up with Power Rangers or Neopets or

whatever, there is just this nostalgia element and this finite supply of of IP that really resonates with us.

>> The opposite is true too. Also, there's

a lot of new like we had little toys of these Italian Railroad characters. These

are characters with no IP. No one owns them. They are completely AI generated

them. They are completely AI generated from like the internet community and once you have >> cheap generation of content very

different permutations of it things that people like catches on and it becomes part of the zeitgeist. So totally

>> the opposite there's signs of opposite true as well.

>> Yeah both are true. How do you related question how do we prevent like the infinite slop machine state of the world? you know, there's this, you know,

world? you know, there's this, you know, version where we're just connected to this machine that knows how to personalize stuff for us and we're just, you know, uh, we're just hooked up to the infinite slop slot machine and

there's a version where there's, you know, human creativity and artistry and, uh, things like that involved. Like, how

do you think the world plays out?

>> I think, um, humans eventually like converge on on the things that are more meaningful. uh in general like I don't

meaningful. uh in general like I don't know like no matter how much slot we fill the world with I think I think you know taste prevails uh and and people

are drawn to like you know experiences that are personal and and human and you know I I just think that uh that's going to happen. One interesting um example of

to happen. One interesting um example of this was like when Meta announced Vibes uh and then OpenAI announced Sora 2, like the reception was very different.

And one of the like reasons in my mind was like Vibes was like positioned as this uh you know slot machine kind like kind of thing where you know they they didn't they didn't have the product out

at the time but but it was just like these AI generated like just you have no relation to the characters etc right like it was kind of this like detached

thing whereas like Sora really made it about friends right like cameo and and you know they were very >> and now you can cameo your pets There you go. It's huge, right? Um, so yeah, I

you go. It's huge, right? Um, so yeah, I think I think like this this connection to like friends and and pets and and things like that that actually made and Sora was also like they were being very

personal about it. They were they were very uh adamant about like hey we want to make this about friends. We want to make this about you know um you know these connections as opposed to you know influence slot machine. So, so I I think

that's you know that that that perception was also I think a good good signal that like there's ways to uh work make this technology work uh you know in a in a good way.

>> Absolutely. Okay. I want to get your perspective on timelines uh and what's feasible today and what's feasible to come. I guess do you think that we'll

come. I guess do you think that we'll see Hollywood grade feature film length films entirely generated by AI and and if so on what timeline?

>> What what does entirely generate by AI means? Is it like no human involvement

means? Is it like no human involvement or like >> no human filming? So

>> editing is okay editing?

>> Yes, absolutely human editing but no no human filming.

>> I think less than a year we'll have like you know advanced video models with combined with the storyboarding that people have been doing. you'll have

feature grade short films uh like less than 20 minutes. I think that's that's a fair estimation. Like even today I think

fair estimation. Like even today I think you can like do really great films. It's just like not enough investment of time is go going into these. Yeah.

>> But like with enough investment of time and the model quality I think will be there.

>> I think we're already there. Okay. And

you you think it's photorealistic? You

think it's anime? You think what like what categories do you think are more likely to happen sooner? I think photo realistic is like what everyone is like targeting, but like anime would be a cool one, right? Like it's like you

don't see that many anime specialized models. Why not? I think it's there

models. Why not? I think it's there needs to be a market for that.

>> Clearly, I I think it's going to be animation or or anime or cartoon like one like not photo realistic like as far away from photo realistic as possible.

maybe even like as fantasy as possible because filming photo realism is cheap and doable already. Like that's not what costs money when people are making

movies. It's the non-photore realistic

movies. It's the non-photore realistic stuff that's actually expensive. And um

you know even if you look at the animated movies, some of the my favorite movies are animated. The Toy Story series, How to Train Your Dragon, Shrek,

Ratatouille. And people like these

Ratatouille. And people like these things not because it it it reminds them photo realism. It it's the storytelling

photo realism. It it's the storytelling that matters and this this created a new medium. I think AI is going to be

medium. I think AI is going to be similar to animation and how that brought a whole different angle to film making. Yeah, I think feature films are

making. Yeah, I think feature films are hard like because yeah with photo realism like you you you typically I mean people usually like the movies that their like favorite actors are in

whatever actors actresses and that's you know so so it's like it's like one step removed from >> that's the thing that costs money to get the actors.

>> Yeah. Yeah. Exactly. So so so that's the you know we we first need to build a connection to this AI you know AI generated character before we can turn turn into a film. Yeah.

>> Um but but I think like yeah I think I think it's among like different kinds of content like shorts you know uh I think Italian Brain Rod is an amazing example right it

was first like these characters and then it became a Roblox game >> uh and making I don't even know like you know a lot of revenue. So so yeah I think I think like AI native stuff is uh

and shorter uh form content is is probably going to be very very big. We

saw this with VFX where like the VFX effects like one of the most expensive parts of like producing these videos or films is is like got like AI fight very very quickly because it's very easy for

AI to do like explosions, right? Like or

a building collapse. It's like almost perfect now and I think it's just going to continue along on that dimension >> with like >> and maybe facial expressions are going to be hard and they're very hard.

>> You don't have to do face facial expressions. That's going to be okay.

expressions. That's going to be okay.

But now they can do gymnastics.

>> Yeah, gymnastics are important. Good

thing we have a lot of footage of Olympics. So,

Olympics. So, >> um, what about you mentioned Roblox? At

what point do you do you think we'll have interactive video games that are generated in real time?

>> Yes, I think so. I I'm very excited about it actually. Like I think I think like in in one world I think the the sort of next reasonable step for text to

video like if you if you think text to video is the continuation of text to image I would say like a text to game is the continuation of text to video um

because you know with a with a game you would you know you essentially making the video interactive right that's that's kind of what that means and I I actually think that there is a world

where this hyper I know hyper casual games exist but this is like another level of hyper casual where it's actually discardable. Uh I

think I think we're not too far away from that. I I actually feel like pretty

from that. I I actually feel like pretty um pretty bullish on having like these you know one-time uh playable games like very short games. Yeah.

>> Um I I think that's probably going to happen. I think that's a good use case

happen. I think that's a good use case for world models other than many other great use cases. Uh but but I think I think it's going to happen.

>> What about AAA quality games? Will we

will these models at least, you know, assist and change the development pipeline of those games or >> Yeah, I think they're already impacting like at least LLMs are impacting like conversations. There's like, you know,

conversations. There's like, you know, dynamic conversations, things like that.

Um I think um pre-production stuff is impacted already. Um I think like kind

impacted already. Um I think like kind of side quest like IP stuff is impacted, right? like where you have the assets

right? like where you have the assets and you can make it miniame. I think

people are using it actually not very public but like that is already happening. I think like using for AAA

happening. I think like using for AAA production or like generating that with a model that's like I don't know at least like three four years ahead uh for

me and and yeah it's it's I mean that would be insane if if we can actually do that. Um but but you know along the way

that. Um but but you know along the way there just like the just like the uh you know video space I think along the way to the AAA there's like many other things and I think those are going to be

very big.

>> Yeah the video model space has just exploded in terms of options quality etc. As you look ahead towards what's needed to get us to the promised land for everything that generative media can

be, do you think that there's, you know, future R&D breakthroughs that are needed on the horizon like fundamental R&D breakthroughs or do you think we're very much in the engineering scaleup uh leg of the race?

>> I think the architecture needs to like slightly change at least like if you think about like scaling these models by 10x 100x. I think the architecture is a

10x 100x. I think the architecture is a big >> bottleneck right now in terms of the inference efficiency, right? like the

more compression of the video space then that's definitely needed like we we saw this with image image models used to be like much less compressed and then the lat like or like you were operating at the pixel space and then we introduced

latent space and then like even inside that latent space you took like 64 pixels and made them a single pixel uh and now like with video we're compressing on a time dimension where

we're seeing like 4x ratios why not like 24x or whatever like you you need to like increase that like compression and like I think that's going to be a big driver of improving both inference efficiency as well as training

efficiency. But like I think like any

efficiency. But like I think like any model like I think at this age that we're operating any model you take on the generative media side we're far from being like scaled up engineering wise like I think there's not enough

investment being put into or like it just started happening in within the past six months like Google showed this with like their models and how quickly they were able to catch up. They didn't

need to innovate that much. It's just

like they have the resources they can put more more effort into it. But at the same time smaller labs uh are able to demonstrate this because like there's so much like unique and noble stuff that you can do at the data level to train

these models. So I think that's also

these models. So I think that's also like helping contributing and there's the factor of you know like outside like you know mid-tier labs that raise like hundred to a billion dollars that's also trying to come up with models releasing

them open source or like contributing their ecosystem.

>> Yeah, >> that's what's so exciting about this space. There's so much more work to do.

space. There's so much more work to do.

Like so far the research community did the simplest thing possible. They

captioned images and trained a model on on text to text prompt and now like we are doing video image editing that requires a lot more data engineering to

to create the data sets. But luckily

seemingly we have a lot of abundant free video data. We are not we're going to

video data. We are not we're going to run out of compute before we run out of video data. So that means there's a lot

video data. So that means there's a lot more work to do and a lot more room for improvement. I mean earlier on like

improvement. I mean earlier on like Guram's math also indicates that like if you want to get to 4K video real time that is like I mean that that means like

I don't know 100x maybe more in um in com like compute or architecture something has to something has to give to to to get us there right um uh and

yeah like right now a lot of models are like not that usable like for for uh for professionals especially right? Or or

even for like consumer, right? Like if

you're if you're sitting there like for the best models, you still have to wait like 40 seconds, I don't know. Sometimes

you have to wait two minutes, 3 minutes, like that's not really acceptable in a world where like we want everything like on demand. So, so yeah, I think I think

on demand. So, so yeah, I think I think something needs to change. Yeah.

>> Uh and probably pace of like hardware getting faster is not enough. I think if if that's the case, you know, it'll take much longer. We'll have longer

much longer. We'll have longer timelines. So, I think architecture

timelines. So, I think architecture needs to get better.

>> Awesome. Thank you guys. You made a very high conviction bet on generative media as a theme. I think way before it was obvious. I think we are just at the

obvious. I think we are just at the start of I think what's going to be an explosion of generative media. And it's

been really cool to hear about everything you've built from the kernel optimizations and the compiler all the way up to uh the workflows and and what you're seeing from customers with with new and and old media alike. And so um

thank you for joining us on the show today.

>> Thank you. Thank you so much.

>> This was a lot of fun.

Heat. Heat.

Loading...

Loading video analysis...