Google's Nano Banana Team: Behind the Breakthrough as Gemini Tops the Charts
By Unsupervised Learning: Redpoint's AI Podcast
Summary
## Key takeaways - **Nano Banana's Impact: Gemini Tops App Store Charts**: The "Nano Banana" AI image model has significantly boosted Google's Gemini app, driving it to the top of the App Store charts and surpassing ChatGPT. [00:25], [00:33] - **Character Consistency is Key for Nano Banana**: A primary breakthrough of the Nano Banana model is its advanced character consistency, enabling users to generate images of themselves or specific characters in various scenarios with high fidelity. [01:56], [02:29] - **Unexpected Use Case: Colorizing Old Photos**: Beyond creative iterations, a surprisingly emotional and popular use case for Nano Banana has been colorizing old black and white photographs, allowing people to see historical images and family members in a new light. [02:48], [02:57] - **Future of Image Models: World Knowledge & Personalization**: Future image models will integrate world knowledge from language models to be more helpful, offering personalized aesthetics and conversational refinement to better understand user intent. [04:05], [05:03] - **Prompt to Production is Overhyped**: The expectation of generating production-ready content directly from a single prompt is overhyped, as significant iteration and work are still required for final products. [36:52], [37:14] - **Image & Video Models Converging Towards OmniModels**: Image and video generation efforts are closely related and moving towards 'OmniModels' that can handle multiple modalities, sharing techniques and influencing each other's progress. [34:17], [34:42]
Topics Covered
- Image models are now reasoning engines, not just renderers.
- The hardest UI problem is detecting user intent.
- Chatbots can't replace pro tools for pixel-perfect control.
- Image quality isn't solved; the worst cases must improve.
- Image models will evolve from creative tools to factual engines.
Full Transcript
There's a PM on our team. Her name is Nana. She was up at 2.30
in the morning working on this release, and she came up with the name then,
and it stuck. I'm trying to even conceptualize, like, what the next, you know, 10x
improvement even would be. Just the image quality, I think, has a long run to
improve, and I think personalization is an area that's still being improved also on the
tech side as well. I think we're just scratching the surface on what's possible here.
What actually has happened now having released this out into the world? The most exciting
thing for me was... Gemini recently reached the top of the App Store, Finally, overcoming
ChatGPT for the first time since ChatGPT's launch. And what drove it? Well, if you've
been on Twitter or the internet, you've definitely seen it. It's NanoBanana. They are an
incredible new image model that has huge breakthroughs in character consistency, image quality. And today
on Unsupervised Learning, I got to talk to two folks who are really driving that
at Google, Nicole and Oliver. We hit on a bunch of different things, including their
favorite use cases for the models and how folks are actually using NanoBanana today. We
talked about some of the challenges in building a product around really good models and
how they thought about solving the blank canvas problem, as well as interacting with other
image editing products. And we hit on the future of image models, what's next, frontiers,
and how it'll interact with video models too. This was just a fascinating conversation with
what I'd say is the main character of the AI ecosystem right now. And so
without further ado, please enjoy this conversation with Nicole and Oliver. Nicole
and Oliver, thanks so much for coming on the show. Really appreciate it. I've been
looking forward to this. I feel like you have pretty much taken over the entirety
of my Twitter feed with Nana Banana, as well as like any spare moment of
free time I have. You know, I figured a lot of things we'll dive into
today, but maybe just to start, you obviously were sitting with this incredible model and
product and before it was released into the world, and I guess it was maybe
anonymously released into the world first, but there were folks, you know, you were some
of the first to play around with it. I'm curious, like, some of the use
cases you thought would be most prevalent or got you most excited, and then what
actually has happened now having released this out into the world. Oliver has seen a
lot of pictures of my face. In various iterations. The most
exciting thing for me was character consistency and seeing yourself in new scenarios. So I
literally, there are slide decks full of my face in like wanted posters
and as an archaeologist and all my childhood dream professions, basically. We've now created an
Eval set that has, you know, my face in it and like other people on
the team that we just kind of eyeball when we develop a new model. That's
like the ultimate honor in AI world. to have your own, like your own, I'm
very, I'm very excited. So I was, I was really excited about the character consistency
capabilities because it just felt like it's, it's giving people a new way to imagine
themselves in a way that just wasn't very easy to do before. And that is
one of the things that people ended up, you know, being really excited about. We're,
we're seeing people turning themselves into figurines, which is like a very popular use case.
The one that surprised me, but it shouldn't have probably is people colorizing old photos
has been a really emotional use case that comes up for people of like, I
can now see what I actually looked like as a baby, or I can see
what my parents actually looked like from these black and white photographs. So that's been
really fun. I mean, I'm sure alongside seeing all the ways people use these things,
one of the joys, I'm sure, of having a very popular product is, and I've
seen this on Twitter, you must get like a million feature requests, right? And everyone
wants, you know, these models to do this thing or that thing. What are like
some of the most common things people want? And how do you think about, I
guess, like the next milestones or frontiers for these types of products and models? The
things we get most on Twitter are higher resolution. So a lot of pro feature
requests currently are 1K resolution, so higher than that. We get a lot of requests
for transparency because it's also a really popular pro use case. And those are
probably the two biggest ones that I've seen, plus better text rendering.
Obviously, I think character consistency for so long was this big thing that needed to
be solved and feels like you guys have done a great job on that. What
are the next frontiers of image model improvement in your minds? Yeah, so I think
one of the things that was most exciting to me about this model is that
you can start to give it harder questions. So instead of having to define every
aspect of the image you want to see, you can really ask for help like
you would from a language model. So for example, people are using this for like
I would like to redecorate my room, but I don't know how, like give me
some ideas. And the model is able to come up with like, oh, you know,
this might be a plausible suggestion or like given the color scheme, like these things
would go together well. So the thing to me that's really interesting is like, is
really like getting the world knowledge from the language model so that we can make
these images truly helpful for people and maybe show things that like they hadn't thought
of or like, you know, They don't know what the answer is going to be,
or maybe there's an information-seeking request. Like, I would like to know how this thing
works and, like, being able to show an image of, like, oh, this is how
it works. Like, I think that's a really important use case for these models going
forward. Where are we in solving that right now? Well, you know, of course, sort
of that specific example, aesthetics is always a bit tricky because it requires pretty deep
personalization to be able to give useful information. information. And I think personalization is an
area that's still being improved also on the tech side as well. So I think
we're quite a ways away from really being able to tell exactly what the user
is asking. But I think with a little bit of clarification and being able to
sort of have a conversation with the model, which is another thing I'm really excited
about about this model is you can talk to it in a thread. Then you
can kind of refine down to get the images you're trying to get in the
end. Do you think personalization will happen at like just the prompt layer? Like ultimately
you clarify enough that you can then feed the model enough context to personalize? Or
is it like, well, will folks have different aesthetic models? I think it'll be more
at the prompt layer. I think we'll have like, you know, given things the users
told you about themselves, for example, we'll be able to make much more informed decisions.
So at least I hope it's that way. You know, of course, everybody having their
own model, And serving that sounds like kind of a nightmare, but maybe that's the
direction we're moving. But I do think that you will have very different aesthetics,
right? I do think that there's probably some level of personalization that has to happen
at that level. Because even, you know, you see it probably when you go to
the shopping tab on Google right now, right? Like you're looking for a sweater and
then you get a bunch of recommendations and then you actually wanted to hone into
your own aesthetic and like be able to pull in from your closet to see
what other things work with it. So I hope that a lot of that can
happen in the context window of that model, right? Because we should be able to.
feed the model images of the things that you have in your closet and then
try to find something that actually goes with that. And I'm really excited about that.
So I hope we can do it there. Maybe you will need some level of
aesthetic control kind of beyond that, that like is with the model, but I suspect
that might happen more in the pro workflows again. I'm surprised by obviously in the
LM world, it feels like a lot of, or even in the image world, a
lot of the decisions that are made in data that's used for pre-training, like actually
really does impact the, you know, end capabilities or aesthetics of the models. And so
I wonder whether we'll have a world where there's one model that's just prompting
I
think
we continue to be impressed with just how broad the range of use cases are
that you can support with the off-the-shelf model. But I think one thing... to your
point is like you get a lot of mileage on it on some of these
like consumer facing use cases where, you know, you're trying to just sketch out what
something would look like in your room, et cetera. But once you get again into
the more advanced capability, like you do need to bring in other tools to actually
make it a final product and have it be useful in a workflow for, you
know, marketing or design and those kinds of capabilities. Well, I'm sure folks are really
curious, like what made these models so much better? A lot of special sauce. Yeah.
I think it's one of these boring cases where there's not one specific thing. It's
like figuring out all the details right and really tuning the recipe and having a
team that's been working on the problem for quite a while. I think we were
even kind of a little bit taken aback by the degree of success that the
model had. We knew we had a cool model and we were really excited to
get it out there, but the We saw when we released it on Elmarina, we
had not just the ELO scores were high, and that's great. I think it's a
good sign the model is useful. But the real metric for me was that we
actually had so many users going to Elmarina in order to use the model that
we had to keep increasing the amount of queries per second that we can support.
And we definitely didn't expect that. And I think that was really the first sign
that, oh, this is something that's really special. And there's a lot of people that
have use for this model. I think that's one of the most fun parts of
this whole ecosystem, where it's like you obviously have some sense of the model building
it yourselves, but it's only in releasing it in the wild that you really understand
the extent of the power it has. And clearly, this has struck a chord. Obviously,
some of the reasoning capabilities of the models are really driven by improvements in LLMs
themselves, right? And I think maybe just contextualize a little bit, how much do image
models benefit from these improvements in LLMs? And do you kind of expect that to
continue as LLM progress continues? Hey, guys, this is Rashad. I'm the producer of Unsupervised
Learning. And for once, I'm not here to ask you for a rating of the
show, although that is always welcome. But I would love your help with something else
today. We're running a short listener survey. It's like three or four questions. And it
gives us a little bit of insight into what's resonating and how we can ultimately
make the show even more useful to you, our listeners. The link to the survey
is in the description of this show. I promise you it takes like or three
minutes and it's a huge help to us uh we're always trying to make the
show better and so this is one way of supporting um and uh yeah that's
it now back to the conversation how much do image models benefit from these improvements
in llms and do you kind of expect that that to continue as uh as
lm progress continues yeah i mean definitely they benefit um you know almost 100 from
the world knowledge of language models i mean the thing about the Gemini 2.5 flash
image, which is the name of the model. Is it a Gemini model? It's a
little more fun of a name, but yes, that too. That's a bit easier to
say. I do wonder how much of our success has to do with just the
fact that people like saying the name Nano Banana. But yeah, you know, it is
a Gemini model. So like we're, you know, you can talk to this model like
you do Gemini and it understands all the things that Gemini understand. And that's actually,
I think, was a really important kind of step function and utility for these models
is to integrate them with the language models. Yeah. Yeah. And you probably remember this,
but it used to be the case that, you know, two, three years ago, you
had to be very, very specific. And when you asked the models to do, it
was like a cat sitting on a table with this exact background and these are
the colors. And now you don't have to do that anymore. And a lot of
it is because the language models have just got me so much better. Yeah, it's
not like magic prompt transformation that's happening on the back end. I feel like that
was the hack back in the day. You'd like have a sentence and they would
turn it into like a 10 sentence prompt that made sure it was specific enough
to get things right. But now it feels like the models are sophisticated enough to
get it, which is exciting. One thing I'm curious about from the product perspective is...
I feel like you have so many different kinds of people that use these products,
right? You have the folks that I guess are flocking to Ella Marina the second
the product's out. They know what they want to do with it. They are experts
in playing with it. And you have a lot of folks that are just probably
Gemini users and have this blank canvas problem. They're like, what exactly do we
do with the product? I'm curious... How did you think about building product for like
those two different types of users? There's so much more we can do. And to
your point, we're doing a lot, right? I think the LM Arena crowd and even
just developers, they're really sophisticated. They know how to use these tools. They came up
with use cases that are new to us. We've seen people do, you know, turn
objects into holograms within a photo. And that's not something that we train for, expect
it to be good at, but apparently the model is very good at it. For
consumers, having... Making it really, really easy is super important. And so even now, when
you go into the Gemini app, you will notice there's banana emojis everywhere. And we
did that because we realized that people were having a really hard time finding the
banana when they hear about it. And then they go into the app to try
it, but there was no obvious place to actually do it. and we've done a
lot of work to, you know, pre-seed some of these use cases with creators that
we partner with, and to kind of just, like, put examples out there that link
directly then back to the Gemini app, and so then the prompt pre-populates. I think
there's a lot more that we can do on kind of just the zero-state problem
of, like, giving you visual guidance. There's probably a lot more that we could do
with making gestures a thing when you're trying to edit images so that it's not
all just prompt-based. And sometimes when you do want something very specific, you still need
a fairly long prompt and that's just not a natural mode of operating for most
consumers. And so I try to give it the parent test. Like if my parents
can use it, then it's probably good enough. And I don't think we're quite there.
So I think we have a long way to go, but a lot of it
does come down to just showing, not telling, giving people the examples that they can
easily reproduce, like making it really easy to share things. Yeah. And so it's to
a lot of things that it's not kind of one magical answer. The other thing
we've noticed is that sort of social sharing is a
really important part of like the blank slate problem. Like people see things that other
people are doing and because the model is sort of personalizable, by default, you can
try it on yourself or your friends or your pets. There's this very easy way
for people to see something and be like, I'll try that for myself and see
how it works. That's a really big way that this model is being spread around.
Today, the way you interact with these models is all via text. What other design
interfaces get you excited about ways people could interact with them longer term? I think
we're just scratching the surface on what's possible here. Ultimately, I I envision all of
these modalities kind of blending together and then having some sort of an interface that
like picks the right one for whatever it is that you're actually trying to do.
I think even now kind of moving more towards a place where the LLMs don't
just output text, but they can output images and visual explainers when it's actually relevant
to the user query. So I think there's a lot of potential in voice, especially.
It's a very natural way for people to interact. I don't think anyone has really
cracked, like how that could actually show up in a user interface, we're still very
much, you know, typing the thing that you want to put in. And so some
combination of that, plus again, gesture. So if you want to erase, you know, an
object from an image, you should be able to just erase it like you would
on a scratch pad. Again, like how do you do that and how do you
seamlessly transition between those modalities based on what the task is, is something I'm
really excited about. And I think there's a lot of headroom in figuring out like,
what does that actually look like? Yeah. What feel like the limitations to that voice
UI today? I mean, it makes it, you know, I could totally imagine talking to
these images. I think some of it might also just be kind of prioritizing it
because, you know, we're still pushing on a lot of model capabilities. Voice is obviously
getting really good. Um, also in the last couple of years. So I think we're
going to start seeing some people take this on and think about, and probably we'll
do some of this work too, try to think about what this could look like.
I think part of the problem is actually the how do you detect the intent
and then how do you switch different modes based on the user intent and what
they're actually trying to accomplish? Because it's not obvious. And you could also end up
with a surface that, again, is like a blank slate. And then how do you
actually show the user what's possible? Because it's one of the other things that's really
challenging. I think We see users come into chatbots, and they just expect the chatbot
to be able to do anything, right? Because you can just talk to it like
you would talk to a human. And it's actually very hard to explain the limitations.
It's very hard to show people what you can do when the tool can now
do so much. And so I think part of it is just figuring out how
do you scope the problem? How do you show people what's possible in a UI
that can ultimately help them accomplish almost anything? Totally. And it feels like if you
teach them at any moment in time what the chatbot can and can't do, that
changes three months later at maximum. And And so you're always having to reteach those
very same capabilities, which I think is a really interesting product challenge a lot of
folks have in consumer and enterprise products. You alluded to it earlier around
evals. I guess you have your own eval data set, Nicole, of yourself. But I'm
curious, image models generally, what do model evals look like for this besides
just putting things out in LLM arena? And any learnings since you got started on
tracking what makes these models better? Yeah. Well, one of the nice things about
the language models and the vision language models getting better is that there is start
to be like a feedback loop where we can use the intelligence from the language
model to help with evaluating its own generations. And this kind of like is a
virtuous circle where we can continually improve both of these dimensions. So that's pretty exciting.
But I think ultimately, people are the arbiters of the images that they're trying to
create for themselves. So that's why I think cases like Elmarina, where users are entering
their own prompts, this is the best way to evaluate a model. So taste also
matters a lot. Oliver won't say this because he's being modest, but he's one of
the people on the team who just has a really good eye for what these
things look like and whether or not things are working, where are the flaws in
the images. So we do have a couple of on the team like that who
will do a lot of this initial eyeballing, as we call it, very technical term,
on what the outputs from a model that training actually look like. And then that
still plays a pretty significant role. And then to your question, we do get a
lot of feedback from people, including on X, on what's working, what's not working. And
then we do try to adapt our evals to capture the things that are both
working so we don't regress on them, but then also the things that are not
working that the community wants us to push on. So keep sending us feedback. It
feels in some ways like a much harder problem than, you know, some of these
LLMs that are working on, you know, like a legal use case where there's some,
there's some like relatively right answer. You get the ones where the models deviate and
like there's kind of this pure eval data set that really works, you know, because
images are so subjective, it almost feels hard to know like what I'm sure they're,
you know, did the character stay consistent in generation to generation? There's probably some things
that carry over, but the subjectivity of it must make it really challenging, right? To
figure out what to hill climb on. Well, I do have to ask, what is
the story behind the name? There's a PM on our team. Her name is Nana.
She was up at 2.30 in the morning working on this release.
And she came up with the name then. And it stuck because it was fun.
And it even now stuck as a semi-official name because to your point, Gemini 2.5
flash image is a bit harder to say. Yeah, I mean, clearly very successful when
you have the CEO of Google tweeting out bananas like the name has taken
charge. The branding takeaway, I guess, is your name should have an
appropriate emoji to go with it because it makes it stickier. Yeah. I feel like
hugging face was the OG of figuring that out in the AI world with their
emoji. But it seems like we're not too far away from stock tickers just being
emojis for companies. I guess switching gears, another interesting part of, we were alluding to
earlier, this idea that you have really sophisticated users that use this stuff and then
people that come to the blank screen figuring out what to do. What have you
seen your most sophisticated users do? I have a favorite kind of sophisticated use case.
And I used to work most of my career in video. So I really like
video tools and creation tools. And so we found that having this, and so Nana
Banana ends up being a really useful way to make AI generated videos, you know,
in combination with video models like VO3, because you can ideate quicker, you can sort
of plan shots out. And interestingly enough, that's also how people plan movies, right? They
start with like the shot boards where you have the story and the views. And
so people have started to do that and build kind of like more coherent, longer
form video. I've been really impressed with people using this in real, say, architecture
workflows, where you can kind of go from a blueprint to something that looks like
a 3D model, but it's not actually a 3D model, to something that then starts
to look like a design where you can iterate on it and kind of just
like It shortens a lot of the part of the pipeline that just is very
tedious. And then it allows people to actually work on the parts of the pipeline
that are like creative and fun and that they actually enjoy doing. And so I
think I was a little bit surprised that it sort of worked as well as
it did out of the box. It's like all these vibe coding image use cases
of getting like, you know, the first basics up and running for, you know, for
across these different spaces. And the other thing is like you could vibe code a
website UI now. And it always felt a little bit abrupt to me that we
went from sort of a prompt to a website being coded. And it feels like
there's a missing step where you can actually iterate on the design and change things
and do that really quickly. And so now we're at a point where you can
actually start to do those things. And then you can code it once you're happy
with the way that it looks. Yeah. Do you think that's the future workflow? I
mean, it makes a tremendous amount of sense. Like why execute, why spend the tokens
on all the code if it's going to come out and you're like, no, that
aesthetic is completely off from what I wanted, or that actually looks nothing like what
I wanted. It also just feels more fun to Oliver's point is how people have
actually been using these, you know, the previous technologies in their existing workflows. It
feels very natural to adapt, but I think just, you know, the LLMs are progressing
so quickly that you can go straight from a prompt to a website and it's
super impressive. I think it's really fun for people to iterate kind of in that
in-between space to make sure that it adheres to your aesthetic. I mean, you obviously
offer the model as well as an API. So obviously there's going to be all
sorts of interfaces and use cases on top of the model. How do you think
about the ones that make sense to live in kind of a general chat tool
like Gemini and then the ones you want to power through other product experiences? I
think they're just very different. So we do see people going to Gemini to do
a lot of this quick iteration, right? So we've had folks on the team already.
They're trying to redesign their garden. And so they will go to the Gemini app
to try to imagine what it could look like. And then they may actually go
and, you know, work with a landscape architect to kind of mold it to their
vision, collaborate on and like take that idea further. Right. So it's really kind of
that first step in your ideation process. It's very rarely like the final product of
the thing that you're actually trying to do. Right. Versus for a lot of these
pros for developers, you are building much more sophisticated tools that chain multiple
models at this point. Right. And so I think if it is a very, it's
a much more sophisticated, much more complex kind of multi-tool journey. And the chatbots are
really great at kind of getting you started, giving you inspiration, supporting a lot of
these use cases that are just really fun and shareable so you can share things
with family and friends. And I kind of see that probably sticking around for a
bit because the more sophisticated users who have like more demanding workflows will always want
to go back to something that is just like a more visual forward UI than,
you know, a chatbot. And where does like the editing workflow end up fitting into
this? Obviously, it seems like first generation and getting started makes a ton of sense
to do in these tools. I think you guys power the APIs available in Adobe
and some of these other tools. Oliver, I know you used to work there, where
the classic editing gets done. Do you think those editing workflows end up looking incredibly
different? Obviously, you can do some light editing using just the core models today. Or
do you think to bring something from 95% done to 100% done, it's still going
to look similar to the classic editing workflows of the past? I think it depends
a lot on the user. So some users will have a very, very clear idea
of what they want to the pixel. And for this type of use case, I
think we really have to integrate this with the existing tools and all the existing
like things you can do in Adobe products, for example. And some users are really
looking for inspiration and to have a bit less restrictive requirements. And for these users,
then just being able to get quick ideas out in a chatbot is totally valid.
So both of them are important applications of the model. TINA WISDOM- And to the
pixel level control, one thing that I learned two days ago is that when you're
creating ads for different products, different brands, Things including where a
model is looking in the ad end up mattering a lot because where your gaze
is at kind of influences what message it communicates. And that's just a level of
control that is very, very hard to get out of a chatbot. And I think
for those kinds of users, for those kinds of use cases, you will continue to
need specialized tools and just like a very precise level of control. I think it
comes down to things that can be specified by language versus not specified. Because language
is a great way to communicate high-level ideas. But if you're trying to move something
left by three pixels, it's not the most elegant way to do that. I do
think there is a place for both. I guess if you watch a real artist
or creator do all their workflows end to end, it would be hard for them
to dictate exactly what they're doing. There's some sort of just feel as you're doing
that. There's lots of folks within Google that are excited about using this in different
ways. What are some of the ways you get excited about an image model proliferating
throughout the many different things that Google does? I think there's a lot of, I
mean, on the creative side, you can imagine that photos is a really exciting place
to be doing any kind of photo editing, right? Because your library is right there.
And so like turning a family photo into a birthday card, which is a use
case that I have, you know, a couple of times a year. You get kind
of being able to do it right there, I think is really exciting. And then
to Oliver's point at the beginning, I think a lot of this kind of The
factuality, whatever we want to call it, set of use cases is really exciting on
a lot of different Google product surfaces where if you want the model to explain
photosynthesis to you and you're a five-year-old and do it in a way that, you
know, it probably doesn't exist anywhere on the Internet visually, we should be able to
do that. And so I think that will just open up again, like a whole
set of use cases and opportunities for people to learn in a way that is
customized to them and is very visual. Because a lot of people are visual learners.
Another one I think is really cool is, is, is workspace as well. Like, you
know, PowerPoint slides and Google slides, like being able to get to the point where
people can have like really compelling presentations that don't all look the same, like same
set of little points. As someone who started their career in consulting, that would be
phenomenal. Me too. If we get that. Yeah. The Lord knows too much time in
formatting all this stuff. Well, and you used to start from just... storyboarding
what the slides look like on a whiteboard, right? With like the exact headline and
then, oh, I need this chart to represent this data set on the left-hand side
of the slide. Like being able to feed that to an LLM and like take
a lot of that legwork off of you. I'm very excited about this part. Just
snapshot the whiteboard. Yeah. Snapshot the whiteboard. That sounds great. That sounds great. I guess
maybe taking a step back for just like, you know, the broader kind of image
model landscape, I feel like it's just been a torrent of progress since maybe stable
diffusion and then mid-journey. Maybe, Oliver, just contextualize, like, as you think about the last
two, three years in image generation models, like, what have been the major milestones? Like,
how would you kind of characterize the path we've been on and the extent to
which things have changed? Yeah, my God, it's been rocket ship, definitely. I mean, I...
I worked in this space back when GANs were the de facto method for generating
images, and we were amazed by what GANs could do. But they could really only
generate in very narrow distributions, basically like, okay, we had images of people that looked
pretty good, but you could just generate these front-facing images of people. And I think
that we started to see models come out that could really generalize and be fully
text-driven control over anything, but they started out really small and really noisy. But I
think a lot of us at that point were like, oh, wow, this is going
to change everything. And so we all started to like, you know, pour our energy
into this area, but none of us could have predicted the rate which it improved.
And I think I attribute this to there being quite a few... labs of
absolutely great people working on these problems. And then this kind of like friendly competition
between all the labs where we see another group come out with this amazing model,
you know, for a long time, you know, mid journey just had just like far
ahead of everyone. It just looked amazing. And this is really motivating people like, man,
how did they do that? How does it how does it look so good? And
then, of course, stable diffusion coming out as an open source model really showed a
little bit of like, what is the size of the developer community and like how
many people out there actually want to build on these models and make things. Um,
so that was like clearly another launching point. Um, but, but since then it's been,
it's just been really fun working in this area. Um, although maybe, maybe a little
bit, um, a little depressing sometimes because I think people's not, not only have the
models gotten so much better, but people's expectations have gotten so much better too. So
you now see people like complaining about these like small little problems where you're like,
Oh man, don't you know how hard we work to make this model? And like
just a year ago, we were looking at these images that don't even look realistic
at all. And like everyone was amazed. So, you know, People have a remarkable ability
to get jaded about new technology. It does feel like, yeah, I mean, with LLMs,
right? It's like we have this incredibly powerful technology that you told us in 2017
we had. We'd be absolutely dumbfounded by, and still we always complain about all the
places where they have shortcomings. It's a funny part of human nature. In retrospect, what
allowed MidJourney to get out to such a strong lead in the space? I feel
like for a while they were like the brand, and clearly the place everyone was
going. Yeah, I mean, MidJourney... figured out a little bit before everybody else how to
do the post training and specifically how to do post training in these models in
order to give you sort of stylistic and artistic imagery. And that's really been their
bread and butter is focusing on like, how do we, how do we give control
over style? And how do we make sure that like, no matter what you generate,
it's going to look amazing. And at the beginning, that was really important because And
when you can sort of narrow the domain of images you generate to just the
good-looking ones, you can actually do a better job generating those images as well. So
starting with that sort of small section of really, really good-looking images was a great
stage at that point. And then I think all the models, Midrion included, but all
the other models, like Flux and GPT, have expanded since then and can now generate
much broader categories of images while still kind of retaining the quality. What allowed that
ability to generate a broader set of images and not have to handpick just the
perfect ones to put in? Well, a bunch of things. We all figured out the
details and, in particular, what the data should look like. And at the same time,
you just have this kind of natural scaling curve where all the models get bigger,
everyone's compute gets bigger. So things that weren't possible to do back then start working
now because you can do them at a bigger scale. Yeah. And then you kind
of alluded to it earlier. We've made so much progress in image models, and I
can't tell whether there's like only 10% more to go, or there's actually like, we're
going to look back three years from now and be like, that was funny that
we thought we had good image models. How do you think about that? I'm like,
what, you know, I mean, the generations are pretty good. Like I'm trying to even
conceptualize like what the next, you know, 10 X improvement even would be. Yeah. I
I'm completely in the latter camp there. I think we have, we have much, much
farther to go on just, just in terms of image quality. Like let's not even
think about all the other applications, but like just the image quality, I think has,
has a long room to improve. And, And really, I think that's going to be
in the expressibility of the models. So we know we can generate some things basically
perfectly. No one could tell that it's a generated image and it's indistinguishable from reality.
But as soon as you start going outside the realm of very common things that
people try to generate, then the quality degrades really quickly. And you can look at
the prompts that require more imagination, more composing multiple concepts together. These break down really
fast. So I think the models in the future, while the The best case images
today may look as good as the best case images in a few years. The
worst case images today will be significantly worse. So we'll just make the models more
useful and sort of broader in applications. And, you know, we found the broader we
make them, the more use cases people discover and the more useful they become. I'm
curious as you think about the evolution of this broader image model space. If I
were compared to the LLM world, there's really... I don't know. I can say it
feels like it's you guys, OpenAI Anthropic, and then a host of other folks maybe
doing slightly behind different things. Do you think the image model landscape plays out similarly?
Yeah, I think it's a good question. So until now, I think images have been
an area where it's been possible for smaller teams to develop really state-of-the-art image models.
I mean, we've seen this... There's a few small labs that have just amazing models.
I hope this could maintain because I like the fact that there's small labs doing
it. But like I said before that the world knowledge of these models and making
them more useful, I think, is something that really benefits its scale and especially with
the scale of the language model. So I suspect we might see the same groups
that can do the large language model trainings or the ones that have the image
models that have all the world knowledge as well. We're seeing similar trends with the
big Chinese labs also coming out with amazing models and the same as language models.
So I think we'll see that as a major player in the image space as
well. Do you think it's that big a disadvantage for an image model to use
the best open source model versus some of the LLM versus the cutting edge closed
source ones? This is a great question. I feel like this is going to depend
a little bit on the future of open source models, which is something that changes
quite frequently. I think maybe a year ago, it would have seemed like that was
a very safe solution. And now maybe it's less so, but I don't know what
the future of open source models is. It's definitely a possibility. And that could sustain
a lot of sort of smaller labs training good image models, definitely. Well, Oliver, I'd
like to just shamelessly ask you a question because you mentioned you worked in video
for a while. And I'm always trying to just make sure I groff the relationship
between image models and video models. Obviously, your colleagues have had tremendous advances on the
video side with their recent video model releases. Are these separate efforts? Do they build
off each other? How are the image and video worlds interacting these days? They're very
closely related. I mean, I think in the future, everyone is moving towards
OmniModels. which are sort of models that do everything. And for many reasons, these models
have advantages and may end up winning in the long run. I don't know. But
I would say that a lot of the techniques that we learn for image generation
also sort of went into the video generation models and vice versa. And so this
is kind of one of the reasons why video generation took off as well is
because as a community, we kind of learned how to solve this problem. So
I think that I would say... Very close friends. And we share a lot.
And maybe we'll live together someday. Yeah. I guess
the techniques to do a lot of this stuff end up looking similar across the
models. Yeah. Exactly. And even in the workflows, right, a lot of people do end
up using these models in very complementary ways. So if you're a movie maker and
you're making a movie, a lot of the initial iteration, first off, it happens in
the LLM land. And then iterating in the frame or image space makes a lot
of sense because it's faster and cheaper to do that. And then you actually move
on to video. So I think even from just like workflows and usability perspective, there's
a little complementarity that happens between the models. And a lot of the use cases
and problems that we have solve are similar. So consistency, characters, objects, scenes, right? Exists
in image land, exists in video, gets a little more complex in video because you
have to do it over multiple frames. Yeah. What do you feel like the next
problems to solve on the video side? Yeah, I think that having the same kind
of control over the models that we get with the latest generation of image models
is something that obviously will be really impactful in the video space. So that's one
thing to look out for. I think video teams are also working on just improving
the sort of resolution and the consistency over time. And of course, like being able
to generate the same character across many scenes is whenever you look at like, you
know, what people want, that's always sort of top on the list. So I would
say moving towards longer form content with more coherency, definitely a future direction.
And it sounds like you can figure out some of that stuff in the image
context. And a lot of that ends up being transferable to the video side, which
is cool. it's been a fascinating conversation. We always like to end our interview with
a standard set of quick fire questions where I stuff some overly broad questions into
the end of the interview. And so maybe to start, would love your thoughts on
one thing that's overhyped and one thing that's underhyped in the broader AI world today.
I think overhyped is this idea that you can just go from like one short
prompt to something that's like a production ready solution that you can ship anywhere. I
do think that there's a lot of Iteration that needs to happen. And if you
even if you look at a lot of the stuff that people post on socials,
there's actually still a lot of work that goes into, you know, producing that final
product and sharing it. So I think that's a little bit overhyped. Under Hyped, I
think it's what are the UIs of the future? We already talked about this, but
how does this actually all come together? And how do you make it easier for
people to use these models, show them what's possible, and then make it useful for
specific workflows? Have you seen any products that you're like, oh, that's a really interesting
new take on a UI for the AI age? I'm still waiting. I'm still waiting.
I like node-based interfaces, but I don't think I'm representative of everybody. It's a very
specialized audience. Everyone gets their own UI in the future
anyway. It will be an era of specialization. Do you think image model progress is
going to be more or less the same next year as it's been this year?
I hope more. There's more smart people working on the
problem and putting more resources into it. I think we're going to see and in
all axes, the change will accelerate, yeah. Well, you've obviously taken over the internet
with everyone's focus on Nano Banana. Are there other things happening in the AI image
world that you're paying attention to that you don't think enough people are thinking about
or paying attention to? I do think it's this factuality dimension.
Even when we look at the things that people are doing with Nano Banana, people
have been making infographics or taking a picture of Niagara Falls and asking the models
to annotate it. And it sort of does a good job as a demo. looks
fine, but when you hone in, the text is a little bit garbled. It's not
factual. It repeats information. So I think that is a frontier. It's something that not
a ton of people are paying attention to, but I do expect to get better.
There's a real analogy with the text language models as well because when GPT-1 and
2 came out, people found like, oh, these are kind of cool. I can have
them write haikus and kind of do creative tasks where there's a broader range of
acceptable answers. But now people don't use language models for this as much at all
now they use them for like um you know for information seeking and also for
like kind of conversation and and like you know personal connections and like having someone
to talk to so like i think that we could probably find analogies for all
of these in the image space as well where we went from like creative tools
to maybe like information seeking tools and then also maybe like people will end up
talking to a video model you know when they need to talk to someone and
this is something that we could see in the future. And the models, I think,
should also get a lot more proactive because right now you always have to ask,
right? I want an image, but if the query warrants an image, right? And we're
used to this from search, actually, right? Like you're used to going to search asking
for a thing, and sometimes it comes with text. Sometimes it comes with text and
an image. Sometimes image is the right answer. And so I also expect these models
to just get a lot more proactive and smarter, right? about how they talk to
you and how they leverage these different modalities based on what you're asking about. I
love that you'll be able to seamlessly go back and forth. And then Oliver, to
your earlier point about, it seems like so much of the improvement here is really
around reliability. And in the early days of LLMs, there were glimpses of, oh, wow,
that's really cool. But it was not nearly consistent enough to use for more kind
of work use cases. And it seems like a similar pattern to be followed with
image models as well. The most important question in this interview definitely What is your
single favorite piece of content that you've seen made using Nana Banana? Both of you.
Okay. So my favorite piece of content is going to be like playing with the
model with my kids where we like put them in, in like, you know, funny
locations or, and like, um, make their stuffies come to life and these kinds of
things because they're, they're, they're so personal and it's like something I can do with
them and they love it. And so like this kind of thing I think is
the, the most important. definitely the most valuable for me. That's amazing. I feel like
people love that. Yeah. Yeah. People love like the LM, you know, bedtime stories for
their kids. And now we get to move into image and soon you'll have movies
that you can put your, your kids into and whatnot. And that'll be, uh, no,
it's a really exciting, uh, exciting future. Um, well, I want to make sure to
leave the mic to you folks, uh, besides like clicking on, you know, various bananas,
uh, around the internet and Gemini, where can folks go to learn more about you,
uh, the product, uh, anything you want to point folks to the, uh, the floor
is yours. Please do go click on the bananas. Gemini app, AI studio, you know
where to find us. And then we're both on X. So do you find us?
Give us feedback. Yeah. If you find something that doesn't work or sucks, definitely like
tag me in it. It's very helpful. Amazing. Well, we'll help people contribute to
that and we'll link to you as well. But thank you both so much. This
was a ton of fun. Thank you.
Loading video analysis...