Google's Nano Banana Team: Behind the Breakthrough as Gemini Tops the Charts

By Unsupervised Learning: Redpoint's AI Podcast

Summary

## Key takeaways - **Nano Banana's Impact: Gemini Tops App Store Charts**: The "Nano Banana" AI image model has significantly boosted Google's Gemini app, driving it to the top of the App Store charts and surpassing ChatGPT. [00:25], [00:33] - **Character Consistency is Key for Nano Banana**: A primary breakthrough of the Nano Banana model is its advanced character consistency, enabling users to generate images of themselves or specific characters in various scenarios with high fidelity. [01:56], [02:29] - **Unexpected Use Case: Colorizing Old Photos**: Beyond creative iterations, a surprisingly emotional and popular use case for Nano Banana has been colorizing old black and white photographs, allowing people to see historical images and family members in a new light. [02:48], [02:57] - **Future of Image Models: World Knowledge & Personalization**: Future image models will integrate world knowledge from language models to be more helpful, offering personalized aesthetics and conversational refinement to better understand user intent. [04:05], [05:03] - **Prompt to Production is Overhyped**: The expectation of generating production-ready content directly from a single prompt is overhyped, as significant iteration and work are still required for final products. [36:52], [37:14] - **Image & Video Models Converging Towards OmniModels**: Image and video generation efforts are closely related and moving towards 'OmniModels' that can handle multiple modalities, sharing techniques and influencing each other's progress. [34:17], [34:42]

Topics Covered

Image models are now reasoning engines, not just renderers.
The hardest UI problem is detecting user intent.
Chatbots can't replace pro tools for pixel-perfect control.
Image quality isn't solved; the worst cases must improve.
Image models will evolve from creative tools to factual engines.

Full Transcript

There's a PM on our team. Her name is Nana. She was up at 2.30

in the morning working on this release, and she came up with the name then,

and it stuck. I'm trying to even conceptualize, like, what the next, you know, 10x

improvement even would be. Just the image quality, I think, has a long run to

improve, and I think personalization is an area that's still being improved also on the

tech side as well. I think we're just scratching the surface on what's possible here.

What actually has happened now having released this out into the world? The most exciting

thing for me was... Gemini recently reached the top of the App Store, Finally, overcoming

ChatGPT for the first time since ChatGPT's launch. And what drove it? Well, if you've

been on Twitter or the internet, you've definitely seen it. It's NanoBanana. They are an

incredible new image model that has huge breakthroughs in character consistency, image quality. And today

on Unsupervised Learning, I got to talk to two folks who are really driving that

at Google, Nicole and Oliver. We hit on a bunch of different things, including their

favorite use cases for the models and how folks are actually using NanoBanana today. We

talked about some of the challenges in building a product around really good models and

how they thought about solving the blank canvas problem, as well as interacting with other

image editing products. And we hit on the future of image models, what's next, frontiers,

and how it'll interact with video models too. This was just a fascinating conversation with

what I'd say is the main character of the AI ecosystem right now. And so

without further ado, please enjoy this conversation with Nicole and Oliver. Nicole

and Oliver, thanks so much for coming on the show. Really appreciate it. I've been

looking forward to this. I feel like you have pretty much taken over the entirety

of my Twitter feed with Nana Banana, as well as like any spare moment of

free time I have. You know, I figured a lot of things we'll dive into

today, but maybe just to start, you obviously were sitting with this incredible model and

product and before it was released into the world, and I guess it was maybe

anonymously released into the world first, but there were folks, you know, you were some

of the first to play around with it. I'm curious, like, some of the use

cases you thought would be most prevalent or got you most excited, and then what

actually has happened now having released this out into the world. Oliver has seen a

lot of pictures of my face. In various iterations. The most

exciting thing for me was character consistency and seeing yourself in new scenarios. So I

literally, there are slide decks full of my face in like wanted posters

and as an archaeologist and all my childhood dream professions, basically. We've now created an

Eval set that has, you know, my face in it and like other people on

the team that we just kind of eyeball when we develop a new model. That's

like the ultimate honor in AI world. to have your own, like your own, I'm

very, I'm very excited. So I was, I was really excited about the character consistency

capabilities because it just felt like it's, it's giving people a new way to imagine

themselves in a way that just wasn't very easy to do before. And that is

one of the things that people ended up, you know, being really excited about. We're,

we're seeing people turning themselves into figurines, which is like a very popular use case.

The one that surprised me, but it shouldn't have probably is people colorizing old photos

has been a really emotional use case that comes up for people of like, I

can now see what I actually looked like as a baby, or I can see

what my parents actually looked like from these black and white photographs. So that's been

really fun. I mean, I'm sure alongside seeing all the ways people use these things,

one of the joys, I'm sure, of having a very popular product is, and I've

seen this on Twitter, you must get like a million feature requests, right? And everyone

wants, you know, these models to do this thing or that thing. What are like

some of the most common things people want? And how do you think about, I

guess, like the next milestones or frontiers for these types of products and models? The

things we get most on Twitter are higher resolution. So a lot of pro feature

requests currently are 1K resolution, so higher than that. We get a lot of requests

for transparency because it's also a really popular pro use case. And those are

probably the two biggest ones that I've seen, plus better text rendering.

Obviously, I think character consistency for so long was this big thing that needed to

be solved and feels like you guys have done a great job on that. What

are the next frontiers of image model improvement in your minds? Yeah, so I think

one of the things that was most exciting to me about this model is that

you can start to give it harder questions. So instead of having to define every

aspect of the image you want to see, you can really ask for help like

you would from a language model. So for example, people are using this for like

I would like to redecorate my room, but I don't know how, like give me

some ideas. And the model is able to come up with like, oh, you know,

this might be a plausible suggestion or like given the color scheme, like these things

would go together well. So the thing to me that's really interesting is like, is

really like getting the world knowledge from the language model so that we can make

these images truly helpful for people and maybe show things that like they hadn't thought

of or like, you know, They don't know what the answer is going to be,

or maybe there's an information-seeking request. Like, I would like to know how this thing

works and, like, being able to show an image of, like, oh, this is how

it works. Like, I think that's a really important use case for these models going

forward. Where are we in solving that right now? Well, you know, of course, sort

of that specific example, aesthetics is always a bit tricky because it requires pretty deep

personalization to be able to give useful information. information. And I think personalization is an

area that's still being improved also on the tech side as well. So I think

we're quite a ways away from really being able to tell exactly what the user

is asking. But I think with a little bit of clarification and being able to

sort of have a conversation with the model, which is another thing I'm really excited

about about this model is you can talk to it in a thread. Then you

can kind of refine down to get the images you're trying to get in the

end. Do you think personalization will happen at like just the prompt layer? Like ultimately

you clarify enough that you can then feed the model enough context to personalize? Or

is it like, well, will folks have different aesthetic models? I think it'll be more

at the prompt layer. I think we'll have like, you know, given things the users

told you about themselves, for example, we'll be able to make much more informed decisions.

So at least I hope it's that way. You know, of course, everybody having their

own model, And serving that sounds like kind of a nightmare, but maybe that's the

direction we're moving. But I do think that you will have very different aesthetics,

right? I do think that there's probably some level of personalization that has to happen

at that level. Because even, you know, you see it probably when you go to

the shopping tab on Google right now, right? Like you're looking for a sweater and

then you get a bunch of recommendations and then you actually wanted to hone into

your own aesthetic and like be able to pull in from your closet to see

what other things work with it. So I hope that a lot of that can

happen in the context window of that model, right? Because we should be able to.

feed the model images of the things that you have in your closet and then

try to find something that actually goes with that. And I'm really excited about that.

So I hope we can do it there. Maybe you will need some level of

aesthetic control kind of beyond that, that like is with the model, but I suspect

that might happen more in the pro workflows again. I'm surprised by obviously in the

LM world, it feels like a lot of, or even in the image world, a

lot of the decisions that are made in data that's used for pre-training, like actually

really does impact the, you know, end capabilities or aesthetics of the models. And so

I wonder whether we'll have a world where there's one model that's just prompting

I

think

we continue to be impressed with just how broad the range of use cases are

that you can support with the off-the-shelf model. But I think one thing... to your

point is like you get a lot of mileage on it on some of these

like consumer facing use cases where, you know, you're trying to just sketch out what

something would look like in your room, et cetera. But once you get again into

the more advanced capability, like you do need to bring in other tools to actually

make it a final product and have it be useful in a workflow for, you

know, marketing or design and those kinds of capabilities. Well, I'm sure folks are really

curious, like what made these models so much better? A lot of special sauce. Yeah.

I think it's one of these boring cases where there's not one specific thing. It's

like figuring out all the details right and really tuning the recipe and having a

team that's been working on the problem for quite a while. I think we were

even kind of a little bit taken aback by the degree of success that the

model had. We knew we had a cool model and we were really excited to

get it out there, but the We saw when we released it on Elmarina, we

had not just the ELO scores were high, and that's great. I think it's a

good sign the model is useful. But the real metric for me was that we

actually had so many users going to Elmarina in order to use the model that

we had to keep increasing the amount of queries per second that we can support.

And we definitely didn't expect that. And I think that was really the first sign

that, oh, this is something that's really special. And there's a lot of people that

have use for this model. I think that's one of the most fun parts of

this whole ecosystem, where it's like you obviously have some sense of the model building

it yourselves, but it's only in releasing it in the wild that you really understand

the extent of the power it has. And clearly, this has struck a chord. Obviously,

some of the reasoning capabilities of the models are really driven by improvements in LLMs

themselves, right? And I think maybe just contextualize a little bit, how much do image

models benefit from these improvements in LLMs? And do you kind of expect that to

continue as LLM progress continues? Hey, guys, this is Rashad. I'm the producer of Unsupervised

Learning. And for once, I'm not here to ask you for a rating of the

show, although that is always welcome. But I would love your help with something else

today. We're running a short listener survey. It's like three or four questions. And it

gives us a little bit of insight into what's resonating and how we can ultimately

make the show even more useful to you, our listeners. The link to the survey

is in the description of this show. I promise you it takes like or three

minutes and it's a huge help to us uh we're always trying to make the

show better and so this is one way of supporting um and uh yeah that's

it now back to the conversation how much do image models benefit from these improvements

in llms and do you kind of expect that that to continue as uh as

lm progress continues yeah i mean definitely they benefit um you know almost 100 from

the world knowledge of language models i mean the thing about the Gemini 2.5 flash

image, which is the name of the model. Is it a Gemini model? It's a

little more fun of a name, but yes, that too. That's a bit easier to

say. I do wonder how much of our success has to do with just the

fact that people like saying the name Nano Banana. But yeah, you know, it is

a Gemini model. So like we're, you know, you can talk to this model like

you do Gemini and it understands all the things that Gemini understand. And that's actually,

I think, was a really important kind of step function and utility for these models

is to integrate them with the language models. Yeah. Yeah. And you probably remember this,

but it used to be the case that, you know, two, three years ago, you

had to be very, very specific. And when you asked the models to do, it

was like a cat sitting on a table with this exact background and these are

the colors. And now you don't have to do that anymore. And a lot of

it is because the language models have just got me so much better. Yeah, it's

not like magic prompt transformation that's happening on the back end. I feel like that

was the hack back in the day. You'd like have a sentence and they would

turn it into like a 10 sentence prompt that made sure it was specific enough

to get things right. But now it feels like the models are sophisticated enough to

get it, which is exciting. One thing I'm curious about from the product perspective is...

I feel like you have so many different kinds of people that use these products,

right? You have the folks that I guess are flocking to Ella Marina the second

the product's out. They know what they want to do with it. They are experts

in playing with it. And you have a lot of folks that are just probably

Gemini users and have this blank canvas problem. They're like, what exactly do we

do with the product? I'm curious... How did you think about building product for like

those two different types of users? There's so much more we can do. And to

your point, we're doing a lot, right? I think the LM Arena crowd and even

just developers, they're really sophisticated. They know how to use these tools. They came up

with use cases that are new to us. We've seen people do, you know, turn

objects into holograms within a photo. And that's not something that we train for, expect

it to be good at, but apparently the model is very good at it. For

consumers, having... Making it really, really easy is super important. And so even now, when

you go into the Gemini app, you will notice there's banana emojis everywhere. And we

did that because we realized that people were having a really hard time finding the

banana when they hear about it. And then they go into the app to try

it, but there was no obvious place to actually do it. and we've done a

lot of work to, you know, pre-seed some of these use cases with creators that

we partner with, and to kind of just, like, put examples out there that link

directly then back to the Gemini app, and so then the prompt pre-populates. I think

there's a lot more that we can do on kind of just the zero-state problem

of, like, giving you visual guidance. There's probably a lot more that we could do

with making gestures a thing when you're trying to edit images so that it's not

all just prompt-based. And sometimes when you do want something very specific, you still need

a fairly long prompt and that's just not a natural mode of operating for most

consumers. And so I try to give it the parent test. Like if my parents

can use it, then it's probably good enough. And I don't think we're quite there.

So I think we have a long way to go, but a lot of it

does come down to just showing, not telling, giving people the examples that they can

easily reproduce, like making it really easy to share things. Yeah. And so it's to

a lot of things that it's not kind of one magical answer. The other thing

we've noticed is that sort of social sharing is a

really important part of like the blank slate problem. Like people see things that other

people are doing and because the model is sort of personalizable, by default, you can

try it on yourself or your friends or your pets. There's this very easy way

for people to see something and be like, I'll try that for myself and see

how it works. That's a really big way that this model is being spread around.

Today, the way you interact with these models is all via text. What other design

interfaces get you excited about ways people could interact with them longer term? I think

we're just scratching the surface on what's possible here. Ultimately, I I envision all of

these modalities kind of blending together and then having some sort of an interface that

like picks the right one for whatever it is that you're actually trying to do.

I think even now kind of moving more towards a place where the LLMs don't

just output text, but they can output images and visual explainers when it's actually relevant

to the user query. So I think there's a lot of potential in voice, especially.

It's a very natural way for people to interact. I don't think anyone has really

cracked, like how that could actually show up in a user interface, we're still very

much, you know, typing the thing that you want to put in. And so some

combination of that, plus again, gesture. So if you want to erase, you know, an

object from an image, you should be able to just erase it like you would

on a scratch pad. Again, like how do you do that and how do you

seamlessly transition between those modalities based on what the task is, is something I'm

really excited about. And I think there's a lot of headroom in figuring out like,

what does that actually look like? Yeah. What feel like the limitations to that voice

UI today? I mean, it makes it, you know, I could totally imagine talking to

these images. I think some of it might also just be kind of prioritizing it

because, you know, we're still pushing on a lot of model capabilities. Voice is obviously

getting really good. Um, also in the last couple of years. So I think we're

going to start seeing some people take this on and think about, and probably we'll

do some of this work too, try to think about what this could look like.

I think part of the problem is actually the how do you detect the intent

and then how do you switch different modes based on the user intent and what

they're actually trying to accomplish? Because it's not obvious. And you could also end up

with a surface that, again, is like a blank slate. And then how do you

actually show the user what's possible? Because it's one of the other things that's really

challenging. I think We see users come into chatbots, and they just expect the chatbot

to be able to do anything, right? Because you can just talk to it like

you would talk to a human. And it's actually very hard to explain the limitations.

It's very hard to show people what you can do when the tool can now

do so much. And so I think part of it is just figuring out how

do you scope the problem? How do you show people what's possible in a UI

that can ultimately help them accomplish almost anything? Totally. And it feels like if you

teach them at any moment in time what the chatbot can and can't do, that

changes three months later at maximum. And And so you're always having to reteach those

very same capabilities, which I think is a really interesting product challenge a lot of

folks have in consumer and enterprise products. You alluded to it earlier around

evals. I guess you have your own eval data set, Nicole, of yourself. But I'm

curious, image models generally, what do model evals look like for this besides

just putting things out in LLM arena? And any learnings since you got started on

tracking what makes these models better? Yeah. Well, one of the nice things about

the language models and the vision language models getting better is that there is start

to be like a feedback loop where we can use the intelligence from the language

model to help with evaluating its own generations. And this kind of like is a

virtuous circle where we can continually improve both of these dimensions. So that's pretty exciting.

But I think ultimately, people are the arbiters of the images that they're trying to

create for themselves. So that's why I think cases like Elmarina, where users are entering

their own prompts, this is the best way to evaluate a model. So taste also

matters a lot. Oliver won't say this because he's being modest, but he's one of

the people on the team who just has a really good eye for what these

things look like and whether or not things are working, where are the flaws in

the images. So we do have a couple of on the team like that who

will do a lot of this initial eyeballing, as we call it, very technical term,

on what the outputs from a model that training actually look like. And then that

still plays a pretty significant role. And then to your question, we do get a

lot of feedback from people, including on X, on what's working, what's not working. And

then we do try to adapt our evals to capture the things that are both

working so we don't regress on them, but then also the things that are not

working that the community wants us to push on. So keep sending us feedback. It

feels in some ways like a much harder problem than, you know, some of these

LLMs that are working on, you know, like a legal use case where there's some,

there's some like relatively right answer. You get the ones where the models deviate and

like there's kind of this pure eval data set that really works, you know, because

images are so subjective, it almost feels hard to know like what I'm sure they're,

you know, did the character stay consistent in generation to generation? There's probably some things

that carry over, but the subjectivity of it must make it really challenging, right? To

figure out what to hill climb on. Well, I do have to ask, what is

the story behind the name? There's a PM on our team. Her name is Nana.

She was up at 2.30 in the morning working on this release.

And she came up with the name then. And it stuck because it was fun.

And it even now stuck as a semi-official name because to your point, Gemini 2.5

flash image is a bit harder to say. Yeah, I mean, clearly very successful when

you have the CEO of Google tweeting out bananas like the name has taken

charge. The branding takeaway, I guess, is your name should have an

appropriate emoji to go with it because it makes it stickier. Yeah. I feel like

hugging face was the OG of figuring that out in the AI world with their

emoji. But it seems like we're not too far away from stock tickers just being

emojis for companies. I guess switching gears, another interesting part of, we were alluding to

earlier, this idea that you have really sophisticated users that use this stuff and then

people that come to the blank screen figuring out what to do. What have you

seen your most sophisticated users do? I have a favorite kind of sophisticated use case.

And I used to work most of my career in video. So I really like

video tools and creation tools. And so we found that having this, and so Nana

Banana ends up being a really useful way to make AI generated videos, you know,

in combination with video models like VO3, because you can ideate quicker, you can sort

of plan shots out. And interestingly enough, that's also how people plan movies, right? They

start with like the shot boards where you have the story and the views. And

so people have started to do that and build kind of like more coherent, longer

form video. I've been really impressed with people using this in real, say, architecture

workflows, where you can kind of go from a blueprint to something that looks like

a 3D model, but it's not actually a 3D model, to something that then starts

to look like a design where you can iterate on it and kind of just

like It shortens a lot of the part of the pipeline that just is very

tedious. And then it allows people to actually work on the parts of the pipeline

that are like creative and fun and that they actually enjoy doing. And so I

think I was a little bit surprised that it sort of worked as well as

it did out of the box. It's like all these vibe coding image use cases

of getting like, you know, the first basics up and running for, you know, for

across these different spaces. And the other thing is like you could vibe code a

website UI now. And it always felt a little bit abrupt to me that we

went from sort of a prompt to a website being coded. And it feels like

there's a missing step where you can actually iterate on the design and change things

and do that really quickly. And so now we're at a point where you can

actually start to do those things. And then you can code it once you're happy

with the way that it looks. Yeah. Do you think that's the future workflow? I

mean, it makes a tremendous amount of sense. Like why execute, why spend the tokens

on all the code if it's going to come out and you're like, no, that

aesthetic is completely off from what I wanted, or that actually looks nothing like what

I wanted. It also just feels more fun to Oliver's point is how people have

actually been using these, you know, the previous technologies in their existing workflows. It

feels very natural to adapt, but I think just, you know, the LLMs are progressing

so quickly that you can go straight from a prompt to a website and it's

super impressive. I think it's really fun for people to iterate kind of in that

in-between space to make sure that it adheres to your aesthetic. I mean, you obviously

offer the model as well as an API. So obviously there's going to be all

sorts of interfaces and use cases on top of the model. How do you think

about the ones that make sense to live in kind of a general chat tool

like Gemini and then the ones you want to power through other product experiences? I

think they're just very different. So we do see people going to Gemini to do

a lot of this quick iteration, right? So we've had folks on the team already.

They're trying to redesign their garden. And so they will go to the Gemini app

to try to imagine what it could look like. And then they may actually go

and, you know, work with a landscape architect to kind of mold it to their

vision, collaborate on and like take that idea further. Right. So it's really kind of

that first step in your ideation process. It's very rarely like the final product of

the thing that you're actually trying to do. Right. Versus for a lot of these

pros for developers, you are building much more sophisticated tools that chain multiple

models at this point. Right. And so I think if it is a very, it's

a much more sophisticated, much more complex kind of multi-tool journey. And the chatbots are

really great at kind of getting you started, giving you inspiration, supporting a lot of

these use cases that are just really fun and shareable so you can share things

with family and friends. And I kind of see that probably sticking around for a

bit because the more sophisticated users who have like more demanding workflows will always want

to go back to something that is just like a more visual forward UI than,

you know, a chatbot. And where does like the editing workflow end up fitting into

this? Obviously, it seems like first generation and getting started makes a ton of sense

to do in these tools. I think you guys power the APIs available in Adobe

and some of these other tools. Oliver, I know you used to work there, where

the classic editing gets done. Do you think those editing workflows end up looking incredibly

different? Obviously, you can do some light editing using just the core models today. Or

do you think to bring something from 95% done to 100% done, it's still going

to look similar to the classic editing workflows of the past? I think it depends

a lot on the user. So some users will have a very, very clear idea

of what they want to the pixel. And for this type of use case, I

think we really have to integrate this with the existing tools and all the existing

like things you can do in Adobe products, for example. And some users are really

looking for inspiration and to have a bit less restrictive requirements. And for these users,

then just being able to get quick ideas out in a chatbot is totally valid.

So both of them are important applications of the model. TINA WISDOM- And to the

pixel level control, one thing that I learned two days ago is that when you're

creating ads for different products, different brands, Things including where a

model is looking in the ad end up mattering a lot because where your gaze

is at kind of influences what message it communicates. And that's just a level of

control that is very, very hard to get out of a chatbot. And I think

for those kinds of users, for those kinds of use cases, you will continue to

need specialized tools and just like a very precise level of control. I think it

comes down to things that can be specified by language versus not specified. Because language

is a great way to communicate high-level ideas. But if you're trying to move something

left by three pixels, it's not the most elegant way to do that. I do

think there is a place for both. I guess if you watch a real artist

or creator do all their workflows end to end, it would be hard for them

to dictate exactly what they're doing. There's some sort of just feel as you're doing

that. There's lots of folks within Google that are excited about using this in different

ways. What are some of the ways you get excited about an image model proliferating

throughout the many different things that Google does? I think there's a lot of, I

mean, on the creative side, you can imagine that photos is a really exciting place

to be doing any kind of photo editing, right? Because your library is right there.

And so like turning a family photo into a birthday card, which is a use

case that I have, you know, a couple of times a year. You get kind

of being able to do it right there, I think is really exciting. And then

to Oliver's point at the beginning, I think a lot of this kind of The

factuality, whatever we want to call it, set of use cases is really exciting on

a lot of different Google product surfaces where if you want the model to explain

photosynthesis to you and you're a five-year-old and do it in a way that, you

know, it probably doesn't exist anywhere on the Internet visually, we should be able to

do that. And so I think that will just open up again, like a whole

set of use cases and opportunities for people to learn in a way that is

customized to them and is very visual. Because a lot of people are visual learners.

Another one I think is really cool is, is, is workspace as well. Like, you

know, PowerPoint slides and Google slides, like being able to get to the point where

people can have like really compelling presentations that don't all look the same, like same

set of little points. As someone who started their career in consulting, that would be

phenomenal. Me too. If we get that. Yeah. The Lord knows too much time in

formatting all this stuff. Well, and you used to start from just... storyboarding

what the slides look like on a whiteboard, right? With like the exact headline and

then, oh, I need this chart to represent this data set on the left-hand side

of the slide. Like being able to feed that to an LLM and like take

a lot of that legwork off of you. I'm very excited about this part. Just

snapshot the whiteboard. Yeah. Snapshot the whiteboard. That sounds great. That sounds great. I guess

maybe taking a step back for just like, you know, the broader kind of image

model landscape, I feel like it's just been a torrent of progress since maybe stable

diffusion and then mid-journey. Maybe, Oliver, just contextualize, like, as you think about the last

two, three years in image generation models, like, what have been the major milestones? Like,

how would you kind of characterize the path we've been on and the extent to

which things have changed? Yeah, my God, it's been rocket ship, definitely. I mean, I...

I worked in this space back when GANs were the de facto method for generating

images, and we were amazed by what GANs could do. But they could really only

generate in very narrow distributions, basically like, okay, we had images of people that looked

pretty good, but you could just generate these front-facing images of people. And I think

that we started to see models come out that could really generalize and be fully

text-driven control over anything, but they started out really small and really noisy. But I

think a lot of us at that point were like, oh, wow, this is going

to change everything. And so we all started to like, you know, pour our energy

into this area, but none of us could have predicted the rate which it improved.

And I think I attribute this to there being quite a few... labs of

absolutely great people working on these problems. And then this kind of like friendly competition

between all the labs where we see another group come out with this amazing model,

you know, for a long time, you know, mid journey just had just like far

ahead of everyone. It just looked amazing. And this is really motivating people like, man,

how did they do that? How does it how does it look so good? And

then, of course, stable diffusion coming out as an open source model really showed a

little bit of like, what is the size of the developer community and like how

many people out there actually want to build on these models and make things. Um,

so that was like clearly another launching point. Um, but, but since then it's been,

it's just been really fun working in this area. Um, although maybe, maybe a little

bit, um, a little depressing sometimes because I think people's not, not only have the

models gotten so much better, but people's expectations have gotten so much better too. So

you now see people like complaining about these like small little problems where you're like,

Oh man, don't you know how hard we work to make this model? And like

just a year ago, we were looking at these images that don't even look realistic

at all. And like everyone was amazed. So, you know, People have a remarkable ability

to get jaded about new technology. It does feel like, yeah, I mean, with LLMs,

right? It's like we have this incredibly powerful technology that you told us in 2017

we had. We'd be absolutely dumbfounded by, and still we always complain about all the

places where they have shortcomings. It's a funny part of human nature. In retrospect, what

allowed MidJourney to get out to such a strong lead in the space? I feel

like for a while they were like the brand, and clearly the place everyone was

going. Yeah, I mean, MidJourney... figured out a little bit before everybody else how to

do the post training and specifically how to do post training in these models in

order to give you sort of stylistic and artistic imagery. And that's really been their

bread and butter is focusing on like, how do we, how do we give control

over style? And how do we make sure that like, no matter what you generate,

it's going to look amazing. And at the beginning, that was really important because And

when you can sort of narrow the domain of images you generate to just the

good-looking ones, you can actually do a better job generating those images as well. So

starting with that sort of small section of really, really good-looking images was a great

stage at that point. And then I think all the models, Midrion included, but all

the other models, like Flux and GPT, have expanded since then and can now generate

much broader categories of images while still kind of retaining the quality. What allowed that

ability to generate a broader set of images and not have to handpick just the

perfect ones to put in? Well, a bunch of things. We all figured out the

details and, in particular, what the data should look like. And at the same time,

you just have this kind of natural scaling curve where all the models get bigger,

everyone's compute gets bigger. So things that weren't possible to do back then start working

now because you can do them at a bigger scale. Yeah. And then you kind

of alluded to it earlier. We've made so much progress in image models, and I

can't tell whether there's like only 10% more to go, or there's actually like, we're

going to look back three years from now and be like, that was funny that

we thought we had good image models. How do you think about that? I'm like,

what, you know, I mean, the generations are pretty good. Like I'm trying to even

conceptualize like what the next, you know, 10 X improvement even would be. Yeah. I

I'm completely in the latter camp there. I think we have, we have much, much

farther to go on just, just in terms of image quality. Like let's not even

think about all the other applications, but like just the image quality, I think has,

has a long room to improve. And, And really, I think that's going to be

in the expressibility of the models. So we know we can generate some things basically

perfectly. No one could tell that it's a generated image and it's indistinguishable from reality.

But as soon as you start going outside the realm of very common things that

people try to generate, then the quality degrades really quickly. And you can look at

the prompts that require more imagination, more composing multiple concepts together. These break down really

fast. So I think the models in the future, while the The best case images

today may look as good as the best case images in a few years. The

worst case images today will be significantly worse. So we'll just make the models more

useful and sort of broader in applications. And, you know, we found the broader we

make them, the more use cases people discover and the more useful they become. I'm

curious as you think about the evolution of this broader image model space. If I

were compared to the LLM world, there's really... I don't know. I can say it

feels like it's you guys, OpenAI Anthropic, and then a host of other folks maybe

doing slightly behind different things. Do you think the image model landscape plays out similarly?

Yeah, I think it's a good question. So until now, I think images have been

an area where it's been possible for smaller teams to develop really state-of-the-art image models.

I mean, we've seen this... There's a few small labs that have just amazing models.

I hope this could maintain because I like the fact that there's small labs doing

it. But like I said before that the world knowledge of these models and making

them more useful, I think, is something that really benefits its scale and especially with

the scale of the language model. So I suspect we might see the same groups

that can do the large language model trainings or the ones that have the image

models that have all the world knowledge as well. We're seeing similar trends with the

big Chinese labs also coming out with amazing models and the same as language models.

So I think we'll see that as a major player in the image space as

well. Do you think it's that big a disadvantage for an image model to use

the best open source model versus some of the LLM versus the cutting edge closed

source ones? This is a great question. I feel like this is going to depend

a little bit on the future of open source models, which is something that changes

quite frequently. I think maybe a year ago, it would have seemed like that was

a very safe solution. And now maybe it's less so, but I don't know what

the future of open source models is. It's definitely a possibility. And that could sustain

a lot of sort of smaller labs training good image models, definitely. Well, Oliver, I'd

like to just shamelessly ask you a question because you mentioned you worked in video

for a while. And I'm always trying to just make sure I groff the relationship

between image models and video models. Obviously, your colleagues have had tremendous advances on the

video side with their recent video model releases. Are these separate efforts? Do they build

off each other? How are the image and video worlds interacting these days? They're very

closely related. I mean, I think in the future, everyone is moving towards

OmniModels. which are sort of models that do everything. And for many reasons, these models

have advantages and may end up winning in the long run. I don't know. But

I would say that a lot of the techniques that we learn for image generation

also sort of went into the video generation models and vice versa. And so this

is kind of one of the reasons why video generation took off as well is

because as a community, we kind of learned how to solve this problem. So

I think that I would say... Very close friends. And we share a lot.

And maybe we'll live together someday. Yeah. I guess

the techniques to do a lot of this stuff end up looking similar across the

models. Yeah. Exactly. And even in the workflows, right, a lot of people do end

up using these models in very complementary ways. So if you're a movie maker and

you're making a movie, a lot of the initial iteration, first off, it happens in

the LLM land. And then iterating in the frame or image space makes a lot

of sense because it's faster and cheaper to do that. And then you actually move

on to video. So I think even from just like workflows and usability perspective, there's

a little complementarity that happens between the models. And a lot of the use cases

and problems that we have solve are similar. So consistency, characters, objects, scenes, right? Exists

in image land, exists in video, gets a little more complex in video because you

have to do it over multiple frames. Yeah. What do you feel like the next

problems to solve on the video side? Yeah, I think that having the same kind

of control over the models that we get with the latest generation of image models

is something that obviously will be really impactful in the video space. So that's one

thing to look out for. I think video teams are also working on just improving

the sort of resolution and the consistency over time. And of course, like being able

to generate the same character across many scenes is whenever you look at like, you

know, what people want, that's always sort of top on the list. So I would

say moving towards longer form content with more coherency, definitely a future direction.

And it sounds like you can figure out some of that stuff in the image

context. And a lot of that ends up being transferable to the video side, which

is cool. it's been a fascinating conversation. We always like to end our interview with

a standard set of quick fire questions where I stuff some overly broad questions into

the end of the interview. And so maybe to start, would love your thoughts on

one thing that's overhyped and one thing that's underhyped in the broader AI world today.

I think overhyped is this idea that you can just go from like one short

prompt to something that's like a production ready solution that you can ship anywhere. I

do think that there's a lot of Iteration that needs to happen. And if you

even if you look at a lot of the stuff that people post on socials,

there's actually still a lot of work that goes into, you know, producing that final

product and sharing it. So I think that's a little bit overhyped. Under Hyped, I

think it's what are the UIs of the future? We already talked about this, but

how does this actually all come together? And how do you make it easier for

people to use these models, show them what's possible, and then make it useful for

specific workflows? Have you seen any products that you're like, oh, that's a really interesting

new take on a UI for the AI age? I'm still waiting. I'm still waiting.

I like node-based interfaces, but I don't think I'm representative of everybody. It's a very

specialized audience. Everyone gets their own UI in the future

anyway. It will be an era of specialization. Do you think image model progress is

going to be more or less the same next year as it's been this year?

I hope more. There's more smart people working on the

problem and putting more resources into it. I think we're going to see and in

all axes, the change will accelerate, yeah. Well, you've obviously taken over the internet

with everyone's focus on Nano Banana. Are there other things happening in the AI image

world that you're paying attention to that you don't think enough people are thinking about

or paying attention to? I do think it's this factuality dimension.

Even when we look at the things that people are doing with Nano Banana, people

have been making infographics or taking a picture of Niagara Falls and asking the models

to annotate it. And it sort of does a good job as a demo. looks

fine, but when you hone in, the text is a little bit garbled. It's not

factual. It repeats information. So I think that is a frontier. It's something that not

a ton of people are paying attention to, but I do expect to get better.

There's a real analogy with the text language models as well because when GPT-1 and

2 came out, people found like, oh, these are kind of cool. I can have

them write haikus and kind of do creative tasks where there's a broader range of

acceptable answers. But now people don't use language models for this as much at all

now they use them for like um you know for information seeking and also for

like kind of conversation and and like you know personal connections and like having someone

to talk to so like i think that we could probably find analogies for all

of these in the image space as well where we went from like creative tools

to maybe like information seeking tools and then also maybe like people will end up

talking to a video model you know when they need to talk to someone and

this is something that we could see in the future. And the models, I think,

should also get a lot more proactive because right now you always have to ask,

right? I want an image, but if the query warrants an image, right? And we're

used to this from search, actually, right? Like you're used to going to search asking

for a thing, and sometimes it comes with text. Sometimes it comes with text and

an image. Sometimes image is the right answer. And so I also expect these models

to just get a lot more proactive and smarter, right? about how they talk to

you and how they leverage these different modalities based on what you're asking about. I

love that you'll be able to seamlessly go back and forth. And then Oliver, to

your earlier point about, it seems like so much of the improvement here is really

around reliability. And in the early days of LLMs, there were glimpses of, oh, wow,

that's really cool. But it was not nearly consistent enough to use for more kind

of work use cases. And it seems like a similar pattern to be followed with

image models as well. The most important question in this interview definitely What is your

single favorite piece of content that you've seen made using Nana Banana? Both of you.

Okay. So my favorite piece of content is going to be like playing with the

model with my kids where we like put them in, in like, you know, funny

locations or, and like, um, make their stuffies come to life and these kinds of

things because they're, they're, they're so personal and it's like something I can do with

them and they love it. And so like this kind of thing I think is

the, the most important. definitely the most valuable for me. That's amazing. I feel like

people love that. Yeah. Yeah. People love like the LM, you know, bedtime stories for

their kids. And now we get to move into image and soon you'll have movies

that you can put your, your kids into and whatnot. And that'll be, uh, no,

it's a really exciting, uh, exciting future. Um, well, I want to make sure to

leave the mic to you folks, uh, besides like clicking on, you know, various bananas,

uh, around the internet and Gemini, where can folks go to learn more about you,

uh, the product, uh, anything you want to point folks to the, uh, the floor

is yours. Please do go click on the bananas. Gemini app, AI studio, you know

where to find us. And then we're both on X. So do you find us?

Give us feedback. Yeah. If you find something that doesn't work or sucks, definitely like

tag me in it. It's very helpful. Amazing. Well, we'll help people contribute to

that and we'll link to you as well. But thank you both so much. This

was a ton of fun. Thank you.

Loading...

Loading video analysis...