LongCut logo

How Google’s Nano Banana Achieved Breakthrough Character Consistency

By Sequoia Capital

Summary

## Key takeaways - **Breakthrough Character Consistency**: Achieving breakthrough character consistency from a single photo was a primary goal, enabled by high-quality data, long multimodal context windows, and disciplined human evaluations, addressing a key gap in prior models. [01:14] - **Human Evaluation is Critical**: Subtle aspects like aesthetic quality and face consistency are hard to quantify, making human evaluations a game-changer for judging model capabilities, especially on familiar faces. [09:54], [10:16] - **Craft and Data Quality Matter**: Beyond scale, the craft of AI involves meticulous attention to detail and data quality, with obsessed team members driving improvements in specific areas like text rendering. [13:18] - **Fun as a Gateway to Utility**: The 'fun' aspect of Nano Banana, like turning yourself into a 3D figurine, serves as an accessible entry point that encourages users to discover and utilize the model's broader utility for practical tasks. [04:04], [22:14] - **Unexpected Use Cases Emerge**: Users have creatively adapted the model for tasks like generating sketch notes for complex topics, enabling better understanding and conversation within families, demonstrating emergent utility beyond initial design. [03:10], [03:41] - **Accessibility and Imagination**: The goal is to provide tools that allow people to capture not just reality but possibility, evolving accessibility and imagination together to empower users to express themselves in new ways. [00:27], [05:35]

Topics Covered

  • Nano Banana's Breakthrough in Video Character Consistency
  • Hacking Nano Banana for Learning and Digestible Sketch Notes
  • The 'Aha!' Moment: Achieving Photorealistic Self-Portraits with AI
  • Personalized Tutors and Textbooks: The Future of Learning
  • AI's Emotional Impact: Visuals, Personalization, and Storytelling

Full Transcript

There's something about like visual

media that really excites people that

it's like the fun thing, but it's not

just fun. It's exciting. It's intuitive.

The visual space is so much of how we as

humans experience uh experience life

that I think I've loved how much it's

moved people.

>> I think we're really now making it

possible to like tell stories that you

never could. And in a way where like the

camera allowed anyone to capture reality

when it became very accessible, you're

kind of capturing people's imagination.

like you're giving them the tools to be

able to like get the stuff that's in

their brain out on paper visually in a

way that they just couldn't before

because they didn't have the tools or

they didn't have the knowledge of the

tools. Like that's been really awesome.

Today we're talking with Nicole Brtova

and Hanza Swini Vasan, the team behind

Google's nano banana image model, which

started as a 2 a.m. code name and has

become a cultural phenomenon since. They

walk us through the technical leaps that

made single image character consistency

possible. How high quality data, long

multimodal context windows, and

disciplined human evals enabled reliable

character consistency from a single

photo. and why craft and infrastructure

matter as much as scale. We discussed

the trade-offs between pushing the

frontier versus broad accessibility and

where this technology is headed.

Multimodal creation, personalized

learning, and specialized UIs that marry

fine grain control with hands-off

automation. Finally, we'll touch on

what's still missing for true AGI and

whites space where startups should be

building now. Enjoy the show.

Nicole and Hanza, thank you so much for

joining us today. We're so excited to be

here to chat a little bit more about

Nano Banana, which has taken the world

by storm. We thought we'd start off with

a fun question. What have been some of

your own personal creations using Nano

Banana or some of the most creative

things you've seen from the community?

Yeah. So I think um for me one of the

most exciting things I've been seeing is

like the I it didn't occur to me but

this is very obvious in hindsight is um

the use with video models to get

actually consistent cross scene you know

character and scene preservation. Um

>> how fluid is that workflow today? How

hard is it to do that?

>> So what I've been seeing is people are

really mixing the tools and using

different video models from different

sources. And so I think it's probably

not very fluid. I know some there's some

products out there that are trying to

like integrate with multiple models to

make this more fluid, but I think the

the difference in the the videos I've

been seeing from before and after the

Nano Banana launch has been pretty

pretty remarkable. And it's like much

much smoother and much more like what

you'd want in the video creating process

with scene cuts that feel natural. So

that's been cool. Um, and I don't know

why it didn't totally occur to me that

people would immediately do that. But

yeah,

>> one of my favorite ways that I didn't

expect is how people have hacked around

the model to use it for learning new

things or digesting information. I met

somebody last week who has been using it

to create sketch notes of these like

varied topics. And it's surprising

because text rendering is not something

that it's not where we want it to be.

But this person has hacked around like

these massive promps that like get the

model to output something that's

coherent and um he's used it to try to

understand the work that his father's

doing who's like a chemist at a

university and it's like a super

technical topic and so he's been feeding

his lecturers to Gemini with nano banana

and then getting these sketch notes that

are like very coherent and like visually

digestible and for the first time I

think in like decades they've been able

to have a conversation with each other

about his dad's work and that was really

fun.

>> Um, and something that I didn't see

coming.

>> I think people are really working

around,

>> you know, like this model is amazing,

but obviously it's it's not perfect. We

have a lot of things we want to improve.

And I think I've been astounded by the

ways people have found to to work with

the model in ways we didn't anticipate

and give inputs to the models in ways we

didn't anticipate to bring out the best

performance um, and unlock these things

that are kind of mind-blowing. Did you

guys in in the building of it, was there

a moment like an aha moment where you

kind of felt, "Wow, this thing's going

to be pretty good."

>> We just talked about this.

>> Yeah, I think Nicole had the aha moment.

>> I had one where so we always have an

internal demo where we play with the

models as we're developing them. And I

had one where I just took an image of

myself and then I said like, "Hey, put

me on the red carpet and like full GL

just total vanity prompt, right?" And

then it came out and it looked like me.

And then I compared it to like all the

models that we had before. Um, and no

other model actually looked like me and

I was like so excited.

>> Wow.

>> Um, and then people looked and they were

like, "Okay, yeah, we get it. Like

you're on the red carpet somewhere."

And and then I think it it took a couple

of weeks of other people being able to

take their own photos and play with it

and just kind of realize how magical

that is when you get it to work. And

that's kind of the main thing that

people have been actually doing with the

model, right? turning yourself into a 3D

figurine. Um where it's like you want a

computer, you want a toy box, and then

you as the figurine. So like you three

times. Um like that way to be able to

kind of like express yourself and see

yourself in new ways and almost kind of

like enhance your own identity has just

been really fun. And that for me was the

like, oh man, this is awesome.

>> What was it about what Nano Banana did

with you on the red carpet that was

miles better than what everyone else

has?

>> It looked like me. Um, and it's very,

um, it's very difficult for you to be

able to judge character consistency on

people's faces you don't know.

>> Yeah.

>> Um, and so if I saw, you know, a version

of you that's like an AI version of you,

I might be okay with it, but you would

say like, oh no, the, you know, like

parts of my face are not quite right.

And you can really only do it on

yourself, which is why we now have evals

on many team members where it's like

their own faces and they're looking at

the models output with their own faces

on it because it's really the only way

that you can judge whether or not

someone looks like you

>> yourself and like faces you're familiar

with. I think like when we started doing

it on ourselves and it's like I see

Nicole a lot so like Nicole versus like

random person we might eval on, right?

It's it's just a very big difference in

terms of judging the model capabilities.

And yeah, I think it's one of those

things that it's like so fun that

preservation of the identity is so

fundamental to these models actually

being useful and exciting, but is,

>> you know, surprisingly

>> tricky. Uh, and that's why we see a lot

of other models not quite hitting it.

>> Well, I was going to ask you, I would

imagine that character consistency is

not just an emergent property of scale.

And so may maybe two questions. One, I'm

sure there's stuff you can't tell us,

but what can you tell us about how you

achieved it? And then two, was that an

explicit goal heading into the

development of this model?

>> Yeah, so I would say I mean, yeah, I

think there's definitely things that are

tricky to say here, but I I would say um

there's like sort of different genres of

ways to do image generation. Um, and so

that so that plays that definitely plays

a part uh in how good it is. Um, and I

think it was definitely a goal from the

beginning.

>> It was it was definitely a goal because

we knew it was a gap with the models

that we released in the past. Um, and

generally consistency for us was a goal

because every time you're editing

images, right? Like you want to preserve

some parts of it and then you want to

change something and prior models just

weren't very good at that. And that

makes it not very useful in professional

workflows, but it also doesn't make it

useful for things like character

consistency. And we've heard this for

years from even advertisers who are, you

know, trying to advertise their products

and like putting them in lifestyle

shots. It has to look like your product

like 100% otherwise you can't put it in

an ad. Um, so we knew there was demand

for it. We knew the models had a gap.

Um, and we felt like we had the right

recipe both in terms of like the model

architecture and the data to finally

make it happen. I think what surprised

us was with just how good it was when we

actually finally built the model.

>> Yeah. Right. Cuz like I think we felt

like we had the recipe. Exactly as

Nicole said, but there's still always

until you're seeing the model, you

finished training, you you're actually

using it, you don't know how close

you're going to get right to that goal.

And I think we were all surprised by

that.

>> Um Yeah. And I think the other thing is

if we think about like what people

expect out of editing when that you edit

on your phone apps or like Photoshop,

you expect a high degree of preservation

of things you're not touching.

>> Yeah. And depending on how the models

are made and how how the design

decisions behind them, that's very

tricky to do. But it's something people

really like it's it's one of those

things where like

it's shockingly technically difficult.

Even though it's something I think a lay

person who's using the models would be

expect to be like the basic thing about

editing is like you don't mess with the

things you don't want to be messed with.

>> Yeah. back to that moment where you saw

yourself on the red carpet and wow

that's actually me and it took some of

your colleagues a couple weeks to have

the same experience because they tried

it with their own photos. The question

is beyond hey that's actually me that

you know the qualitative test is there

some sort of an eval that you can put

against that to make it quantitative

that you know we have achieved the thing

that we set out to achieve here.

>> Yeah. So I actually think I think face

consistency exactly for the reason

Nicole said is is quite hard. It's quite

hard for other people to do.

>> Yeah. Um, I will say in general, I think

what we found with image generation in

particular that's unlocked a lot for us

is like human evals are important. Um,

and so I think they're a foundational.

We have we have a a team that works on

helping us build sort of good tooling

and good practices for evals and having

humans actually eval these things um

that are very subtle. Like if you think

about image generation like faces,

aesthetic quality, these are things that

are very hard to quantify. Um and so I

think human evals have been a main a big

game changer for us. I think it's

definitely I think it's a combination of

there's human evals there's very

technical term eyeballing

um of the model results by different

people um and there's also just

community testing and when we do

community testing we start internally

and we have artists um at Google and at

Google deep mind who play with these

models our execs will play with these

models um and that really helps I think

kind of build that qualitative narrative

around like why is this model actually

awesome Um because if you just look at

the quantitative benchmarks, you could

say like, oh, it's 10% better than this

model that we had before. And that

doesn't quite gro that emotional aspect

of like, oh, I can now see myself in new

ways or I can now finally edit this

family photo that I cut up when I was 5

years old. Yeah.

>> Um and I probably shouldn't have. People

have done that. Um when like I'm able to

restore it. Like I think you really need

that qualitative um user feedback in

order to be able to like tell that

emotional story. I think this is

probably true of many of the the Gen AI

and AI capabilities, but I think it's

especially true of

visual media where it's very subjective

versus if you think about something like

math reasoning, logic reasoning where

like you you can really ground it in an

answer, right? Um, and so it's more easy

to have these very objective automated,

you know, quantitative evals.

>> To get to that level of character

consistency from just one 2D image of of

someone is really, really hard. Can you

walk us through maybe a little bit what

are the technical breakthroughs that

helped you drive to that level of

character consistency that we actually

haven't seen anywhere else? I mean, I

think a key thing is like having good

data that teaches the models to

generalize, right? And the fact that

this this is a based it's a Gemini

model.

>> Yeah.

>> It's it's a multimodal foundational

model that's

>> had seen a lot of data and has good

generalization capabilities. And I think

that like that's kind of the secret

sauce, right? Is like you really need uh

models that generalize well

>> to be able to take advantage of that

>> for this, right?

>> Yeah. And I think the other nice part

about doing this in a model like Gemini

is that you also get this like really

long context window. So like yes, you

can provide one image of yourself, but

you can also provide multiple. And then

on the output side, you can also iterate

across multiple turns and actually have

a conversation with the model, which

wasn't possible before, right?

>> One, two years ago, we were fine-tuning

on 10 images of you and it took 20

minutes to actually get something that

looked like you. And that's why it never

took um it never took off in the

mainstream, right? Because it's it's

just too hard. Um and you don't have

that many images of yourself. It's like

too much work. Um and so I think it's

both kind of the like general like

Gemini gets better. You benefit from

that multimodal context window and you

benefit from the like long output and

ability to like maintain context over a

long conversation. And then you also

benefit from the like actually paying

attention to the data, focusing on the

problem. A lot of the things we get

better at come down to there's a person

on the team who's like obsessed with

making them work. Like we have people on

the team who are obsessed with text

rendering and so our text rendering just

keeps getting better because that person

just like is obsessed with the problem.

>> Yeah. It's like it's not just about

throwing high

>> quantities of data in. Right. Uh I think

that's one thing that's really important

is it's there's this like attention to

detail

>> um

>> and quality of you know all the things

you're doing with the model. There's a

lot of there's a lot of small design

decisions and decision points at every

point and uh I think that like detail

orientedness of high quality is data and

selections are really important.

>> Yeah.

>> It's the craft part of I think the AI

which we don't talk about a lot but I

think it's super important.

>> Yeah.

>> How big was the team that worked on it

>> to ship it? It took a village.

>> Yeah. Cuz especially because we split

ship across many products. So I think

like there's like sort of the core sort

of modeling team and then there's you

know our close collaborators across like

all the surfaces.

>> Yeah.

>> When you put them all together you

easily get into like dozens and hundreds

but um the team who works on the model

is much smaller and then the people who

actually make all the magic happen. We

had a lot of infrastructure teams like

optimizing every part of the stack to be

able to surf the demand that we were

seeing which was really awesome.

>> Um but really like to ship it we were

joking that it takes like a small

country. When you build something like

this, do you build it with particular

personas or particular use cases in mind

or do you build it more with a

capability first mindset and then once

the capabilities emerge you can map it

to personas? It's a little bit of both,

I would say. Like before we start

training any new model, we kind of have

an idea of what we want the capabilities

to be. Um, and some design decisions

like um how fast is it at inference

time, right? They also impact which

persona you're going after. Yeah. So

this model, because it's kind of a

conversational editor, we wanted it to

be really snappy because you can't

really have a conversation with a model

if it takes like a minute or two to

generate. That's really nice about image

models versus video models. Like you

just don't have to wait that long.

>> And so to us from the beginning it felt

like a very consumer centric model. Um

but obviously we also have developer

products and enterprise products and all

of these capabilities end up being

useful to them. But really, we've seen a

ton of excitement on the consumer side

in a way that I think we haven't before

um with our image models um because it

was very snappy and it kind of made

these like prolevel capabilities just

really easily accessible through a text

prompt. Um and so that's kind of how we

started it out, but then obviously it

ends up being useful in other um in

other domains as well.

>> Yeah. And I think one of the like

differences in philosophy, so like

previously we'd worked on the imagine

line of models which were straight image

generation. And I think one of the like

big philosophical goal changes in these

Gemini image generation models is

generalization is like a more

foundational capability. So, I think

there is also a lot of like there's

there's things where like we want this

model to be able to be good at this,

like representing people and letting

them edit their images and have it look

like themselves. But I think there's

also a lot of things like that are our

emergent from the goal of just having a

baseline capable model that like reasons

about visual information. Like I think

one thing that's surprised me I guess as

a call back to your earlier conversation

is people can put in math problems like

a drawing of a math problem and like ask

it to like render the solution right so

like you can put in a geometry problem

and say like what is this angle and

>> that's that's like an emergent thing of

like a foundationally capable model that

has both like reasoning mathematical

understanding and visual understanding.

Yeah.

>> So it's I I think it's there's it's

both.

>> Yeah. Can you maybe share I just out of

curiosity what's a good way to

understand maybe the family mapping and

the relationship between Gemini powering

nano banana vo you know all these other

adjacent products and models that are

all driven and benefit from the

generalization and the scale of Gemini

itself um how you co-develop and then

where you want to take it from here

>> um our goal has always been to build the

single most powerful model that can do

all these things right? You can take in

any modality and you can transform it

into any modality. Um, and that's the

northstar. We're obviously not quite

there yet. And so on the way there, we

had a lot of sort of specialized models

that just got you great results in a

specific domain. So, imagine was an

example of that for image generation. VO

is an example of that for video

generation um, and editing. Um, and so I

think we're we're both kind of

developing these models to push the

frontier of that modality. and you get

really useful outputs out of that,

right? A lot of filmmakers are using VO

in their creative process. Um, but

you're also learning a lot that you can

then bring back into Gemini and then

make it good at that modality. Image is

always a little bit, I think, ahead of

the curve because you just have one

frame, right? It's it's cheaper um both

train and at inference time. Um, so I

think kind of a lot of the developments

you see into in image I expect you to

see in video like six 12 months down the

line. Um, and so that that's always kind

of been the goal. And so we have

separate teams kind of developing these.

And then I think with image we're now

moving closer um to Gemini. Um, and to

that vision of that single most powerful

model. Um, and you will see that I think

with some of the other modalities and

along the way we'll release these

experiences that are just like really

powerful and like really exciting in

that modality. So like V3 was really

awesome because it brought audio into

video generation, right, in a way that

we haven't seen before. G3 was really

awesome because it let you in real time

kind of navigate a world. Um, and so in

order to push that frontier, it's very

hard to like do all of that at the same

time right now in one model. Um, and so

to some extent these specialized models

are kind of a testing ground, but I

would expect that like over time, you

know, Gemini should be able to do all

these things.

>> Oh, that's so interesting.

>> Okay, we got to ask you about the name.

>> Ah,

>> I suspect that the name was a bit of an

it's an amazing product. I suspect that

the name gave it a little bit of a boost

because it's so easy to remember and so

distinct. So, was it a happy accident or

is there some creative genius who knew

that this is going to be just the right

name?

>> It it was a happy accident. Um, so I I

think as many people know, uh, the model

went out

>> on analina where many models do and you

part of that is you give a code name. If

anyone hasn't used Ella Marina, you get

to put in your prompt. You'll get back

two responses from two models. They have

code names until they're publicly

released. Um, and I think it was like we

had to someone we were going out at like

2 a.m. and uh, Nicole Nicole's our

wonderful PM. another PM we have Nina

and someone messaged her being like what

do we name it and she was really tired

and exhausted and she was like

this was the name of Stroke of Genius

that came to her at 2 a.m.

>> This is you.

>> It was not me. It was somebody on my

team who named who named the model I

can't take

>> works with another one of our PMs.

>> Um but what was really awesome is like a

it was really fun. I think that really

helps. It's easy to pronounce. It has an

emoji which is critical for branding.

>> She didn't overthink it

>> in this era, but she didn't overthink

it. And what was awesome is everybody

just went with it once it went live and

and I think it just like felt very

googly and very organic and end up

looking like the stroke of marketing

genius. Um, but no, it was it was a

happy accident and it just sort of

worked out and people loved it and so we

leaned into it and now there's, you

know, bananas everywhere when you go

into the Gemini app. Um which we did

because people were complaining that

they were having a really hard time

finding the model when they came into

the app.

>> Yeah.

>> Um and so we just made it easier.

>> Yeah. And Yeah. Exactly. I think there's

like publicly people were like nano

banana nano banana. How do I use nano

banana? I had someone at Google I work

with be be like how do I use nano

banana? And I was like it's Gemini. It's

right there.

Just just ask for an image.

Um, yeah. But I think that's the thing

is like I think Google's always had this

really fun brand, right? Like it's like

it's not like it's been a

consumeroriented company at its

inception and like

>> I think it was really nice to to play on

that rep that image people have of like

Google as a fun

fun place, fun company um and have this

fun name.

>> It's also just like a really nice path

to fun being kind of a gateway to

utility, right? I I I I think Nano

Banana and just the model in general and

what you can do with it like put

yourself on the red carpet, do all the

childhood dream professions you had,

it's like a really fun entry point. But

what's been awesome to see is that like

once people are in the app and they are

using Gemini, they start to use it for

other things.

>> Yeah.

>> That then become useful in their

day-to-day life. Like you use it to

study and solve math problems or you use

it to learn about something else. And so

I think it's maybe a little bit

undervalued sometimes to like have a

little fun. um not just with the naming

but also just like with the products

that we build because it kind of gets

people in and gets them excited and that

it helps them discover other things that

you know the models are awesome at.

>> Yeah, I think other users like my my um

like my parents and their friends are

using. I think it's cuz it like had this

reputation. It was really easy. It was

really fun. It felt unintimidating to

try.

>> Then you try it and you're like actually

this this is very easy to work.

This works very easily. It's very easy

to interact with. There's no like techn

there's no like you know technology I

think can sometimes be intimidating to

people especially AI right now.

>> Yeah.

>> Um and I think the chatbot naturalenness

has broken a lot of that barriers but

maybe more so with younger people.

>> Yeah.

>> Um and I think this like fun like

>> Yeah. My mom like made was like making

these images and having a great time and

and and

>> then realized she can use it to like

remove people from the background of her

images like these very practical things,

right? started very silly

>> turn very practical then people can use

it to realize like actually they can

give you them diagrams or help them

understand stuff so I think there's also

like a big accessibility component

>> where do you want to take from here

maybe both from a model side and from a

product side

>> on the product side I think there's kind

of a couple areas like on the consumer

side I still think we have a long way to

go to just like make these things easier

to use right um you will notice that a

lot of the nano banana prompts are like

hundred words

and people actually go in and copy paste

them into the Gemini app and like go

through the work to make it work because

the payoff is worth it.

>> Um, but I think we have to get past this

prompt engineering phase for consumers

and just like make things really easy

for them to use.

>> I think on the professional side, we

need to get into like much more precise

control kind of robustness like

reproducibility

um to make it useful in actual

professional workflows, right? So like

yes, we you know we're very good at

editing consistency and not changing

pixels, but we're not 100% there. And

when you're a professional, you need to

be 100% there, right? Like you really

need kind of these precise maybe even

like gesture based controls like over

every single pixel in the frame.

>> So we definitely need to go in that

direction. And then I think there's like

a general direction that I'm really

excited about which is just about

visualizing information. Um, so the

example I had about sketch notes at the

beginning and somebody kind of hacking

their way around using Nano Banana for

that use case, you could just imagine

being able to do that for anything,

right? And a lot of people are visual

learners. I think we haven't really

exhausted the potential of LLMs to be

able to like help you digest and

visualize information in whatever way is

most natural for you to consume, right?

So sometimes it's a diagram, sometimes

it's an image, and sometimes maybe it's

a short video, right? that you want to

um to learn about some concept that

you're learning in a biology class or

something like that. So I think that's

like a completely new domain that I'm

really excited about just these models

getting better and getting past the

point where you know 95% of the outputs

that you get out of these models are

just text which is useful but it's it's

not how we consume information in the

real world right now. It's really

interesting. So on the product side then

are you alluding to the fact that you

might want to vertically integrate and

build a little bit more product around

it and and also are you alluding to the

fact that maybe the way you interact

with some of these models isn't just

through pure language and prompting over

time but more UI.

>> Yeah. Yeah. I defin I definitely think

the chatbots I think are an easy entry

point for people because you don't have

to learn a UI new UI. You just talk to

it and then you say whatever you want to

do, right? I think it starts to become a

little bit limiting for the visual

modalities and I think there's a ton of

headroom to think about like what is the

new visual creation canvas for the

future. Um, and how do you build that in

a way that doesn't become overwhelming,

right? Because as these models can do

more and more things, it's very hard to

explain to the user in something that's

very open-ended like what the

constraints are and like how do you work

around that and like how do you actually

use it in a productive way. So I'm

really excited about people kind of

building products in those directions.

Um and for us, you know, we have a team

called Labs um at Google that's led by

Josh Woodward and they do a lot of this

kind of like frontier thinking

experimentation. They work with us

really closely where they take our

frontier models and they think about

like what's the future of entertainment,

what's the future of creation, what's

the future of productivity. Um and so

they've built products like Notebook LM

and Flow on the video side. And I'm

excited that maybe flow could kind of

become this place where you could do,

you know, some of this creation and

think about what that looks like in the

future.

>> I think in the short term it's it's very

clear that, you know, this model has

things that it's not perfect at.

>> Um, and so in the short term, it's

obviously it should work the way you

expect it to every time, not just a lot

of the time. Um, and really make it so

seamless. uh and and fix all these like

small small things where it's just like

a little bit inconsistent in its

performance. Um I think long term it's I

think Nicole covered that which is to me

it's in order to have

that reality of really rich multimodal

generation. So like right now if you ask

Gemini to explain something it'll

usually just explain in test text unless

you ask it for images. But if you think

about like the platforms that have

really taken off in the last like 10 20

years for learning, right? Like we think

of like Khan Academy started on YouTube.

We think about like Wikipedia has a lot

of images. Like it's very image focused.

If you look up any math thing you got

like diagrams and so like that should

become more like

a natural part of the flow and a part of

the way you use these models. And to

enable that from a modeling point of

view, it's it goes back to like like we

were talking about this this multimodal

understanding and seamless gen

generalization between mod modalities.

Um maybe the other interesting area as

we think about kind of you know these

models being more proactive at pulling

in you know whether it's code or images

or video when it's appropriate for the

user intent. I think this other exciting

year I I started out as a consultant um

in my career and so obviously I made a

lot of slide decks in my time. I still

do. Um and I think there are some of

these use cases where you don't actually

really want to be in the weeds of

creation. Like what you really want is

let's say you're updating your

stakeholders on how a project is going,

right? You want to pull in some context.

Maybe it's meeting notes. Maybe it's a

couple of bullet points. Um maybe it's,

you know, some other deck that you've

created in the past. And then you maybe

just want Gemini to go off and like do

all the work for you, right? Like pull

that deck together, format it, create

appropriate visuals to make it really

easy to digest. And that's something

that you probably don't want to be

involved in. And it gets more into these

agentic behaviors versus I think for

some of these creative workflows, like

you actually want to be creating. You

want to be in the weeds. You want to

think about what the UI looks like that

makes it easy for a user to accomplish

the goal. And so like if I'm designing

my house and I'm actually into designing

my house, um then I probably actually

want to play with it and like play with

textures and different colors and like

maybe what would happen if I remove this

wall. And so I think there's kind of

this spectrum of like very hands-off

like just let the model go off and like

pull in relevant visuals, materials for

a task that makes sense all the way to

like how do you actually make a creative

process like more fun and remove the

tedious parts and remove the technical

barriers that exist today with tools

that we have. It's like this mix of

giving the user fine grain control like

the precision control they want but also

at the other extreme having the model be

able to

understand the user request and

anticipate right like the need and the

outcome that it should be and do all the

intervening work in between.

>> Yeah.

>> It's almost like when you actually hire

a professional for something today,

right? Like when you hire a designer,

you give them a spec and then they go

off and then they do all that awesome

work that they do because they have all

this expertise. And so the these models

should be able to do that and they can't

really do that in many domains today.

>> What do you think the next competitive

battleground is in this world?

>> I think there's still work to be done on

making these models more capable. And so

this idea of having a single model that

can take anything and transform it into

anything else, I think nobody has really

figured that out.

>> Um, but I do think in order to actually

drive adoption, there's probably two

things. One is user interfaces. Like we

still rely very heavily on the chat bots

and we talked about this like it's

useful for some things and it's a great

entry point but it maybe isn't useful

for all the things. And so I think

starting to think about much more deeply

about who are the users, what are they

trying to do, how can the technology be

helpful and then what product do you

build around it to make that happen? Um

is probably one. Do you think five or 10

years from now the frontier will be

advancing as quickly as it has advanced

over these last few years?

>> Five to 10 years from now feels like 20

years from now. It just the space and

you guys probably see this too like the

space is moving really quickly.

>> Yeah.

>> And you know if you ask me two years ago

I would have told you the space is

moving really quickly. If you ask me

today I will tell you it's moving faster

than it was two years ago.

>> Okay. I'm gonna ask you a very different

question.

Um, so, um, I know Google's very, uh,

very sort of careful and very concerned

about deep fakes and and that sort of

thing. Um, and I have to imagine when

you saw how capable this model was,

there's a big conversation about, okay,

well, how are we going to make sure

people don't use it in the wrong sorts

of ways? How did that how does that sort

of a conversation go inside of Google

and are you guys sort of like happy with

where it ended up? I think it's an ever

evolving frontier also um because it's

this mix of you want to give people the

creative freedom to be able to use these

tools, right? And you want to give users

control to be able to use these tools in

a way that don't feel overly restrictive

and you want to prevent the worst harm,

right? I think that's always the the

balance that we spend a lot of time

talking about. Um, and so obviously when

you look at the outputs of the model,

there's a visible watermark that says

it's been generated with Gemini. So that

immediately indicates that it's AI

content. Um, and then we also in every

output that we um produce with our

models, image, video, um, you know,

audio, there's synth ID embedded, which

is invisible watermarking. Um, and so

those are kind of the the visible ways

or and invisible ways in which we verify

that a content is AI generated. um we're

very invested in it and you know we

believe that it is really important um

to give users those tools to be able to

understand that when they're seeing

something it's not it's not a real video

or it's not a real image. Um and then

obviously when we develop these models,

we do a ton of testing um internally and

um also with external partners to kind

of find as the models get more capable,

you find new attack vectors, right? And

like new new ways that you have to

mitigate for um and so that is like a

very important part of model development

for us. And um we continue to invest in

and as as the models get better and as

there's new new things that you can do

with them, we also have to develop kind

of new mitigations for you know making

sure that we don't create harm but also

still give users the creativity and the

control um in order to make these models

usable in a product.

>> I mean I think it's a very very hard

balance to to strike, right? Um because

>> you will always have people using a tool

in good faith. You'll also always have

people using it in bad faith. Um,

>> and I think I think it's hard. It's like

is it a is it a tool? Is it something

that has responsibility? So I think we

we take this very seriously. Um,

users obviously are also responsible for

what they do with the model. But synth

ID really is an important technology

that lets us like release these

capabilities to people and have have

some faith in that we can still verify,

right? and and have a tool to to combat

this misin the the risk of

misinformation. Um, but it it's a it's

it's a super tricky conversation and I

think it's one that I've seen everyone

take very seriously. Um, there's a lot

of a lot of conversations about how to

balance both.

>> Is that the standard now across the

industry?

>> Synth ID. Yeah,

>> it's a Google standard.

>> It's the Google standard. I believe

there's like every Google like imagine

the imagine line veo they all have synth

ID when you use them in any product

surface.

>> All right. You told us we can't go 5 to

10 years down the road because things

are moving too fast. We'll go one to

three years down the road.

>> Thank you.

>> Um two questions. One

uh what will be possible that we can

only dream about today?

And two,

what will the resulting change be to the

way that we all live our lives?

>> I really hope that a year or two from

now you could really get like

personalized tutors, personalized

textbooks in a way, right?

>> Love it.

>> Yeah.

>> Like I there's no reason why you and I

should be learning from the same

textbook if we have different learning

styles and different starting points.

But that's what we do now, right? That's

how our learning environment is set up.

And I think across all these

breakthroughs, like that should be very

possible where you have an LLM tutor

that just figures out your learning

style. What are the things you like?

Maybe you're into basketball and so I

need to explain physics to you with

basketball analogies, right? Um and so

I'm really excited about learning just

becoming way more personalized and that

feels that feels very achievable. And we

obviously have to make sure that we

don't hallucinate and there's like a

high bar for factuality. Um, and so we

need to ground in sort of real world

content, but that I'm really excited

about. Um, and that really I think just

it removes a lot of barriers for people,

right? To to your question on like what

the impact is going to be. Um, I think

it just becomes much more it becomes

much easier to learn basically anything

in a way that's very tailored to you

that you just can't do right now.

>> Could that be a Google product surface?

>> Somebody should look into it.

>> Yeah. And I think for the way it'll

change how we live and work, I think I

think we I think working on these

technologies,

I've already seen how it changes the way

we work, right? Because we we obviously

use them um a lot. Uh I'm getting

married. We made our save the dates with

our with our model. Um and so I what I

really think we'll see is and and just

work the amount part of I think the

reason that the innovation has

accelerated is we have these models you

have like code assistants you have uh

just like you can use models to like

filter things to analyze huge amounts of

data like it's drastically increased our

own workflows like what I can do this

year versus two years ago is just like

an order of magnitude more work. And I

think that's that's true of the tech

industry. It's not true of a lot of

other industries just because that

integration into their workflows or into

their tooling hasn't happened.

>> Um I so I I think you know some people

are like oh

it's going to it's going to replace me.

But at least what I've seen is it it it

really just actually changes the amount

of work an individual can get done. What

that means like for businesses or

economically I'm not sure. But I think

it means we will just see people be more

empowered to hopefully do more in the

same amount of time. Like maybe you

don't have to, you know, I have friends

who are in consulting and spend a lot of

time. They're like, I just spent a lot

of time, like two hours making slides,

tweaking,

>> moving

>> individual, moving logos around and like

hopefully they won't have to do that.

They can actually spend time

>> thinking about what the content of the

slides like should be, thinking working

with clients. Um, and I think that

that's hopefully what we will see in

like one to two years.

>> Given the trajectory that you see in

these capabilities, are there

interesting areas that you think

startups should go do that Google itself

might not get into?

>> I think there's a ton of spaces even

just in the creative tools like I I

think there's a ton of room for people

to figure out like what what do these

UIs of the future look like? Like what

is the creative control? How do you

bring everything together? We see a lot

of people in the creative field work

across LLM's, image, video, and music in

a way where they have to go to four

separate tools to be able to do that. So

like a lot of people ideulate in with

LLMs, right? Like give me some concepts

like here's an idea that I have. Once

you're happy with that, you take it to

an image model. You start to think about

what are the key frames that I want to

have in my video. You spend a lot of

time iterating there. Then you take it

to a video model, which is yet another

surface. And then at some point you want

to have sound and music and mix it all

together. And then you actually want to

do maybe some heavy-handed editing and

you go to some of the traditional um

software tools. That feels like these

kind of workflow-based tools are

probably going to spin up for a lot of

different verticals. So creative

activity is just one example of it. But

you know maybe there might be one for

consultants so that you can more

efficiently make slide decks and

presentation and pitch decks to clients.

Um and so I think there's a there's a

lot of opportunity there. um that you

know some some of the big companies may

not go into.

>> Yeah, there's a lot of like how do we

make this technology useful for X

workflow right like sales fin like I'm

saying a lot of things I don't know

about in companies like financial

workflows but I I imagine there's like a

lot of tasks that could be automated

could be made much more efficient.

>> Yeah. Um and I think startups are in a

good position to really like go

understand the specific client use case

need that niche need and and do that

application layer right versus what we

really focus on is the fundamental

technology.

>> Um I think I'm just really excited

by the number of people who've been

excited

by this model.

>> Yeah.

>> If that makes sense.

A lot of people in my life like a lot of

aunts, uncles, my parents, like friends,

like they've used chat bots. They ask it

things, they get information. My mom

loves to ask chatbots health about

health information. But there's

something about like visual media that

really excites people that it's like the

fun thing, but it's not just fun. It's

exciting. It's intuitive.

the visual space is so much of how we as

humans experience uh experience life

that I think I've loved how much it's

moved people like emotionally in

excitement wise like

>> I think that's been the most exciting

part of this for me.

>> My kids love it.

>> Yeah.

>> He uh my my three-year-old son tied our

dog leash which is this like fraying you

know brown rope like over himself so he

looked like a warrior. I took a picture

of him and turned him into this like

warrior superhero.

>> Yeah. Exactly.

>> And it makes him feel super human.

>> Yeah.

>> And my husband will read. So he uses

Google storybook to to read him these

stories about lessons that he learned in

school. You know, if he if there was

like an incident on the playground with

another kid or adjusting to a new

school. And it's made I mean it's made

these characters that look like him and

my husband and me and our dog and our

and our daughter in in these fun stories

and lessons that we're trying to teach

him to the personalization that you

talked about. So I really really love

this future. It's it's going to be

totally different for him growing up

>> and and and it's awesome, right? Because

this is a story for, you know, one or

five people that you would have never

had made, right? Like and and other

people probably don't want to read it. I

would love to if you want to.

Um, but I I think we're really now

making it possible to like tell stories

that you never could. And in a way where

like the camera allowed anyone to

capture reality when it became very

accessible, you're kind of capturing

people's imagination. Like you're giving

them the tools to be able to like get

the stuff that's in their brain out on

paper visually in a way that they just

couldn't before because they didn't have

the tools or they didn't have the

knowledge of the tools. Like that's been

really awesome.

>> That's a nice way to put it. Thank you

so much for having us. It was awesome to

have you.

Loading...

Loading video analysis...