Google DeepMind Lead: Building AI Apps in Minutes with Gemini
By The MAD Podcast with Matt Turck
Summary
Topics Covered
- Gemini Natively Handles All Modalities
- AI Studio Exports Production Code Instantly
- Gemini Live Analyzes Screens in Any Language
- Build Deploys AI Apps with One Prompt
- Solo Founders Build Sci-Fi Apps Now
Full Transcript
[music] Greetings everyone. My name is Paige.
Greetings everyone. My name is Paige.
I'm the inch lead for our developer relations team at Google DeepMind, which means that I get to work with the Gemini APIs, AI Studio, Gemma, and a whole bunch of other stuff every single day.
So, if you have startups that want to use our models, if you're curious about the capabilities of them or something that you see today, please feel free to chat afterwards. I would love to hear
chat afterwards. I would love to hear about your use case and to see how we can work together. So I don't think it's a secret that Google has been a little bit busy. We have released uh on average
bit busy. We have released uh on average a new model or a new feature every single day. Um sometimes multiple ones a
single day. Um sometimes multiple ones a day uh over the course of the last couple of months. Latest flagship model is still Gemini 2.5 Pro for uh at least
for now. Um Gemini 2.5 flash image
for now. Um Gemini 2.5 flash image preview aka Nano Banana. Um how many people have tried Nano Banana? Uh
excellent. We're we're not the best at naming things. um but uh Nano Banana
naming things. um but uh Nano Banana being the exception but we'll [clears throat] see that in a little bit. VO3.1 which is our latest video
bit. VO3.1 which is our latest video generation model which also includes audio including audio in multiple languages, background effects, music etc. GIMA 3N which is part of our open
model family. Genie3 which is a world
model family. Genie3 which is a world model and our computer use API which if we have time we will also be demoing. So
Gemini is special in a couple of ways.
One of which is that it's natively multimodal. So it can understand video
multimodal. So it can understand video and images and audio uh and text and code and all of the above all at once.
But it can also output m multiple modalities which means that you can output audio including audio in multiple languages. You can output images and
languages. You can output images and images and text interled or edit images and you can do all of that with the same model. Um and so uh this is just kind of
model. Um and so uh this is just kind of a subset of things that you can do with Gemini. Now, if you explore within AI
Gemini. Now, if you explore within AI Studio, but it's a lot cooler to show rather than tell, it's a great way to play with Deep Mind's models as soon as they come out. Um, you can select
different models here off to the right and get different insights about them.
Um, but you can also do things like add images. So, I'm just going to select a
images. So, I'm just going to select a sample image, maybe this one. Um, I'm
going to turn on structured outputs, which uh which hopefully folks are using. add some properties. So maybe
using. add some properties. So maybe
like a landmark name, a country name, a city name, maybe a description. Make all
of that available as multiples. So if
they uh if they want to return multiple values, they can. And then hit run. Um
and we should see a JSON blob. Yep. That
has kind of landmark name, country name, etc. Um you can just uh play with the models pretty quickly. And then if you can get something working in AI Studio, if you click get code, it gives you
everything that you would need to replicate whatever you just did in AI Studio as part of your app. So if I go to Python, um I hear that all of y'all
are data humans. So probably um probably love Python um or at least tolerate it.
Um the you can see that it's selected the the specific model. Um it's kind of given you the the blurb to configure.
Um, also it's incorporated a lot of information about the schema structure that I had asked for in structured outputs. Um, and you're kind of off to
outputs. Um, and you're kind of off to the races as opposed to trying to figure out how to do this via documentation or via wizardry. So this is uh just an
via wizardry. So this is uh just an example of something that you could do with the one of the base models themselves. Um, but we've also recently
themselves. Um, but we've also recently introduced something called Gemini Live um, which gives you the ability to talk to the models as if you were talking to a human um, in multiple languages and
even doing really cool things like sharing your screen. So, this is a blog post. It's for our Genie 3 world model
post. It's for our Genie 3 world model um, which if you haven't seen, I really suggest taking a look through. You can
describe a scene that you'd like to experience and then hollow deck style just navigate through it. So each frame is generated pixel by pixel. Um you can uh and none of this is using a physics
engine behind the scenes. It's just a harness of models that are kind of creating um these images as you click the arrow keys off here to the left. Um
and you can have anything from like these watery landscapes that respond really well like if you hit the lamp um if you hit it too hard it sometimes gets knocked into the water in which case if
you circle back you can see it again.
Um, and these experiences last for like 60 seconds, 90 seconds, at least for now. Um, but it really does feel like
now. Um, but it really does feel like you're in a completely different world.
And you can start from either a text description um or uh or something like a photo. So, if you want to relive one of
photo. So, if you want to relive one of your vacations or a hurricane in Florida, you can. Um, but this is uh but this is just very magical and also beside the point. So, I was looking for
something that we could ask Gemini about. um this table pretty decent
about. um this table pretty decent example. Um I'm going to go back to AI
example. Um I'm going to go back to AI Studio and then I am going to share my screen. Um I'm also going to turn on uh
screen. Um I'm also going to turn on uh turn on something called grounding with Google search and uh let's see how well it does. Hey there Gemini. Could you
it does. Hey there Gemini. Could you
tell me what you see on the screen?
I'm going zoom in so we can all see. I
see a table comparing different models with respect to various capabilities in world simulation or video generation.
The models are game engine, genie 2, veo and genie 3. The capabilities compared are resolution domain control interaction horizon and interaction
latency.
>> Yeah, that's pretty cool. That's pretty
cool. Which one would I care about if I wanted to care about 720p?
Based on the table, both Veo and Genie3 support resolutions of 720p. Veo also
supports resolutions up to 4K. Genie3's
specified resolution is 720p.
>> Yeah. Could you tell me that? But could
you tell me again just in um in Mandarin like Chinese?
>> Does anybody speak? I probably should have asked first. Does anybody speak?
>> Does anybody speak a language other than English? I'll I'll be polite this time.
English? I'll I'll be polite this time.
Portuguese. Yeah. Brazilian Portuguese
or Portuguese. Portuguese. Brazilian
Portuguese. Could you tell me that again, Gemini? But could you tell me in
again, Gemini? But could you tell me in Brazilian Portuguese?
So, so this is our Gemini live feature.
You can send video data. You can send like video data from your screen, from your phone, camera, um audio, and then again if you click get code, um all of
it is kind of handled for you um through the the code that you can copy and paste. And again, this is a single model
paste. And again, this is a single model call. So single model capable of
call. So single model capable of supporting all of these different languages, over 140 the last time we checked, but we keep finding ones that we hadn't included as part of the um the
140. So like test out a language that
140. So like test out a language that you're skeptical about. Um it supports Cllingon. It supports Elvish. Like like
Cllingon. It supports Elvish. Like like
uh try to put it through its paces and let us know what you find. That's Gemini
Live. It also ends up being on the order of I think one penny per minute of interactions. So this is the speechtoext
interactions. So this is the speechtoext LLM understanding and texttospech pipeline all bundled into one API call.
Um and again just about you know a penny a minute. So that is uh Gemini live.
a minute. So that is uh Gemini live.
Actually, the other thing that I was going to point out was grounding with Google search. So, as an example, hey
Google search. So, as an example, hey Gemini, could you tell me um what the temperature is like in New York City today and then also what it's going to be like for the rest of the week?
>> The current temperature in New York City is 53° F or 12° C.
>> It feels like 51° F. Today, there will be clouds during the day and light at night. That's a lot of information. So,
night. That's a lot of information. So,
I'm just going to hit stop 10% during the >> It gave me obviously a lot of information also citations um if as part of the REST API call. Um and it used uh
code execution to uh and our search grounding feature which is baked in in order to look up information on Google search. And when you look at the code
search. And when you look at the code snippet, so this guy over here, um if you scroll down, you can see that to incorporate search as a tool call um
it's a oneliner. You can also add custom tools if you would prefer um to uh to kind of cobble together with Gemini Live or with any of the other models.
Something else that we've been working on pretty significantly is our build feature for um for generating apps including AI enabled apps uh and then
deploying them on Google Cloud. So this
is uh available today at a.dev/build.
Um, but you can do really interesting things like say um create an app that takes a photo of the
user with the webcam. Um, it's
Halloween, so I'm going to say something like, um, the app should then ask the user for their favorite animal. Um, and
what costume they would like to have for Halloween. Um, the app should then use
Halloween. Um, the app should then use Nano Banana uh to show the user dressed
in a DIY costume uh while cuddling uh their favorite animal.
Be creative.
Dressed in a related costume. Um, and
I've never tried this before, so like we'll see if it works. Um, fingers
crossed. So I I I hit enter and immediately you're launched into something that looks a little bit like an IDE. Um if you expand out the
an IDE. Um if you expand out the thinking section off to the left, you can see Gemini 2.5 Pro walking through kind of the architecture. We have a very opinionated stack for the app building
process itself. So we've got React,
process itself. So we've got React, Tailwind, probably some other things that you've probably seen. And it
incorporates all of the latest features and models that are available in our Genai SDK. So, it knows even though I
Genai SDK. So, it knows even though I said nano banana, it knows that I mean like Gemini 2.5 flash image. Um, it
should be able to figure out what uh what should be incorporated into the app itself. Um, and if it encounters any
itself. Um, and if it encounters any errors along the way, it should also be able to take the error, put it back into the Gemini model um, and fix itself. Um,
so it it uh makes it very very easy to experiment and bless you and to keep uh and to keep kind of running through um a lot of these uh a lot of these images.
So let's see what happens. There's also
a save to GitHub feature in the upper right. Um if you want to add different
right. Um if you want to add different files, you can. So you can upload images, upload text files and PDF files.
Um let's so let's see how this goes. Oh
gosh. So that is uh hopefully it like sees me and not the the uh Yep. So so so it's got uh let me zoom out a little
bit. I'm currently likeund and something
bit. I'm currently likeund and something but favorite animal finick fox. Um a
dream Halloween costume. Maybe like a um a rock uh not a rocket ship but maybe like a robot. Um and then let's see how well it does. uh powered by Gemini Nano
Banano, which I did not ask for it to uh to add, but like Oh, there we go. So,
there's like a robot uh uh with me inside it, a Finnick Fox that is honestly like a very cute Finnick Fox.
And he has a little he has like a little spaceship thing that's very And so uh so clearly everybody else wants to try this as well. like obviously um uh uh and so
as well. like obviously um uh uh and so if I if I click deploy app um which I will absolutely be sending to all of my nieces and nephews, you can select a
cloud project um click deploy um and then what it does is it kind of creates logging storage. It deploys the app via
logging storage. It deploys the app via cloud run. It hides all your API keys so
cloud run. It hides all your API keys so they can't be exfiltrated from the app itself. And it also gives you a unique
itself. And it also gives you a unique URL. So, if all of y'all wanted to like
URL. So, if all of y'all wanted to like go and like experiment with this, you could. Um, maybe I will say favorite
could. Um, maybe I will say favorite animal is a labradoodle puppy, maybe like a goblin, and we'll see how well that works. And
then uh if I go to my Google cloud console, you can see the request counts, the logs. Um and as the the the kind of
the logs. Um and as the the the kind of app uh gets uh as the app gets more utilization, um since this is based on cloud run, it will also scale out and
support like increased workloads. Um so
we should see pretty soon some sort of logging data coming in if the Oh my gosh. Yeah, that is also very cute. um
gosh. Yeah, that is also very cute. um
but uh for the app itself and then um couple more demos because we've got time for it. We also just recently added
for it. We also just recently added Gemini support in Collab. Um so again all of y'all are data humans or hopefully data humans. Um which means
that uh you've you've seen well actually how many folks have seen collab before?
Okay, more than I've seen AI Studio, but still uh like second piece of homework.
Like go try collab if you enjoy Python programming. Um it's a notebook based
programming. Um it's a notebook based environment. Um uh you can access GPUs
environment. Um uh you can access GPUs um single node GPUs or TPUs um and uh kind of interact uh with a um interact
with a compute runtime that can do a broad spectrum of data um data tasks.
Um, but it we've also incorporated Gemini's kind of step-by-step reasoning process um to do things like analyze CSVs. So, I'm going to just grab one
that's handy here. You could also just add a URL um and it knows how to to kind of curl in um or wget and rather the the
uh the file itself or to use something like pandas reads uh the the CSV reader to to pull in the the data too. Um but
you can say something to the effect of please do exploratory data analyses
on these data. Um and also uh experiment with a few different models to predict
California housing prices. Um which is always a very sad thing. Um
and then hit enter. Similar to what we just saw with the app builder example, it should give us kind of this step-by-step process to uh to do the
data analysis. So, I'm going to accept
data analysis. So, I'm going to accept and run it. It will load the data set, do EDA. So, we've got this kind of
do EDA. So, we've got this kind of step-by-step preparation that's going to happen. Um, I'm going to collapse this
happen. Um, I'm going to collapse this guy, collapse this guy so it's easier to see, and then zoom out so it's uh so it's easier as well. But it's adding the
text and then the code running it. um
and then using each previous step to inform the next steps. So as it does EDA, it will take kind of the shape of the data, the distribution of some of the features, um decide which ones
should be included, um will visualize them, uh and we'll wrangle with Mattplot Lib so I do not have to and then will also kind of show
me the results along the way. This is uh I love this uh just because I I hate trying to remember all of the seabour and mattplot lib data incantations. Um
and it's a really nice way to just get um to get a quick overview of a lot of different uh a lot of different um EDA tasks without necessarily sitting in front of a screen and interacting with
it yourself. I also feel like not enough
it yourself. I also feel like not enough people look at their data and a big reason for that is that they don't necessarily have the tools or the ability to do it by themselves. Um, and
this kind of solves a a lot of that problem. Um, but while while this is
problem. Um, but while while this is cooking, I'm going to go back to the presentation. VO is our video generation
presentation. VO is our video generation model. Um, not uh any no presentation is
model. Um, not uh any no presentation is complete without a quote from Andre Carpathy. And here's mine. Um so I truly
Carpathy. And here's mine. Um so I truly believe that video is a powerful medium for understanding everything and communicating it with folks. Um VO3.1
can do a broad spectrum of things. So it
can do grounding based on reference images. It can have uh you know that
images. It can have uh you know that same character in a different uh collection of scenes. Animate images
including with guidance different camera controls outpainting. So you can take a
controls outpainting. So you can take a full screen uh video and turn it into widescreen. You can add things to video.
widescreen. You can add things to video.
You can remove things. You can have reference face movements. So a character and then uh some way in which you want the character to react. You can use first and last frame and then
interpolate between them. And you can also uh do things like this which is probably um copyright uh protected.
>> Hey, my name's K. And what makes the chicken sandwich original to me is the crispiness of the bread. a commercial
that I want to replicate and I'm going to show you how you can do that with different VO models.
>> Also, this is not me page talking about the Chick-fil-A sandwich. It's like some other page that also enjoys chicken sandwiches. The process with V2 to
sandwiches. The process with V2 to replicate this is you feed in the video to Gemini. It creates the very detailed
to Gemini. It creates the very detailed prompts which you can give back to the V2 model. Use Gemini text to speech to
V2 model. Use Gemini text to speech to do the voice over. You use uh LIA which is one of our music generation models in order to do the music background. And it
takes a little bit less than 30 minutes.
I stitched it together with Camtasia.
You could also use movie pie. And it
ends up like this.
>> Hey, my name's Paige. And what makes a Chick-fil-A chicken sandwich to me is the crispiness of the breading and a tenderness of the fillet. It's tasty.
It's warm. It's total satisfaction, >> which I thought was pretty good. I was
very pleased with like the the chill guitar music, but the process with V3 got a little bit better. So, you just give the video to Gemini2.5. It
generates the prompt which you give to V3. And this is what it looked like.
V3. And this is what it looked like.
>> My name is Paige. And what makes the Chick-fil-A chicken sandwich to me is the crispiness of the breading and the tenderness of the fillet.
>> So, that was first shot. Um, uh, in June, so like four maybe 5 months ago.
Um, same prompt into VO3.1. Same
process.
And this is what it looks like now.
>> Hi, my name's Paige. And what makes the Chick-fil-A chicken sandwich original to me is the crispiness of the breading and the tenderness of the fillet.
>> So, four months delta, same prompt, same process, and it's amazing to see what has changed. So, with that, um, I'm
has changed. So, with that, um, I'm going to fly through because I have no time left, um, to just say that like even the tiny models are getting better
and better. Our Gemma 3N release is just
and better. Our Gemma 3N release is just about as good as Gemini 1.5 Pro, which was our best model uh I guess around eight months ago. Um but four it's four billion parameters in size. You can
download it to your laptop. There's a
quantized version which fits on mobile devices, can be stuffed into a browser.
And uh my uh my final point is that all of y'all in the room uh it has never been a better time to be a founder, especially a solo founder or a founder with a small team. Um, I think that
people can do very very magical things right now in a very short amount of time. I remember when I first started
time. I remember when I first started training models, it took months to do anything that was reasonably interesting. And now uh now it feels
interesting. And now uh now it feels like with these truly democratized tools and approaches, you can build things that felt like sci-fi um even just a
year ago. So go build stuff and like get
year ago. So go build stuff and like get these nice people to fund you. Um and uh and share it uh share it also with the world. Um and uh thanks again for for
world. Um and uh thanks again for for staying here for for coming out. Go test
out AI Studio. Uh and send me emails because I love questions.
Loading video analysis...