The Gemini Interactions API

By Sam Witteveen

Summary

Topics Covered

APIs Evolved: Completions to Agent Calls
Server-Side State Saves Tokens
Reasoning Tokens Persist Server-Side
Call Research Agents Directly

Full Transcript

Okay, so in this video I'm going to look at the new Gemini interactions API which was released last week. And this is something that the team has been working on for a few months now. And it really

makes a lot of sense in that as the way we're using large language models has changed over the past few years, the API endpoints really have had to change as

well. If we go back in history and look

well. If we go back in history and look at some of the main ideas for these kinds of APIs, it really in many ways started with the completions API from

OpenAI. So the whole idea there was it

OpenAI. So the whole idea there was it was just a simple text in text out model. You basically send a prompt in

model. You basically send a prompt in and get a completion out. And if you think about it, that was exactly what the first LLM were. they were just taking the sort of text that you were

putting in and using that to generate a completion out of the model. Now, while

that API made a lot of sense at the start, you had to handle everything in there. There was no built-in

there. There was no built-in conversation memory. There was no sort

conversation memory. There was no sort of way of signaling to the model that which bits the user said versus which bits the model said. Later on, people brought in system prompts that needed

things like that. And that brought about the revolution of the sort of chat completions API. So this was where open

completions API. So this was where open AAI kind of realized that okay as they were bringing out new models most of the people at the time were using them for

chat interfaces things like chat GPT etc. And along with the introduction of a system prompt, they actually brought in the whole idea of roles, having a

system role, a user role, an assistant role. And while the API was still

role. And while the API was still stateless, it suddenly had a much better scaffolding for the sort of context management and for doing conversations

between a user and a model. Along with

this, function calling came out as being one of the killer features that people could use the models for. And that

required changes in the API as well. Now

jump forward to 2025. Everyone's been

talking about agents and agents don't necessarily work in this sort of conversational scaffold of user assistant. There are lots of things

assistant. There are lots of things going on in there and this is why OpenAI brought out their responses API move more to structured outputs and establishing response schemas as first

class citizens in the returns that you were getting from there. Not to mention along with all of these multimodality has become a huge use case for these models across the entire industry. And

that brings us to the Gemini interactions APIs. So whereas in the

interactions APIs. So whereas in the past perhaps Google was trying to catch up to things that OpenAI had done, now it certainly seems that Google's much more in a position to be thinking, okay

what do they really want to do with this? And a big factor that seems to be

this? And a big factor that seems to be key here is that they're not just thinking about models themselves, but agents that are using models. And this

is certainly a common trend that we're seeing certainly for user or consumer interfaces. Really, most of these models

interfaces. Really, most of these models in the cloud now when people are using it through chat GBT or the Gemini app or something, you're really interacting with an agent. You're not just purely

interacting with a model. You're

interacting with a system that can use multiple models. It can do multiple

multiple models. It can do multiple loops of calls to models. It can use tools. It can do code execution on the

tools. It can do code execution on the back end. And in many ways, these same

back end. And in many ways, these same features are now starting to come to developer API endpoints. And this is what the Gemini interactions API really

is all about. So, one of the first things that Google's changed here is that they've made it optional so that you can actually have a serverside

history for conversations. So, you don't need to resend every single thing each time. Now, that gives you a whole bunch

time. Now, that gives you a whole bunch of different options. You can still manage it stateless like you did in the past if you want to, but you now have

the ability to refer to a previous call and use that as a history for the current call. That enables a whole bunch

current call. That enables a whole bunch of different things around token efficiency. One of them is being able to

efficiency. One of them is being able to make use of the implicit caching that is going on in that backend. So you don't have to pay as much for the tokens that

you're calling. Another big factor with

you're calling. Another big factor with the new interactions API is not only can you call a model, you can actually call agents. And this is where the Gemini

agents. And this is where the Gemini team have exposed the Gemini research agent as something that you can call as a developer. So obviously this tool and

a developer. So obviously this tool and idea has been available in the consumer interfaces for quite a while and we've seen it not only the Gemini app but on the OpenAI app on Claude on a whole

bunch of different sort of consumer interfaces. The big advantage that we've

interfaces. The big advantage that we've got now though is that you're able to actually use this as a developer and call it with the interactions API. Now

another feature that makes that possible in the new API is this whole idea of background execution. So, if you've got

background execution. So, if you've got a really longunning task, whether that's going to be something like running an agent, perhaps doing something that's

going to have code execution as a tool on the back end, you can now actually set it to be a background execution where you can offload longunning inference loops to the server without

needing to maintain this connection to your client. and then just pull it when

your client. and then just pull it when you want to get the final result back.

On top of these features, they've also added in a bunch of new things for multimodality.

Obviously, now the Gemini models can not only take multimodality in the form of images or audio or PDF files, etc., but

they can also generate things out. Don't

forget that Nano Banana Pro is actually Gemini 3 Pro image, and this new API supports using all of those models as well.

So, I think the best thing is let's just jump into the code and I will walk you through a notebook of using the new interactions API and we can look at some of the, changes, and, how, you're, probably going to want to do most of the common

tasks that you would do with an API like this including how you can run the new Gemini deep research agent using this API. So, let's jump in and take a look

API. So, let's jump in and take a look at the code. Okay, so the first thing you need to do is make sure that you've got the latest version of the SDK installed. So, you actually want the

installed. So, you actually want the Google Gen AI 1.55.0 or higher to get the interactions API.

You can see here we're just basically importing the Genai SDK, checking that it's 1.55.

And then to use this, we're going to just use client.inctions.

So, this is the main difference from the previous API. And the previous API is

previous API. And the previous API is still in there. You can actually still use that. I'm not sure how long that's

use that. I'm not sure how long that's going to be there, but with the interactions API, we now got a simpler way to basically just go through and

pass in a model, pass in an input, and get a response back. Obviously, if

you're using a model that has reasoning you're going to be using reasoning tokens as well. So, you want to be sort of aware of that. So, you can come in here and see, okay, what are your input

tokens, what are your output tokens reasoning tokens, etc. And of course this will work with the new Gemini 3 models. You just pass those in. We can

models. You just pass those in. We can

also pass in a system instruction if we want to and get a normal response back from this kind of thing. We also have the ability to pass in configs. And the

config is now quite simple. You can

basically pass in things like temperature, max output tokens, whole variety just like you could before. You

can also pass in here the thinking level if you want to adjust the thinking tokens or the reasoning tokens coming back. You can see in this case we used a

back. You can see in this case we used a lot less tokens going through there.

Obviously we can also stream responses back. All I need to do that is just pass

back. All I need to do that is just pass in stream equals true and then I can filter each chunk for whether it's a reasoning token like a thought token or whether it's a final token. And then if

I do want to make use of the sort of reasoning or the thinking tokens in here, I can basically set thinking level to high in the generation code figs and

put the thinking summaries to auto. Now

by default, I think the thinking summaries are just turned off. You don't

get those. And of course, don't forget that's going to be using tokens. So, you

can see here we've gone through printed that out. And our final response back is

that out. And our final response back is this joke. And we can see that the

this joke. And we can see that the output tokens was only 13 tokens for this. But the reasoning tokens is nearly

this. But the reasoning tokens is nearly 600 tokens there. And if we want to see those thought tokens, we can actually look at the thought summaries. Now, I

don't think it's guaranteed that the thought summaries will be the exact same size of the reasoning tokens. That's

something that may change based on the output., But, at least, here, we, can, see, a

output., But, at least, here, we, can, see, a summary of what was the thinking that got us to this final response. And in

relation to the reasoning, one of the things that you'll see like I'll show you with chat going forward is that you can pass in a previous call. So if

you're passing in multiple calls, one of the cool things is that this can now persist a state server side. So the

simplest use of doing that is going to be for a chat. So you can see here we've got where I'm saying okay hi my name is Sam. It responds back right with this

Sam. It responds back right with this and we've got this interaction ID and we can pass that in. So that becomes like a memory and you can see for the next one

all I do is pass in what is my name and it's able to retrieve that from the memory based on the fact that we've passed in the previous call. So you can

just keep passing in calls and having the memory sort of persist server side.

And you can see with this second turn here, I've just called it new interaction. And same for the third

interaction. And same for the third turn. And I can just keep passing that

turn. And I can just keep passing that back in. So each time we're passing in

back in. So each time we're passing in the new interaction in, we're adding to the memory. And that memory is getting

the memory. And that memory is getting longer as we go through. So you can see here I've asked it to do 1500 words. It

hasn't done that. It's done, I think, a few hundred words on this. And the

reason why I wanted to show this was that once you get above like a thousand tokens, it starts to have the implicit caching of those tokens server side. So

that can be something useful that allows you to actually save money as you're doing something like a conversation or like a sort of multi-turn set of

interactions. And you see, sure enough

interactions. And you see, sure enough the tokens getting used are going up.

Most of those tokens are reasoning tokens. And this brings about something

tokens. And this brings about something that's really interesting. One of the things is you want to have is this thought signature. So basically Google

thought signature. So basically Google is not giving us the real raw reasoning tokens. But you could imagine that as we

tokens. But you could imagine that as we do multiple calls, we would like the model to know the real reasoning that it did beforehand. And one of the ways that

did beforehand. And one of the ways that we can do this is by passing this back in and having these signatures for these actual thought tokens. It can then

access those reasoning tokens or those raw sort of thought tokens that it had not just the summaries. So while we can't see them client side, we can

actually use them in follow-up calls going through. And you can see even

going through. And you can see even after sort of five turns of doing this if I ask it, okay, what is my name? It's

still got that in memory. So it shows that this has actually persisted as we've gone through. Now you can also retrieve a interaction to basically get that interaction back. So you can see

that if I just go back and get this last interaction here, sure enough, I'm able to get out the response for that. And I

guess this could be useful if you're going to serialize a whole conversation yourself as you go through. If you do want to do conversations in the previous sort of chat format, that's not a

problem. You can just instantiate it

problem. You can just instantiate it like before where you would have your role as user role and content and you can just pass that in. So just passing

in a list of dictionaries for that allows you to basically make a full conversation and then we can just append to that the model outputs a new user

response and pass that in as well. So

obviously the issue here is we're not necessarily saving on tokens. We're

passing everything in each time. All

right. Next up multimodal stuff.

Obviously now they have both multimodal understanding and multimodal generation in here. So multimodal understanding, we

in here. So multimodal understanding, we can pass in an image and we can do things with the files API like before but we can also just convert everything

to base 64. So you can see if I pass in this image, a picture of the blog post I then just base 64 encode it and pass that in. Sure enough, it's able to have

that in. Sure enough, it's able to have no problem processing that. Same for

audio as well. So, I've got some audio here which was actually generated with the Gemini TTS of two speakers interacting. I can take that in whether

interacting. I can take that in whether that's a web file, whether that's an MP3 file, B 64 encode it, pass it up to the model, and then ask it in this case

please transcribe this audio and put speaker tags. And sure enough, it's able

speaker tags. And sure enough, it's able to get out exactly what was the correct audio and the fact that it was two speakers in there. Same sort of thing

can be done for video and also for PDF files. Now if we want to generate

files. Now if we want to generate multimodal stuff out whether that's images or audio or video etc. The only key thing is obviously we change the

model name but we still just pass in our prompt uh just like before but we pass in the response modalities out here. And

this is sort of one of the key things about the interactions API is that you can define what the outputs will be.

You'll see this when we look at the structured outputs as well. But here

we're basically defining the response modality out as image. It will then return that back in the interaction

outputs. We've now got a type of image

outputs. We've now got a type of image in there and we can just save that out and then display it. And sure enough it's made an image with the Nano Banana

Pro model in there. So, it does seem that the Gemini API team is really working on trying to make it simple for us to do lots of different multimodal

calls to the model to be able to not only have multimodal in, but also have multimodal out. And the same thing that

multimodal out. And the same thing that I've shown you here with the image out we could also do with audio out, etc. If we look at structured outputs, we've got

a similar kind of thing here. We're just

using paidantic to define some model classes etc. And we can see sure enough that the response output we can basically just say is going to be this

class and is going to be just model JSON schema. So in many ways this makes it

schema. So in many ways this makes it super simple now just to do structured outputs for anything that you want out of the model and you can just define you

can see here we're nesting classes we're nesting models within the sort of this moderation result here and this moderation result is actually passing things back and we can see sure enough

when we want to pass the output we can just take this class we can validate the JSON and we can run it through and print it out here so more and more I think anything that you're not doing actual

sort of chat stuff or even if you are doing chat you can have one class be for the chat and one class be for analysis of the chat or other sorts of things

when you're generating multiple pieces of text that perhaps relate to the same input in a single output here all right so tools and function calling function calling I've just gone for their

particular example in here we can see this is standard I don't think there's a huge amount of difference in here we're now just passing in a list of tools And we've got the standard sort of

function call type coming back. We can

then execute the tool and pass the input back into the model like this. And in

this case, we're making use of that server side state. If we wanted to do it fully sort of client side, we can do it just like here where we're passing through. We're making this sort of

through. We're making this sort of history as we go through and passing in that history and then sort of appending to it just like we would have done before when we had a stateless sort of

solution. Of course, we've also got

solution. Of course, we've also got access to built-in tools. So, the

built-in tools now is very simple to use. You can basically just pass in the

use. You can basically just pass in the built-in tools as being Google search.

We can then run this through. This will

give us an answer back. We can get some URLs and stuff like that. I actually

talk about the URLs in the agent. This

is my one big gripe about Google with all this stuff is to do with the URLs that they return.

We've also got the code execution tool which allows us to basically get the model to actually write Python code. So

you can see here I've basically put in calculate the seventh prime number. I've

given it the tool of code execution. So

it comes back with the right an answer here to see how it actually did that. We

can see that okay it actually wrote code and ran that server side to get the correct answer out there. Third tool is the URL context tool. Here you can see

we can basically just pass in that's the tool and then pass in any URL and it's supposed to go and get it. Now there are certain URLs I think that are blocked. I

think that's got to do with the sort of AI robots text nowadays whether sites are giving permission for AI to scrape them or not. So you'll find that certain

sites this just won't work. So you want to check this and see if it's going to actually work for your particular use case., All right., Lastly, for, the, sort, of

case., All right., Lastly, for, the, sort, of tools section is the whole remote MCPs.

You can now actually just pass in a remote MCP server. You can see here this is just telling it, okay, this is a weather service. They've got it running

weather service. They've got it running on an appspot instance here. And we can just pass that in as a tool. So this is nice that all we need to just do is pass

in that it's an MCP server, the name of it, the URL, and then the model can work out what to do with that. So we can see here that okay, I've just passed in what is the weather like in Singapore. It

wasn't working if I just asked it for today and it's given me a response back there. And if I look at the interaction

there. And if I look at the interaction outputs here, I can see actually what was sent into that MCP server. We can

see it's an MCP server tool call there.

And I can see the response that came back with the different temperatures etc going, through, this., All right., So, the final selling point of this whole new

interactions API is its ability to use agents. And you can see here that

agents. And you can see here that they've released currently just one agent, the deep research pro preview agent for December 2025. And we can just

pass in rather than passing in a model we can just pass in the agent. And we

don't know exactly what the agent is doing. We're basically just making use

doing. We're basically just making use now of an agent that's running server side on the Gemini API. I've asked it give me a clear outline about the

history of Jax. And you'll see that one of the things that this agent really shows is that we can run tasks in the background. And so once we've started

background. And so once we've started this off, we can get our interaction ID but we're now no longer connected. We're

not waiting for a response. So what we can do is ping the interaction to see if it's completed or if it's failed or if it's still in progress. And in this

case, I'm basically just printing out the interaction status, pausing every 10 seconds. So you can see it took a while

seconds. So you can see it took a while to actually run all of this in the background. And then finally it comes

background. And then finally it comes back when it's completed and actually gives us the output here. So you can see we've got a markdown of we asked for the history of Jax. We've got the sort of

history of Jax with the initial release in 2018, the different things that have happened that sort of led up to it. And

it's done a pretty good job in here. Now

my biggest gripe about this though is that you'll see it gives us a bunch of citations which is really good. And if

we launch those citations in here, let's say if I come in here and look at one of the citations I can actually get the citation to open.

But these URLs, if I save them and try to use them in a different session or something, they're just not going to work. So, it's not actually giving you

work. So, it's not actually giving you the raw URLs in here. You can see that what we end up getting is this Vertex AI

search URL and then the source of the actual site but not the raw URL itself.

So this for me is very frustrating. And

if I'm going to make a report for someone with citations or something like that, I want them to be able to click on the URLs from a PDF file or something and have it actually work. Whereas, if I

get this here, I'm not getting the actual URLs. Now, if you do want to

actual URLs. Now, if you do want to actually take these URLs and convert them, you can do it with something like the request library in here pretty

simply to get the original URL back. I'm

not sure if that's against the terms of service. I know one of the frustrating

service. I know one of the frustrating things with the Google search tool is that they also don't want to give you the raw URLs. They want to give you redirection URLs. And I think with

redirection URLs. And I think with Google Search, there's actually a number of things about how you're supposed to display it in your app. For me, the frustrating thing here is that if you

wanted to export this out to a PDF file you would like to have proper links that are clickable. And these links won't

are clickable. And these links won't stay clickable for a long time. So

while it generates them and it displays them so you can see what the actual domain name is, having something like medium.com as a citation is not very

good. if I want to see a particular

good. if I want to see a particular article or something. So overall, the interactions API really just unifies a number of sort of calls and stuff like

that. Makes it much clearer to be able

that. Makes it much clearer to be able to do certain tasks and gives you some really nice features like the ability to have a state server side. And this

really sets the whole path for more agents and perhaps more things that can run in sandboxes and stuff like that server side. And you can imagine those

server side. And you can imagine those things being things like computer use agents, a number of different things that we'll probably see coming over the next few months. So overall, this is a

really nice addition to the Gemini API and basically just updates it so that Google can now add a lot of new features as they're rolling out the Gemini 3 models. Anyway, let me know in the

models. Anyway, let me know in the comments what your take on this is.

Definitely interested to hear how people are looking at using this and any sort of hot takes on how you think you can get better use out of having state persisted on the server side of things.

Let me know what you think in the comments and I will talk to you in the next video. Bye for now.

next video. Bye for now.

Loading...

Loading video analysis...