Gemini Embedding 2 - Audio, Text, Images, Docs, Videos
By Sam Witteveen
Summary
Topics Covered
- One Model Replaces Five Embeddings
- Combine Image and Text Embeddings
- Chunk Long Videos for Precise Search
- Aggregate Embeddings for Multimodal Posts
Full Transcript
Okay.
If someone asked you today to build a search system that could handle text, images, audio recordings, video clips, and even things like documents and PDFs, all in the same search, what would your pipeline kind of look like?
So up until recently when I worked on things like this, you would end up using multiple vector stores, you would end up having multiple embedding models, and the system would get pretty complicated, pretty fast.
Now jump forward to a model I covered almost two months ago which came from the Qwen team that basically allowed us to do embeddings with both text and images at the same time.
Now I do think the Qwen thing is really cool, but what I couldn't talk about at that time was that I was already testing another model, which just got released yesterday, which takes this whole multimodal embedding system even further than just text and images.
And this is the Gemini Embedding 2 model.
So this is the first natively multimodal embedding model from the Gemini team that can not only cover, text and images, but also it can take videos of up to two minutes without having to convert them to any other format.
It can take audio files without having to transcribe them or anything like that.
And it can even take files like PDFs and embed them natively in their format without needing to convert them to plain text or anything like that.
So whereas in the past you would need to spin up perhaps a text embedding model, perhaps something like a clip or sig lip embedding model.
Then use, something like whisper to transcribe your audio, This model basically replaces all of those sort of five problems and perhaps five different models that you needed and five indexes and five different headaches and they've collapsed all of that into a single API call.
So the idea with this model is that it can take text, images, video, audio and PDFs and put them into this same shared vector space, one model, one index, and one query to basically access an embedding that you can then use for a variety of different tasks.
Okay, so real quick for anyone who's new to this, and I'm not really sure how embeddings work.
I covered that quite a bit in the video that I made about the Quinn models and how multimodal embeddings work in that.
But the simplest way to think about this is that what we're gonna do here is you can take any piece of content that could be a sentence or some text, an image, a chunk of audio, a video, or even a PDF file.
We're just gonna convert that into a list of numbers, specifically a vector that lives in a high dimensional space.
And the key property that those numbers have is that they've extracted out semantic information from that particular piece of content.
And you can think about that vector as basically like an address in end dimensional space that allows you to sort of see, okay, things that are in similar space tend to be semantically similar overall.
Meaning if I've got some text about a cat and I've got an image of a cat and I've got some speech talking about a cat, they're all gonna be in roughly the same location.
Now, when we think of space, we tend to think about it in three dimensions.
But for your model like this, your full representations are over 3000 dimensions.
So that's what allows it to encode things so well, to be able to find, when you do similarity lookups, find content that relates to your specific query, et cetera.
Now, historically, to do this, every modality needed its own model to do this kind of conversion.
So you would need things like a text embedding model, an image embedding model, perhaps something like a CLIP or a SigLIP model.
And then usually for things like audio speech, people wouldn't even just encode it.
What they would actually do is transcribe it and then encode it as text.
Now what this did was it made sort of multimodality RAG and multimodality search both challenging to do because not only would you require multiple models.
But you would often have multiple indexes that you would be searching and you would need the whole sort of re-ranking layer or sort of fusion layer to work out what to actually bring back to the user.
It was usually very messy to do, very expensive to maintain.
Often very slow to run as well, and this is where the Gemini embedding two model really changes that.
We've now got this one unified space that we can have both a written description of a product and a photo of that product.
They end up being close together in that same vector space.
This means that your users can write a text query and retrieve most semantically similar results, whether they be text, images, video, audio, PDFs, or they can throw in an image.
Or you can even just take a raw recording of the person saying what they want and then get an embedding that actually then finds the content that they want.
This whole sort of unified element is really a game changer here.
One of the other key parts of this is that if you want to, you can not only just embed each of the different modalities separately, you can pass in multiple modalities.
For example, we can pass in an image and text in a single request to actually get back and embedding that represents the combination of those two.
So this allows you to build a whole bunch of different things.
For example, like if I've got a picture of a watch band I like and I describe the sort of watch, but I don't actually have a picture of that part.
I can actually create an embedding that represents the two of those things combined, and then use that to do lookups against pictures of watches or videos of watches, et cetera.
And in fact, if we come into the demo that they've got here, you can see that we can do searches by images.
So here I'm pressing the picture of the cats and sure enough, we are getting pictures of cats that look like these black and white cats coming back here.
We're also getting videos that are coming back that have been embedded into a similar space.
So you can notice here that the model has not only worked out cats semantically, it's worked out this sort of black and white, and specifically this sort of black and white on the face of a cat and represented that.
So it's giving us back both images and videos that match that.
You can see we're doing the same with the soccer team here of where well, not only are we getting a soccer back, but it's picked up on the sort of yellow uniforms that they're actually wearing here.
Now, just as we could do this via sort of image search, we can also do it with speech search.
So if we listen to this audio.
Tiger.
You can see that, okay, that's very simple little sort of piece of speech.
But if we now do a search on the embedding made from that, we are getting back images of tigers and we are even getting back videos that where the tiger is sort of in the video.
So you can see in this demo this is got over a million images in it.
It's got over half a million videos in it.
It's able then to basically just take these audio things quickly, do a lookup.
Find the relevant images, the relevant videos in there.
Alright, so if we look at some of the details and limitations of this, when you are passing the different modalities in, obviously you are limited to what you can pass in.
So for text, that's up to 8,000 tokens.
Now, most of the time you're probably gonna be wanting to do some kind of chunking.
I'm not sure I'd really want to be putting sort of six to 8,000 words in to get a representation of the whole thing.
You're probably more likely to go for smaller chunks in that, but if you want to, you can go up to 8,000 tokens.
You can also go up to six images at a time.
You can also go up to videos that are two minutes long.
Now, again, with the video example, I might wanna actually chunk it much smaller than two minutes.
If I've got a video that's six hours long and I want to be able to do search of that, I might even chunk that down to sort of 15 second or 32nd chunks.
Embed all of those.
And then allow myself to do sort of text search over that video.
Perhaps if I wanna type in something like when does a woman in a red dress appear in the video?
Obviously, the smaller the chunks are, the more specific I will be able to be at returning the exact time that happened in the video.
But this is definitely a way now that you can do search over really long videos.
I can certainly see some really nice use cases of where you could take, for example, some of the university courses that are maybe 25 hours, 30 hours, 50 hours long in code.
Both the video, the audio, and any PDFs of slides that you had for each of those lessons, and then be able to ask it, Hey.
Which lessons did they talk about specifically this and have a diagram about it.
Up until now, that really hasn't been something that you've been able to easily do for this kind of task, and I do think this is where all this gets really interesting is in the ability to come up with new ideas for new products and new ways of using this technology.
That just wasn't possible in the past.
Alright, so Google's published a bunch of benchmarks for this.
I'm not really gonna go through these.
Interestingly, this is already doing just text to text similarity better than the original Gemini embedding 0 0 1 model.
And it's outperforming the other sort of multimodal models that are out there for image to text and text to image.
But really where this shines is the fact that it can do all of the modalities together.
On top of this, the way this model's actually built is it incorporates the matrika representation learning, which means that if you don't want to get the full size embedding back, which is 3072.
You can get embeddings back that are either half that size, a quarter of that size.
And that can be useful where perhaps you don't need the fine grain semantics of knowing exactly what color the cats were.
You want to just know that, okay, that was a cat, that was not a cat there, but you want the performance boost of not having to store such large embeddings.
And then also just the speed of being able to look up the embeddings faster because they're actually shorter.
So on top of releasing this in the Gemini API for AI Studio and Vertex AI.
They've also teamed up with many of the agentic frameworks like LangChain, LlamaIndex, and the Vector Store companies like ChromaDB and QDrant to actually get support for this on day zero as it's released.
So I think the best thing is let's jump into a CoLab and have a play with this, and then you can get a sense of what this can actually do.
And I would love to hear from you in the comments of what are some of the ideas that you can see yourself using this now that you can index across all these modalities?
So let's jump into the notebook.
Alright, so if we come into the notebook here, I basically made a little notebook just to show you some of the key features of this.
So the model itself is still in preview, it's Gemini embedding to preview, and I've basically put together just how you would use it without any external agent frameworks or anything like that.
So we need the Google Gen ai, SDK to do this.
You'll need your Gemini key to do this.
Alright, so we bring down some example content.
If we run that through and just see what that is.
We can see we've got the jet pat backpack, which is an image that's been used since Gemini one for a lot of the demos.
We've got a scones image in here.
We've got a cat image in here and we've got a audio file in here.
It is so peaceful walking through the trees with the leaves crunching under foot.
Alright, so next up we've got some just helper functions that I've put together here.
So basically calling.
The actual model, you're just gonna use client models, embed content in here.
So I've made one for embedding text, embedding images, embedding audio, and then later on we'll look at some, some simple straightforward ones as well.
Alright, so you can see that with these, it's pretty easy to embed.
Something and we'll get back these 3072 dimensional vectors back here, and that's gonna be the same size vector that we're gonna get back, whether we're doing images, text, audio, in fact, all the modalities will produce this length back.
If we wanna do something like text to image similarity here, I've basically got a bunch of text descriptions.
You can see that some of those will fit the image that we've, images that we've got there.
We can then just embed those.
We can embed the actual images that we had as well, these three images, and then we can compute the similarity in here.
Now, this obviously normally would be done by, a framework.
If you're using something like Lang Chain and stuff like that, it can handle all this for you.
But just to show you that, sure enough, when we look at the jet pack picture.
Which is a sketch O of a jet pack backpack.
What is it?
What comes back the highest?
This is coming back the highest.
Now remember, this is not a percentage kind of thing, it's just think of it as a score.
That's a similarity score.
You can see here that for that one, we've got a person flying with a jet pack.
Well, the jet pack is there, but not the person really, and not really flying through the sky and stuff like that.
But that still registers, higher than obviously.
The cat image, for example, and we can see that rocket launching into space, which kind of makes sense, registers higher than the cat image.
If we look at the cat image, by the way, we can see the cat image registers much higher for this acute ginger cat sitting and looking at the camera than any of the other things in here.
Now, if we do the same for audio, all we need to do now is basically encode our audio file that we've got there.
We've already encoded our text, so now we can just compare that audio file to the text and sure enough, that file that was talking about trees and walking in nature and stuff like that is getting the highest score by far here.
Right?
That this one is coming back.
That okay.
This is the closest text.
To that particular audio file, if we wanna do sort of a reverse search where we pass in an image and get text back, we can also do that here.
so this is just putting in the various texts.
This image, the closest text match is a sketch of a jet pack backpack, which makes sense in here.
Now you could go through and try adding a pen sketch on lined paper.
The jet pack backpack and that would probably score higher than this too.
So you can play around with sort of the detail of these.
You can see that this one, sure enough, the bake SCOs with jam and cream on a plate.
And our last one cute ginger cat is scoring, the highest there.
So this is definitely working, right?
This for doing the similarity matching and stuff like that.
We've got something going out well here.
The other thing too is we can just run a full cross modality similarity trick.
And basically see, okay.
What is similar to what, right?
So obviously the text, a sketch of a backpack is gonna come back with a score of one for matching the text of a sketch backpack, right?
That's what this one line is doing.
But we can see that the second highest thing.
In there is actually, this text matches this other text higher than the actual, image in there, in this case.
So they're pretty close, right?
we can see that there.
And so we can see that we could actually check the audio to the text, compare the audio to the images.
Obviously they're not.
Anywhere near sort of as high as this audio to this text of treason nature sounds in a peaceful forest.
Okay, so in this one we're just looking at how you would embed a video.
So we just download a video file and you can see here that.
This is still just using the client models, embed content, the model we are using the Gemini embedding two preview model here and then the important part is we're gonna do it from bytes and we need to pass it in the actual mime type so that it knows that it's a video file in here.
Once it's got that, it can encode it fine.
Same sort of thing for embedded PDFs, right?
So you've got the option of where you could use the files API if you wanted to upload them and provide them that way.
Or you can do it from bytes here.
And again, this mine type now is gonna be application PDF in the app.
And it's gonna give, both of these are gonna give us back.
The embedding, which is 3000 and 72 long in there.
Another thing that you wanna think about as you're using this is if you're passing in multiple pieces of content, do you want to have separate embeddings for all of them, or do you want to basically aggregate the embeddings?
So for example, let's say we've got a Twitter post or some kind of social media post where we've got a text component and an image component.
Now, if we wanna make one particular.
Embedding just for the whole post.
Then what we would do is like this, where we pass in the actual call, in the type stock content in parts, we basically just pass in a list of parts, right?
Where all of them are gonna be joined back and we're gonna get this average out embedding.
So you'll see that this, even though we've got two pieces of content sort of going in there, or two parts of one content going in there.
We've got the text, we've got an image.
We're getting one embedding back here if we do it like this, where we actually pass in a list of pieces of content, right?
So you can see here, we've actually just, the contents is just one piece of content with two parts to it.
Here we've actually got the second one.
We've actually got two pieces of content going in there.
When you pass it in like this, you will get multiple embeddings back.
So if we wanted to pass in six images, we would do it like this, where if we wanted to pass in six images and get six embeddings back for them, we would do it like this.
But if we wanted to aggregate an embedding over those images, then we would basically just pass those in parts, right.
As each part being a separate image in there.
So they've got some notes in here about how to use that.
I really think you want to sort of experiment with that yourself, if you are gonna be sort of working out like, okay.
What is gonna be the best way to represent things in a rag system?
Is that gonna be, if you've got sort of like social media posts where all of them, you want to have sort of one embedding, you might go for the aggregated way of doing it.
You can always do that with this too, where you basically take these out and you just average them out, right?
So you just work out the average, of the two embeddings or the 3, 4, 5 embeddings that you've got there.
Overall though, while this may not be as sexy as something like a Gemini 3.1 model coming out, if you're actually building AI apps, embedding stuff is one of the sort of core tools that you're using all the time.
So it's definitely worth checking this out.
And while generally I like to use open embeddings just for having the control, unfortunately there's nothing out there like this that has the quality, of these embeddings over all of the modalities with one model in here.
Anyway, let me know in the comments what you think.
If you've got any really good ideas of how you plan to use this, I would definitely be interested to sort of see what sort of use cases people are most interested in using this for.
As always, if you found the video useful, please click like and subscribe, and I will talk to you in the next video.
Bye for now.
Loading video analysis...