Lecture 4 – Multimodal Alignment (MIT How to AI Almost Anything, Spring 2025)

By Paul Liang

Summary

Topics Covered

Highlights from 00:00-11:41
Highlights from 11:39-21:51
Highlights from 21:42-33:15
Highlights from 33:03-42:57
Highlights from 42:46-52:20

Full Transcript

Okay, very good. All right, welcome back everyone. So, hope you had a great week.

everyone. So, hope you had a great week.

Today's lecture will be about introduction to multimodal AI. So,

logistics first um for the project.

Thank you everyone for submitting your proposals. It was all a pleasure to

proposals. It was all a pleasure to read. I've tried to give fair amount of

read. I've tried to give fair amount of feedback on the proposals. I've also

assigned a primary TA, one of David and Chanyaka to your projects. Um, and I CC them in an email. And the goal is to meet with myself and a primary TA alternating every week. So with me every

other week, with a primary tier every other week, on average every week, you get someone to take a look at your progress. All right. And our goal is

progress. All right. And our goal is really, you know, take these projects and really turn them into, you know, state-of-the-art um, really good research level work. Uh,

reading assignments. The second reading assignment is due tomorrow, Wednesday, in preparation for this Thursday's reading discussion. Again two papers

reading discussion. Again two papers mostly around the theme of modern AI architectures. Uh one of the papers is

architectures. Uh one of the papers is about scaling laws of these modern auto reggressive models. Scaling law

reggressive models. Scaling law essentially meaning how what is performance scale as you put in more data and increase number of parameters in your model. And surprisingly there's a pretty nice trend to these scaling laws. So you could essentially predict

laws. So you could essentially predict what the performance would be as you scale up your models. That's the first paper. Second paper also very

paper. Second paper also very interesting insight on perhaps you don't need all tokens to do auto reggressive pre-training of your models some might be more important than others and therefore implications on how you could

perhaps more efficiently train your models all right so uh project start meeting with us and make progress on the projects and reading assignment that'll be due tomorrow and discussion on

Thursday great so what have we seen so far we've kind of covered basic foundations of AI that int introduction to AI and AI research, different forms of data, the

structure, information within them, ways to learn from data and also we covered a very high level a unifying paradigm of different model architectures including

sequential, convolutional, spatial and models for sets and graphs. Now

naturally a lot of problems in the real world are also going to involve many different data sources, many different modalities active area of research. So

in the next three weeks up until spring break, we'll be covering uh broadly a bird's eye view on different paradigms and multimodal AI. Uh today will be more

introduction and aligning different modalities. Next you could be about how

modalities. Next you could be about how to better fuse information between different modalities. And finally uh in

different modalities. And finally uh in week seven we'll look at this idea of transfer. You might not have a lot of

transfer. You might not have a lot of data in the modality that you care about, but you could use some other more common modality to supplement your information.

All right, so that's just overview of today's lecture, introduction to multimodal AI, some core principles and challenges and diving deeper into this first challenge of alignment.

So before I start, a little bit of behavioral history about multimodal AI.

uh like many things in AI, it really started from people who are looking at psychology and human behavior. And

before this psychologist called David McNeel, uh before he did his line of research, most people thought that, you know, we primarily communicate using language and anything else such as our

gestures and our voice was just uh secondary, not very useful and would not displace language. And David McNeel came

displace language. And David McNeel came up with this experiment. It's called the Mgherk effect which really shook some of the predominant thinking at that time.

So I'll demonstrate this effect by playing two videos and I want you to just quickly tell me was the similarities and differences between these two videos. So here's the first one.

Okay, first video and here's the second one. Ba ba ba

one. Ba ba ba ba ba ba?

Okay, who can tell me what is interesting either similarities or differences between these two video clips? Mike sim similarity in the visual

clips? Mike sim similarity in the visual information but uh in sound first one was B and the second one was B even

though like very well masked like similar in visual information but different in sound.

Yeah.

Anybody else encounter a point? Yes.

Yeah. I'm not sure that the sounds are different. I think maybe just the

different. I think maybe just the visuals are different and since the lips look differently and you have some thought as well.

So if I didn't have the information that information yes so I thought you raised your hand that's why I called out but anyways that that is you know that that's I think

most of you got got the key idea was when I give this talk to many people half the people on the laptops um so they be oh have you even played the second clip yet because If you're only a

laptop and when you're listening to it, you realize that the audio clips are exactly the same, right? If you start looking at the person's lips, you can see that the first person is doing ba

with a B, second person is doing fa with an F. So our perception

an F. So our perception um of the same audio clip can actually be very different based on how the person is saying it.

So that really kickstarted and it kickstarted this first era of multimodal research back in 1980s which is this behavioral era realizing that people when we recognize speech from others and

interact with others actually use actually a good mix of different communicative modalities uh because that was more behavioral era and then naturally there was an ambition to build

AI tools computational tools that could replicate this behavioral perception of different modalities in humans. after

the single computational era using AIG just replicate came this interaction era where multimodal HCI and multimodal interfaces became really popular. So you

add AI and human interacting with different modalities. And finally

different modalities. And finally nowadays we have this deep learning era.

Ever since the 2010s uh we've seen huge amount of of GPU, huge amounts of data, huge amount of large models that really revolutionized uh this space of multimodal AI. And of course as a sub

multimodal AI. And of course as a sub era there's this foundation model era that really started in the 2020s where we really seen the limit of scaling sorry the the the huge potential of

scaling these large transformer models for for single modality and multimodal tasks.

So just historically what tasks did people care about? Um I gave the example of audiovisisual speech recognition where your goal is to recognize what the person was saying from both audio and

visual. So that started from this

visual. So that started from this behavioral era and naturally was the first kind of tasks that people apply computational methods to audiovisisual speech recognition.

Soon after came um a lot of this early boom in internet based data. So there

was lots of interest in building systems that could digitize multimedia data on the internet. So given content can you

the internet. So given content can you retrieve videos? Um and soon after that

retrieve videos? Um and soon after that there was lots of interest in recognizing human affect and emotions from our multimodal behaviors.

Once we started getting into interaction era there was systems based on recognizing events in videos. There's

work in recognizing sentiment and emotion of multiple people interacting with each other.

And something that really kickstarted some of this, you know, multimodal deep learning work was on image captioning.

So if I give you an image, can you come up with a right caption for the image?

Um, and that was kind of the birth of modern language and vision research. And

of course, within the past couple of years, we've really seen a ton of applications in multimodal deep learning.

All right. in multimodal deep learning uh

right. in multimodal deep learning uh from uh captioning images to captioning videos to writing descriptions of videos to and then people found that captioning was really cool but it was really hard

to evaluate right how do you evaluate whether a given caption for image or video is correct it's so subjective so then people converted captioning into question answering so if I give you a

question about a particular video how you train an AI system to give me the right answer so that was visual question answering and again people extended that to video based question answering

um and then it went from static systems to just writing one caption and answering one question. AI systems

capable of holding multimodal dialogue.

So given a particular image referencing and conversing in a dialogue with humans about those modalities uh more large scale YouTube data became

a source of data to pre-train up all these video models that could recognize and caption and retrieve different types of videos.

Soon after we saw lots of work in autonomous agents. So this was uh agents

autonomous agents. So this was uh agents that are able to navigate using vision and language. So you might give it some

and language. So you might give it some instruction like go there and then turn left at the nearest gas station. So

those are agents capable of navigating based on vision and language.

And of course self-driving cars also do a lot of multimodal fusion of different visual language and sensory information.

Uh and nowadays in generative AI we've seen a lot of applications that are taking in text to generate images and videos right these generative texttoage video generative models.

And finally, probably the biggest application today is these interactive agents. Interactive agents on your

agents. Interactive agents on your computer uh and your robots that can take in lots of data and are grounded in either the digital or physical world to complete difficult

tasks and many more. So that's just a brief history of various tasks that multimodal AI has made an impact on.

So in the first part of today's lecture, we're going to just cover an introduction to multimodal AI and then the second half we'll dive deep into um some of these core methods and a lot of this will be based on this sur paper

that we have on foundations and trends in multimodal AI and if you're interested in longer versions we give tutorials at international conferences.

So firstly what is a modality? Um, a

modality is some data source representing a way in which information is expressed or perceived in the world.

So like all forms of data that we've seen in in AI, you think about some sensor which is collecting your data such as cameras or microphones and you think of it as a spectrum from more raw

modalities closer to a sensor to more abstract modalities that are further away from a sensor by doing more processing. So you can think of a raw

processing. So you can think of a raw speech signal from a microphone or a raw image taken from a camera as raw modalities. You can start extracting

modalities. You can start extracting language from your speech signal. You

can detect objects from images. You can

extract sentiment from language and you can categorize the object. So all this becomes more abstract information that you could even represent as words, right? Positive sentiment or this is a

right? Positive sentiment or this is a table, this is a chair.

And as we've seen, you know, the raw data can be very different. uh all

different data comes from different types of structure and different dimensions. But as you start processing

dimensions. But as you start processing these different data modalities using AI systems like these model architectures that we've seen, you're basically learning abstractions. You're learning

learning abstractions. You're learning abstractions and semantic information from different forms of data. So there's

a possibility of bringing them closer together with each other.

So multimodal problems are there for those with different modalities. And

when different modalities multodalities are are are in the picture, there's often several challenges. Uh these are challenges of firstly heterogeneity. The

fact that different modalities are very different and processing them together will cause difficulties and the idea that there are some interconnections between your modalities. And we break down this word interconnection into

first connections. So there's two

first connections. So there's two modalities themselves having some connected overlapping information. and

this idea of interactions that there is some fusion that goes on to create new information where they're brought together for some task.

So in more detail uh this idea of heterogeneity should again be seen as a spectrum representing more homogeneous modalities with more similar qualities and representations to more

heterogeneous modalities with more diverse qualities and representations.

So you can think of images from two cameras as being more homogeneous.

They're both visual modalities. They

both have this spatial structure.

They're both uh sensitive to the same type of noise from your camera blurs. Um

so they're more homogeneous, but again they're giving you different information, right? Different angles,

information, right? Different angles, different viewpoints of the same object.

Uh and then you can think of something that's more different. For example, text from two different languages. English,

French, English Chinese.

um different language families would have different structures and different grammars and different vocabularies, but you know, they're still kind of similar because you can train a model to translate between one language to

another with sometimes perfect information and sometimes losing a little bit of information. Then you can think of language and vision, images and text. They're much more different. uh

text. They're much more different. uh

they start to see their differences and even go to something a language and sensors uh which is even more different because at least in vision and language you can think about a language to describe the image but is there even a

language that can successfully describe sensor data. So these are more different

sensor data. So these are more different modalities. Yeah.

modalities. Yeah.

What do you mean by qualities here?

Um so everything in the modality profile that we've we've seen so the base elements themselves the distribution of elements whether the data is discrete or

continuous what kind of information they contain and I just highlighted structure representations because those are kind of perhaps slightly more important qualities to consider.

So it's not about the image quality or the language quality it's just about the modality. Yes, the the the the

modality. Yes, the the the the yes not not actual quality but just the the profile the information the properties of the modalities.

Okay.

And abstract modalities are more likely to be homogeneous. We've seen raw modalities are those that are collected closer to the sensor. So raw image and raw sensor data. Abstract modalities are

the ones that undergo more processing.

So extracting whether I'm happy or sad from my speech signal and whether I'm happy or sad from maybe my physiological sensor. So abstraction allows you to

sensor. So abstraction allows you to essentially bring heterogeneous modalities closer to each other. All

right. So that's the first principle that modalities are heterogeneous and therefore different. The second key

therefore different. The second key principle in multimodal learning is the fact that we just look at these modalities. There's often going to be

modalities. There's often going to be some shared information that that connects them together. And again, you want to think of this as a spectrum um shared and unique information and cases

where there's more overlap, less overlap, and maybe settings where modalities are completely independent from each other with no overlap. So

given any any pair that for example co occurs in the world, for example, image and text, you can see that there is some connections. They're both describing

connections. They're both describing teacup and sofa. Um, but for example, there's a lot of information in the image that is not captured in the in the caption. For example, the color of the

caption. For example, the color of the sofa, the size of the table, even the presence of the table, all these are not really captured in in the in language.

So, modalities in general have some connections but not not perfectly overlapping in information.

And finally, interactions. So

interactions describes this phenomena where you bring your modalities together and you try to combine them some new information will typically emerge for your task. And so these modalities and

your task. And so these modalities and your task one type of common interaction that people care about is this idea of redundancy where both modalities give you some common information and your

goal is to really exploit this common information. So if I'm saying something

information. So if I'm saying something positive like this movie is great and I'm smiling at the same time then this is the idea of redundancy. you're

reinforcing the idea that the person really liked the movie.

Uh that's in opposition to uniqueness where there's only information in one, not in the other. Um for example, you're saying something slightly positive like the movie does a good job developing the

characters with neutral facial expression and one of these has a unique information. So the challenge here is to

information. So the challenge here is to identify which one has the right information and which others you should maybe throw away.

And finally, synergy. So this is the idea that this information only emerges when you fuse the modalities together.

They might be saying something uh wow it seems positive but with some anger or frustration on your face and then you can infer that this person might actually be sarcastic about this topic.

So you have to look at both you have to compute that there's some differences between the modalities and that indicate sarcasm.

So that's a key idea of interactions. So

at a very high level, multimodal problems are those that involve heterogeneous um connected and interacting data.

Heterogeneous because data sources are different from each other and there's difficulty in processing different sources. Connections because these

sources. Connections because these modalities overlap in some information and interacting because different parts of the information are going to combine when you try to fuse them together for

some task.

All right. So that's a brief explanation of what multimodal problems are.

Um so that will cover why is it hard?

What are what are some unique challenges in multimodal AI that are typically not seen in single modality problems like computer vision and NLP separately and what are some of the state-of-the-art

methods to deal with these challenges.

So all of these multimodal models look something like this. So multiple

modalities using different colors and shapes to represent them and I'm also you know breaking down these modalities with elements right uh these might be words

in a sentence that you're saying it might be image regions in some big image that you're trying to process it might be time steps in long range sensor data

that you're looking at. So just like how you know all of our discussion about machine learning is all about breaking things down into individual parts learning representations and then aggregating information. We're going to

aggregating information. We're going to use a similar schematic over here.

Um and typically you're designing some AI system that brings it together as input multiple of these modalities and

either learns a representation or predicts a label um as the output.

Right? So essentially we're looking at what really goes in here and what are the challenges in designing the components within this this block.

So at a high level there's going to be six challenges that makes multimodal AI difficult and unique and interesting to study.

Uh the first challenge is this idea of representation right? So you have data

representation right? So you have data that's different modalities there some heterogeneity between them. How do you learn the representations that you know can essentially bring them together and

combine them in the right way? So

representation is often uh core building block in almost all problems right when you start with data that's that's raw it's just collected the first thing you always want to think about is how to

represent your modalities.

So for simplicity again this is the schematic. You might have data coming in

schematic. You might have data coming in with different elements for example multiple words that you're saying multiple time steps in sensor data.

We're just going to look at look at it more simply as just one element in each.

We're just going to care about representing one element in each and ignoring the rest. We'll look at look at the rest in the subsequent challenges.

Okay. So there's three key ways of doing representation each achieving different purposes.

One way of doing representation is what we call fusion, right? So you take in two elements. You might have already

two elements. You might have already some representations for them separately, but how do you bring them together into one common representation that best uses the information in some

appropriate way? Um this fusion might

appropriate way? Um this fusion might exploit the overlap in the information.

It might be one that picks the right modality depending on the information.

maybe might combine them in more sophisticated ways to capture some synergy between them. So in general, one class of methods is called fusion.

A second class of methods are what we call coordination. Uh these methods take

call coordination. Uh these methods take in your your two elements and learn representations separately for them.

Right? You have one the first one, one for the second one. Um but it's not independent because they're going to align these two representations using some similarity function. So enforcing

that there is some structure between these two representations that you learn. There are there's a lot of

learn. There are there's a lot of research that goes into how for example you can define this this alignment between them. This is more useful when

between them. This is more useful when you might want to do retrieval for example you define this function as you know cosine or dot product similarity.

Then you can take something learn a representation find the nearest one in the other modality and therefore retrieve the closest in the other modality. So going from image to

modality. So going from image to retrieve the nearest caption or vice versa. Okay. So that's coordination. Uh

versa. Okay. So that's coordination. Uh

fision is this other kind of final example where you start maybe with two elements and now you're trying to learn a partition space that has more representations than the number of modalities that you started with. And

this can be very useful if you want to learn more disentangled and more interpretable features where some of them captures overlap between the two of them. one captures what's unique in the

them. one captures what's unique in the first and one captures what's unique in the second. Right? So now you have three

the second. Right? So now you have three representations for two modalities and you could go even more. Right? So three

general ways of learning representations divided by whether you're reducing and combining representations to keeping them separate but actually learning more representation that captures different

information.

All right. So that's the first challenge of representation.

The second challenge is alignment. Um so

as we've seen in representation we just focus on one element in one and one element in the other. Of course in most problems you have elements in some sequence following some structure right

you have words that you're saying across time you have image regions and some spatial image you have sensory data also across time. So alignment essentially

across time. So alignment essentially studies this problem of how do you align your modalities at both element level and at a global level.

Um so one example is for example if I'm saying something and trying to describe an image which word that I'm describing for example a person actually matches

the the image region showing the person right I'm saying hockey stick which part actually relates to the to the area of the image that that says hockey stick so learning this mapping between your

modalities that captures the parts that have similar uh semantic ation.

So several sub challenges here. One sub

challenge is the easiest case where your modalities are all subdivided into discrete elements. For example, words

discrete elements. For example, words are discreet, image bounding boxers are discrete and the goal is to essentially learn this match which word corresponds to which region in the image which other

word represents some other region in the image. So solving this matching problem

image. So solving this matching problem to find um the alignment between your two two modalities.

So that's called learning discrete connections.

Um a second sub challenge is a more difficult version where you no longer can segment your data cleanly into discrete boundaries and your data is more continuous. So let's say you have

more continuous. So let's say you have some high frequency sensor data, medical data and so on that is continuous sensor data. At the same time, I want to align

data. At the same time, I want to align it with something a doctor is writing.

This part indicates that your heart is pulating. This part indicates that a

pulating. This part indicates that a person fell down. So, how do you do this segmentation while also matching it aligning it to the other the other modalities? So, you have to add this

modalities? So, you have to add this step of doing this this segmentation and discretization of continuous data.

And while the first two aim to just achieve the alignment and that was the goal just to find the matching between the modalities, this third challenge is to use alignment to learn a better

representation. Right? So we call that

representation. Right? So we call that contextualized representations.

Uh doing better representation learning by contextualizing how it matches with other parts. You know a lot of work

other parts. You know a lot of work nowadays in multimodal transformers essentially falls into this because I'm trying to learn a better representation of you know some language by looking at

other words in its context and perhaps also looking at parts of the image that are given as context right all that is to learn a better representation of words by taking into account how it's

aligned with other words and maybe other other images other videos so sometimes you also call these tools explicit alignment because alignment is

explicit and the alignment is what you're trying to do as the end goal. We

sometimes call this implicit alignment because alignment is just some intermediate step to learn a better representation.

Challenge three is reasoning. So we've

seen in representation, how do you represent one element in one and one element in the other? In alignment,

we've looked at how you could learn the connections between multiple elements and multiple elements. Reasoning

essentially aims to take all this information and combine it through multiple inferential steps to make some prediction.

And when you think about combining information through multiple steps, you know, at some point people were doing it using multiple layers of neuronet networks that each learn more and more

extracted information that's closer and closer to what you want to predict. But nowadays, as all of you have seen on social media, there's a lot of interest in kind of doing this

reasoning more more explicitly, right?

Am I combining my information in a tree structure, in a graph structure? Do I

use attention maps to say I first look at this part of the image that the person is is behind the other one and then I look at this part of the image to show

that the person fell down and so on or nowadays you can even use you know words as an intermediate medium for reasoning.

We have these LLMs that can not just make predictions but can really explain step by step I first do this then do this then look at the image then combine it with this and therefore this is what

the what disease the patient might have.

Words can also be used as intermediate step to reason.

Um so a lot of work in reasoning is about first defining what the structure is. Is it sequential one step to the

is. Is it sequential one step to the next and how many steps? Is it some tree or graph? How do you parameterize

or graph? How do you parameterize intermediate steps? Whether it's

intermediate steps? Whether it's attention maps which are more visual words which might be more language and of course you cannot do reasoning without some external knowledge. need to

know about the problem and know what a good reasoning sequence is before you can do it for AI systems. Yeah.

Do we know if like reasoning actually improves uh multimodal inputs like like chain of thought is typically used in just a language domain but I've seen a

lot of examples where if you use chain of thought reasoning with like an image that it actually performs worse.

Um we're going to have a whole lecture on reasoning and state of your reasoning. So only things that happened

reasoning. So only things that happened in the past year. Uh my students are are helping me prepare that lecture and they be helping to give it. Um the answer is

reasoning in language is easier.

Reasoning with images might be harder.

For example, visual chain of thought is harder. Um but if done right, there is

harder. Um but if done right, there is lots of potential. Right? So we have some some research going on that let's say you want to solve a geometry problem

some starting starting figure and some question it'd be very intuitive if a system could do step by step the mathematical proofs while also step by step drawing and annotating looking at

this angle I'm drawing this perpendicular line this makes it a right angle triangle all these are visual visual steps and if done well we can improve performance and more importantly

make it more understandable for humans you want to actually give it to a student as a tutoring system. We need

this intermediate language visual and probably more more forms of more mediums of reasoning. But we'll look at we'll

of reasoning. But we'll look at we'll look at all this. Sorry, I wasn't clear.

This is an overview. We're going to look at the specific models for each each challenge in the subsequent lectures.

Yeah. So that question there is one paper which patterns on image classification and they have shown that it actually

Great.

All right. Challenge four is generation.

Uh so you've seen mostly predictive systems um that align and fuse and make some prediction and of course there's a lots of interest in building generative

models that are able to operate across different modalities. So we look at

different modalities. So we look at summarization. So you take more data,

summarization. So you take more data, you summarize it into the most important features. There's translation where you

features. There's translation where you take one modality and you try to map it exactly to the other. So text to image generation for example text to video and creation is kind of a holy grail but no

one's really been able to do it successfully. When you start with less

successfully. When you start with less data perhaps like a first frame and you try to generate multiple frames to make an image into a video or make even like

you know music soundtracks where video and audio and music are all synchronized and generated at the same time.

Um so that's generation also work in multimodal generative models transference is this fifth challenge where you might be trying to make

prediction in some modality that you care about say medical images um but you don't have that much data in medical you don't have that much data you don't have that much labels and even if you do it

might be noisy but many reasons why you know prediction in just a one that you care about might be unsuccessful and the goal is to use some extra modality to

help you to help you support your learning and to learn a better representation for the primary one that you actually care about right um so nowadays for example you could use

LLM to supplement image classification models you could use you know back then you could pre-train on imageet as a second modality to support you know medical image classification many

examples of using one as extra information to help the primary modality that you care about.

So several ways of doing that. One is to maybe pre-train and then initialize.

We'll cover lots of examples of this. Um

there's this new this set of methods called co-learning where you introduce some extra information either as an input during training or as a prediction target during training. Right? So if you

introduce it as an input during training, you would just zero it out during testing. So during testing you

during testing. So during testing you only have the modality that you care about using inference on and if you use it as a prediction target during training you don't need it during testing. So it's also just operating on

testing. So it's also just operating on the one that you care about during inference. So ways of supplementing more

inference. So ways of supplementing more more data as input or more training objectives during training to help you get to a better model.

And finally, model induction class of methods where you keep classifiers separate but again you encourage them to share some information so that something

really good classifier strong classifiers can help weaker classifiers.

Okay. And the final challenge is quantification. So challenges one to

quantification. So challenges one to five are more about building designing new models. um challenge shake is more

new models. um challenge shake is more about better understanding with this entire space from your data to the models that you build to the training

objectives and evaluation. So you know a lot I say a lot of words like measuring how different modalities are formally is still a challenge and understanding how it influences how the models are

optimized and trained is still open question right lots of heruristics no deep understanding. Uh we have

deep understanding. Uh we have intuitions on when modalities are redundant or give unique information but again none of these are formalized can be made formal through actual

estimators. It is not clear how this can

estimators. It is not clear how this can again influence the learning process. So

quantification contains a lot of challenges on just under better understanding things understanding when they work and when they don't work.

All right. So the summary of these six challenges representation alignment are probably the most important given any any problem. You're going to break it

any problem. You're going to break it down into different elements. Break down

these no large data into individual elements and you have to first decide how to represent your data. Are we

looking at fusion problems? So the goal is to combine. Are you looking at retrieval problems? We might want to

retrieval problems? We might want to keep things separate and align them. Are

you looking at maybe vision problems?

Alignment is this problem where now you're trying to match different parts of one side to different parts of your modality on the other side. And often

you do representation and then you have to think about alignment.

And once you've done that, it'd be great if we always did reasoning so that you could maybe break down your problem step by step, explain step by step what's going on, language and vision and more modalities and use that to predict your

label. When you don't care about

label. When you don't care about predicting a label, sometimes you care about generating more data, generating more data, and some you care about transferring information from one to the other.

And of course although it' be great we did reasoning all the time. Sometimes

people skip over the reasoning step and most things like this black box over here.

Uh and finally quantification is kind of this magnifying glass that revisits uh these previous challenges and tries to really understand when things work and when things don't in some principal

manner.

All right. So this is a quick intro to to multimodal challenges. Any questions

so far?

Great.

So in the last about 15 minutes I will cover I'll try to cover some parts of alignment. Um next week we're going to

alignment. Um next week we're going to cover you know everything about fusion but alignment is kind of shorter so I'll just cover it in like maybe 15 minutes.

So as you recall this alignment challenge is broken out into three. It's

all about looking at elements in one and elements in the other and trying to find out where the connections are, where the matchings are. Is the easiest when your

matchings are. Is the easiest when your data can be broken down into discrete elements like words and maybe bounding boxes going to be harder when your data is continuous. So you got to solve this

is continuous. So you got to solve this question of taking continuous data and breaking them down into semantic boundaries before doing alignment. uh

we'll cover later in the class this idea of learning implicit alignment. So using

alignment to learn better representations like in these uh these large large transformer models right

so again in discrete alignment um you know one example here is you might try to tie language which are contains words and phrases to non- linguistic elements

that are more visual u so for example a woman reading newspaper right you might have data that's only given at the high So entire sentences with entire images.

And maybe your goal is to first align the entire sentence with the entire image at a global level, but also maybe at a more primary level. Where does the woman correspond to? Where does the

newspaper correspond to? And so on.

And a common class of ways of um of doing this is you know contrasted learning. And most of these settings

learning. And most of these settings your supervision will come from some pair data. So this is maybe one image

pair data. So this is maybe one image with a corresponding caption, another image with another caption, a third image with a third caption. You can get

this from Instagram, Wikipedia, Flickr. These are all common data sets

Flickr. These are all common data sets with pair data between image and text.

Of course, if you want to do it for some other application that's not image and text, you can think about how to get this pair data.

And often times these representations are learned by putting two elements through separate encoders each getting some representation and you're trying to enforce some similarity function between

the two of them. And this similarity function is what captures the fact that these two representations are aligned.

These two representations should not be aligned. They should be further away.

aligned. They should be further away.

Okay. So that's some some similarity function that you can define.

All right.

Um so in contrastive learning what you would do is you try to make this similarity function really high or really really close together for these positive pairs. So images and the actual

positive pairs. So images and the actual captions they correspond to and everything in red are what we call negative pairs. So it might be an image

negative pairs. So it might be an image and incorrect caption right and there might be multiple incorrect captions for that image and you were trying to make

the similarity function smaller so not as similar right and you can even prove that you know these kind of methods they essentially can provably learn the

common information between two modalities. So if you formalize common

modalities. So if you formalize common information by the neutral information between these two random variables then these class of contrasted methods essentially keep everything in the

middle and discard everything that is unique in one and not present in the other.

All right. So that's a high level you know view of what people do in learning these these connections. Uh

so a bit more formally these encoders can be designed they can be very specialized. So we saw like this road

specialized. So we saw like this road map of different model architectures right you have uh spatial data you might use know spatial methods like CNN or vision transformers if you have sensor

data you might use other encoders. So

the encoders can be very different.

They'll capture what's unique about each modality and this this uh alignment similarity function would essentially capture the connections, right? Which

representations should be connected to each other and which ones should not. So

several choices here. What is FA? What

is FB? What are your specialized encoders? And also what is this G

encoders? And also what is this G function? This this similarity function.

function? This this similarity function.

And once you design those, so FA on your first FD on your second. You then

compute this similarity function G between those representations and your loss would essentially be you know um maximizing the similarity or maybe

minimizing the distance between those representations and the parameters that you can update are any parameters within FA FB your encoding models and perhaps any

parameters in this defining the similarity function.

So several choices of such a similarity function. Cosine similarity is a common

function. Cosine similarity is a common one. So you learned ZA for the first and

one. So you learned ZA for the first and ZA for the second. These are our vectors. And cosine similarity

vectors. And cosine similarity essentially says you know are these vectors pointing in the same direction in the embedding space. Very simple to compute essentially a dot product

normalized by by the norms of those vectors.

Uh people have also designed kernel similarity functions. So you can think

similarity functions. So you can think of for example the dot product as basically just measuring whether these two vectors are are nearby in in the

original space or we call it a linear space. There also ways of essentially

space. There also ways of essentially projecting those vectors into some other more expressive space and measuring how similar those embeddings are in that more expressive space. They call that

kernel similarity functions. Again not

super important that you understand the details but just understand that you can measure the embedding in the original space. Oh sorry measure the similarity

space. Oh sorry measure the similarity in the original space of the embeddings.

You can also measure the similarity of some transformed space is more expressive of the embeddings.

Uh something that has also become quite popular are correlation based methods.

So maybe you don't care about just two embeddings being very close to each other. But given a batch of embeddings,

other. But given a batch of embeddings, you want them to be correlated with each other, right? So 32 images and maybe 32

other, right? So 32 images and maybe 32 captions, you want them to be correlated perhaps across some axis, right? So

things that are perhaps outdoors, they should all be on this side. Things that

are more indoors maybe on this this side. So these are correlation based

side. So these are correlation based methods.

And you can also investigate again you don't care about just one element in one matching one in the other you can define these similarities based on pair wise

relationships right so for example again pair of images these are very similar and I want the pair of captions that describe the image to also be very similar so these are these are methods

that go beyond defining similarity at individual data points but across multiple data points All right. So visually looks like this.

All right. So visually looks like this.

Uh you might have paired data, images on some side, descriptions on some other side. You would define positive pairs as

side. You would define positive pairs as those that actually correspond to each other. For example, blue car, yellow

other. For example, blue car, yellow bus. These actually correspond to each

bus. These actually correspond to each other. Negative pairs are those that are

other. Negative pairs are those that are are incorrect captions for for your images.

And you're essentially training this function that maximize the similarity for the positive pairs. So maximize that value and minimize the similarity for

your negative pairs. Okay?

And your similarity function can be something like cosine similarity.

And although this is a very simple idea, but you know it's been around for a long time and it gives you some really really cool results. So now that you've learned

cool results. So now that you've learned this kind of common embedding space that brings your blue cars next to the text for it and your red cars next to the text of it, you can do this kind of your

manipulation. You can take in a blue

manipulation. You can take in a blue car, embed it into your your embedding space minus the word embedding for blue plus the word embedding for red. That

gives you another embedding and you will try to retrieve the nearest image corresponding to that and you actually get red cars, right? You can take uh

this is another cool example. You can

take you know this this cat in a bowl um embed it into your space subtract the embedding for bold and add the embedding for box. These are word embeddings that

for box. These are word embeddings that gives you another embedding and you find the nearest image and you actually see cats cats in a box.

And this is this is 2014 stuff, right?

This 2014 stuff. Images might be smaller resolution. words are just you know two

resolution. words are just you know two words looking at like biograms and of course nowadays people have really scaled these systems up nowadays all you have probably seen clip which is

essentially an example of scaling up these alignment based methods right you have millions of images of images of captions describing these images and you

can again compute this similarity matrix where entries on a diagonal represent those of the image corresponding with the right caption, right? And those

should be similarities that should be maximized. So bringing those together

maximized. So bringing those together and everything that is off diagonal would be representing images with long captions. Everything is off diagonal. So

captions. Everything is off diagonal. So

those similarities should be uh should be minimized and all these things here just a cosign similarity dot products, right? Image embedding dot product with

right? Image embedding dot product with a text embedding.

And that's really basically this this contrasted blocks that you often see nowadays. Want to sum over your data

nowadays. Want to sum over your data points, take the average, take the log to make things more numerically stable by essentially maximizing the similarity of your positive points and you're

minimizing the similarity of your negative class. And sometimes people add

negative class. And sometimes people add the positive points here as well just for computational simplicity, but essentially just going to be dominated

by all the n minus one negative pairs.

All right. And again similarity function is cosine similarity.

Cool. So clip is very useful uh because now you can do something like zero shot.

Zero shot in the sense that after you've trained this you can give it a new image um you know this thing that looks like a television studio. And previously when

television studio. And previously when people you know wanted to classify what's in an image they have to train specific image classifiers that could only predict five classes or 10 classes

right we do pre-train on image net you to have a fixed defined number of classes and one of the biggest benefits of clip is now you could do this idea of

open set classification so any any caption or any class that you want to classify your your image into you can just give it a list give it as a list to

the model. And what the model would do

the model. And what the model would do is that it would embed the image into the feature space. It would embed each of your potential classes or captions into the feature space and take the dot

part. So maybe one of this is a higher

part. So maybe one of this is a higher similarity dot product, some of the lower lower similarity dot product and it would essentially give you a normalized score. So they can classify

normalized score. So they can classify it into any categories that you want.

Doesn't have to be a fixed predefined set of thousand image categories. All

right. As you can see, it does really well on these zero shot classification tasks.

Um, as I mentioned, you can think about the relationship between what alignment does using contrastive learning and what information it captures. So, we saw that modalities can have connected and unique

information, shared and unique information.

uh you define share information through the mutual information and you can prove that contrastive learning essentially captures the mutual information. So it

learns this and it will throw everything else away. Okay. And that is something

else away. Okay. And that is something that that you can you can actually prove um which is both good and bad which means that if you care about you're caring about what is what is common then

these alignment based contrasted methods are really great. you're guaranteed to keep this and throw away everything else. It's a very scalable method. Uh

else. It's a very scalable method. Uh

but if you care about what is here and what is here, then you're in trouble because the more you do contrasted learning and the more alignment that you do, the more your performance will get worse. You're throwing away information

worse. You're throwing away information that you actually might care about.

Uh several ways of dealing with this. We

have some work other folks have also some work very easy to modification where you can do contrasted learning across your modalities. So you will take image and your your caption and you will

do your contrasted learning. Uh that

keeps the information in the middle. If

you want to keep what is unique in image then you would take image and you do in some sense image only contrasted learning. There are ways of doing that.

learning. There are ways of doing that.

You take the image you kind of modify a little bit and then you do contrasted learning only in the images. And it

should be intuitive that you also do um this alignment based contrasted learning only in text. So you will take the caption you will modify a little bit and

you will do contrasted learning only in text and then you can learn what is both shared and unique across your data.

All right. Um I'm just going to quickly go through this continuous alignment.

Most of these are references. um

continuous alignment is harder because now you have continuous signals with not discrete elements and usually what these methods do is that they basically recover and try to break your continuous

data down into discrete parts. I'll

leave some references for people here because I know some people are working with sensor data and videos. There are

ways of for example you know taking continuous data and learning discrete boundaries between them. uh it's a well studied problem in computer vision

uh for time series data some of my group is looking at change point detection so they can automatically break into a part your your sensor data into discrete boundaries and each of these boundaries

should be semantically meaningful for example this is when the person was sleeping this is when they were awake and exercising this is when they're they're working and so on so there are ways of automatically breaking down

continuous data into discrete boundaries each of them with some semantic meaning.

So you could align it with text for example.

And finally another class of methods for speech which is very common. Okay

continuous data uh find some boundary fix sampling boundary get a continuous data and then do clustering right? So

clustering is one way of turning something that is continuous into discrete embeddings. So maybe just three

discrete embeddings. So maybe just three clusters and then you can do your your self-supervised learning transformers to predict discrete tokens.

All right. Okay. So I end it here. Um we

looked at introduction to multimodal AI.

Several key challenges of data being heterogeneous. There's some connected

heterogeneous. There's some connected shared information between them and how they might interact to to fuse and get some label that you care about. We

looked at several challenges and we also dived deep into alignment.

All right. Um and just quick reminder again reading assignments due tomorrow in preparation for Thursday's discussion and make sure you make progress on your project so that you could meet with

myself and the TA every week. All right.

Thanks everyone.

Okay.

Loading...

Loading video analysis...