Lecture 4 – Multimodal Alignment (MIT How to AI Almost Anything, Spring 2025)
By Paul Liang
Summary
Topics Covered
- Highlights from 00:00-11:41
- Highlights from 11:39-21:51
- Highlights from 21:42-33:15
- Highlights from 33:03-42:57
- Highlights from 42:46-52:20
Full Transcript
Okay, very good. All right, welcome back everyone. So, hope you had a great week.
everyone. So, hope you had a great week.
Today's lecture will be about introduction to multimodal AI. So,
logistics first um for the project.
Thank you everyone for submitting your proposals. It was all a pleasure to
proposals. It was all a pleasure to read. I've tried to give fair amount of
read. I've tried to give fair amount of feedback on the proposals. I've also
assigned a primary TA, one of David and Chanyaka to your projects. Um, and I CC them in an email. And the goal is to meet with myself and a primary TA alternating every week. So with me every
other week, with a primary tier every other week, on average every week, you get someone to take a look at your progress. All right. And our goal is
progress. All right. And our goal is really, you know, take these projects and really turn them into, you know, state-of-the-art um, really good research level work. Uh,
reading assignments. The second reading assignment is due tomorrow, Wednesday, in preparation for this Thursday's reading discussion. Again two papers
reading discussion. Again two papers mostly around the theme of modern AI architectures. Uh one of the papers is
architectures. Uh one of the papers is about scaling laws of these modern auto reggressive models. Scaling law
reggressive models. Scaling law essentially meaning how what is performance scale as you put in more data and increase number of parameters in your model. And surprisingly there's a pretty nice trend to these scaling laws. So you could essentially predict
laws. So you could essentially predict what the performance would be as you scale up your models. That's the first paper. Second paper also very
paper. Second paper also very interesting insight on perhaps you don't need all tokens to do auto reggressive pre-training of your models some might be more important than others and therefore implications on how you could
perhaps more efficiently train your models all right so uh project start meeting with us and make progress on the projects and reading assignment that'll be due tomorrow and discussion on
Thursday great so what have we seen so far we've kind of covered basic foundations of AI that int introduction to AI and AI research, different forms of data, the
structure, information within them, ways to learn from data and also we covered a very high level a unifying paradigm of different model architectures including
sequential, convolutional, spatial and models for sets and graphs. Now
naturally a lot of problems in the real world are also going to involve many different data sources, many different modalities active area of research. So
in the next three weeks up until spring break, we'll be covering uh broadly a bird's eye view on different paradigms and multimodal AI. Uh today will be more
introduction and aligning different modalities. Next you could be about how
modalities. Next you could be about how to better fuse information between different modalities. And finally uh in
different modalities. And finally uh in week seven we'll look at this idea of transfer. You might not have a lot of
transfer. You might not have a lot of data in the modality that you care about, but you could use some other more common modality to supplement your information.
All right, so that's just overview of today's lecture, introduction to multimodal AI, some core principles and challenges and diving deeper into this first challenge of alignment.
So before I start, a little bit of behavioral history about multimodal AI.
uh like many things in AI, it really started from people who are looking at psychology and human behavior. And
before this psychologist called David McNeel, uh before he did his line of research, most people thought that, you know, we primarily communicate using language and anything else such as our
gestures and our voice was just uh secondary, not very useful and would not displace language. And David McNeel came
displace language. And David McNeel came up with this experiment. It's called the Mgherk effect which really shook some of the predominant thinking at that time.
So I'll demonstrate this effect by playing two videos and I want you to just quickly tell me was the similarities and differences between these two videos. So here's the first one.
Okay, first video and here's the second one. Ba ba ba
one. Ba ba ba ba ba ba?
Okay, who can tell me what is interesting either similarities or differences between these two video clips? Mike sim similarity in the visual
clips? Mike sim similarity in the visual information but uh in sound first one was B and the second one was B even
though like very well masked like similar in visual information but different in sound.
Yeah.
Anybody else encounter a point? Yes.
Yeah. I'm not sure that the sounds are different. I think maybe just the
different. I think maybe just the visuals are different and since the lips look differently and you have some thought as well.
So if I didn't have the information that information yes so I thought you raised your hand that's why I called out but anyways that that is you know that that's I think
most of you got got the key idea was when I give this talk to many people half the people on the laptops um so they be oh have you even played the second clip yet because If you're only a
laptop and when you're listening to it, you realize that the audio clips are exactly the same, right? If you start looking at the person's lips, you can see that the first person is doing ba
with a B, second person is doing fa with an F. So our perception
an F. So our perception um of the same audio clip can actually be very different based on how the person is saying it.
So that really kickstarted and it kickstarted this first era of multimodal research back in 1980s which is this behavioral era realizing that people when we recognize speech from others and
interact with others actually use actually a good mix of different communicative modalities uh because that was more behavioral era and then naturally there was an ambition to build
AI tools computational tools that could replicate this behavioral perception of different modalities in humans. after
the single computational era using AIG just replicate came this interaction era where multimodal HCI and multimodal interfaces became really popular. So you
add AI and human interacting with different modalities. And finally
different modalities. And finally nowadays we have this deep learning era.
Ever since the 2010s uh we've seen huge amount of of GPU, huge amounts of data, huge amount of large models that really revolutionized uh this space of multimodal AI. And of course as a sub
multimodal AI. And of course as a sub era there's this foundation model era that really started in the 2020s where we really seen the limit of scaling sorry the the the huge potential of
scaling these large transformer models for for single modality and multimodal tasks.
So just historically what tasks did people care about? Um I gave the example of audiovisisual speech recognition where your goal is to recognize what the person was saying from both audio and
visual. So that started from this
visual. So that started from this behavioral era and naturally was the first kind of tasks that people apply computational methods to audiovisisual speech recognition.
Soon after came um a lot of this early boom in internet based data. So there
was lots of interest in building systems that could digitize multimedia data on the internet. So given content can you
the internet. So given content can you retrieve videos? Um and soon after that
retrieve videos? Um and soon after that there was lots of interest in recognizing human affect and emotions from our multimodal behaviors.
Once we started getting into interaction era there was systems based on recognizing events in videos. There's
work in recognizing sentiment and emotion of multiple people interacting with each other.
And something that really kickstarted some of this, you know, multimodal deep learning work was on image captioning.
So if I give you an image, can you come up with a right caption for the image?
Um, and that was kind of the birth of modern language and vision research. And
of course, within the past couple of years, we've really seen a ton of applications in multimodal deep learning.
All right. in multimodal deep learning uh
right. in multimodal deep learning uh from uh captioning images to captioning videos to writing descriptions of videos to and then people found that captioning was really cool but it was really hard
to evaluate right how do you evaluate whether a given caption for image or video is correct it's so subjective so then people converted captioning into question answering so if I give you a
question about a particular video how you train an AI system to give me the right answer so that was visual question answering and again people extended that to video based question answering
um and then it went from static systems to just writing one caption and answering one question. AI systems
capable of holding multimodal dialogue.
So given a particular image referencing and conversing in a dialogue with humans about those modalities uh more large scale YouTube data became
a source of data to pre-train up all these video models that could recognize and caption and retrieve different types of videos.
Soon after we saw lots of work in autonomous agents. So this was uh agents
autonomous agents. So this was uh agents that are able to navigate using vision and language. So you might give it some
and language. So you might give it some instruction like go there and then turn left at the nearest gas station. So
those are agents capable of navigating based on vision and language.
And of course self-driving cars also do a lot of multimodal fusion of different visual language and sensory information.
Uh and nowadays in generative AI we've seen a lot of applications that are taking in text to generate images and videos right these generative texttoage video generative models.
And finally, probably the biggest application today is these interactive agents. Interactive agents on your
agents. Interactive agents on your computer uh and your robots that can take in lots of data and are grounded in either the digital or physical world to complete difficult
tasks and many more. So that's just a brief history of various tasks that multimodal AI has made an impact on.
So in the first part of today's lecture, we're going to just cover an introduction to multimodal AI and then the second half we'll dive deep into um some of these core methods and a lot of this will be based on this sur paper
that we have on foundations and trends in multimodal AI and if you're interested in longer versions we give tutorials at international conferences.
So firstly what is a modality? Um, a
modality is some data source representing a way in which information is expressed or perceived in the world.
So like all forms of data that we've seen in in AI, you think about some sensor which is collecting your data such as cameras or microphones and you think of it as a spectrum from more raw
modalities closer to a sensor to more abstract modalities that are further away from a sensor by doing more processing. So you can think of a raw
processing. So you can think of a raw speech signal from a microphone or a raw image taken from a camera as raw modalities. You can start extracting
modalities. You can start extracting language from your speech signal. You
can detect objects from images. You can
extract sentiment from language and you can categorize the object. So all this becomes more abstract information that you could even represent as words, right? Positive sentiment or this is a
right? Positive sentiment or this is a table, this is a chair.
And as we've seen, you know, the raw data can be very different. uh all
different data comes from different types of structure and different dimensions. But as you start processing
dimensions. But as you start processing these different data modalities using AI systems like these model architectures that we've seen, you're basically learning abstractions. You're learning
learning abstractions. You're learning abstractions and semantic information from different forms of data. So there's
a possibility of bringing them closer together with each other.
So multimodal problems are there for those with different modalities. And
when different modalities multodalities are are are in the picture, there's often several challenges. Uh these are challenges of firstly heterogeneity. The
fact that different modalities are very different and processing them together will cause difficulties and the idea that there are some interconnections between your modalities. And we break down this word interconnection into
first connections. So there's two
first connections. So there's two modalities themselves having some connected overlapping information. and
this idea of interactions that there is some fusion that goes on to create new information where they're brought together for some task.
So in more detail uh this idea of heterogeneity should again be seen as a spectrum representing more homogeneous modalities with more similar qualities and representations to more
heterogeneous modalities with more diverse qualities and representations.
So you can think of images from two cameras as being more homogeneous.
They're both visual modalities. They
both have this spatial structure.
They're both uh sensitive to the same type of noise from your camera blurs. Um
so they're more homogeneous, but again they're giving you different information, right? Different angles,
information, right? Different angles, different viewpoints of the same object.
Uh and then you can think of something that's more different. For example, text from two different languages. English,
French, English Chinese.
um different language families would have different structures and different grammars and different vocabularies, but you know, they're still kind of similar because you can train a model to translate between one language to
another with sometimes perfect information and sometimes losing a little bit of information. Then you can think of language and vision, images and text. They're much more different. uh
text. They're much more different. uh
they start to see their differences and even go to something a language and sensors uh which is even more different because at least in vision and language you can think about a language to describe the image but is there even a
language that can successfully describe sensor data. So these are more different
sensor data. So these are more different modalities. Yeah.
modalities. Yeah.
What do you mean by qualities here?
Um so everything in the modality profile that we've we've seen so the base elements themselves the distribution of elements whether the data is discrete or
continuous what kind of information they contain and I just highlighted structure representations because those are kind of perhaps slightly more important qualities to consider.
So it's not about the image quality or the language quality it's just about the modality. Yes, the the the the
modality. Yes, the the the the yes not not actual quality but just the the profile the information the properties of the modalities.
Okay.
And abstract modalities are more likely to be homogeneous. We've seen raw modalities are those that are collected closer to the sensor. So raw image and raw sensor data. Abstract modalities are
the ones that undergo more processing.
So extracting whether I'm happy or sad from my speech signal and whether I'm happy or sad from maybe my physiological sensor. So abstraction allows you to
sensor. So abstraction allows you to essentially bring heterogeneous modalities closer to each other. All
right. So that's the first principle that modalities are heterogeneous and therefore different. The second key
therefore different. The second key principle in multimodal learning is the fact that we just look at these modalities. There's often going to be
modalities. There's often going to be some shared information that that connects them together. And again, you want to think of this as a spectrum um shared and unique information and cases
where there's more overlap, less overlap, and maybe settings where modalities are completely independent from each other with no overlap. So
given any any pair that for example co occurs in the world, for example, image and text, you can see that there is some connections. They're both describing
connections. They're both describing teacup and sofa. Um, but for example, there's a lot of information in the image that is not captured in the in the caption. For example, the color of the
caption. For example, the color of the sofa, the size of the table, even the presence of the table, all these are not really captured in in the in language.
So, modalities in general have some connections but not not perfectly overlapping in information.
And finally, interactions. So
interactions describes this phenomena where you bring your modalities together and you try to combine them some new information will typically emerge for your task. And so these modalities and
your task. And so these modalities and your task one type of common interaction that people care about is this idea of redundancy where both modalities give you some common information and your
goal is to really exploit this common information. So if I'm saying something
information. So if I'm saying something positive like this movie is great and I'm smiling at the same time then this is the idea of redundancy. you're
reinforcing the idea that the person really liked the movie.
Uh that's in opposition to uniqueness where there's only information in one, not in the other. Um for example, you're saying something slightly positive like the movie does a good job developing the
characters with neutral facial expression and one of these has a unique information. So the challenge here is to
information. So the challenge here is to identify which one has the right information and which others you should maybe throw away.
And finally, synergy. So this is the idea that this information only emerges when you fuse the modalities together.
They might be saying something uh wow it seems positive but with some anger or frustration on your face and then you can infer that this person might actually be sarcastic about this topic.
So you have to look at both you have to compute that there's some differences between the modalities and that indicate sarcasm.
So that's a key idea of interactions. So
at a very high level, multimodal problems are those that involve heterogeneous um connected and interacting data.
Heterogeneous because data sources are different from each other and there's difficulty in processing different sources. Connections because these
sources. Connections because these modalities overlap in some information and interacting because different parts of the information are going to combine when you try to fuse them together for
some task.
All right. So that's a brief explanation of what multimodal problems are.
Um so that will cover why is it hard?
What are what are some unique challenges in multimodal AI that are typically not seen in single modality problems like computer vision and NLP separately and what are some of the state-of-the-art
methods to deal with these challenges.
So all of these multimodal models look something like this. So multiple
modalities using different colors and shapes to represent them and I'm also you know breaking down these modalities with elements right uh these might be words
in a sentence that you're saying it might be image regions in some big image that you're trying to process it might be time steps in long range sensor data
that you're looking at. So just like how you know all of our discussion about machine learning is all about breaking things down into individual parts learning representations and then aggregating information. We're going to
aggregating information. We're going to use a similar schematic over here.
Um and typically you're designing some AI system that brings it together as input multiple of these modalities and
either learns a representation or predicts a label um as the output.
Right? So essentially we're looking at what really goes in here and what are the challenges in designing the components within this this block.
So at a high level there's going to be six challenges that makes multimodal AI difficult and unique and interesting to study.
Uh the first challenge is this idea of representation right? So you have data
representation right? So you have data that's different modalities there some heterogeneity between them. How do you learn the representations that you know can essentially bring them together and
combine them in the right way? So
representation is often uh core building block in almost all problems right when you start with data that's that's raw it's just collected the first thing you always want to think about is how to
represent your modalities.
So for simplicity again this is the schematic. You might have data coming in
schematic. You might have data coming in with different elements for example multiple words that you're saying multiple time steps in sensor data.
We're just going to look at look at it more simply as just one element in each.
We're just going to care about representing one element in each and ignoring the rest. We'll look at look at the rest in the subsequent challenges.
Okay. So there's three key ways of doing representation each achieving different purposes.
One way of doing representation is what we call fusion, right? So you take in two elements. You might have already
two elements. You might have already some representations for them separately, but how do you bring them together into one common representation that best uses the information in some
appropriate way? Um this fusion might
appropriate way? Um this fusion might exploit the overlap in the information.
It might be one that picks the right modality depending on the information.
maybe might combine them in more sophisticated ways to capture some synergy between them. So in general, one class of methods is called fusion.
A second class of methods are what we call coordination. Uh these methods take
call coordination. Uh these methods take in your your two elements and learn representations separately for them.
Right? You have one the first one, one for the second one. Um but it's not independent because they're going to align these two representations using some similarity function. So enforcing
that there is some structure between these two representations that you learn. There are there's a lot of
learn. There are there's a lot of research that goes into how for example you can define this this alignment between them. This is more useful when
between them. This is more useful when you might want to do retrieval for example you define this function as you know cosine or dot product similarity.
Then you can take something learn a representation find the nearest one in the other modality and therefore retrieve the closest in the other modality. So going from image to
modality. So going from image to retrieve the nearest caption or vice versa. Okay. So that's coordination. Uh
versa. Okay. So that's coordination. Uh
fision is this other kind of final example where you start maybe with two elements and now you're trying to learn a partition space that has more representations than the number of modalities that you started with. And
this can be very useful if you want to learn more disentangled and more interpretable features where some of them captures overlap between the two of them. one captures what's unique in the
them. one captures what's unique in the first and one captures what's unique in the second. Right? So now you have three
the second. Right? So now you have three representations for two modalities and you could go even more. Right? So three
general ways of learning representations divided by whether you're reducing and combining representations to keeping them separate but actually learning more representation that captures different
information.
All right. So that's the first challenge of representation.
The second challenge is alignment. Um so
as we've seen in representation we just focus on one element in one and one element in the other. Of course in most problems you have elements in some sequence following some structure right
you have words that you're saying across time you have image regions and some spatial image you have sensory data also across time. So alignment essentially
across time. So alignment essentially studies this problem of how do you align your modalities at both element level and at a global level.
Um so one example is for example if I'm saying something and trying to describe an image which word that I'm describing for example a person actually matches
the the image region showing the person right I'm saying hockey stick which part actually relates to the to the area of the image that that says hockey stick so learning this mapping between your
modalities that captures the parts that have similar uh semantic ation.
So several sub challenges here. One sub
challenge is the easiest case where your modalities are all subdivided into discrete elements. For example, words
discrete elements. For example, words are discreet, image bounding boxers are discrete and the goal is to essentially learn this match which word corresponds to which region in the image which other
word represents some other region in the image. So solving this matching problem
image. So solving this matching problem to find um the alignment between your two two modalities.
So that's called learning discrete connections.
Um a second sub challenge is a more difficult version where you no longer can segment your data cleanly into discrete boundaries and your data is more continuous. So let's say you have
more continuous. So let's say you have some high frequency sensor data, medical data and so on that is continuous sensor data. At the same time, I want to align
data. At the same time, I want to align it with something a doctor is writing.
This part indicates that your heart is pulating. This part indicates that a
pulating. This part indicates that a person fell down. So, how do you do this segmentation while also matching it aligning it to the other the other modalities? So, you have to add this
modalities? So, you have to add this step of doing this this segmentation and discretization of continuous data.
And while the first two aim to just achieve the alignment and that was the goal just to find the matching between the modalities, this third challenge is to use alignment to learn a better
representation. Right? So we call that
representation. Right? So we call that contextualized representations.
Uh doing better representation learning by contextualizing how it matches with other parts. You know a lot of work
other parts. You know a lot of work nowadays in multimodal transformers essentially falls into this because I'm trying to learn a better representation of you know some language by looking at
other words in its context and perhaps also looking at parts of the image that are given as context right all that is to learn a better representation of words by taking into account how it's
aligned with other words and maybe other other images other videos so sometimes you also call these tools explicit alignment because alignment is
explicit and the alignment is what you're trying to do as the end goal. We
sometimes call this implicit alignment because alignment is just some intermediate step to learn a better representation.
Challenge three is reasoning. So we've
seen in representation, how do you represent one element in one and one element in the other? In alignment,
we've looked at how you could learn the connections between multiple elements and multiple elements. Reasoning
essentially aims to take all this information and combine it through multiple inferential steps to make some prediction.
And when you think about combining information through multiple steps, you know, at some point people were doing it using multiple layers of neuronet networks that each learn more and more
extracted information that's closer and closer to what you want to predict. But nowadays, as all of you have seen on social media, there's a lot of interest in kind of doing this
reasoning more more explicitly, right?
Am I combining my information in a tree structure, in a graph structure? Do I
use attention maps to say I first look at this part of the image that the person is is behind the other one and then I look at this part of the image to show
that the person fell down and so on or nowadays you can even use you know words as an intermediate medium for reasoning.
We have these LLMs that can not just make predictions but can really explain step by step I first do this then do this then look at the image then combine it with this and therefore this is what
the what disease the patient might have.
Words can also be used as intermediate step to reason.
Um so a lot of work in reasoning is about first defining what the structure is. Is it sequential one step to the
is. Is it sequential one step to the next and how many steps? Is it some tree or graph? How do you parameterize
or graph? How do you parameterize intermediate steps? Whether it's
intermediate steps? Whether it's attention maps which are more visual words which might be more language and of course you cannot do reasoning without some external knowledge. need to
know about the problem and know what a good reasoning sequence is before you can do it for AI systems. Yeah.
Do we know if like reasoning actually improves uh multimodal inputs like like chain of thought is typically used in just a language domain but I've seen a
lot of examples where if you use chain of thought reasoning with like an image that it actually performs worse.
Um we're going to have a whole lecture on reasoning and state of your reasoning. So only things that happened
reasoning. So only things that happened in the past year. Uh my students are are helping me prepare that lecture and they be helping to give it. Um the answer is
reasoning in language is easier.
Reasoning with images might be harder.
For example, visual chain of thought is harder. Um but if done right, there is
harder. Um but if done right, there is lots of potential. Right? So we have some some research going on that let's say you want to solve a geometry problem
some starting starting figure and some question it'd be very intuitive if a system could do step by step the mathematical proofs while also step by step drawing and annotating looking at
this angle I'm drawing this perpendicular line this makes it a right angle triangle all these are visual visual steps and if done well we can improve performance and more importantly
make it more understandable for humans you want to actually give it to a student as a tutoring system. We need
this intermediate language visual and probably more more forms of more mediums of reasoning. But we'll look at we'll
of reasoning. But we'll look at we'll look at all this. Sorry, I wasn't clear.
This is an overview. We're going to look at the specific models for each each challenge in the subsequent lectures.
Yeah. So that question there is one paper which patterns on image classification and they have shown that it actually
Great.
All right. Challenge four is generation.
Uh so you've seen mostly predictive systems um that align and fuse and make some prediction and of course there's a lots of interest in building generative
models that are able to operate across different modalities. So we look at
different modalities. So we look at summarization. So you take more data,
summarization. So you take more data, you summarize it into the most important features. There's translation where you
features. There's translation where you take one modality and you try to map it exactly to the other. So text to image generation for example text to video and creation is kind of a holy grail but no
one's really been able to do it successfully. When you start with less
successfully. When you start with less data perhaps like a first frame and you try to generate multiple frames to make an image into a video or make even like
you know music soundtracks where video and audio and music are all synchronized and generated at the same time.
Um so that's generation also work in multimodal generative models transference is this fifth challenge where you might be trying to make
prediction in some modality that you care about say medical images um but you don't have that much data in medical you don't have that much data you don't have that much labels and even if you do it
might be noisy but many reasons why you know prediction in just a one that you care about might be unsuccessful and the goal is to use some extra modality to
help you to help you support your learning and to learn a better representation for the primary one that you actually care about right um so nowadays for example you could use
LLM to supplement image classification models you could use you know back then you could pre-train on imageet as a second modality to support you know medical image classification many
examples of using one as extra information to help the primary modality that you care about.
So several ways of doing that. One is to maybe pre-train and then initialize.
We'll cover lots of examples of this. Um
there's this new this set of methods called co-learning where you introduce some extra information either as an input during training or as a prediction target during training. Right? So if you
introduce it as an input during training, you would just zero it out during testing. So during testing you
during testing. So during testing you only have the modality that you care about using inference on and if you use it as a prediction target during training you don't need it during testing. So it's also just operating on
testing. So it's also just operating on the one that you care about during inference. So ways of supplementing more
inference. So ways of supplementing more more data as input or more training objectives during training to help you get to a better model.
And finally, model induction class of methods where you keep classifiers separate but again you encourage them to share some information so that something
really good classifier strong classifiers can help weaker classifiers.
Okay. And the final challenge is quantification. So challenges one to
quantification. So challenges one to five are more about building designing new models. um challenge shake is more
new models. um challenge shake is more about better understanding with this entire space from your data to the models that you build to the training
objectives and evaluation. So you know a lot I say a lot of words like measuring how different modalities are formally is still a challenge and understanding how it influences how the models are
optimized and trained is still open question right lots of heruristics no deep understanding. Uh we have
deep understanding. Uh we have intuitions on when modalities are redundant or give unique information but again none of these are formalized can be made formal through actual
estimators. It is not clear how this can
estimators. It is not clear how this can again influence the learning process. So
quantification contains a lot of challenges on just under better understanding things understanding when they work and when they don't work.
All right. So the summary of these six challenges representation alignment are probably the most important given any any problem. You're going to break it
any problem. You're going to break it down into different elements. Break down
these no large data into individual elements and you have to first decide how to represent your data. Are we
looking at fusion problems? So the goal is to combine. Are you looking at retrieval problems? We might want to
retrieval problems? We might want to keep things separate and align them. Are
you looking at maybe vision problems?
Alignment is this problem where now you're trying to match different parts of one side to different parts of your modality on the other side. And often
you do representation and then you have to think about alignment.
And once you've done that, it'd be great if we always did reasoning so that you could maybe break down your problem step by step, explain step by step what's going on, language and vision and more modalities and use that to predict your
label. When you don't care about
label. When you don't care about predicting a label, sometimes you care about generating more data, generating more data, and some you care about transferring information from one to the other.
And of course although it' be great we did reasoning all the time. Sometimes
people skip over the reasoning step and most things like this black box over here.
Uh and finally quantification is kind of this magnifying glass that revisits uh these previous challenges and tries to really understand when things work and when things don't in some principal
manner.
All right. So this is a quick intro to to multimodal challenges. Any questions
so far?
Great.
So in the last about 15 minutes I will cover I'll try to cover some parts of alignment. Um next week we're going to
alignment. Um next week we're going to cover you know everything about fusion but alignment is kind of shorter so I'll just cover it in like maybe 15 minutes.
So as you recall this alignment challenge is broken out into three. It's
all about looking at elements in one and elements in the other and trying to find out where the connections are, where the matchings are. Is the easiest when your
matchings are. Is the easiest when your data can be broken down into discrete elements like words and maybe bounding boxes going to be harder when your data is continuous. So you got to solve this
is continuous. So you got to solve this question of taking continuous data and breaking them down into semantic boundaries before doing alignment. uh
we'll cover later in the class this idea of learning implicit alignment. So using
alignment to learn better representations like in these uh these large large transformer models right
so again in discrete alignment um you know one example here is you might try to tie language which are contains words and phrases to non- linguistic elements
that are more visual u so for example a woman reading newspaper right you might have data that's only given at the high So entire sentences with entire images.
And maybe your goal is to first align the entire sentence with the entire image at a global level, but also maybe at a more primary level. Where does the woman correspond to? Where does the
newspaper correspond to? And so on.
And a common class of ways of um of doing this is you know contrasted learning. And most of these settings
learning. And most of these settings your supervision will come from some pair data. So this is maybe one image
pair data. So this is maybe one image with a corresponding caption, another image with another caption, a third image with a third caption. You can get
this from Instagram, Wikipedia, Flickr. These are all common data sets
Flickr. These are all common data sets with pair data between image and text.
Of course, if you want to do it for some other application that's not image and text, you can think about how to get this pair data.
And often times these representations are learned by putting two elements through separate encoders each getting some representation and you're trying to enforce some similarity function between
the two of them. And this similarity function is what captures the fact that these two representations are aligned.
These two representations should not be aligned. They should be further away.
aligned. They should be further away.
Okay. So that's some some similarity function that you can define.
All right.
Um so in contrastive learning what you would do is you try to make this similarity function really high or really really close together for these positive pairs. So images and the actual
positive pairs. So images and the actual captions they correspond to and everything in red are what we call negative pairs. So it might be an image
negative pairs. So it might be an image and incorrect caption right and there might be multiple incorrect captions for that image and you were trying to make
the similarity function smaller so not as similar right and you can even prove that you know these kind of methods they essentially can provably learn the
common information between two modalities. So if you formalize common
modalities. So if you formalize common information by the neutral information between these two random variables then these class of contrasted methods essentially keep everything in the
middle and discard everything that is unique in one and not present in the other.
All right. So that's a high level you know view of what people do in learning these these connections. Uh
so a bit more formally these encoders can be designed they can be very specialized. So we saw like this road
specialized. So we saw like this road map of different model architectures right you have uh spatial data you might use know spatial methods like CNN or vision transformers if you have sensor
data you might use other encoders. So
the encoders can be very different.
They'll capture what's unique about each modality and this this uh alignment similarity function would essentially capture the connections, right? Which
representations should be connected to each other and which ones should not. So
several choices here. What is FA? What
is FB? What are your specialized encoders? And also what is this G
encoders? And also what is this G function? This this similarity function.
function? This this similarity function.
And once you design those, so FA on your first FD on your second. You then
compute this similarity function G between those representations and your loss would essentially be you know um maximizing the similarity or maybe
minimizing the distance between those representations and the parameters that you can update are any parameters within FA FB your encoding models and perhaps any
parameters in this defining the similarity function.
So several choices of such a similarity function. Cosine similarity is a common
function. Cosine similarity is a common one. So you learned ZA for the first and
one. So you learned ZA for the first and ZA for the second. These are our vectors. And cosine similarity
vectors. And cosine similarity essentially says you know are these vectors pointing in the same direction in the embedding space. Very simple to compute essentially a dot product
normalized by by the norms of those vectors.
Uh people have also designed kernel similarity functions. So you can think
similarity functions. So you can think of for example the dot product as basically just measuring whether these two vectors are are nearby in in the
original space or we call it a linear space. There also ways of essentially
space. There also ways of essentially projecting those vectors into some other more expressive space and measuring how similar those embeddings are in that more expressive space. They call that
kernel similarity functions. Again not
super important that you understand the details but just understand that you can measure the embedding in the original space. Oh sorry measure the similarity
space. Oh sorry measure the similarity in the original space of the embeddings.
You can also measure the similarity of some transformed space is more expressive of the embeddings.
Uh something that has also become quite popular are correlation based methods.
So maybe you don't care about just two embeddings being very close to each other. But given a batch of embeddings,
other. But given a batch of embeddings, you want them to be correlated with each other, right? So 32 images and maybe 32
other, right? So 32 images and maybe 32 captions, you want them to be correlated perhaps across some axis, right? So
things that are perhaps outdoors, they should all be on this side. Things that
are more indoors maybe on this this side. So these are correlation based
side. So these are correlation based methods.
And you can also investigate again you don't care about just one element in one matching one in the other you can define these similarities based on pair wise
relationships right so for example again pair of images these are very similar and I want the pair of captions that describe the image to also be very similar so these are these are methods
that go beyond defining similarity at individual data points but across multiple data points All right. So visually looks like this.
All right. So visually looks like this.
Uh you might have paired data, images on some side, descriptions on some other side. You would define positive pairs as
side. You would define positive pairs as those that actually correspond to each other. For example, blue car, yellow
other. For example, blue car, yellow bus. These actually correspond to each
bus. These actually correspond to each other. Negative pairs are those that are
other. Negative pairs are those that are are incorrect captions for for your images.
And you're essentially training this function that maximize the similarity for the positive pairs. So maximize that value and minimize the similarity for
your negative pairs. Okay?
And your similarity function can be something like cosine similarity.
And although this is a very simple idea, but you know it's been around for a long time and it gives you some really really cool results. So now that you've learned
cool results. So now that you've learned this kind of common embedding space that brings your blue cars next to the text for it and your red cars next to the text of it, you can do this kind of your
manipulation. You can take in a blue
manipulation. You can take in a blue car, embed it into your your embedding space minus the word embedding for blue plus the word embedding for red. That
gives you another embedding and you will try to retrieve the nearest image corresponding to that and you actually get red cars, right? You can take uh
this is another cool example. You can
take you know this this cat in a bowl um embed it into your space subtract the embedding for bold and add the embedding for box. These are word embeddings that
for box. These are word embeddings that gives you another embedding and you find the nearest image and you actually see cats cats in a box.
And this is this is 2014 stuff, right?
This 2014 stuff. Images might be smaller resolution. words are just you know two
resolution. words are just you know two words looking at like biograms and of course nowadays people have really scaled these systems up nowadays all you have probably seen clip which is
essentially an example of scaling up these alignment based methods right you have millions of images of images of captions describing these images and you
can again compute this similarity matrix where entries on a diagonal represent those of the image corresponding with the right caption, right? And those
should be similarities that should be maximized. So bringing those together
maximized. So bringing those together and everything that is off diagonal would be representing images with long captions. Everything is off diagonal. So
captions. Everything is off diagonal. So
those similarities should be uh should be minimized and all these things here just a cosign similarity dot products, right? Image embedding dot product with
right? Image embedding dot product with a text embedding.
And that's really basically this this contrasted blocks that you often see nowadays. Want to sum over your data
nowadays. Want to sum over your data points, take the average, take the log to make things more numerically stable by essentially maximizing the similarity of your positive points and you're
minimizing the similarity of your negative class. And sometimes people add
negative class. And sometimes people add the positive points here as well just for computational simplicity, but essentially just going to be dominated
by all the n minus one negative pairs.
All right. And again similarity function is cosine similarity.
Cool. So clip is very useful uh because now you can do something like zero shot.
Zero shot in the sense that after you've trained this you can give it a new image um you know this thing that looks like a television studio. And previously when
television studio. And previously when people you know wanted to classify what's in an image they have to train specific image classifiers that could only predict five classes or 10 classes
right we do pre-train on image net you to have a fixed defined number of classes and one of the biggest benefits of clip is now you could do this idea of
open set classification so any any caption or any class that you want to classify your your image into you can just give it a list give it as a list to
the model. And what the model would do
the model. And what the model would do is that it would embed the image into the feature space. It would embed each of your potential classes or captions into the feature space and take the dot
part. So maybe one of this is a higher
part. So maybe one of this is a higher similarity dot product, some of the lower lower similarity dot product and it would essentially give you a normalized score. So they can classify
normalized score. So they can classify it into any categories that you want.
Doesn't have to be a fixed predefined set of thousand image categories. All
right. As you can see, it does really well on these zero shot classification tasks.
Um, as I mentioned, you can think about the relationship between what alignment does using contrastive learning and what information it captures. So, we saw that modalities can have connected and unique
information, shared and unique information.
uh you define share information through the mutual information and you can prove that contrastive learning essentially captures the mutual information. So it
learns this and it will throw everything else away. Okay. And that is something
else away. Okay. And that is something that that you can you can actually prove um which is both good and bad which means that if you care about you're caring about what is what is common then
these alignment based contrasted methods are really great. you're guaranteed to keep this and throw away everything else. It's a very scalable method. Uh
else. It's a very scalable method. Uh
but if you care about what is here and what is here, then you're in trouble because the more you do contrasted learning and the more alignment that you do, the more your performance will get worse. You're throwing away information
worse. You're throwing away information that you actually might care about.
Uh several ways of dealing with this. We
have some work other folks have also some work very easy to modification where you can do contrasted learning across your modalities. So you will take image and your your caption and you will
do your contrasted learning. Uh that
keeps the information in the middle. If
you want to keep what is unique in image then you would take image and you do in some sense image only contrasted learning. There are ways of doing that.
learning. There are ways of doing that.
You take the image you kind of modify a little bit and then you do contrasted learning only in the images. And it
should be intuitive that you also do um this alignment based contrasted learning only in text. So you will take the caption you will modify a little bit and
you will do contrasted learning only in text and then you can learn what is both shared and unique across your data.
All right. Um I'm just going to quickly go through this continuous alignment.
Most of these are references. um
continuous alignment is harder because now you have continuous signals with not discrete elements and usually what these methods do is that they basically recover and try to break your continuous
data down into discrete parts. I'll
leave some references for people here because I know some people are working with sensor data and videos. There are
ways of for example you know taking continuous data and learning discrete boundaries between them. uh it's a well studied problem in computer vision
uh for time series data some of my group is looking at change point detection so they can automatically break into a part your your sensor data into discrete boundaries and each of these boundaries
should be semantically meaningful for example this is when the person was sleeping this is when they were awake and exercising this is when they're they're working and so on so there are ways of automatically breaking down
continuous data into discrete boundaries each of them with some semantic meaning.
So you could align it with text for example.
And finally another class of methods for speech which is very common. Okay
continuous data uh find some boundary fix sampling boundary get a continuous data and then do clustering right? So
clustering is one way of turning something that is continuous into discrete embeddings. So maybe just three
discrete embeddings. So maybe just three clusters and then you can do your your self-supervised learning transformers to predict discrete tokens.
All right. Okay. So I end it here. Um we
looked at introduction to multimodal AI.
Several key challenges of data being heterogeneous. There's some connected
heterogeneous. There's some connected shared information between them and how they might interact to to fuse and get some label that you care about. We
looked at several challenges and we also dived deep into alignment.
All right. Um and just quick reminder again reading assignments due tomorrow in preparation for Thursday's discussion and make sure you make progress on your project so that you could meet with
myself and the TA every week. All right.
Thanks everyone.
Okay.
Loading video analysis...