Lucas Beyer - Sigmoid Loss for Language Image Pre-Training
By Cohere
Summary
## Key takeaways - **Sigmoid Loss (SigLIP) Outperforms Softmax**: Sigmoid loss, when applied to language-image pre-training, offers advantages over the traditional softmax loss. It performs better, especially at smaller batch sizes, and scales more efficiently, requiring less memory and compute. [24:51], [33:11] - **Batch Size Benefits Diminish Above 32k**: While larger batch sizes generally benefit contrastive learning, experiments with SigLIP showed that benefits plateau around 32,000. Increasing batch sizes further, even up to one million, yielded diminishing returns on key metrics. [32:42], [32:49] - **SigLIP Enables Scalable Multilingual Models**: SigLIP's architecture allows for more effective training of multilingual models. Although initial results showed multilingual models lagging behind English-only ones, SigLIP provides a path towards better cross-lingual performance. [34:32], [42:01] - **Captioning Models Learn Relationships Better**: Contrastive models like CLIP often struggle with understanding relationships between objects, behaving more like 'bag-of-words' detectors. Image captioning models, by contrast, are inherently trained to understand and generate sequential relationships, potentially leading to richer visual understanding. [55:33], [57:15] - **SigLIP Achieves State-of-the-Art Results**: Models trained with SigLIP consistently outperform previous state-of-the-art image-text models across various sizes and benchmarks. This includes achieving 84.5% ImageNet zero-shot accuracy with a SigLiT model trained on just four TPUv4 chips in two days. [37:08], [02:45]
Topics Covered
- What does true "general visual understanding" mean for AI?
- Language as an API: The CLIP/ALIGN paradigm shift.
- Sigmoid vs. Softmax: Why SigLIP is more efficient.
- Batch Size Surprises: When more data doesn't help.
- Multilingual AI: Understanding culture beyond literal words.
Full Transcript
[Music]
hi everyone thank you so much for
joining us today uh today we have ukus
with us who uh if you are into computer
vision he does not need any introduction
uh but I'll try to introduce him so he's
a he's a senior staff research
researcher at Google Deep Mind and
author of some uh very impactful works
such as uh Vision Transformers he has
his PhD in robotics and computer vision
uh so I'll just hand it over to him
thank you so
much yeah thank you
um I was unsure uh what we do with the
timing the calendar event is for one
hour is this like a one hour talk with
questions throughout or we do half an
hour talk yeah as you as you're
comfortable no problem okay but we have
a full hour and can split it anyway
right okay yeah then let's do people can
ask questions throughout um and if at
some point we get stuck too long on one
slide then I'll just say okay no more
questions here let's move on and I'll
keep track of the time
also
um yeah so thanks for the introduction I
will present sigp but I actually want to
present a little bit more surround ing
context and I like this title slide we
used it for iccv which was in Paris
that's the only reason there are the two
Eiffel towers and I didn't edit so yeah
shwa is actually here at the office but
not with me um these are the authors of
the sigl
paper
[Music]
um what is this AI assistant okay um
then let's get started um
so as I said I want to give a little bit
broader context before diving straight
into the technical details of sigp uh
and my first few slides some of you who
have seen talks from me before may be
familiar with them but I still want to
um present them to get everybody on the
same starting place um so the the main
goal of me and the various people in my
group and I think a lot of computer
vision researchers in general actually
is to somehow get a model that is a good
General visual
representation
um and why we want this is because we
believe that this is necessary for
building models or machines or robots or
whatever that can actually perform
meaningful tasks in the real real world
uh for example for me concretely I want
that latest at the end of my career
there exist robot that any non-technical
person can teach to do any task and I
believe that really good and robust and
generalizable visual understanding of
the surroundings is one of the
requirements of it not the only one um
so I work towards that uh now what does
it mean concretely um so this is my
favorite slide that I've used a million
times but I'll use again let's play a
little game uh together and that will
show you very clearly I think what I
mean by a general visual understanding
uh so I show you these pictures I say
these are class A these are class B and
these are Class C and then I show you a
new one and just in your mind is it
Class A B or C and I can probably
already sto and you probably all made up
your mind yeah yeah this is Class A uh
another one of things that are maybe
less a little less common in your life
but you may still have seen a few times
uh Class A Class B and Class C and then
I show you a new one and this may be a
little bit harder than the previous one
but I'm sure you all got by now that
this is Class B very likely right the
basketball field as a satellite image uh
and then may be quite a lot harder
depending on whether you've seen this
data set before or not and probably some
of you have may have never seen this
image type in your life uh Class A and
Class B and I show a new one and this
may need you to think a little bit more
and wonder a little bit more but I think
usually by now the let's say 80% of the
people usually got it right that this is
Class A it's like three objects no
matter what um and Class B was five
objects no matter what so this is what I
mean by a general visual understanding
you come equipped with it however you
got it I don't really care but you have
it like I show you a few pictures
of new things that I can group in
different ways or I can describe to you
this is blah this is blah this is blah
and you quickly get it without me having
to explain much just showing a few cases
and then you quickly generalize to new
images of this and make a reasonable
guess on new images right uh so this is
really the
goal and and then some time ago uh with
our colleagues a few years ago we set
out to tackle this and the first thing
we need to do to tackle something is to
measure how far we are from that um and
then we designed a benchmark for this
which we call the visual task adaptation
Benchmark um if you're are more of an
NLP person it is very similar to glue
and was around the same time actually um
so basically we say you pre-train or do
whatever to come up with your model that
is supposed to have this General visual
representation then there is this
abstract uh landscape of all visual task
that makes sense like not complete noise
as a task for example but something
meaningful and we sample from this uh
landscape of of all let's say almost
infinite tasks that make sense we sample
a bunch of them so in this case we
basically what this means is we got 19
concrete but very diverse visual tasks
uh each of them come with a very small
training set just like in the game we
played and the best set and yeah we we
Tred to cover a very broad diversity of
tasks and then your model your general
visual representation you have the
chance to adapt it on this small
training set of each task and then test
it on the test set and then you get a
score and we just take the average of
all of these and this is the we of score
and the Assumption or hypothesis is by
pushing this average number up we get
more and more General visual repres
presentation uh and typically in
practice this is done by pre-training in
some way like supervised self-supervised
whatever uh and then fine-tuning or or
um or transferring with adapters uh to
each of these tasks but but we don't
prescribe exactly how this should be
done
um yeah these things have many words
like upstream or pre-training transfer
Downstream fine tuning and so on um
now this is the setting and the ultimate
goal that we have and then we spend a
couple years trying different approaches
how to get there and I don't want to say
too many details but basically the one
that worked best uh and most efficiently
also is just large scale supervised
pre-training um in that case let's not
talk through all of these graphs um the
the bottom line is that we had
internally available a huge data set
which was called jft and had back then
300 million images and there were a few
public data sets like image net 1
million images and another variance that
was much less known but full image net
in quotes that was 14 million images um
and but now I think everybody just
agrees and that's good but back then
what we figured out is that if you
pre-train a large model on the large
data set for a very long time these
three things together then you get a
really good model or a really good
really General visual representation if
you want uh so on our vaita Benchmark
this stopped everything and we did try
lot of other approaches before like
self-supervised semisupervised all kind
of things generative models too
um and and when you do these three
things basically you get benefits
everywhere you get much more robust also
on typical robustness benchmarks you get
get much better at quickly recognizing
new things like these few shot examples
we had before but also much better at
fine-tuning on large data sets like
basically you get WIS across the board
um okay but this has one issue let's see
what's next yeah this is one issue is
that you actually need this large
labeled data set right this was 300
million images labeled by if I remember
correctly about 30,000 class
labels and is Google internal and
there's no way we publish this um and so
really nobody can use this and I think
even a couple years after we published
uh this nobody else had any data set
like that or at least reported about it
in any papers
um so in that sense it's a nice proof of
existence uh in the sense that we should
look if you can manage to have this you
can get great results but then
practically speaking it's not that
useful to the rest of the word because
the rest of the word doesn't have such
data and it seemed that over course of a
couple years um nobody stepped up to
build such a data in the public for the
public which is reasonable because it is
quite expensive to get that in the ton
of work right um but then something
really cool happened uh really cool in
two ways uh is that language became API
for what do I mean by this basically
clip and the line
happened
um
and I believe most of you know clip but
I still want to explain it to set the
stage for sigp um so clip and the line
are essentially the same thing uh at
essentially the same time by open Ai and
by Google um but clip is much more
popular because they open sourced models
and Google didn't um but what does it do
it basically changes the pre-training
from this large set of labeled images to
a large set of let's say not explicitly
labeled images but just of image text
pairs that you can get in whatever way
that may be much much less expensive
than requiring humans to hand label a
predefined set of classes um for
instance in I think in clip they didn't
say exactly how they got them but in
align paper they said for example that
they use images on the web and the alt
text tag that comes with it you know
when you have a in a web page you
include an image it's this IMG tag and
then source is the path to the image and
then you can have a alt equal and then
some string which is supposed for
accessibility for screen readers uh and
things like that but also search engines
to understand the content of the image
without needing computer vision um and
so this is this is nice like this is a
text that is is supposed to describe
exactly what is there in the image vast
majority of images on the web don't have
that but significant amount still do um
and so both of these models figure out a
way to pre-train on that instead of uh
in a classification way on label data uh
and then this has not only the benefit
of having much easier or more readily
available data but also it has a huge
effect on how to use the model which I
think is really cool um basically now
you can use this model and I'm going to
explain in great detail how next but use
this model to transfer it to all kinds
of tasks without needing any uh small
data set on the task to fine tune on uh
and without doing any fine tuning or any
training in any way by just really
spelling out the class names uh and
that's
it uh so yeah this is I think two great
things that came out of clip SL align um
and just to exemplify the change in the
pre-training data right previously all
data sets inv Vision basically that have
anything to do with pre- trining um
where something along that line like a
set of classes that somebody comes up
with be it engineer at the company or
PhD student in the lab but somebody
comes up with the set of classes the
concepts they want the model to learn
and then collects a list of images for
each of these classes um
versus what clip and Aline can train on
is just random image text pairs from the
internet which can be much much worse
like this one this is just crap image
and thumbnail for version blah blah blah
it doesn't say anything about the image
um or it can be quite precise like here
Motorcycle front wheel right or this
frankurt airport Skyline and then the
date okay I don't know but these are
things that are more detailed than any
class anybody would come up with uh when
they just think about the list of
classes to come up with right so you can
see that if we are able to leverage this
kind of data our models could actually
understand a lot more then the question
is just how to leverage it um and then
basically what clip and align both of
them do is this you have this mini batch
of imag text pairs right here you have
this image and the text that goes with
it both on a mountain lake with
Lighthouse right and here we have three
of them then the images you all send
them through a image model that takes in
an image outputs a single Vector uh
let's say 52 dimensional Vector for each
image similar with the text you send it
through the text encod each of the texts
separately and for each text you get a
vector of also the same say 512
Dimensions uh and now how you train this
is simply you say in the loss I want the
image and text the the vectors of the
image and text that belong to each other
to be more similar than the vector of
image and text that do not belong to
each
other
um and uh and that's how you train it
and we'll go through even more detail on
that point uh pretty soon um and then
when you train the model in this way you
can later use it in a very nice way like
if you have a new task you basically
describe the things you want to
recognize by writing down the text let's
go back to the example we had in the
beginning these flowers these three
classes now I don't call them ABC but I
actually write a picture of a pink Prim
Rose a picture of a pocket Orchid and a
picture of a daisy and notice I don't
need any images for this I just Define
my task by writing it down um send these
strings to the text encoder and get my
three text vector embeddings and these
are now the embeddings are
representative for this class of the
thing that I want to detect or to
recognize and now I give you this new
image that you have never seen before
you send it through the image encod and
get the corresponding image vector and
now you can just compare this image
Vector to each of the three text vectors
that we got and you get a score for each
one and this is basically your score
that the image actually or how well the
image matches each of these texts and
then if you want to choose one you can
just take the max or you can take the
soft max if you want probabilities and
then take the highest one
um that's basically how you then use
this or how you can build a classifier
without even needing images to build it
right so far so
good okay no questions then maybe this
was all perfectly known to you all then
I apologize for the time
um yeah okay uh of course you may also
know it's not always just that simple
like you need to nicely decorate this
text and there is like a whole prompt
engineering that goes into it like it's
better to write a picture of or a flower
called pink Prim Rose then just pink
Prim Rose but hopefully this is
something we'll get over at some point
as the community
um now that's the cool thing right in
this setting that I presented before
basically this whole step of adapting
and needing the training set per task it
just goes away and we just write down
what we want as the new
task um okay so that was clip and I
think somebody wrote a question let me
see if I
can read
it so the question is so naively in a
way clip is learning search rankings
since data is curated from search by
respective text
descriptions the training data is not
related to
search
um this is how people classically use to
build the data sets right like uh maybe
image net C and so on they came up with
the list of classes and then enter these
classes in some image search engine uh
and then just take the images that they
find and say okay these are my classes
this is I think this is the process
behind at least Cent image it um and
then maybe some sanity check human rers
but in this case for clip and align
there is no such thing because nobody
came up with a list of things to search
for in the first place uh right there is
just somehow Maybe by starting from some
seat thing like common craw or something
like that uh get a large amount of image
text pairs from the
web Rel related question would it help
to pose it as a listwise ranking problem
with the Colbert like alignment module
instead of basic contrastive loss I
actually don't know what is Colbert like
alignment module
um but
the the softmax classification that uh
clip does so clip actually let's go
there now I think this is even my next
slide see yeah uh so here a little bit
more in detail the loss that clip uh
does is by directional soft Max so for
this text
embedding compare it to all the image
embedding and ask for the corresponding
one to be higher rated than all others
or specifically take dot product of this
text embedding with all image embeddings
take the soft Max and say ground truth
label for the correct one is one and for
all others is zero so kind of a
classification task um
and I believe let's say I'm 90% sure I
Believe by optimizing this
classification task you are at the same
time implicitly optimizing a ranking
task
um and the thing is this is just one
direction and there is asymmetry so clip
loss also does it in the other direction
like this image embedding compare it
against all the text embeddings again
take softmax and request for the correct
one to be the class basically or class
one and then all others to be zeros
um oh there was clarification late
interaction on the token embeddings
instead of a single vector to Vector
yeah basically if you do anything on top
of these embeddings image and text
embeddings as opposed to them just being
vectors and doing that product if you
put some other neural net module on any
pairs of them it makes everything to do
in the future more expensive if you want
to do something like retrieval right
let's say you want to build a retriever
image retriever you can have your huge
database of I don't know 100 million uh
images um embed them all and store these
embeddings in a database and then any
user comes asks for a text you embed the
text and then you can quickly compare it
by doing Dot product with all of the
images and then retrieve the most
relevant images that way right um you
could try to learn a small neural net
module or anything on top of pairs of
embedding for example uh or even more to
get slightly more precise rankings or
something but then if at retrieval time
you need to run all pairs through that
for example so it gets more expensive
you can probably perform a bit better um
the
tradeoff all right so this is what clip
does um and I mentioned
the a couple disadvantages of this
actually one that I just mentioned it's
bir directional we need to take the soft
Max both ways um the next thing is that
this training a clip and is usually done
with a quite large batch size the clip
paper had 32,000 batch size um across 7
or 800 gpus I don't remember and Aline I
don't remember the specifics but they
also had something large like similarly
large batch size and probably TPU
devices
um and then in the soft marks you have
the normalization right so you need to
sum across the whole
rows in order to just be able to know
normalize and the whole columns again so
you have these Global sums going on to
compute the
softmax that is yet another disadvantage
and then it just intuitively feels like
a maybe weird learning task although I
guess it depends on your taste uh like
as I mentioned this is
basically in the direction from image to
text it is kind of a image
classification task with a list of
classes that is made up on fly defined
by what texts appear in the mini batch
so it changes all the time for each mini
batch uh and at the same time the
classification task the other way around
for each text
classify by the classes being defined by
the images that appear in the mini batch
I think it can be argued whether this is
like a nice magically regularizing thing
or whether this just to me this just
feels arbitrary and wrong um but yeah
that's why I have a question mark I
think it's a matter of opinion or of
results in the end
um the the main disadvantages being the
first two uh that it's by directional so
you need to compute these soft Max twice
and these Global sums which at a small
single device setting small badge maybe
128 it seems a bit overblown um but as
you scale this up to bch size let's say
32,000 for example which is the default
then actually this Matrix here and their
operations on it make up the vast
majority of your memory usage more than
the model even sometimes and of your
time so that's why we started digging
into this and figuring out is this
really needed or are there other ways
that are simpler more scalable maybe uh
and that's how we got to sigp um which
in Hind side is super simple which is
just saying okay then change this from a
softmax based loss to a sigmoid based
loss u
meaning uh instead of doing this kind of
classification across the other modality
just take each image and text pair like
each entry in this similarity Matrix
individually and look at it in isolation
like this pair this image here with this
text here this one do they match yes or
no and that's your task your loss for
this entry same for this one do they
match yes or no and that's your task the
loss for this entry so there needs to be
no Global
normalization
no uh you you don't even need this whole
Matrix to exist at once which you do if
you want to do the global normalization
and there is no um B directionality
needed right it's just each entry in
this Matrix yes or no um do they
match so this we found it works a little
bit better but especially it scales a
lot better and requires much less memory
uh and less compute on the
loss uh and this because it's uh
basically clip but just exchanging the
soft Max with a sigmoid so we call it
clip then a yeah okay just want to show
that also algorithm wise it's super
codewise it's actually super
simple um if you take the pseudo code
from clip and turn it into sigp it's
really just changing to log sigmoid loss
as opposed to soft Max I think the code
is even shorter than the pseudo code in
clip because it doesn't have B
directional but you need to add a bias
here and this we will talk a bit more
about it later will we yeah maybe um
now one thing that we can do to make it
even more efficient and to make things
even more scalable that now becomes very
simple with sigmoid because we don't
have a global normalization sum to take
care of that we would have in softmax
um is what we call the chunk sigmoid
loss um and it's the following let's say
you have three devices and the batch
size of 12 so on each device device
could be a GPU or a TPU doesn't really
matter um on each device we have four
images and their corresponding texts
right and this is the global view of
this similarity Matrix in reality we
want to compute the loss on each of
these
entries and that is then the global
Ross so the the naive and super
inefficient way to do this but the way
that is the most obvious when you do
softmax is to just gather all entries
from all devices onto one and then there
do the softmax computation do the loss
and then the sums however once we
realize that in the sigmoid case each
entry the loss of each entry in this
Matrix is completely independent we can
come up with a much nicer algorithm to
compute this basically First Step each
device just computes it on the entries
available on itself so there is no
communication needed um we compute the
Plus on this subset for this device and
on this subset for this device and on
this subset for this device right
um then and this is a little bit hard to
illustrate uh in we tried our best in
this picture by moving the device here
but in reality then like shift the text
like give your text embeddings to the
device left of you and each device does
that right this is little communication
but then we end up in the state like
here now device one has texts 5 6 7 8
which previously device two had and so
on and so each device now sees without
even Shifting the image embeddings now
sees a different part of this Matrix and
can compute the loss on that part or
chunk instead of part that's why we call
it chunked and then again uh shift move
the text to the device left of you and
then compute the loss on the next chunk
uh and always accumulate the loss in the
same buffer because at the end of the
day when we have the loss of all entries
we need the global sum of everything but
we can do this by just keep
accumulating
um yeah and that's it and that way we
can basically the the whole contrastive
loss uh does not take up any
considerable amount of memory from our
device anymore which was the major thing
holding back scaling up uh clip leg
models before um there was always this
intuition that for contrastive models
since
your most uh let's go back most examples
on the mini batch are really trivial
right here let's say these three
examples uh if you if one of the text
just say the word dog in it like it's no
brainer to figure out which of the
images has that um but what so in the
initial phase of learning maybe that's
good it's not too difficult but as the
model has learned some very simple triv
ities then it would be better to have
much more difficult cases in the mini
batch right like maybe let's say four or
five different breeds of dog in the same
mini batch such that the model has to
differentiate them um because let's say
uh let's take the example that
throughout the training in your mini
batch you always just see one picture of
a dog maybe different breeds U but the
text is always the the the breed of the
dog it's a bad example because I don't
know dog breeds off the top of my head
but you can think of some uh right then
the model would actually just learn to
associate these different dog breeds
that it saw in the text but just to the
concept of dog there is no reason to
differentiate them if in a mini badge
there was always just one dog visible
and so so it would learn the concept of
dog and the names of dog breeds that
exist but not actually to differentiate
those dog breed that's why it has always
been uh the aim and the intuition of
people that any kind of contrastive
learning we need to scale up the batch
size to get more difficult examples in
there or hard negatives in there such
that the model learns more fine grein
Nuance
um and so with this change we can now
actually test this hypothesis because
now the size of this Matrix the batch
size does not really influence our
memory usage anymore we can scale to
almost arbitrary batch size by using
this chunk sigmoid
implementation and so we did that and
that was actually the main aim of the
project we wanted to scale to
outrageously large batch size and see
how how much better do clip models get
by doing that um and so we did that
increasing batch size more and more and
more and more
um I have this yeah so we have two
different settings I will say a bit more
about this later but we went basically
up to a bch size of 1 million
um and then what we found is that in
both settings left and right it seems
that actually bch size around
32,000 is close to the best and beyond
that much larger VCH size doesn't really
give you that much benefit anymore at
least according to the few metrics that
we actively track uh like imagine that
zero shot class ification Coco retrieval
and things like
that uh so that was one learning from
this
um but throughout the other thing that
you can see
here is that between using sigmoid loss
and softmax loss uh throughout the
training almost throughout the training
sigmoid loss performs a bit better than
softmax loss and the nice thing is going
to smaller batch sizes
the the difference gets larger so
basically if you cannot use larg batch
size maybe because you don't have many
devices at the same time then sigmoid is
definitely better loss to use
um than soft
Max then okay there is maybe a negative
result uh is one thing we were hoping oh
wait H question about hard negative
mining yeah I will come to that towards
the end um
maybe one negative result is that we we
also in our group uh really
try to have multilingual models and to
push them as far as we can um if I would
have a slide on that later but basically
I think the original clip and almost all
published clip models I usually trained
on English only subsets of data and
perform well really only on English um
the reason is if you train on
multilingual data like the the way I
described to you how how we can train
models on just data from the web now
right without needing to predefine lists
of tasks and so on this originally just
sounds great yeah we can get data from
the whole web which means in all
languages and we can easily just for
free have models that perform well in
all languages right um but somehow
people don't do this and stick to
English training English clip models
um in including ourselves um but some of
us always push to train on multilingual
and evaluate in multilingual ways but
the models trained on multilingual data
are always behind those on English data
on the standard benchmarks of course if
you if you evaluate a multilingual model
on let's say a French Benchmark and the
English only model on a French Benchmark
then of course English only will be crap
um but on the standard benchmarks which
are mostly English like imaginet coo
captions and so on the multilingual is
worse than the English one and we had
one hypothesis that if we can scale up
the batch size a lot uh maybe this will
help the multilingual model to catch up
because if you have a fixed batch size
of 32,000 examples and you have I'm
explaining the intuition now um and you
have only English text and images in
there uh okay like there is no easy
shortcut you need to match them but
let's say you have a mixture of
languages and images in there let's
say still a lot of English images but
also a bunch of Chinese images and texts
then there could be a shortcut right
that the mother matches Chinese
characters with let's say Chinese
looking images of maybe typical Chinese
looking images uh images of cities or
traditional things or people even right
at least there could be a shortcut that
makes the effective bch to be actually
two smaller B batches the English one
and the Chinese one because it's just
trivial to separate them um and so
that's why we thought uh yeah maybe
scaling up the batch size a lot will
make the model on multilingual a lot
better unfortunately that turns out not
to be the case uh which this plot shows
it's the same like around 32,000 it hits
the peak accuracy and then it kind of
goes down a bit so we were quite
disappointed by this
result um
all right
uh yeah this is a huge table but it's
basically to say across all model sizes
like base and the different sequence
length or SL
resolutions um and large
model we train sigp models that are the
best by far so for example here like
original clip from openi 76% image net
validation and we also have Coco numbers
um but basically sigp is much better
than any other models across the sizes
and we actually were able to open source
all of these sigp models so they are
just available you can download them and
use them and we also trained a larger
model this uh so 400 million is a size
or a shape of vision Transformer that we
uh developed in a separate paper where
we looked at if scaling laws can be used
to predict optimal shapes of
Transformers shape mean more detailed
size not just number of parameters but
number of layers uh model Dimension and
MLP
Dimension um so this is much smaller
than the G or H or even e uh models but
actually we train the clip model of that
shape that performs better than even
these larger ones so big table many
numbers just to say if you need the
image text model that is really good
just try download and try out sigp and
we don't just look at benchmarks so we
also looked at a bunch of difficult
pictures that we just came up with the
first ones um are on the internet so we
cannot be sure that the model has never
seen them they are not part of standard
benchmarks but they are just on the
internet so the model may have seen them
uh this one is quite famous from a long
time ago go and open AI blog post uh
about fooling models but it actually
works pretty well if you can see here
the text um and keep in mind in usual
demos that you see of contrastive image
text models uh
the the score given to an image text
pair depends on the others that are
there because people usually take the
soft marks right but here for sigp this
is just individual pairwise scores right
so this 100% of match between this image
and the text and apple with a note
saying iPod is completely independent
from all other the texts that are
there uh right so basically it works
quite well on difficult examples or here
like cold drink on a hot day versus hot
drink on a cold day it also works quite
well uh this is also in if you just
search for it on GitHub there is a
collab this table is just a straight
straight screenshot from a collab that
we published that you can also play
around with and just run and try out
different ones and then the images to
the right are images from ourselves that
definitely were never uh on the internet
even before um and it still works really
well on them um I'm not saying open AI
clip would not work well on them it
probably also works reasonably well on
them but this was our kind of Sanity
check uh that it also works on never
ever before seen things and it can also
reasonably well like understand uh text
in the image like here we held the signs
sigp and just if we write a photo of the
sigp authors it it like is very happy
about it
um and here this one is interesting also
me and the colleague basil wore t-shirts
that both had to do with coffee mine
saids the text there says something like
current mood need coffee or something
like that and basil is the molecule of
coffeine and I think it says the text
cofin and just fires really really
strongly on the text uh a photo of two
guys in need of caffeine and not at all
in need of water um also one thing in
general with these kind of models the
more precise your text is the higher you
get the matching score if you are just
very wague with very very vague with the
text then it gives a reasonably small
score like here is an example just the
word cow no prompt no a picture off no
nothing and then it just gives like
35ish percent to these two pictures
right but if you are much more precise
like a cow in a tuxedo boom Almost 100%
for this one and almost 0% for this one
because it has no tuxedo so this just
like a general thing to know when
working with these kind of
models okay then also want to say again
what I said previously about
multilinguality I think this is the
first clip like like image text model uh
that has been open source that is pretty
good at
multilinguality um we also have a second
collab uh linked somewhere in the GitHub
or in the first collab I don't remember
that test this and we tried to test like
some kind of cultural things I think
most of them you would only know if
you're Chinese but ANS climbing a tree
is also if you literally translate it to
Chinese which I cannot but is these two
texts is actually the name of a dish
um which it if you write it in English
like a person in English wouldn't know
what do you mean by ants calling ants
climbing a tree so it only fires on this
not on the dish but then if you write it
in Chinese it totally knows what you're
talking about so it fires on the dish
and not on the literal ends climbing on
a tree um if you have more examples of
this thing in your mind please should uh
us an email with these examples so we we
we're actually looking for them uh to
investigate more how many languages was
it trained on or
supports hard to say u in the training
sets there are more than 100 languages I
think it's
the maybe not exact same but very
similar data set as the P paper in the
first P paper we give some statistics
and I said I think we said has more than
100 detected languages it doesn't mean
it's equally good on all of them right
the more rare languages it should be
worse I would expect um but basically
try any language and see there is a good
chance it understands at least to a
reasonable degree a language that you
think of I actually I should ask my
Swiss colleagues to try Swiss German too
um and definitely especially for
multilingual SL Multicultural aspect
just shoot us an email or a tweet or
whatever about your experience we're
especially curious about that actually
and given we don't cover all the
cultures among our colleagues we also
cannot understand or think of all the
different ideas how to test it or ways
how to test it right
um all
right another thing that happened a bit
after uh that is you may have seen P
three paper
um so I forgot to give context on P is a
image text model but that can generate
text uh from our group and extended
group like with colleagues across Google
uh which is
basically image and text as input and
then text as output again so it can and
and then trained in a multitask way so
it can do things like captioning images
but also answering questions about
images doing OCR on images uh if the
images are documents it can answer
questions about the documents or read
some parts of the documents out and
things like that
um and the first two versions and papers
of P the image embedding model was
always based
on the best image embedding model we had
at that time which was always
pre-trained in the classific
way on the internal jft data
um this P model each time like the first
one came out and then it was
state-ofthe-art across a lot of these
image text combined tasks and then the
second one called p x came out where
it's essentially same story but more
tasks more benchmarks and especially
scaled up significantly the image and
the text model and then again better
across the board um um and on the other
hand there's some other labs
or yeah some other labs also had image
text models based on clip encoder I
think mostly for the image model um and
then but but they were worse compared to
p uh so it may have given the impression
that yeah even for these image text
models pre-training in classification
way on large private data set like JF
may just be better um but there was
never explicit comparison uh anywhere
just implicit by comparing let's say p
to another existing model like let's say
Florence from Microsoft but I'm just
thinking of a random one um but there
are many many things that changes across
these models so you cannot really make
such conclusion so for p 3 uh we then
basically did exact Apples to Apples
comparison uh to figure out so we
compared at across different scales Sly
pre-trained image encoder like plug this
into P model P model is image encoder
and then text encoder and then text
decoder um so for example image encoder
and T5 encoder decoder and then concut
the embedded image tokens with the text
tokens send it through the T5 encoder
and then T5 decorder generates answer
that was the first one the next version
said ul2 text encoder decoder instead of
T5 but it's just a better one and that's
it um and basically what we found is
that for almost across the board it's
better to use the Sig pre-trained image
encoder except for linear classification
probes which is important metric that
many people including ourselves often
use to judge model quality uh which
means take the image embedding um
oh sorry take the image model embed a
bunch of images get the representations
and then do few shot classification with
that with just a linear head or they are
sometimes called linear probes I think
um and this across eight different tasks
uh the classification pre-train model
was always significantly better across
those
um yeah so just to say
like if
if you plug the model into a broader
model and do more tasks than just
classification the result may be very
different and for example Sig may be
much better um which is the case
here then I think this is the last H
okay yeah let's see about the time 10
minutes okay um so one thing there was a
question earlier about this uh what
about hard negative Mining and so on um
we we did a little investigation of this
I would say it's a preliminary indicator
uh in the sigl paper um so what if we
can in some way get more difficult
negatives into the mini batch would this
be helpful is kind of the question
um with sigp where each pair each entry
in this final loss Matrix is independent
of the rest we can relatively easily do
some experiment to check this hypothesis
if if it would be helpful to have more
hard negatives um by taking this big
Matrix but Computing the loss only on a
subset of the examples um
like for example okay let's
see
um this point here is the score we get
with the regular clip training in some
setting that we fix
then this point here the blue one is the
one I get if same training setting but
now in the loss Matrix I mask out the
random half of them so right we we go
through the same number of image text
pairs um or of training examples but
only for half of the pair combinations
we actually get a loss then we get this
blue point
uh and then this is like
116th and then 160th and so on and so on
so we get worse and
worse uh this is a baseline this is not
interesting in itself uh now what if we
keep only the hardest half right we mask
out the easiest half uh so the the half
with the lowest loss or which are
already most similar we just mask them
out and don't put the loss on that uh
then we get this Orange Line This is
like basically what if we could train
with half the batch size but they are
all harder once um so this is definitely
better than the random masking right you
you need to compare this Orange Point
with the blue point um and now but we
have in principle we have only seen half
the number of pairs keep in mind the in
this case there is two things there is
the number of examples seen which as you
just iterate through the training let's
say a batch is 32,000 then after one
batch you have seen 32,000 examples but
you have seen 32,000 to the square pairs
like image and text combinations right
because in one batch you get the loss on
all pairs so there is two distinct
things and
so while here we have we did equally
many steps with the same batch size so
we have seen equally many examples as
this point we have only gotten loss on
meaning seen half the amount of
pairs um so you could say maybe that's
why it didn't actually improve the
results so what if we if we match the
number of pairs seen and that is the
pink curve so the pink curve has trained
double the steps as the orange and blue
ones but in each step half the easier
half of the examples in the mini batch
uh being masked out so the pink curve
kind of shows you what if you could fill
your batch only with harder examples uh
and that's what you get uh so that would
improve things a little bit however how
to fill your batch with harder examples
only without it being more expensive
than random examples we don't know uh
and we don't have any solution in this
paper
so so that is the situation with hard
examples now uh just very quickly two
more things and then I want to leave at
least time for a couple questions uh one
is that across different types of
artificial Noise We introduce Sigma it
seems to be always more robust to noise
and this is consistent with we had
previous paper where we in
classification we replaced softmax with
sigmoid loss and found the same it was
more robust to noise um
and yeah let's forget about this one but
maybe this one can be useful in general
we just also put it in the paper it's
not really related to sigp but if you
have a loss that looks like this spiky
spiky spiky but doesn't really die and
you're using Adam this is can be fixed
in most cases not all by reducing Adam's
beta to from 999 to 95 and then you get
something like the green curve which is
much nicer and I think I'll stop here
and yeah you can find stuff on collab
and maybe we can have the last five
minutes for questions if they right in
the chat what are next directions you're
most interested
in uh well one thing is the the p 3 that
I mentioned when we plug this into a
more General model that does all kinds
of things it's interesting also what I
mentioned a couple times I'm really
interested in making as good as possible
multi lingual SL Multicultural model and
I think we only scratch the surface with
this random collab that we made so I'm
quite interested in
that no more other
questions right people need time to
write you can also if you want just
leave the hand and unmute and ask right
there was another question also if you
have a second can you talk about the
captioning paper oh yeah I guess if
there is no
other uh question
so uh there is one issue with all with
all contrastive models including sigp um
why is Sigma more robust than softmax we
only have a vague intuition but don't
really know why um yeah uh remember what
I described here about like with the
example with dogs and breeds there is
actually simpler example um with
relationships let's say I have the image
of in the text of a cat sitting left off
a dog in the mini badge uh and no other
images of cats or dogs in the mini batch
right and no other texts of containing
cat or dog then the model to solve the
loss doesn't actually need to understand
in detail that a cat is sitting left off
a dog it just needs to check like oh um
there's a cat and this text says
something about cat yeah that's the
matching pair that's good um and
otherwise not right it doesn't need to
understand the left off and it would
need to understand the left off if you
had the exact opposite happening equally
amount of time and ideally even in the
same mini badge but this is just not
realistic even with insanely large badge
size so there is this bunch of things
that and there have been a few papers or
research uh showing this and benchmarks
that these contrastive models mostly
behave like a bag of word is thing bag
of word is a bit
exaggerated um but so more like
detecting individual things in the image
and not necessarily their relationship
among each other
um and this is just inherent in the loss
how they are
trained and me and another few
colleagues thought about this
and how would we train a model that
doesn't have this and what would be the
simplest possible way to train a model
that doesn't have this because we always
like simple things that are scalable and
eventually convert it to a captioning
model uh should nail this or should be
able to learn this so which really is
for each image generate the caption that
belongs to it um because then it needs
to generate each token in the text like
to give each token in the text higher
probability than anything else possible
like then for for this example of a cat
sitting left left of a dog it needs to
give left higher probability than
everything else including right So
eventually it needs to learn this kind
of things um and we wrote a paper about
this uh which was called something like
image captioners are scalable Vision
Learners to or something like that the
model we call Kappa CAA because it's a
captioner but we also had some Twist on
it with parallel decorder but it doesn't
matter too much what matters is it's a
captioner model um yeah there actually
to the question before of what I'm most
interested in is also to push this
captioner model pre-training much
further uh the question maybe let's say
the last one from the chat if every
image text pair is independent in loss
why does a very big batch size still
help so well except for accumulating the
loss across devices and adding it at the
end uh yeah the the thing I tried I
hopefully explained well the difference
between example scene and pairs scene
right if you do five steps or 10 steps
with batch size
32,000 uh you will see
32,000 Square * 10 pairs combinations
that you learn about uh whereas if you
do the 10 steps with bch or
okay if you go through the I cannot make
up the numbers on the spot if you go
through the the same number of
examples the number of pairs you go
through is different depending on the
batch size and with the larger bch size
because the number of pairs generated is
square of the bch size you still go
through this uh much higher number of
pairs with larger batch size
um so that's the reason why large large
enough batch size still
helps I this was a little bit difficult
for me to explain I hope hope I did
reasonable
job okay and I do need to go yeah I
think we can end the session thank you
so much for the
talk thank you for having
me
[Music]
byebye
Loading video analysis...