Lucas Beyer - Sigmoid Loss for Language Image Pre-Training

By Cohere

Summary

## Key takeaways - **Sigmoid Loss (SigLIP) Outperforms Softmax**: Sigmoid loss, when applied to language-image pre-training, offers advantages over the traditional softmax loss. It performs better, especially at smaller batch sizes, and scales more efficiently, requiring less memory and compute. [24:51], [33:11] - **Batch Size Benefits Diminish Above 32k**: While larger batch sizes generally benefit contrastive learning, experiments with SigLIP showed that benefits plateau around 32,000. Increasing batch sizes further, even up to one million, yielded diminishing returns on key metrics. [32:42], [32:49] - **SigLIP Enables Scalable Multilingual Models**: SigLIP's architecture allows for more effective training of multilingual models. Although initial results showed multilingual models lagging behind English-only ones, SigLIP provides a path towards better cross-lingual performance. [34:32], [42:01] - **Captioning Models Learn Relationships Better**: Contrastive models like CLIP often struggle with understanding relationships between objects, behaving more like 'bag-of-words' detectors. Image captioning models, by contrast, are inherently trained to understand and generate sequential relationships, potentially leading to richer visual understanding. [55:33], [57:15] - **SigLIP Achieves State-of-the-Art Results**: Models trained with SigLIP consistently outperform previous state-of-the-art image-text models across various sizes and benchmarks. This includes achieving 84.5% ImageNet zero-shot accuracy with a SigLiT model trained on just four TPUv4 chips in two days. [37:08], [02:45]

Topics Covered

What does true "general visual understanding" mean for AI?
Language as an API: The CLIP/ALIGN paradigm shift.
Sigmoid vs. Softmax: Why SigLIP is more efficient.
Batch Size Surprises: When more data doesn't help.
Multilingual AI: Understanding culture beyond literal words.

Full Transcript

[Music]

hi everyone thank you so much for

joining us today uh today we have ukus

with us who uh if you are into computer

vision he does not need any introduction

uh but I'll try to introduce him so he's

a he's a senior staff research

researcher at Google Deep Mind and

author of some uh very impactful works

such as uh Vision Transformers he has

his PhD in robotics and computer vision

uh so I'll just hand it over to him

thank you so

much yeah thank you

um I was unsure uh what we do with the

timing the calendar event is for one

hour is this like a one hour talk with

questions throughout or we do half an

hour talk yeah as you as you're

comfortable no problem okay but we have

a full hour and can split it anyway

right okay yeah then let's do people can

ask questions throughout um and if at

some point we get stuck too long on one

slide then I'll just say okay no more

questions here let's move on and I'll

keep track of the time

also

um yeah so thanks for the introduction I

will present sigp but I actually want to

present a little bit more surround ing

context and I like this title slide we

used it for iccv which was in Paris

that's the only reason there are the two

Eiffel towers and I didn't edit so yeah

shwa is actually here at the office but

not with me um these are the authors of

the sigl

paper

[Music]

um what is this AI assistant okay um

then let's get started um

so as I said I want to give a little bit

broader context before diving straight

into the technical details of sigp uh

and my first few slides some of you who

have seen talks from me before may be

familiar with them but I still want to

um present them to get everybody on the

same starting place um so the the main

goal of me and the various people in my

group and I think a lot of computer

vision researchers in general actually

is to somehow get a model that is a good

General visual

representation

um and why we want this is because we

believe that this is necessary for

building models or machines or robots or

whatever that can actually perform

meaningful tasks in the real real world

uh for example for me concretely I want

that latest at the end of my career

there exist robot that any non-technical

person can teach to do any task and I

believe that really good and robust and

generalizable visual understanding of

the surroundings is one of the

requirements of it not the only one um

so I work towards that uh now what does

it mean concretely um so this is my

favorite slide that I've used a million

times but I'll use again let's play a

little game uh together and that will

show you very clearly I think what I

mean by a general visual understanding

uh so I show you these pictures I say

these are class A these are class B and

these are Class C and then I show you a

new one and just in your mind is it

Class A B or C and I can probably

already sto and you probably all made up

your mind yeah yeah this is Class A uh

another one of things that are maybe

less a little less common in your life

but you may still have seen a few times

uh Class A Class B and Class C and then

I show you a new one and this may be a

little bit harder than the previous one

but I'm sure you all got by now that

this is Class B very likely right the

basketball field as a satellite image uh

and then may be quite a lot harder

depending on whether you've seen this

data set before or not and probably some

of you have may have never seen this

image type in your life uh Class A and

Class B and I show a new one and this

may need you to think a little bit more

and wonder a little bit more but I think

usually by now the let's say 80% of the

people usually got it right that this is

Class A it's like three objects no

matter what um and Class B was five

objects no matter what so this is what I

mean by a general visual understanding

you come equipped with it however you

got it I don't really care but you have

it like I show you a few pictures

of new things that I can group in

different ways or I can describe to you

this is blah this is blah this is blah

and you quickly get it without me having

to explain much just showing a few cases

and then you quickly generalize to new

images of this and make a reasonable

guess on new images right uh so this is

really the

goal and and then some time ago uh with

our colleagues a few years ago we set

out to tackle this and the first thing

we need to do to tackle something is to

measure how far we are from that um and

then we designed a benchmark for this

which we call the visual task adaptation

Benchmark um if you're are more of an

NLP person it is very similar to glue

and was around the same time actually um

so basically we say you pre-train or do

whatever to come up with your model that

is supposed to have this General visual

representation then there is this

abstract uh landscape of all visual task

that makes sense like not complete noise

as a task for example but something

meaningful and we sample from this uh

landscape of of all let's say almost

infinite tasks that make sense we sample

a bunch of them so in this case we

basically what this means is we got 19

concrete but very diverse visual tasks

uh each of them come with a very small

training set just like in the game we

played and the best set and yeah we we

Tred to cover a very broad diversity of

tasks and then your model your general

visual representation you have the

chance to adapt it on this small

training set of each task and then test

it on the test set and then you get a

score and we just take the average of

all of these and this is the we of score

and the Assumption or hypothesis is by

pushing this average number up we get

more and more General visual repres

presentation uh and typically in

practice this is done by pre-training in

some way like supervised self-supervised

whatever uh and then fine-tuning or or

um or transferring with adapters uh to

each of these tasks but but we don't

prescribe exactly how this should be

done

um yeah these things have many words

like upstream or pre-training transfer

Downstream fine tuning and so on um

now this is the setting and the ultimate

goal that we have and then we spend a

couple years trying different approaches

how to get there and I don't want to say

too many details but basically the one

that worked best uh and most efficiently

also is just large scale supervised

pre-training um in that case let's not

talk through all of these graphs um the

the bottom line is that we had

internally available a huge data set

which was called jft and had back then

300 million images and there were a few

public data sets like image net 1

million images and another variance that

was much less known but full image net

in quotes that was 14 million images um

and but now I think everybody just

agrees and that's good but back then

what we figured out is that if you

pre-train a large model on the large

data set for a very long time these

three things together then you get a

really good model or a really good

really General visual representation if

you want uh so on our vaita Benchmark

this stopped everything and we did try

lot of other approaches before like

self-supervised semisupervised all kind

of things generative models too

um and and when you do these three

things basically you get benefits

everywhere you get much more robust also

on typical robustness benchmarks you get

get much better at quickly recognizing

new things like these few shot examples

we had before but also much better at

fine-tuning on large data sets like

basically you get WIS across the board

um okay but this has one issue let's see

what's next yeah this is one issue is

that you actually need this large

labeled data set right this was 300

million images labeled by if I remember

correctly about 30,000 class

labels and is Google internal and

there's no way we publish this um and so

really nobody can use this and I think

even a couple years after we published

uh this nobody else had any data set

like that or at least reported about it

in any papers

um so in that sense it's a nice proof of

existence uh in the sense that we should

look if you can manage to have this you

can get great results but then

practically speaking it's not that

useful to the rest of the word because

the rest of the word doesn't have such

data and it seemed that over course of a

couple years um nobody stepped up to

build such a data in the public for the

public which is reasonable because it is

quite expensive to get that in the ton

of work right um but then something

really cool happened uh really cool in

two ways uh is that language became API

for what do I mean by this basically

clip and the line

happened

um

and I believe most of you know clip but

I still want to explain it to set the

stage for sigp um so clip and the line

are essentially the same thing uh at

essentially the same time by open Ai and

by Google um but clip is much more

popular because they open sourced models

and Google didn't um but what does it do

it basically changes the pre-training

from this large set of labeled images to

a large set of let's say not explicitly

labeled images but just of image text

pairs that you can get in whatever way

that may be much much less expensive

than requiring humans to hand label a

predefined set of classes um for

instance in I think in clip they didn't

say exactly how they got them but in

align paper they said for example that

they use images on the web and the alt

text tag that comes with it you know

when you have a in a web page you

include an image it's this IMG tag and

then source is the path to the image and

then you can have a alt equal and then

some string which is supposed for

accessibility for screen readers uh and

things like that but also search engines

to understand the content of the image

without needing computer vision um and

so this is this is nice like this is a

text that is is supposed to describe

exactly what is there in the image vast

majority of images on the web don't have

that but significant amount still do um

and so both of these models figure out a

way to pre-train on that instead of uh

in a classification way on label data uh

and then this has not only the benefit

of having much easier or more readily

available data but also it has a huge

effect on how to use the model which I

think is really cool um basically now

you can use this model and I'm going to

explain in great detail how next but use

this model to transfer it to all kinds

of tasks without needing any uh small

data set on the task to fine tune on uh

and without doing any fine tuning or any

training in any way by just really

spelling out the class names uh and

that's

it uh so yeah this is I think two great

things that came out of clip SL align um

and just to exemplify the change in the

pre-training data right previously all

data sets inv Vision basically that have

anything to do with pre- trining um

where something along that line like a

set of classes that somebody comes up

with be it engineer at the company or

PhD student in the lab but somebody

comes up with the set of classes the

concepts they want the model to learn

and then collects a list of images for

each of these classes um

versus what clip and Aline can train on

is just random image text pairs from the

internet which can be much much worse

like this one this is just crap image

and thumbnail for version blah blah blah

it doesn't say anything about the image

um or it can be quite precise like here

Motorcycle front wheel right or this

frankurt airport Skyline and then the

date okay I don't know but these are

things that are more detailed than any

class anybody would come up with uh when

they just think about the list of

classes to come up with right so you can

see that if we are able to leverage this

kind of data our models could actually

understand a lot more then the question

is just how to leverage it um and then

basically what clip and align both of

them do is this you have this mini batch

of imag text pairs right here you have

this image and the text that goes with

it both on a mountain lake with

Lighthouse right and here we have three

of them then the images you all send

them through a image model that takes in

an image outputs a single Vector uh

let's say 52 dimensional Vector for each

image similar with the text you send it

through the text encod each of the texts

separately and for each text you get a

vector of also the same say 512

Dimensions uh and now how you train this

is simply you say in the loss I want the

image and text the the vectors of the

image and text that belong to each other

to be more similar than the vector of

image and text that do not belong to

each

other

um and uh and that's how you train it

and we'll go through even more detail on

that point uh pretty soon um and then

when you train the model in this way you

can later use it in a very nice way like

if you have a new task you basically

describe the things you want to

recognize by writing down the text let's

go back to the example we had in the

beginning these flowers these three

classes now I don't call them ABC but I

actually write a picture of a pink Prim

Rose a picture of a pocket Orchid and a

picture of a daisy and notice I don't

need any images for this I just Define

my task by writing it down um send these

strings to the text encoder and get my

three text vector embeddings and these

are now the embeddings are

representative for this class of the

thing that I want to detect or to

recognize and now I give you this new

image that you have never seen before

you send it through the image encod and

get the corresponding image vector and

now you can just compare this image

Vector to each of the three text vectors

that we got and you get a score for each

one and this is basically your score

that the image actually or how well the

image matches each of these texts and

then if you want to choose one you can

just take the max or you can take the

soft max if you want probabilities and

then take the highest one

um that's basically how you then use

this or how you can build a classifier

without even needing images to build it

right so far so

good okay no questions then maybe this

was all perfectly known to you all then

I apologize for the time

um yeah okay uh of course you may also

know it's not always just that simple

like you need to nicely decorate this

text and there is like a whole prompt

engineering that goes into it like it's

better to write a picture of or a flower

called pink Prim Rose then just pink

Prim Rose but hopefully this is

something we'll get over at some point

as the community

um now that's the cool thing right in

this setting that I presented before

basically this whole step of adapting

and needing the training set per task it

just goes away and we just write down

what we want as the new

task um okay so that was clip and I

think somebody wrote a question let me

see if I

can read

it so the question is so naively in a

way clip is learning search rankings

since data is curated from search by

respective text

descriptions the training data is not

related to

search

um this is how people classically use to

build the data sets right like uh maybe

image net C and so on they came up with

the list of classes and then enter these

classes in some image search engine uh

and then just take the images that they

find and say okay these are my classes

this is I think this is the process

behind at least Cent image it um and

then maybe some sanity check human rers

but in this case for clip and align

there is no such thing because nobody

came up with a list of things to search

for in the first place uh right there is

just somehow Maybe by starting from some

seat thing like common craw or something

like that uh get a large amount of image

text pairs from the

web Rel related question would it help

to pose it as a listwise ranking problem

with the Colbert like alignment module

instead of basic contrastive loss I

actually don't know what is Colbert like

alignment module

um but

the the softmax classification that uh

clip does so clip actually let's go

there now I think this is even my next

slide see yeah uh so here a little bit

more in detail the loss that clip uh

does is by directional soft Max so for

this text

embedding compare it to all the image

embedding and ask for the corresponding

one to be higher rated than all others

or specifically take dot product of this

text embedding with all image embeddings

take the soft Max and say ground truth

label for the correct one is one and for

all others is zero so kind of a

classification task um

and I believe let's say I'm 90% sure I

Believe by optimizing this

classification task you are at the same

time implicitly optimizing a ranking

task

um and the thing is this is just one

direction and there is asymmetry so clip

loss also does it in the other direction

like this image embedding compare it

against all the text embeddings again

take softmax and request for the correct

one to be the class basically or class

one and then all others to be zeros

um oh there was clarification late

interaction on the token embeddings

instead of a single vector to Vector

yeah basically if you do anything on top

of these embeddings image and text

embeddings as opposed to them just being

vectors and doing that product if you

put some other neural net module on any

pairs of them it makes everything to do

in the future more expensive if you want

to do something like retrieval right

let's say you want to build a retriever

image retriever you can have your huge

database of I don't know 100 million uh

images um embed them all and store these

embeddings in a database and then any

user comes asks for a text you embed the

text and then you can quickly compare it

by doing Dot product with all of the

images and then retrieve the most

relevant images that way right um you

could try to learn a small neural net

module or anything on top of pairs of

embedding for example uh or even more to

get slightly more precise rankings or

something but then if at retrieval time

you need to run all pairs through that

for example so it gets more expensive

you can probably perform a bit better um

the

tradeoff all right so this is what clip

does um and I mentioned

the a couple disadvantages of this

actually one that I just mentioned it's

bir directional we need to take the soft

Max both ways um the next thing is that

this training a clip and is usually done

with a quite large batch size the clip

paper had 32,000 batch size um across 7

or 800 gpus I don't remember and Aline I

don't remember the specifics but they

also had something large like similarly

large batch size and probably TPU

devices

um and then in the soft marks you have

the normalization right so you need to

sum across the whole

rows in order to just be able to know

normalize and the whole columns again so

you have these Global sums going on to

compute the

softmax that is yet another disadvantage

and then it just intuitively feels like

a maybe weird learning task although I

guess it depends on your taste uh like

as I mentioned this is

basically in the direction from image to

text it is kind of a image

classification task with a list of

classes that is made up on fly defined

by what texts appear in the mini batch

so it changes all the time for each mini

batch uh and at the same time the

classification task the other way around

for each text

classify by the classes being defined by

the images that appear in the mini batch

I think it can be argued whether this is

like a nice magically regularizing thing

or whether this just to me this just

feels arbitrary and wrong um but yeah

that's why I have a question mark I

think it's a matter of opinion or of

results in the end

um the the main disadvantages being the

first two uh that it's by directional so

you need to compute these soft Max twice

and these Global sums which at a small

single device setting small badge maybe

128 it seems a bit overblown um but as

you scale this up to bch size let's say

32,000 for example which is the default

then actually this Matrix here and their

operations on it make up the vast

majority of your memory usage more than

the model even sometimes and of your

time so that's why we started digging

into this and figuring out is this

really needed or are there other ways

that are simpler more scalable maybe uh

and that's how we got to sigp um which

in Hind side is super simple which is

just saying okay then change this from a

softmax based loss to a sigmoid based

loss u

meaning uh instead of doing this kind of

classification across the other modality

just take each image and text pair like

each entry in this similarity Matrix

individually and look at it in isolation

like this pair this image here with this

text here this one do they match yes or

no and that's your task your loss for

this entry same for this one do they

match yes or no and that's your task the

loss for this entry so there needs to be

no Global

normalization

no uh you you don't even need this whole

Matrix to exist at once which you do if

you want to do the global normalization

and there is no um B directionality

needed right it's just each entry in

this Matrix yes or no um do they

match so this we found it works a little

bit better but especially it scales a

lot better and requires much less memory

uh and less compute on the

loss uh and this because it's uh

basically clip but just exchanging the

soft Max with a sigmoid so we call it

clip then a yeah okay just want to show

that also algorithm wise it's super

codewise it's actually super

simple um if you take the pseudo code

from clip and turn it into sigp it's

really just changing to log sigmoid loss

as opposed to soft Max I think the code

is even shorter than the pseudo code in

clip because it doesn't have B

directional but you need to add a bias

here and this we will talk a bit more

about it later will we yeah maybe um

now one thing that we can do to make it

even more efficient and to make things

even more scalable that now becomes very

simple with sigmoid because we don't

have a global normalization sum to take

care of that we would have in softmax

um is what we call the chunk sigmoid

loss um and it's the following let's say

you have three devices and the batch

size of 12 so on each device device

could be a GPU or a TPU doesn't really

matter um on each device we have four

images and their corresponding texts

right and this is the global view of

this similarity Matrix in reality we

want to compute the loss on each of

these

entries and that is then the global

Ross so the the naive and super

inefficient way to do this but the way

that is the most obvious when you do

softmax is to just gather all entries

from all devices onto one and then there

do the softmax computation do the loss

and then the sums however once we

realize that in the sigmoid case each

entry the loss of each entry in this

Matrix is completely independent we can

come up with a much nicer algorithm to

compute this basically First Step each

device just computes it on the entries

available on itself so there is no

communication needed um we compute the

Plus on this subset for this device and

on this subset for this device and on

this subset for this device right

um then and this is a little bit hard to

illustrate uh in we tried our best in

this picture by moving the device here

but in reality then like shift the text

like give your text embeddings to the

device left of you and each device does

that right this is little communication

but then we end up in the state like

here now device one has texts 5 6 7 8

which previously device two had and so

on and so each device now sees without

even Shifting the image embeddings now

sees a different part of this Matrix and

can compute the loss on that part or

chunk instead of part that's why we call

it chunked and then again uh shift move

the text to the device left of you and

then compute the loss on the next chunk

uh and always accumulate the loss in the

same buffer because at the end of the

day when we have the loss of all entries

we need the global sum of everything but

we can do this by just keep

accumulating

um yeah and that's it and that way we

can basically the the whole contrastive

loss uh does not take up any

considerable amount of memory from our

device anymore which was the major thing

holding back scaling up uh clip leg

models before um there was always this

intuition that for contrastive models

since

your most uh let's go back most examples

on the mini batch are really trivial

right here let's say these three

examples uh if you if one of the text

just say the word dog in it like it's no

brainer to figure out which of the

images has that um but what so in the

initial phase of learning maybe that's

good it's not too difficult but as the

model has learned some very simple triv

ities then it would be better to have

much more difficult cases in the mini

batch right like maybe let's say four or

five different breeds of dog in the same

mini batch such that the model has to

differentiate them um because let's say

uh let's take the example that

throughout the training in your mini

batch you always just see one picture of

a dog maybe different breeds U but the

text is always the the the breed of the

dog it's a bad example because I don't

know dog breeds off the top of my head

but you can think of some uh right then

the model would actually just learn to

associate these different dog breeds

that it saw in the text but just to the

concept of dog there is no reason to

differentiate them if in a mini badge

there was always just one dog visible

and so so it would learn the concept of

dog and the names of dog breeds that

exist but not actually to differentiate

those dog breed that's why it has always

been uh the aim and the intuition of

people that any kind of contrastive

learning we need to scale up the batch

size to get more difficult examples in

there or hard negatives in there such

that the model learns more fine grein

Nuance

um and so with this change we can now

actually test this hypothesis because

now the size of this Matrix the batch

size does not really influence our

memory usage anymore we can scale to

almost arbitrary batch size by using

this chunk sigmoid

implementation and so we did that and

that was actually the main aim of the

project we wanted to scale to

outrageously large batch size and see

how how much better do clip models get

by doing that um and so we did that

increasing batch size more and more and

more and more

um I have this yeah so we have two

different settings I will say a bit more

about this later but we went basically

up to a bch size of 1 million

um and then what we found is that in

both settings left and right it seems

that actually bch size around

32,000 is close to the best and beyond

that much larger VCH size doesn't really

give you that much benefit anymore at

least according to the few metrics that

we actively track uh like imagine that

zero shot class ification Coco retrieval

and things like

that uh so that was one learning from

this

um but throughout the other thing that

you can see

here is that between using sigmoid loss

and softmax loss uh throughout the

training almost throughout the training

sigmoid loss performs a bit better than

softmax loss and the nice thing is going

to smaller batch sizes

the the difference gets larger so

basically if you cannot use larg batch

size maybe because you don't have many

devices at the same time then sigmoid is

definitely better loss to use

um than soft

Max then okay there is maybe a negative

result uh is one thing we were hoping oh

wait H question about hard negative

mining yeah I will come to that towards

the end um

maybe one negative result is that we we

also in our group uh really

try to have multilingual models and to

push them as far as we can um if I would

have a slide on that later but basically

I think the original clip and almost all

published clip models I usually trained

on English only subsets of data and

perform well really only on English um

the reason is if you train on

multilingual data like the the way I

described to you how how we can train

models on just data from the web now

right without needing to predefine lists

of tasks and so on this originally just

sounds great yeah we can get data from

the whole web which means in all

languages and we can easily just for

free have models that perform well in

all languages right um but somehow

people don't do this and stick to

English training English clip models

um in including ourselves um but some of

us always push to train on multilingual

and evaluate in multilingual ways but

the models trained on multilingual data

are always behind those on English data

on the standard benchmarks of course if

you if you evaluate a multilingual model

on let's say a French Benchmark and the

English only model on a French Benchmark

then of course English only will be crap

um but on the standard benchmarks which

are mostly English like imaginet coo

captions and so on the multilingual is

worse than the English one and we had

one hypothesis that if we can scale up

the batch size a lot uh maybe this will

help the multilingual model to catch up

because if you have a fixed batch size

of 32,000 examples and you have I'm

explaining the intuition now um and you

have only English text and images in

there uh okay like there is no easy

shortcut you need to match them but

let's say you have a mixture of

languages and images in there let's

say still a lot of English images but

also a bunch of Chinese images and texts

then there could be a shortcut right

that the mother matches Chinese

characters with let's say Chinese

looking images of maybe typical Chinese

looking images uh images of cities or

traditional things or people even right

at least there could be a shortcut that

makes the effective bch to be actually

two smaller B batches the English one

and the Chinese one because it's just

trivial to separate them um and so

that's why we thought uh yeah maybe

scaling up the batch size a lot will

make the model on multilingual a lot

better unfortunately that turns out not

to be the case uh which this plot shows

it's the same like around 32,000 it hits

the peak accuracy and then it kind of

goes down a bit so we were quite

disappointed by this

result um

all right

uh yeah this is a huge table but it's

basically to say across all model sizes

like base and the different sequence

length or SL

resolutions um and large

model we train sigp models that are the

best by far so for example here like

original clip from openi 76% image net

validation and we also have Coco numbers

um but basically sigp is much better

than any other models across the sizes

and we actually were able to open source

all of these sigp models so they are

just available you can download them and

use them and we also trained a larger

model this uh so 400 million is a size

or a shape of vision Transformer that we

uh developed in a separate paper where

we looked at if scaling laws can be used

to predict optimal shapes of

Transformers shape mean more detailed

size not just number of parameters but

number of layers uh model Dimension and

MLP

Dimension um so this is much smaller

than the G or H or even e uh models but

actually we train the clip model of that

shape that performs better than even

these larger ones so big table many

numbers just to say if you need the

image text model that is really good

just try download and try out sigp and

we don't just look at benchmarks so we

also looked at a bunch of difficult

pictures that we just came up with the

first ones um are on the internet so we

cannot be sure that the model has never

seen them they are not part of standard

benchmarks but they are just on the

internet so the model may have seen them

uh this one is quite famous from a long

time ago go and open AI blog post uh

about fooling models but it actually

works pretty well if you can see here

the text um and keep in mind in usual

demos that you see of contrastive image

text models uh

the the score given to an image text

pair depends on the others that are

there because people usually take the

soft marks right but here for sigp this

is just individual pairwise scores right

so this 100% of match between this image

and the text and apple with a note

saying iPod is completely independent

from all other the texts that are

there uh right so basically it works

quite well on difficult examples or here

like cold drink on a hot day versus hot

drink on a cold day it also works quite

well uh this is also in if you just

search for it on GitHub there is a

collab this table is just a straight

straight screenshot from a collab that

we published that you can also play

around with and just run and try out

different ones and then the images to

the right are images from ourselves that

definitely were never uh on the internet

even before um and it still works really

well on them um I'm not saying open AI

clip would not work well on them it

probably also works reasonably well on

them but this was our kind of Sanity

check uh that it also works on never

ever before seen things and it can also

reasonably well like understand uh text

in the image like here we held the signs

sigp and just if we write a photo of the

sigp authors it it like is very happy

about it

um and here this one is interesting also

me and the colleague basil wore t-shirts

that both had to do with coffee mine

saids the text there says something like

current mood need coffee or something

like that and basil is the molecule of

coffeine and I think it says the text

cofin and just fires really really

strongly on the text uh a photo of two

guys in need of caffeine and not at all

in need of water um also one thing in

general with these kind of models the

more precise your text is the higher you

get the matching score if you are just

very wague with very very vague with the

text then it gives a reasonably small

score like here is an example just the

word cow no prompt no a picture off no

nothing and then it just gives like

35ish percent to these two pictures

right but if you are much more precise

like a cow in a tuxedo boom Almost 100%

for this one and almost 0% for this one

because it has no tuxedo so this just

like a general thing to know when

working with these kind of

models okay then also want to say again

what I said previously about

multilinguality I think this is the

first clip like like image text model uh

that has been open source that is pretty

good at

multilinguality um we also have a second

collab uh linked somewhere in the GitHub

or in the first collab I don't remember

that test this and we tried to test like

some kind of cultural things I think

most of them you would only know if

you're Chinese but ANS climbing a tree

is also if you literally translate it to

Chinese which I cannot but is these two

texts is actually the name of a dish

um which it if you write it in English

like a person in English wouldn't know

what do you mean by ants calling ants

climbing a tree so it only fires on this

not on the dish but then if you write it

in Chinese it totally knows what you're

talking about so it fires on the dish

and not on the literal ends climbing on

a tree um if you have more examples of

this thing in your mind please should uh

us an email with these examples so we we

we're actually looking for them uh to

investigate more how many languages was

it trained on or

supports hard to say u in the training

sets there are more than 100 languages I

think it's

the maybe not exact same but very

similar data set as the P paper in the

first P paper we give some statistics

and I said I think we said has more than

100 detected languages it doesn't mean

it's equally good on all of them right

the more rare languages it should be

worse I would expect um but basically

try any language and see there is a good

chance it understands at least to a

reasonable degree a language that you

think of I actually I should ask my

Swiss colleagues to try Swiss German too

um and definitely especially for

multilingual SL Multicultural aspect

just shoot us an email or a tweet or

whatever about your experience we're

especially curious about that actually

and given we don't cover all the

cultures among our colleagues we also

cannot understand or think of all the

different ideas how to test it or ways

how to test it right

um all

right another thing that happened a bit

after uh that is you may have seen P

three paper

um so I forgot to give context on P is a

image text model but that can generate

text uh from our group and extended

group like with colleagues across Google

uh which is

basically image and text as input and

then text as output again so it can and

and then trained in a multitask way so

it can do things like captioning images

but also answering questions about

images doing OCR on images uh if the

images are documents it can answer

questions about the documents or read

some parts of the documents out and

things like that

um and the first two versions and papers

of P the image embedding model was

always based

on the best image embedding model we had

at that time which was always

pre-trained in the classific

way on the internal jft data

um this P model each time like the first

one came out and then it was

state-ofthe-art across a lot of these

image text combined tasks and then the

second one called p x came out where

it's essentially same story but more

tasks more benchmarks and especially

scaled up significantly the image and

the text model and then again better

across the board um um and on the other

hand there's some other labs

or yeah some other labs also had image

text models based on clip encoder I

think mostly for the image model um and

then but but they were worse compared to

p uh so it may have given the impression

that yeah even for these image text

models pre-training in classification

way on large private data set like JF

may just be better um but there was

never explicit comparison uh anywhere

just implicit by comparing let's say p

to another existing model like let's say

Florence from Microsoft but I'm just

thinking of a random one um but there

are many many things that changes across

these models so you cannot really make

such conclusion so for p 3 uh we then

basically did exact Apples to Apples

comparison uh to figure out so we

compared at across different scales Sly

pre-trained image encoder like plug this

into P model P model is image encoder

and then text encoder and then text

decoder um so for example image encoder

and T5 encoder decoder and then concut

the embedded image tokens with the text

tokens send it through the T5 encoder

and then T5 decorder generates answer

that was the first one the next version

said ul2 text encoder decoder instead of

T5 but it's just a better one and that's

it um and basically what we found is

that for almost across the board it's

better to use the Sig pre-trained image

encoder except for linear classification

probes which is important metric that

many people including ourselves often

use to judge model quality uh which

means take the image embedding um

oh sorry take the image model embed a

bunch of images get the representations

and then do few shot classification with

that with just a linear head or they are

sometimes called linear probes I think

um and this across eight different tasks

uh the classification pre-train model

was always significantly better across

those

um yeah so just to say

like if

if you plug the model into a broader

model and do more tasks than just

classification the result may be very

different and for example Sig may be

much better um which is the case

here then I think this is the last H

okay yeah let's see about the time 10

minutes okay um so one thing there was a

question earlier about this uh what

about hard negative Mining and so on um

we we did a little investigation of this

I would say it's a preliminary indicator

uh in the sigl paper um so what if we

can in some way get more difficult

negatives into the mini batch would this

be helpful is kind of the question

um with sigp where each pair each entry

in this final loss Matrix is independent

of the rest we can relatively easily do

some experiment to check this hypothesis

if if it would be helpful to have more

hard negatives um by taking this big

Matrix but Computing the loss only on a

subset of the examples um

like for example okay let's

see

um this point here is the score we get

with the regular clip training in some

setting that we fix

then this point here the blue one is the

one I get if same training setting but

now in the loss Matrix I mask out the

random half of them so right we we go

through the same number of image text

pairs um or of training examples but

only for half of the pair combinations

we actually get a loss then we get this

blue point

uh and then this is like

116th and then 160th and so on and so on

so we get worse and

worse uh this is a baseline this is not

interesting in itself uh now what if we

keep only the hardest half right we mask

out the easiest half uh so the the half

with the lowest loss or which are

already most similar we just mask them

out and don't put the loss on that uh

then we get this Orange Line This is

like basically what if we could train

with half the batch size but they are

all harder once um so this is definitely

better than the random masking right you

you need to compare this Orange Point

with the blue point um and now but we

have in principle we have only seen half

the number of pairs keep in mind the in

this case there is two things there is

the number of examples seen which as you

just iterate through the training let's

say a batch is 32,000 then after one

batch you have seen 32,000 examples but

you have seen 32,000 to the square pairs

like image and text combinations right

because in one batch you get the loss on

all pairs so there is two distinct

things and

so while here we have we did equally

many steps with the same batch size so

we have seen equally many examples as

this point we have only gotten loss on

meaning seen half the amount of

pairs um so you could say maybe that's

why it didn't actually improve the

results so what if we if we match the

number of pairs seen and that is the

pink curve so the pink curve has trained

double the steps as the orange and blue

ones but in each step half the easier

half of the examples in the mini batch

uh being masked out so the pink curve

kind of shows you what if you could fill

your batch only with harder examples uh

and that's what you get uh so that would

improve things a little bit however how

to fill your batch with harder examples

only without it being more expensive

than random examples we don't know uh

and we don't have any solution in this

paper

so so that is the situation with hard

examples now uh just very quickly two

more things and then I want to leave at

least time for a couple questions uh one

is that across different types of

artificial Noise We introduce Sigma it

seems to be always more robust to noise

and this is consistent with we had

previous paper where we in

classification we replaced softmax with

sigmoid loss and found the same it was

more robust to noise um

and yeah let's forget about this one but

maybe this one can be useful in general

we just also put it in the paper it's

not really related to sigp but if you

have a loss that looks like this spiky

spiky spiky but doesn't really die and

you're using Adam this is can be fixed

in most cases not all by reducing Adam's

beta to from 999 to 95 and then you get

something like the green curve which is

much nicer and I think I'll stop here

and yeah you can find stuff on collab

and maybe we can have the last five

minutes for questions if they right in

the chat what are next directions you're

most interested

in uh well one thing is the the p 3 that

I mentioned when we plug this into a

more General model that does all kinds

of things it's interesting also what I

mentioned a couple times I'm really

interested in making as good as possible

multi lingual SL Multicultural model and

I think we only scratch the surface with

this random collab that we made so I'm

quite interested in

that no more other

questions right people need time to

write you can also if you want just

leave the hand and unmute and ask right

there was another question also if you

have a second can you talk about the

captioning paper oh yeah I guess if

there is no

other uh question

so uh there is one issue with all with

all contrastive models including sigp um

why is Sigma more robust than softmax we

only have a vague intuition but don't

really know why um yeah uh remember what

I described here about like with the

example with dogs and breeds there is

actually simpler example um with

relationships let's say I have the image

of in the text of a cat sitting left off

a dog in the mini badge uh and no other

images of cats or dogs in the mini batch

right and no other texts of containing

cat or dog then the model to solve the

loss doesn't actually need to understand

in detail that a cat is sitting left off

a dog it just needs to check like oh um

there's a cat and this text says

something about cat yeah that's the

matching pair that's good um and

otherwise not right it doesn't need to

understand the left off and it would

need to understand the left off if you

had the exact opposite happening equally

amount of time and ideally even in the

same mini badge but this is just not

realistic even with insanely large badge

size so there is this bunch of things

that and there have been a few papers or

research uh showing this and benchmarks

that these contrastive models mostly

behave like a bag of word is thing bag

of word is a bit

exaggerated um but so more like

detecting individual things in the image

and not necessarily their relationship

among each other

um and this is just inherent in the loss

how they are

trained and me and another few

colleagues thought about this

and how would we train a model that

doesn't have this and what would be the

simplest possible way to train a model

that doesn't have this because we always

like simple things that are scalable and

eventually convert it to a captioning

model uh should nail this or should be

able to learn this so which really is

for each image generate the caption that

belongs to it um because then it needs

to generate each token in the text like

to give each token in the text higher

probability than anything else possible

like then for for this example of a cat

sitting left left of a dog it needs to

give left higher probability than

everything else including right So

eventually it needs to learn this kind

of things um and we wrote a paper about

this uh which was called something like

image captioners are scalable Vision

Learners to or something like that the

model we call Kappa CAA because it's a

captioner but we also had some Twist on

it with parallel decorder but it doesn't

matter too much what matters is it's a

captioner model um yeah there actually

to the question before of what I'm most

interested in is also to push this

captioner model pre-training much

further uh the question maybe let's say

the last one from the chat if every

image text pair is independent in loss

why does a very big batch size still

help so well except for accumulating the

loss across devices and adding it at the

end uh yeah the the thing I tried I

hopefully explained well the difference

between example scene and pairs scene

right if you do five steps or 10 steps

with batch size

32,000 uh you will see

32,000 Square * 10 pairs combinations

that you learn about uh whereas if you

do the 10 steps with bch or

okay if you go through the I cannot make

up the numbers on the spot if you go

through the the same number of

examples the number of pairs you go

through is different depending on the

batch size and with the larger bch size

because the number of pairs generated is

square of the bch size you still go

through this uh much higher number of

pairs with larger batch size

um so that's the reason why large large

enough batch size still

helps I this was a little bit difficult

for me to explain I hope hope I did

reasonable

job okay and I do need to go yeah I

think we can end the session thank you

so much for the

talk thank you for having

me

[Music]

byebye

Loading...

Loading video analysis...