How AI Image Generators Work (Stable Diffusion / Dall-E) - Computerphile
By Computerphile
Summary
## Key takeaways - **GANs vs. Diffusion Models for Image Generation**: Generative Adversarial Networks (GANs) can struggle with training and may produce repetitive outputs (mode collapse), whereas diffusion models simplify image generation into iterative steps, making the process more stable and manageable. [01:46], [02:16] - **Diffusion: Adding and Removing Noise Iteratively**: Diffusion models work by gradually adding noise to an image until it becomes pure noise, and then training a network to reverse this process step-by-step, making it easier to generate complex images from noise. [03:09], [04:05] - **Predicting Noise for Image Reconstruction**: Instead of directly generating a clean image, diffusion models are trained to predict the noise that was added at a specific step, allowing for a more stable and easier-to-train process for image reconstruction. [06:55], [08:31] - **Text-to-Image Generation with Classifier-Free Guidance**: To guide image generation with text prompts, diffusion models use 'classifier-free guidance,' which involves running the noise prediction process twice—once with text embeddings and once without—to amplify the text's influence on the output. [11:46], [14:30] - **Accessibility of Diffusion Models**: While training large diffusion models can be extremely expensive, open-source versions like Stable Diffusion are accessible and can be run for free using platforms like Google Colab, allowing users to experiment with AI image generation. [16:06], [16:22]
Topics Covered
- Why generative adversarial networks are difficult to train.
- The surprising secret: Diffusion models predict noise, not the image.
- Diffusion models: Iteratively removing noise to create images.
- How text prompts guide diffusion models to specific images.
- Classifier-Free Guidance: The hack for precisely targeted AI images.
Full Transcript
generating images using diffusion what
is that right so I should probably find
out it's just things like Dolly and
Dolly too yeah Imogen from Google stable
diffusion now as well I've spent quite a
long time messing about a stable
diffusion I'm having quite a lot of fun
with that so what I thought I'd do is I
download the code I'd you know read the
paper with work out what's going on and
then we can talk about it
I delved into this code and realized
it's actually quite a lot to these these
things right it's not so much that
they're complicated it's just there's a
lot of a lot of moving Parts
um so let's just have a quick reminder
of generative adversarial networks which
are I suppose before now the the
standard way for generating images and
then we can talk about how it's
different and why we're doing it using
diffusion having a network or some you
know deep Network train to Just Produce
the same image over and over again not
very interesting so we have some kind of
random noise that we're using to make it
different each time we have some kind of
very large generator Network which is
this is just I'm gonna give this black
box big neural network right that turn
that turns out an image that hopefully
looks nice like at the like the thing
we're trying to produce faces Landscapes
people you know is this that how those
Anonymous people on this person does not
exist is this one yeah that's exactly
how they work yeah if that's using I
think style Gan right and it's that
exact idea and that's trained on a large
Corpus of faces and it just generates
faces right at random right or at least
mostly at random the way we train this
is we have you know millions and
millions of pictures of something that
we're trying to produce so we produce we
give it noise
we produce an image and we have to tell
is that good or is that bad right we
need to give this network some
instruction on if this image is actually
looking like a face right otherwise it's
not going to train it so what we do is
we have another Network here which is
sort of like the opposite and this says
is it a real or is it a fake image and
so we're giving this half a time we're
giving it fake images and half a time
we're giving it real faces so this
trains and gets better at discriminating
between the fake images produced here
and the real images produced from the
training set and in doing so this has to
get better at faking them and so on and
so forth and the hope is that they just
get better and better and better all
right now that that kind of works the
the problem is that
um they're very hard to train right you
have a lot of problems with things like
mode collapse where it just produces the
same face if it produces a face that
fools this every time there's not a lot
of incentive for this network to do
anything interesting right because it
does solve the problem right it's beaten
this let's move on right and so if
you're not careful with your training
process it's these kind of things can
happen and I suppose intuitively it's
quite difficult to go from this bit of
noise to a really beautiful looking
image in high resolution
without there being some Oddities right
and some things that go a bit more so
what we're going to do is in diffusion
models is try and simplify this process
into a kind of iterative small step
situation where the work that this
network has to do is slightly smaller
and you just run it a few times to try
and make the process better right we'll
start again on the paper so we can clean
things up a bit so we've got an image
right let's say it's an image of a
rabbit right we add some noise so we've
got a rabbit
which is the same right and you add some
noise to it now it's not speckly noise
but I can't draw gaussian noise right
and then we add another bit of noise
right and the rabbit it's the same shape
rabbit there's a bit more noise right
and then we come over here and we come
over here
and we end up with just noise looks like
nonsense and so the question is like how
do we craft some kind of training
algorithm some kind of what we call
inference you know how do we actually
deploy a network that can undo this
process the first question is how much
noise do we add why don't we just add
loads of noise
right so just delete all these images
and doesn't really don't need to worry
about that add loads of noise and then
say like give me that and then you've
got a pair of training examples you
could use and the answer is it'll kind
of work but that's about a very
difficult job and you've sort of in the
same problem with the Gant you're trying
to do everything in one go right the
intuition perhaps is that it's maybe
slightly easier to go from this one to
this one just remove a little bit of
noise and then from this one to this one
a little bit more noise well in
traditional like image processing you do
there are noise removal techniques rise
yeah it's not difficult to do that is it
no I mean it's it's difficult in a sense
that you don't know what the original
image was so
what we're trying to do is train a
network to undo this process
that's the idea and if we can do that
then we can start with random noise a
bit like I can and we can just iterate
this process and produce an image right
now there's a lot of missing parts here
right so we'll start building up the
complexity a little bit
okay so the first thing is is let's go
back to our question of how much noise
do we add right so we could add a small
amount of noise
and then the same amount again I've been
the same amount again and we could keep
adding it until we have essentially what
looks like random noise over here right
and that will be what we would call a
linear
schedule right for that is the same same
amount of noise each time basically
right and it's not interesting but it
works the other thing you could do is
you could add very little noise at the
beginning and then ramp up the amount of
noise you add later right and so there
are different strategies depending on
what paper you read about the best
approach for adding noise
but it's called the schedule right so
the idea is you have a schedule that
says right given this image so this is
an image at uh at time T equals naught
this is T equals one blah blah blah T
equals some capital T which is like the
final number of steps you've got right
and this represents essentially all the
noise and this represents some amount of
noise and you can change how much each
step has right and then the nice thing
is you can then very easily produce
because gaussians add together very
nicely so you can say well I want T
equals seven and you don't have to
produce all the images you can just jump
straight to t7 add the exact right
amount of noise and then hand that back
to the network so when you train this
you can give it image random images from
your training set with random amounts of
noise added based on this schedule right
varying randomly between 1 and T right
and you can say okay here's a really
noisy image Undo It here's a little less
noisy image undo it right so what you do
is you take your noise image image
right I'm going to keep going with this
rabbit it's taller than it was before
right you take your noisy image at some
time let's say t equals five right you
have a giant unit shaped Network we've
talked about encoder decoder networks
before there's nothing particularly
surprising about this one
and then you also put in the time right
because if we're running a funny
schedule where your at different times
have different amounts of noise you need
to tell the network where it is so that
it knows okay I'm gonna have to remove a
lot of noise this time or just a little
bit of noise what do we produce here
so we could go for the whole hog and we
just say we'll just produce the original
rabbit image but then you've got a
situation where you have to go from here
all the way back to the rabbit that's a
little bit difficult right
mathematically it works out a little bit
easier if we just try and predict the
noise we want to know what is the noise
that was added to this image
that you could use to get back to the
original image so this is all the noise
from
t1234 and five so you just get noise
basically out here like this right with
no rabbit that's the hope
and then theoretically you could take
that away from this and you get the
rabbit back right now if you did that
from here you would find that it's a
little bit iffy right because you know
you're predicting the noise all the way
back to this rabbit is maybe quite
difficult but if you did it from here it
may be not quite so difficult we want to
predict the noise so what we could do is
predict the noise at let's say time T
equals five and to say give me the noise
it takes us back to T equals four right
and then T equals three and T equals two
the problem if you do that is that
you're very
stuck doing the exact time steps of the
schedule used right if you used a
thousand time steps for training now
you've got to use a thousand time steps
of inference right you can't speed it up
so what we might try and do instead is
say well okay whatever time step you're
at you've got some amount of noise
remove it all predict me all the noise
in the image and just give me back that
noise that I can take away and get back
to the original image and so that's what
we do so during training we pick a
random Source image we pick a random
time step and we add based on our
schedule that amount of noise right so
we have
a noisy image
a Time step T we put that into the
network
and we say what was the noise
that
we've just added to that image right now
we haven't given it the original image
right so that's what's Difficult about
this we we have the original image
without any noise on it that we're not
showing it and we added some noise and
we want that noise back right so we can
do that very easily we've got millions
of images in our or billions of images
in our data set right we can add random
bits of noise and we can say what was
that noise right and over time it starts
to build up a picture of what that noise
is so it sounds like a really good kind
of plug-in for Photoshop or something
right it's going to be noise removal
plug-in how does that turn into creating
new images yeah so actually in some
sense that's the clever bit right is how
we use this network that produces noise
to undo the noise right
we've got a network which given an image
with some noise added to it and a Time
step that represents how much noise that
is roughly or where we are in the
noising process we have a network which
produces an estimate for what that noise
is in total and theoretically if we take
that noise away from this we get back to
the original image now that is not a
perfect process right this network is
not going to be perfect and so if you
give it an incredibly noisy image
and you take away what it predicts
you'll get like a sort of
maybe like a vague shape and so what we
want to do is take it a little bit more
slowly okay so we take this noise and we
subtract it from our image
right to get an estimate of what the
original image was right T naught okay
so we take this
and we take this and we do subtraction
and we get another image which is our
estimate for T equals naught
right and it's not going to look very
good the first time
but then we add a bunch of this noise
back again and we get to a t that's
slightly less than this one so maybe
this was like T10 T equals 10. maybe we
add like nine tenths of a noise back and
we get to what we roughly T equals nine
right so now we have a slightly less
noisy image and we can repeat this
process so we put the slightly less
noisy image in we predict how to get
back to T naught and we add back most
but not all of the noise
and then we repeat the process right and
so each time we Loop this
we get a little bit closer to the
original image it was very difficult to
predict the noise at T equals 10. it's
slightly easier to predict the noise at
T equals nine and very easy at T equals
one because it's both mostly the image
with a little bit of noise on it and so
if we just sort of feel our way towards
it by taking off little bits of noise at
a time we can actually produce an image
right so you start off with a noisy
image
you predict all the noise and remove it
and then add back most of it right and
so then you get and so at each step you
have an estimate for what the original
image was and you have a next image
which is just a little bit less noisy
than the one before and you Loop this a
number of times right and that's
basically how the image generation
process works so you take your noisy
image you Loop it and you gradually
remove noise until you end up back at
what the network thinks was the original
image and you're doing this by
predicting the noise and taking it away
rather than spitting out an image with
less noise right and that mathematically
works out a lot easier to train and it's
a lot more stable than again there's an
elephant in the room here there is
you're kind of talking about how to make
random images effectively how do we
direct this so that's where the
complexity starts ramping up right we've
got a structure where we can train a
network to produce random images but
it's not guided there's no way of saying
I want a frog rabbit hybrid right which
I've done and it's very weird so how do
we do that the answer is we base
condition this network that's the word
we would use we'll basically give access
to the text as well all right so let's
actually infer on an image on my piece
of paper right I bear in mind the output
is going to be hand drawn by me so it's
going to be terrible you start off with
a random noise image right so this is
just an image that you've generated by
taking random gaussian noise
mathematically this is centered around
zero so you have negative and positive
numbers you don't go from zero to two
five five because it's just easier for
the network to train
you put in your time step so you
generate a you put in a times that let's
say you're going to do 50 iterations
right so we put in a Time step that's
maybe right at the end of our schedule
but it says like time step equals you
know 50 which is our most noised image
right and then you pass it through the
network and say estimate me the noise
and we also take our string which is
frogs
frogs on stilts
I'll have to have to try that later oh
look right what's this one anyway we
could spend let's say another 20 30
minutes producing fogs on stills
we embed this
right by using our GPT style Transformer
embedding and we'd stick that in as well
and then it produces an estimate of how
much noise it thinks is in that image
so that estimate on T equals 50 is going
to be a bit average right it's not going
to produce you a frog on a stilt picture
it's going to produce you like a gray
image or a brown image or something like
that because that is a very very
difficult problem to solve however if
you subtract
this noise from this image you get your
first estimate for what your first image
is right and when you add back a bunch
of noise and you get to T equals 49
right so now we've got slightly less
noise and maybe they're like the biggest
outline of a frog on a stilt right and
this is T equals 49 you take your
embedding and you put this in as well
right and you get another maybe slightly
better estimate of the noise in the
image and then we Loop right it's a for
Loop right we've done those before you
take this output you subtract it you add
noise back and you repeat this process
and you keep adding this text embedding
now there's one final trick that they
use to make things a little bit better
if you do this you will get a picture
that maybe looks slightly frog-like
maybe there's a stilt in it right but it
won't look anything like the images you
see on the internet that have been
produced by these tools because they do
another trick to make the output even
more tied to the text what you do is
something called classifier free
guidance so you actually put this image
in twice once you include the embeddings
of the text and once you don't right so
this method this network is maybe
slightly better when it has a text
estimating the noise so you actually put
in
two images
right this one's with the embedding
and this one's no embedding right and
this one is maybe slightly more random
noise and this one's slightly more
frog-like right or it's better better
it's slightly moving towards the right
thing
and we can calculate the difference
between these two noises and amplify
that signal right and then feed that
back so what we essentially do is we say
okay if this network wasn't given any
information on what was in the image and
then this version of a network was
what's the difference between those two
predictions and can we amplify that when
we loot this to really Target this kind
of output right and the idea is
basically you're really forcing this
network or this this Loop to really
point in direction of the of the scene
we want right
um and that's called classify free
guidance and it is somewhat of a hack at
the end of the network but it does work
right if you turn it off which I've done
it doesn't it produces you vague sort of
structures that kind of look right it's
not it's not terrible I mean I think I
did like a muppet cooking in the kitchen
and it just produced me a picture of a
generic kitchen with no Muppet in it
right but if you do this then you
suddenly are really targeting what you
want standard question got to ask it is
this something people can play with
without just going to one of these
websites and typing some words well yeah
I mean that's the thing is is that
um is that it costs hundreds of
thousands of dollars to try one of these
networks because of how many images they
use and how much processing power they
use um the good news is that there are
ones like stable diffusion that are um
that are available to use for free right
and you can use them through things like
Google colab Now I I did this through
Google collab
um and it works really really well
um and maybe we'll talk about that in
another video where we delve into the
code and see all of these bits happening
within the code right I blew through my
uh free Google allowance very very
quickly I had to pay my eight pounds for
uh for premium Google access so um you
know eight pounds eight pounds
thank you yeah so you know never let it
be said I don't spare expense I I know I
spare no expense on um on on computer
file uh getting access to proper compute
Hardware but um
could beasts do something like that it
could yeah almost of our servers could
I'm just a bit lazy and haven't set them
up to do so um but actually the code is
quite easy to run that the the sort of
the entry-level version of a code you
literally can just like basically call
one python function and it will produce
you an image I'm using a code which is
perhaps a little bit more detailed it's
got the full loop in it and I can go in
and inject things and change things so I
can understand it better and we'll talk
through that next you know perhaps next
time
the only other interesting thing about
the current neural networks is that the
weights here and here and here are
shared so they are the same because
otherwise this one here would always be
the time to make one sandwich but you've
got two people doing it so they make
twice as many sandwiches each time they
make a sandwich same with the computer
we could either make the computer
processor faster or
Loading video analysis...