How AI Image Generators Work (Stable Diffusion / Dall-E) - Computerphile

By Computerphile

Summary

## Key takeaways - **GANs vs. Diffusion Models for Image Generation**: Generative Adversarial Networks (GANs) can struggle with training and may produce repetitive outputs (mode collapse), whereas diffusion models simplify image generation into iterative steps, making the process more stable and manageable. [01:46], [02:16] - **Diffusion: Adding and Removing Noise Iteratively**: Diffusion models work by gradually adding noise to an image until it becomes pure noise, and then training a network to reverse this process step-by-step, making it easier to generate complex images from noise. [03:09], [04:05] - **Predicting Noise for Image Reconstruction**: Instead of directly generating a clean image, diffusion models are trained to predict the noise that was added at a specific step, allowing for a more stable and easier-to-train process for image reconstruction. [06:55], [08:31] - **Text-to-Image Generation with Classifier-Free Guidance**: To guide image generation with text prompts, diffusion models use 'classifier-free guidance,' which involves running the noise prediction process twice—once with text embeddings and once without—to amplify the text's influence on the output. [11:46], [14:30] - **Accessibility of Diffusion Models**: While training large diffusion models can be extremely expensive, open-source versions like Stable Diffusion are accessible and can be run for free using platforms like Google Colab, allowing users to experiment with AI image generation. [16:06], [16:22]

Topics Covered

Why generative adversarial networks are difficult to train.
The surprising secret: Diffusion models predict noise, not the image.
Diffusion models: Iteratively removing noise to create images.
How text prompts guide diffusion models to specific images.
Classifier-Free Guidance: The hack for precisely targeted AI images.

Full Transcript

generating images using diffusion what

is that right so I should probably find

out it's just things like Dolly and

Dolly too yeah Imogen from Google stable

diffusion now as well I've spent quite a

long time messing about a stable

diffusion I'm having quite a lot of fun

with that so what I thought I'd do is I

download the code I'd you know read the

paper with work out what's going on and

then we can talk about it

I delved into this code and realized

it's actually quite a lot to these these

things right it's not so much that

they're complicated it's just there's a

lot of a lot of moving Parts

um so let's just have a quick reminder

of generative adversarial networks which

are I suppose before now the the

standard way for generating images and

then we can talk about how it's

different and why we're doing it using

diffusion having a network or some you

know deep Network train to Just Produce

the same image over and over again not

very interesting so we have some kind of

random noise that we're using to make it

different each time we have some kind of

very large generator Network which is

this is just I'm gonna give this black

box big neural network right that turn

that turns out an image that hopefully

looks nice like at the like the thing

we're trying to produce faces Landscapes

people you know is this that how those

Anonymous people on this person does not

exist is this one yeah that's exactly

how they work yeah if that's using I

think style Gan right and it's that

exact idea and that's trained on a large

Corpus of faces and it just generates

faces right at random right or at least

mostly at random the way we train this

is we have you know millions and

millions of pictures of something that

we're trying to produce so we produce we

give it noise

we produce an image and we have to tell

is that good or is that bad right we

need to give this network some

instruction on if this image is actually

looking like a face right otherwise it's

not going to train it so what we do is

we have another Network here which is

sort of like the opposite and this says

is it a real or is it a fake image and

so we're giving this half a time we're

giving it fake images and half a time

we're giving it real faces so this

trains and gets better at discriminating

between the fake images produced here

and the real images produced from the

training set and in doing so this has to

get better at faking them and so on and

so forth and the hope is that they just

get better and better and better all

right now that that kind of works the

the problem is that

um they're very hard to train right you

have a lot of problems with things like

mode collapse where it just produces the

same face if it produces a face that

fools this every time there's not a lot

of incentive for this network to do

anything interesting right because it

does solve the problem right it's beaten

this let's move on right and so if

you're not careful with your training

process it's these kind of things can

happen and I suppose intuitively it's

quite difficult to go from this bit of

noise to a really beautiful looking

image in high resolution

without there being some Oddities right

and some things that go a bit more so

what we're going to do is in diffusion

models is try and simplify this process

into a kind of iterative small step

situation where the work that this

network has to do is slightly smaller

and you just run it a few times to try

and make the process better right we'll

start again on the paper so we can clean

things up a bit so we've got an image

right let's say it's an image of a

rabbit right we add some noise so we've

got a rabbit

which is the same right and you add some

noise to it now it's not speckly noise

but I can't draw gaussian noise right

and then we add another bit of noise

right and the rabbit it's the same shape

rabbit there's a bit more noise right

and then we come over here and we come

over here

and we end up with just noise looks like

nonsense and so the question is like how

do we craft some kind of training

algorithm some kind of what we call

inference you know how do we actually

deploy a network that can undo this

process the first question is how much

noise do we add why don't we just add

loads of noise

right so just delete all these images

and doesn't really don't need to worry

about that add loads of noise and then

say like give me that and then you've

got a pair of training examples you

could use and the answer is it'll kind

of work but that's about a very

difficult job and you've sort of in the

same problem with the Gant you're trying

to do everything in one go right the

intuition perhaps is that it's maybe

slightly easier to go from this one to

this one just remove a little bit of

noise and then from this one to this one

a little bit more noise well in

traditional like image processing you do

there are noise removal techniques rise

yeah it's not difficult to do that is it

no I mean it's it's difficult in a sense

that you don't know what the original

image was so

what we're trying to do is train a

network to undo this process

that's the idea and if we can do that

then we can start with random noise a

bit like I can and we can just iterate

this process and produce an image right

now there's a lot of missing parts here

right so we'll start building up the

complexity a little bit

okay so the first thing is is let's go

back to our question of how much noise

do we add right so we could add a small

amount of noise

and then the same amount again I've been

the same amount again and we could keep

adding it until we have essentially what

looks like random noise over here right

and that will be what we would call a

linear

schedule right for that is the same same

amount of noise each time basically

right and it's not interesting but it

works the other thing you could do is

you could add very little noise at the

beginning and then ramp up the amount of

noise you add later right and so there

are different strategies depending on

what paper you read about the best

approach for adding noise

but it's called the schedule right so

the idea is you have a schedule that

says right given this image so this is

an image at uh at time T equals naught

this is T equals one blah blah blah T

equals some capital T which is like the

final number of steps you've got right

and this represents essentially all the

noise and this represents some amount of

noise and you can change how much each

step has right and then the nice thing

is you can then very easily produce

because gaussians add together very

nicely so you can say well I want T

equals seven and you don't have to

produce all the images you can just jump

straight to t7 add the exact right

amount of noise and then hand that back

to the network so when you train this

you can give it image random images from

your training set with random amounts of

noise added based on this schedule right

varying randomly between 1 and T right

and you can say okay here's a really

noisy image Undo It here's a little less

noisy image undo it right so what you do

is you take your noise image image

right I'm going to keep going with this

rabbit it's taller than it was before

right you take your noisy image at some

time let's say t equals five right you

have a giant unit shaped Network we've

talked about encoder decoder networks

before there's nothing particularly

surprising about this one

and then you also put in the time right

because if we're running a funny

schedule where your at different times

have different amounts of noise you need

to tell the network where it is so that

it knows okay I'm gonna have to remove a

lot of noise this time or just a little

bit of noise what do we produce here

so we could go for the whole hog and we

just say we'll just produce the original

rabbit image but then you've got a

situation where you have to go from here

all the way back to the rabbit that's a

little bit difficult right

mathematically it works out a little bit

easier if we just try and predict the

noise we want to know what is the noise

that was added to this image

that you could use to get back to the

original image so this is all the noise

from

t1234 and five so you just get noise

basically out here like this right with

no rabbit that's the hope

and then theoretically you could take

that away from this and you get the

rabbit back right now if you did that

from here you would find that it's a

little bit iffy right because you know

you're predicting the noise all the way

back to this rabbit is maybe quite

difficult but if you did it from here it

may be not quite so difficult we want to

predict the noise so what we could do is

predict the noise at let's say time T

equals five and to say give me the noise

it takes us back to T equals four right

and then T equals three and T equals two

the problem if you do that is that

you're very

stuck doing the exact time steps of the

schedule used right if you used a

thousand time steps for training now

you've got to use a thousand time steps

of inference right you can't speed it up

so what we might try and do instead is

say well okay whatever time step you're

at you've got some amount of noise

remove it all predict me all the noise

in the image and just give me back that

noise that I can take away and get back

to the original image and so that's what

we do so during training we pick a

random Source image we pick a random

time step and we add based on our

schedule that amount of noise right so

we have

a noisy image

a Time step T we put that into the

network

and we say what was the noise

that

we've just added to that image right now

we haven't given it the original image

right so that's what's Difficult about

this we we have the original image

without any noise on it that we're not

showing it and we added some noise and

we want that noise back right so we can

do that very easily we've got millions

of images in our or billions of images

in our data set right we can add random

bits of noise and we can say what was

that noise right and over time it starts

to build up a picture of what that noise

is so it sounds like a really good kind

of plug-in for Photoshop or something

right it's going to be noise removal

plug-in how does that turn into creating

new images yeah so actually in some

sense that's the clever bit right is how

we use this network that produces noise

to undo the noise right

we've got a network which given an image

with some noise added to it and a Time

step that represents how much noise that

is roughly or where we are in the

noising process we have a network which

produces an estimate for what that noise

is in total and theoretically if we take

that noise away from this we get back to

the original image now that is not a

perfect process right this network is

not going to be perfect and so if you

give it an incredibly noisy image

and you take away what it predicts

you'll get like a sort of

maybe like a vague shape and so what we

want to do is take it a little bit more

slowly okay so we take this noise and we

subtract it from our image

right to get an estimate of what the

original image was right T naught okay

so we take this

and we take this and we do subtraction

and we get another image which is our

estimate for T equals naught

right and it's not going to look very

good the first time

but then we add a bunch of this noise

back again and we get to a t that's

slightly less than this one so maybe

this was like T10 T equals 10. maybe we

add like nine tenths of a noise back and

we get to what we roughly T equals nine

right so now we have a slightly less

noisy image and we can repeat this

process so we put the slightly less

noisy image in we predict how to get

back to T naught and we add back most

but not all of the noise

and then we repeat the process right and

so each time we Loop this

we get a little bit closer to the

original image it was very difficult to

predict the noise at T equals 10. it's

slightly easier to predict the noise at

T equals nine and very easy at T equals

one because it's both mostly the image

with a little bit of noise on it and so

if we just sort of feel our way towards

it by taking off little bits of noise at

a time we can actually produce an image

right so you start off with a noisy

image

you predict all the noise and remove it

and then add back most of it right and

so then you get and so at each step you

have an estimate for what the original

image was and you have a next image

which is just a little bit less noisy

than the one before and you Loop this a

number of times right and that's

basically how the image generation

process works so you take your noisy

image you Loop it and you gradually

remove noise until you end up back at

what the network thinks was the original

image and you're doing this by

predicting the noise and taking it away

rather than spitting out an image with

less noise right and that mathematically

works out a lot easier to train and it's

a lot more stable than again there's an

elephant in the room here there is

you're kind of talking about how to make

random images effectively how do we

direct this so that's where the

complexity starts ramping up right we've

got a structure where we can train a

network to produce random images but

it's not guided there's no way of saying

I want a frog rabbit hybrid right which

I've done and it's very weird so how do

we do that the answer is we base

condition this network that's the word

we would use we'll basically give access

to the text as well all right so let's

actually infer on an image on my piece

of paper right I bear in mind the output

is going to be hand drawn by me so it's

going to be terrible you start off with

a random noise image right so this is

just an image that you've generated by

taking random gaussian noise

mathematically this is centered around

zero so you have negative and positive

numbers you don't go from zero to two

five five because it's just easier for

the network to train

you put in your time step so you

generate a you put in a times that let's

say you're going to do 50 iterations

right so we put in a Time step that's

maybe right at the end of our schedule

but it says like time step equals you

know 50 which is our most noised image

right and then you pass it through the

network and say estimate me the noise

and we also take our string which is

frogs

frogs on stilts

I'll have to have to try that later oh

look right what's this one anyway we

could spend let's say another 20 30

minutes producing fogs on stills

we embed this

right by using our GPT style Transformer

embedding and we'd stick that in as well

and then it produces an estimate of how

much noise it thinks is in that image

so that estimate on T equals 50 is going

to be a bit average right it's not going

to produce you a frog on a stilt picture

it's going to produce you like a gray

image or a brown image or something like

that because that is a very very

difficult problem to solve however if

you subtract

this noise from this image you get your

first estimate for what your first image

is right and when you add back a bunch

of noise and you get to T equals 49

right so now we've got slightly less

noise and maybe they're like the biggest

outline of a frog on a stilt right and

this is T equals 49 you take your

embedding and you put this in as well

right and you get another maybe slightly

better estimate of the noise in the

image and then we Loop right it's a for

Loop right we've done those before you

take this output you subtract it you add

noise back and you repeat this process

and you keep adding this text embedding

now there's one final trick that they

use to make things a little bit better

if you do this you will get a picture

that maybe looks slightly frog-like

maybe there's a stilt in it right but it

won't look anything like the images you

see on the internet that have been

produced by these tools because they do

another trick to make the output even

more tied to the text what you do is

something called classifier free

guidance so you actually put this image

in twice once you include the embeddings

of the text and once you don't right so

this method this network is maybe

slightly better when it has a text

estimating the noise so you actually put

in

two images

right this one's with the embedding

and this one's no embedding right and

this one is maybe slightly more random

noise and this one's slightly more

frog-like right or it's better better

it's slightly moving towards the right

thing

and we can calculate the difference

between these two noises and amplify

that signal right and then feed that

back so what we essentially do is we say

okay if this network wasn't given any

information on what was in the image and

then this version of a network was

what's the difference between those two

predictions and can we amplify that when

we loot this to really Target this kind

of output right and the idea is

basically you're really forcing this

network or this this Loop to really

point in direction of the of the scene

we want right

um and that's called classify free

guidance and it is somewhat of a hack at

the end of the network but it does work

right if you turn it off which I've done

it doesn't it produces you vague sort of

structures that kind of look right it's

not it's not terrible I mean I think I

did like a muppet cooking in the kitchen

and it just produced me a picture of a

generic kitchen with no Muppet in it

right but if you do this then you

suddenly are really targeting what you

want standard question got to ask it is

this something people can play with

without just going to one of these

websites and typing some words well yeah

I mean that's the thing is is that

um is that it costs hundreds of

thousands of dollars to try one of these

networks because of how many images they

use and how much processing power they

use um the good news is that there are

ones like stable diffusion that are um

that are available to use for free right

and you can use them through things like

Google colab Now I I did this through

Google collab

um and it works really really well

um and maybe we'll talk about that in

another video where we delve into the

code and see all of these bits happening

within the code right I blew through my

uh free Google allowance very very

quickly I had to pay my eight pounds for

uh for premium Google access so um you

know eight pounds eight pounds

thank you yeah so you know never let it

be said I don't spare expense I I know I

spare no expense on um on on computer

file uh getting access to proper compute

Hardware but um

could beasts do something like that it

could yeah almost of our servers could

I'm just a bit lazy and haven't set them

up to do so um but actually the code is

quite easy to run that the the sort of

the entry-level version of a code you

literally can just like basically call

one python function and it will produce

you an image I'm using a code which is

perhaps a little bit more detailed it's

got the full loop in it and I can go in

and inject things and change things so I

can understand it better and we'll talk

through that next you know perhaps next

time

the only other interesting thing about

the current neural networks is that the

weights here and here and here are

shared so they are the same because

otherwise this one here would always be

the time to make one sandwich but you've

got two people doing it so they make

twice as many sandwiches each time they

make a sandwich same with the computer

we could either make the computer

processor faster or

Loading...

Loading video analysis...