Accept-Reject Sampling : Data Science Concepts

By ritvikmath

Summary

## Key takeaways - **Accept-Reject Sampling: No CDF Needed**: Accept-Reject sampling allows sampling from a distribution without needing the cumulative distribution function (CDF) or even the full probability density function (PDF). [00:55], [01:03] - **PDF Numerator is Sufficient**: Often in real-world scenarios, you know the numerator of a probability density function (PDF) but not the normalizing constant. Accept-Reject sampling works with just this numerator. [02:01], [03:05] - **The Two-Step Process**: The method involves sampling from an easy-to-sample-from distribution 'G' that spans the same domain as the target distribution, and then accepting or rejecting that sample based on a calculated probability. [04:26], [06:31] - **Intuition: F/G Ratio is Key**: The core intuition lies in the ratio F(s)/G(s); high ratios indicate samples rare in G but common in the target distribution P, thus increasing their acceptance probability. [12:38], [14:34] - **Choosing 'G' Impacts Efficiency**: While not a requirement for the algorithm to work, choosing a proposal distribution 'G' that closely matches the shape of the target distribution 'P' significantly impacts the efficiency of the sampling process. [05:07], [17:02] - **Inefficiency from Poor 'G' Choice**: If the proposal distribution 'G' is poorly chosen (e.g., a normal distribution for a highly spiked target distribution), the scaling factor 'M' becomes very large, leading to a very low acceptance probability and an inefficient process. [15:53], [16:14]

Topics Covered

Why inverse transform sampling isn't always practical.
How accept-reject sampling generates target distribution samples.
Why the F/G ratio drives sample acceptance.
Inefficient proposal distributions doom sampling efficiency.

Full Transcript

[Music]

hey everyone welcome back so we're

filming this around the holiday so I got

my holiday pajamas on so we can be

fested and the uh topic of this video is

going to be called accept reject

sampling and sometimes people just call

this rejection sampling so the last

video we made on sampling from a

distribution was a little while ago I'll

post that video in the description and

it was called inverse transform sampling

so that was a method to sample from some

distribution it's a cool method but the

biggest issue with it by far is that we

explicitly need the cumulative

distribution function the CDF in order

to do this method and for a lot of real

world distributions either those that

you're using in your research or your

job or ones you're forming yourself

you're simply just not going to have

access to the CDF and even if you do

another step in inverse transform

sampling is to invert the CDF we need to

take the inverse and that might be

difficult even if you have it so this

method except reject sampling is solving

a lot of those issues at least starting

two it's a method which you don't need

the CDF for in fact you don't even need

the full form of the PDF which is the

probability density function so let's go

on we're going to be working with a real

world kind of example here let's say

that you are the counselor for some

department at a university and the

students in this department normally

take about four years to complete their

studies and let's say that the way your

department works halfway into their

studies so two years in they take an

assessment which tells us how well or

how badly they're doing in their studies

let's say this assessment can take any

score s where this s is the score they

get on the assessment between negative

Infinity so more negative means they're

not doing well and to positive Infinity

higher values mean they are doing well

and now let's say that the PDF the

probability density function of the

scores that students get on these exams

is given by P of s so again to be clear

P of s is a probability density function

for S the score of the student that gets

on the exam now we don't know p ofs and

that's where this method starts to get

interesting we do know however the

numerator of P of s so P of s is equal

to some numerator F of s so F of s is

the function which is the numerator of P

of s divided by NC NC stands for

normalizing constant and now this is

actually a very common thing that we see

in probability and statistics especially

as we get into research and work in

these fields where we get this case

where we often know the numerator of

some probability density function but we

don't know the denominator and why is

that the case well if we think about

this n C how would we compute it NC

would simply just be the integral of the

numerator between negative infinity and

infinity that's how we Define these

normalizing constants and so that form

is given down here but if you think

about taking the integral of this

function which we haven't talked about

yet but it already looks pretty

disgusting if you think about taking the

integral of this thing between negative

infinity infinity doesn't look like a

fun time often times it's not even

possible oftentimes it's just really

difficult so we often get this situation

where we can easily get the numerator of

the probability density function but we

don't know the explicit form for the

normalizing constant and that's the case

we're going to be assuming in this video

here so speaking of that numerator this

is the form of the numerator we won't

need to worry too much about it for this

video we're just going to be talking

about it in general but to talk about it

a little bit we see that it's a

piece-wise function so if the student's

score is positive it's given by this sum

of exponential functions if the student

score is negative it's given by this

slightly different sum of exponential

functions and the graph is Loosely shown

here so it looks almost symmetric but

it's not because there's two different

functions going on it looks almost

normal but it's not again because it

doesn't follow the exact form of a

normal distribution so while it looks

similar to a normal distribution kind of

it's not the same and so we can't as

easily sample from it which does beg the

question now if we have this probability

density function P of s but we don't

explicitly know it we only know its

numerator what hope do we even have of

sampling from this distribution it seems

like it's pretty hard slimpossible

well let's move on here so how do we

sample from P of s the next thing I'm

going to do so I went through several

iterations of this whiteboard trying to

figure out the best way to convey this

message and I think what I'll do first

is show you the method so for the next

minute or so just I'm going to show you

the method but I promise to come back

and explain it intuitively and show you

why it works but first let's see how the

method of accept reject sampling works

the first thing we do is we sample from

some other distribution G of s which is

in some sense close to P ofs so the

shape is similar to P ofs similar as we

can make it and also we want this to be

easy to sample from so to be more clear

here G ofs only needs to satisfy two

properties the first is that we can

easily sample from it and the second is

that it needs to span the same domain as

our Target distribution so our Target

distribution goes from minus to positive

infinity and therefore the only two

conditions we absolutely need for this

different distribution GFS is that it is

easy to sample from and it also goes

from negative to positive Infinity this

idea of being close to P ofs so the

shape being similar to P of s is a big

added bonus that's going to make our

whole process more efficient as we'll

see at the end of this video but it's

not a requirement but since we were

talking about how this is almost

symmetric and almost normal I think the

natural thing to use here is G of s

being the normal distribution so that's

what we'll use here and so if we plot F

which is again this black curve here

that's the same curve we saw on this

side of the board a moment ago and we

plot this green curve G which is the

normal distribution then this is what

they look like now the next step we need

to do is ensure that the green line is

always above the black line so we want

to scale G up we want to multiply G by

some large enough number such that it is

always going to be above F which is that

black curve so we're going to pick some

number M such that for any value of s I

choose so any value of s along this line

If I multiply M and G together it's

always going to be above F and now that

we have all of this set up this is how

accept reject sampling actually works so

step one is to get a sample from G ofs

again we said that that is possible

because G ofs is easy to sample from so

we're going to Boop get a sample from G

ofs that could be anywhere it could be

here it could be here it's going to be

more likely in places where G ofs has a

higher density and less likely in places

where G ofs has a lower density

naturally so that's step number one and

step number two is we're either going to

accept this sample or reject this sample

we accept this sample s with some

probability given by this formula which

is f of s s being the thing we just

sampled from G divided M * G of s and

now just a quick note how do we know

that this thing can be interpreted as a

probability how do we know it's bounded

between zero and one well that comes

back to why we did this transformation

of multiplying G by some big enough

constant M because we know that this

form M * G of s which is exactly what we

see in the denominator here is always

going to be above

F of s which is the numerator here so

this is safely assumed to be interpreted

as a probability it's bounded between 0

and one okay so no issues there so again

the process is pretty simple once you

have this set up you simply just keep

getting samples from G ofs and for each

sample you compute this acceptance

probability and then you simply just

choose to accept or reject that sample

you just got with that probability and

so as you go through this method you'll

have a bunch of samples going to the

accept pile which is this green bar I

visualized up here and other samples

will go into the reject pile if they

don't meet that probability which is

that reject pile over here and then

after some stopping condition however

many samples you want to get you're

going to stop and the samples that are

in the accept pile so all of the samples

that you've accepted through this

process are very mysteriously right now

actually going to be as if you sampled

from P of s itself now I want to pause

here and say that when I first learned

this when I first learned the method it

seemed like magic it didn't seem like it

should work it doesn't seem like there's

anything in it that inherently really

has to do with P of s so how can we say

that all these things that we've

accepted are actually a draw from P of s

and so how I'll do it first I'll prove

it to you mathematically but that's

usually not enough for me to really

understand it from first principles so

after we prove it mathematically I'm

going to go back and talk about more

intuitively why this works what we want

to prove mathematically is if we look at

all of the samples that we have accepted

in this process so if we look at all the

samples that are in this green accept

bar up here

then we want to show that the density of

those samples is actually P of s if we

can show that that proves that all the

samples in there are samples from P of s

so we'll start by asking what is the

density so this capital D is just kind

of a standin for a density if you want

you can think of it as a probability

symbol I want it to be a little bit

clear here and where densities are not

exactly probabilities but if it mentally

helps you just replace all these Capital

D's by probability symbols you'll get

the same intuition so the density of a

sample given that that we have accepted

it okay that's exactly what we're after

right now we're going to use a little

bit of Baye theorem here so whenever we

have a conditional we can write it as

the combination of the following terms

the first one being the probability of

accepting a sample given the value of

the sample is little s times the

unconditional density of that sample

little s all divided by the probability

that we accept a sample again

unconditional so getting from here to

here is just using base theorem so now

we can actually get some good forms for

all these components so first of all

what's the probability of accepting a

sample given that its value is little s

well we literally know that's equal to

this formula that's how we constructed

the process so we're going to put in F

of s divided mg of s for that term now

what is the unconditional density of

getting the sample little s in the first

place well that's just the probability

or more specifically the probability

density that our different distribution

G would give us that sample back and so

that would just be G of s and that's all

divided by the probability that we

accept a sample unconditionally so the

only difficult thing to compute now is

what's the probability that we accept a

sample unconditionally and we're going

to work that out here it's not too

tricky so we're asking about what's the

probability that we accept a sample

unconditionally so it turns out this can

be written as an integral because notice

there's no conditional on here so we

need to go over all the different

possible samples from negative Infinity

to infinity and ask about what's the

probability that we would get this and

accept this and the first part that I

just said the probability that we would

get this sample that it would emerge is

exactly the probability that g would

suggest this sample which is G of s so

that is the probability that we would

get proposed this sample and given that

we're proposed that sample what's the

probability that we accept it that is

again the same form here F of s divided

mg of s or the acceptance probability

now we get this nice cancellation of G

of S and G of s so this integral

simplifies to 1/ capm integral of f of s

DS from negative Infinity to Infinity

seems a little bit tricky but have we

seen that form before well that is

exactly the formula for the normalizing

constant so it turns out interestingly

enough that the probability of accepting

a sample using this method is given by

normalizing constant whatever that is

divided by Big M Big M again being the

thing that we multiply G by to make it

always above F and now completing this

formula so let's bring this down here we

see that density of s given except is

equal so what so this G and this G will

actually cancel so the numerator is just

F of s / m f of s / M and the

denominator is the probability of

accepting which we just said is

normalizing constant divided M

normalizing constant divided M these M's

also cancel out and mysteriously

magically enough we get back that the

density of getting a sample given that

we accepted it is equal to F of Sid

normalizing constant which is exactly P

of s which is exactly P of s we have

just proved mathematically using the

still mysterious method where we use

this different distribution G to get

candidates and accept them with this

probability here if we do that we are

exactly going to be sampling from P of s

so this works this mathematically is

sound it works but if you're anything

like me you're not satisfied at this

point because we don't really understand

why it works what's the intuition behind

it and so let's talk about that next so

I think the key in intuitively

understanding this method is hidden in

this board and let me highlight the

place that I think is the most important

so this ratio here F of s divided G ofs

Let's ignore the M for a second we'll

bring that back in what does that ratio

F of s divided G ofs inherently mean

well let's think about if that ratio was

very high if that ratio that I just

circled was very high that means the

numerator was very high and the

denominator was low if the numerator is

high that means that the sample we are

talking about is very likely in the

distribution P because P and F are

proportional to each other so if there's

a value of s such that f is very high

that means that value of s is actually

very high in P of s as well so if the

numerator is high it means that it is a

sample that is very very

likely if the denominator is low that

means that it's a sample that is very

rare in G ofs it has a very low density

so what would you say if I told you that

here's a sample that is very likely in

your target distribution but is very

unlikely in your proposal distribution

or your candidate distribution G well my

first instinct would be accepted accept

it we might not see it again because we

only see samples if they are proposed by

G and if this G is very low for that

sample we might not see the sample again

for a really long time so let's jump on

it and accept that sample now we need it

and that's where the intuition comes in

because High values for this ratio imply

that this is a very important sample

important because it's rare to get

proposed and also because it's very high

probability in our Target distribution

that leads to very high probabilities of

accept which is why this is exactly the

form of our ratio and this Big M is just

there so that this F over G can get

interpreted as a probability because F

over G can be higher than one

arbitrarily higher lower than one so m

is just there to scale everything down

make sure that we can interpret this as

a probability but the main point I want

to get across is that the thing I've

Circle F / G is exactly the quantity

that is driving this whole process if

FID G is high these are samples that are

very rare to get proposed but if they

get proposed they're extremely common in

our Target distribution so we should

accept them if F divided by G is low

that means these are samples that are

very commonly proposed G is high and F

is low which means that they don't

actually occur too much in our Target

distribution so we don't care too much

about accepting them so once I saw that

link I think the process process made a

little bit more sense to me where some

different distribution G is throwing

samples at us and we are accepting them

based on this ratio here M just being a

normalizer and so that's acceptor reject

sampling and the last thing I'll talk

about is just an issue that people

commonly face in acceptor reject

sampling and that is the choice of G of

s as we said here we used a normal

distribution because it was sort of

similar to this and it was an Easy

distribution to sample from but even

though I framed this as kind of a real

world problem

real world distributions are usually

even more complex than something that is

kind of nice looking as this for example

if this black line this crazy squiggly

black line was your F then we need to go

through the same steps if we want to use

the normal distribution again we need to

multiply the normal distribution by some

big enough constant M so it's always

going to be above this and that wouldn't

really be a problem except for this

terrible Spike that happens in F which

means that we need to scale the normal

distribution up so much that it's going

to be above this Spike what that means

is that m is going to be huge because we

need to multiply the normal distribution

by a very large number to always be

above F and what does it mean if m is

huge well look at this probability of

accept that was normalizing constant

divided by m if m is massive then this

quantity becomes very small so the

probability of accepting a sample goes

towards zero what that means for our

process is that g is going to keep

proposing us samples hey do you want to

accept this do you want to accept this

but because our M was so large our

probability of accepting any of those

samples is extremely low very close to

zero so this process is going to take a

very very very long time in order for us

to get however many samples we need to

get so although this algorithm this

method that I've proposed seems pretty

similar it's just a two-step process the

real heavy lifting and the real creative

work if you choose to use this process

come before that and choosing a g that

is similar sort of to your target

distribution which isn't a problem if

your target distribution looks nice but

is potentially a problem if your target

distribution looks kind of disgusting

like this so This choosing your G is a

big art and if you do it wrong you're

going to lead to very inefficient

process here okay so that was accept

reject sampling or sometimes it's called

just rejection sampling hopefully um

I've showed you a couple things I've

showed you why we need it instead of

inverse transform sampling many times in

the real world I've showed you how it

works I've showed you mathematically

that it works but most importantly

hopefully you understand now intuitively

why it works and it all comes down to

this F overg ratio okay so if you learn

something please like And subscribe any

comments are always welcome below and

I'll see you next time

Loading...

Loading video analysis...