Accept-Reject Sampling : Data Science Concepts
By ritvikmath
Summary
## Key takeaways - **Accept-Reject Sampling: No CDF Needed**: Accept-Reject sampling allows sampling from a distribution without needing the cumulative distribution function (CDF) or even the full probability density function (PDF). [00:55], [01:03] - **PDF Numerator is Sufficient**: Often in real-world scenarios, you know the numerator of a probability density function (PDF) but not the normalizing constant. Accept-Reject sampling works with just this numerator. [02:01], [03:05] - **The Two-Step Process**: The method involves sampling from an easy-to-sample-from distribution 'G' that spans the same domain as the target distribution, and then accepting or rejecting that sample based on a calculated probability. [04:26], [06:31] - **Intuition: F/G Ratio is Key**: The core intuition lies in the ratio F(s)/G(s); high ratios indicate samples rare in G but common in the target distribution P, thus increasing their acceptance probability. [12:38], [14:34] - **Choosing 'G' Impacts Efficiency**: While not a requirement for the algorithm to work, choosing a proposal distribution 'G' that closely matches the shape of the target distribution 'P' significantly impacts the efficiency of the sampling process. [05:07], [17:02] - **Inefficiency from Poor 'G' Choice**: If the proposal distribution 'G' is poorly chosen (e.g., a normal distribution for a highly spiked target distribution), the scaling factor 'M' becomes very large, leading to a very low acceptance probability and an inefficient process. [15:53], [16:14]
Topics Covered
- Why inverse transform sampling isn't always practical.
- How accept-reject sampling generates target distribution samples.
- Why the F/G ratio drives sample acceptance.
- Inefficient proposal distributions doom sampling efficiency.
Full Transcript
[Music]
hey everyone welcome back so we're
filming this around the holiday so I got
my holiday pajamas on so we can be
fested and the uh topic of this video is
going to be called accept reject
sampling and sometimes people just call
this rejection sampling so the last
video we made on sampling from a
distribution was a little while ago I'll
post that video in the description and
it was called inverse transform sampling
so that was a method to sample from some
distribution it's a cool method but the
biggest issue with it by far is that we
explicitly need the cumulative
distribution function the CDF in order
to do this method and for a lot of real
world distributions either those that
you're using in your research or your
job or ones you're forming yourself
you're simply just not going to have
access to the CDF and even if you do
another step in inverse transform
sampling is to invert the CDF we need to
take the inverse and that might be
difficult even if you have it so this
method except reject sampling is solving
a lot of those issues at least starting
two it's a method which you don't need
the CDF for in fact you don't even need
the full form of the PDF which is the
probability density function so let's go
on we're going to be working with a real
world kind of example here let's say
that you are the counselor for some
department at a university and the
students in this department normally
take about four years to complete their
studies and let's say that the way your
department works halfway into their
studies so two years in they take an
assessment which tells us how well or
how badly they're doing in their studies
let's say this assessment can take any
score s where this s is the score they
get on the assessment between negative
Infinity so more negative means they're
not doing well and to positive Infinity
higher values mean they are doing well
and now let's say that the PDF the
probability density function of the
scores that students get on these exams
is given by P of s so again to be clear
P of s is a probability density function
for S the score of the student that gets
on the exam now we don't know p ofs and
that's where this method starts to get
interesting we do know however the
numerator of P of s so P of s is equal
to some numerator F of s so F of s is
the function which is the numerator of P
of s divided by NC NC stands for
normalizing constant and now this is
actually a very common thing that we see
in probability and statistics especially
as we get into research and work in
these fields where we get this case
where we often know the numerator of
some probability density function but we
don't know the denominator and why is
that the case well if we think about
this n C how would we compute it NC
would simply just be the integral of the
numerator between negative infinity and
infinity that's how we Define these
normalizing constants and so that form
is given down here but if you think
about taking the integral of this
function which we haven't talked about
yet but it already looks pretty
disgusting if you think about taking the
integral of this thing between negative
infinity infinity doesn't look like a
fun time often times it's not even
possible oftentimes it's just really
difficult so we often get this situation
where we can easily get the numerator of
the probability density function but we
don't know the explicit form for the
normalizing constant and that's the case
we're going to be assuming in this video
here so speaking of that numerator this
is the form of the numerator we won't
need to worry too much about it for this
video we're just going to be talking
about it in general but to talk about it
a little bit we see that it's a
piece-wise function so if the student's
score is positive it's given by this sum
of exponential functions if the student
score is negative it's given by this
slightly different sum of exponential
functions and the graph is Loosely shown
here so it looks almost symmetric but
it's not because there's two different
functions going on it looks almost
normal but it's not again because it
doesn't follow the exact form of a
normal distribution so while it looks
similar to a normal distribution kind of
it's not the same and so we can't as
easily sample from it which does beg the
question now if we have this probability
density function P of s but we don't
explicitly know it we only know its
numerator what hope do we even have of
sampling from this distribution it seems
like it's pretty hard slimpossible
well let's move on here so how do we
sample from P of s the next thing I'm
going to do so I went through several
iterations of this whiteboard trying to
figure out the best way to convey this
message and I think what I'll do first
is show you the method so for the next
minute or so just I'm going to show you
the method but I promise to come back
and explain it intuitively and show you
why it works but first let's see how the
method of accept reject sampling works
the first thing we do is we sample from
some other distribution G of s which is
in some sense close to P ofs so the
shape is similar to P ofs similar as we
can make it and also we want this to be
easy to sample from so to be more clear
here G ofs only needs to satisfy two
properties the first is that we can
easily sample from it and the second is
that it needs to span the same domain as
our Target distribution so our Target
distribution goes from minus to positive
infinity and therefore the only two
conditions we absolutely need for this
different distribution GFS is that it is
easy to sample from and it also goes
from negative to positive Infinity this
idea of being close to P ofs so the
shape being similar to P of s is a big
added bonus that's going to make our
whole process more efficient as we'll
see at the end of this video but it's
not a requirement but since we were
talking about how this is almost
symmetric and almost normal I think the
natural thing to use here is G of s
being the normal distribution so that's
what we'll use here and so if we plot F
which is again this black curve here
that's the same curve we saw on this
side of the board a moment ago and we
plot this green curve G which is the
normal distribution then this is what
they look like now the next step we need
to do is ensure that the green line is
always above the black line so we want
to scale G up we want to multiply G by
some large enough number such that it is
always going to be above F which is that
black curve so we're going to pick some
number M such that for any value of s I
choose so any value of s along this line
If I multiply M and G together it's
always going to be above F and now that
we have all of this set up this is how
accept reject sampling actually works so
step one is to get a sample from G ofs
again we said that that is possible
because G ofs is easy to sample from so
we're going to Boop get a sample from G
ofs that could be anywhere it could be
here it could be here it's going to be
more likely in places where G ofs has a
higher density and less likely in places
where G ofs has a lower density
naturally so that's step number one and
step number two is we're either going to
accept this sample or reject this sample
we accept this sample s with some
probability given by this formula which
is f of s s being the thing we just
sampled from G divided M * G of s and
now just a quick note how do we know
that this thing can be interpreted as a
probability how do we know it's bounded
between zero and one well that comes
back to why we did this transformation
of multiplying G by some big enough
constant M because we know that this
form M * G of s which is exactly what we
see in the denominator here is always
going to be above
F of s which is the numerator here so
this is safely assumed to be interpreted
as a probability it's bounded between 0
and one okay so no issues there so again
the process is pretty simple once you
have this set up you simply just keep
getting samples from G ofs and for each
sample you compute this acceptance
probability and then you simply just
choose to accept or reject that sample
you just got with that probability and
so as you go through this method you'll
have a bunch of samples going to the
accept pile which is this green bar I
visualized up here and other samples
will go into the reject pile if they
don't meet that probability which is
that reject pile over here and then
after some stopping condition however
many samples you want to get you're
going to stop and the samples that are
in the accept pile so all of the samples
that you've accepted through this
process are very mysteriously right now
actually going to be as if you sampled
from P of s itself now I want to pause
here and say that when I first learned
this when I first learned the method it
seemed like magic it didn't seem like it
should work it doesn't seem like there's
anything in it that inherently really
has to do with P of s so how can we say
that all these things that we've
accepted are actually a draw from P of s
and so how I'll do it first I'll prove
it to you mathematically but that's
usually not enough for me to really
understand it from first principles so
after we prove it mathematically I'm
going to go back and talk about more
intuitively why this works what we want
to prove mathematically is if we look at
all of the samples that we have accepted
in this process so if we look at all the
samples that are in this green accept
bar up here
then we want to show that the density of
those samples is actually P of s if we
can show that that proves that all the
samples in there are samples from P of s
so we'll start by asking what is the
density so this capital D is just kind
of a standin for a density if you want
you can think of it as a probability
symbol I want it to be a little bit
clear here and where densities are not
exactly probabilities but if it mentally
helps you just replace all these Capital
D's by probability symbols you'll get
the same intuition so the density of a
sample given that that we have accepted
it okay that's exactly what we're after
right now we're going to use a little
bit of Baye theorem here so whenever we
have a conditional we can write it as
the combination of the following terms
the first one being the probability of
accepting a sample given the value of
the sample is little s times the
unconditional density of that sample
little s all divided by the probability
that we accept a sample again
unconditional so getting from here to
here is just using base theorem so now
we can actually get some good forms for
all these components so first of all
what's the probability of accepting a
sample given that its value is little s
well we literally know that's equal to
this formula that's how we constructed
the process so we're going to put in F
of s divided mg of s for that term now
what is the unconditional density of
getting the sample little s in the first
place well that's just the probability
or more specifically the probability
density that our different distribution
G would give us that sample back and so
that would just be G of s and that's all
divided by the probability that we
accept a sample unconditionally so the
only difficult thing to compute now is
what's the probability that we accept a
sample unconditionally and we're going
to work that out here it's not too
tricky so we're asking about what's the
probability that we accept a sample
unconditionally so it turns out this can
be written as an integral because notice
there's no conditional on here so we
need to go over all the different
possible samples from negative Infinity
to infinity and ask about what's the
probability that we would get this and
accept this and the first part that I
just said the probability that we would
get this sample that it would emerge is
exactly the probability that g would
suggest this sample which is G of s so
that is the probability that we would
get proposed this sample and given that
we're proposed that sample what's the
probability that we accept it that is
again the same form here F of s divided
mg of s or the acceptance probability
now we get this nice cancellation of G
of S and G of s so this integral
simplifies to 1/ capm integral of f of s
DS from negative Infinity to Infinity
seems a little bit tricky but have we
seen that form before well that is
exactly the formula for the normalizing
constant so it turns out interestingly
enough that the probability of accepting
a sample using this method is given by
normalizing constant whatever that is
divided by Big M Big M again being the
thing that we multiply G by to make it
always above F and now completing this
formula so let's bring this down here we
see that density of s given except is
equal so what so this G and this G will
actually cancel so the numerator is just
F of s / m f of s / M and the
denominator is the probability of
accepting which we just said is
normalizing constant divided M
normalizing constant divided M these M's
also cancel out and mysteriously
magically enough we get back that the
density of getting a sample given that
we accepted it is equal to F of Sid
normalizing constant which is exactly P
of s which is exactly P of s we have
just proved mathematically using the
still mysterious method where we use
this different distribution G to get
candidates and accept them with this
probability here if we do that we are
exactly going to be sampling from P of s
so this works this mathematically is
sound it works but if you're anything
like me you're not satisfied at this
point because we don't really understand
why it works what's the intuition behind
it and so let's talk about that next so
I think the key in intuitively
understanding this method is hidden in
this board and let me highlight the
place that I think is the most important
so this ratio here F of s divided G ofs
Let's ignore the M for a second we'll
bring that back in what does that ratio
F of s divided G ofs inherently mean
well let's think about if that ratio was
very high if that ratio that I just
circled was very high that means the
numerator was very high and the
denominator was low if the numerator is
high that means that the sample we are
talking about is very likely in the
distribution P because P and F are
proportional to each other so if there's
a value of s such that f is very high
that means that value of s is actually
very high in P of s as well so if the
numerator is high it means that it is a
sample that is very very
likely if the denominator is low that
means that it's a sample that is very
rare in G ofs it has a very low density
so what would you say if I told you that
here's a sample that is very likely in
your target distribution but is very
unlikely in your proposal distribution
or your candidate distribution G well my
first instinct would be accepted accept
it we might not see it again because we
only see samples if they are proposed by
G and if this G is very low for that
sample we might not see the sample again
for a really long time so let's jump on
it and accept that sample now we need it
and that's where the intuition comes in
because High values for this ratio imply
that this is a very important sample
important because it's rare to get
proposed and also because it's very high
probability in our Target distribution
that leads to very high probabilities of
accept which is why this is exactly the
form of our ratio and this Big M is just
there so that this F over G can get
interpreted as a probability because F
over G can be higher than one
arbitrarily higher lower than one so m
is just there to scale everything down
make sure that we can interpret this as
a probability but the main point I want
to get across is that the thing I've
Circle F / G is exactly the quantity
that is driving this whole process if
FID G is high these are samples that are
very rare to get proposed but if they
get proposed they're extremely common in
our Target distribution so we should
accept them if F divided by G is low
that means these are samples that are
very commonly proposed G is high and F
is low which means that they don't
actually occur too much in our Target
distribution so we don't care too much
about accepting them so once I saw that
link I think the process process made a
little bit more sense to me where some
different distribution G is throwing
samples at us and we are accepting them
based on this ratio here M just being a
normalizer and so that's acceptor reject
sampling and the last thing I'll talk
about is just an issue that people
commonly face in acceptor reject
sampling and that is the choice of G of
s as we said here we used a normal
distribution because it was sort of
similar to this and it was an Easy
distribution to sample from but even
though I framed this as kind of a real
world problem
real world distributions are usually
even more complex than something that is
kind of nice looking as this for example
if this black line this crazy squiggly
black line was your F then we need to go
through the same steps if we want to use
the normal distribution again we need to
multiply the normal distribution by some
big enough constant M so it's always
going to be above this and that wouldn't
really be a problem except for this
terrible Spike that happens in F which
means that we need to scale the normal
distribution up so much that it's going
to be above this Spike what that means
is that m is going to be huge because we
need to multiply the normal distribution
by a very large number to always be
above F and what does it mean if m is
huge well look at this probability of
accept that was normalizing constant
divided by m if m is massive then this
quantity becomes very small so the
probability of accepting a sample goes
towards zero what that means for our
process is that g is going to keep
proposing us samples hey do you want to
accept this do you want to accept this
but because our M was so large our
probability of accepting any of those
samples is extremely low very close to
zero so this process is going to take a
very very very long time in order for us
to get however many samples we need to
get so although this algorithm this
method that I've proposed seems pretty
similar it's just a two-step process the
real heavy lifting and the real creative
work if you choose to use this process
come before that and choosing a g that
is similar sort of to your target
distribution which isn't a problem if
your target distribution looks nice but
is potentially a problem if your target
distribution looks kind of disgusting
like this so This choosing your G is a
big art and if you do it wrong you're
going to lead to very inefficient
process here okay so that was accept
reject sampling or sometimes it's called
just rejection sampling hopefully um
I've showed you a couple things I've
showed you why we need it instead of
inverse transform sampling many times in
the real world I've showed you how it
works I've showed you mathematically
that it works but most importantly
hopefully you understand now intuitively
why it works and it all comes down to
this F overg ratio okay so if you learn
something please like And subscribe any
comments are always welcome below and
I'll see you next time
Loading video analysis...