LongCut logo

The Algorithm That Made Modern AI Possible

By StringsandTheory

Summary

Topics Covered

  • Markov chains need no memory of where they started
  • MCMC bypasses the intractable evidence problem
  • Random walks that visit probability proportionally
  • Inference over distributions you cannot compute

Full Transcript

Markov chain Monte Carlo is one of those methods that [music] shows up everywhere, from physics to machine learning, weather forecasting, and predicting the next word in a sentence.

But, despite how widely it's used, for many people these methods still feel like a black box. In this video, I'll walk through how it works and why it's such a powerful tool. To understand

MCMC, we need to first understand the two ingredients that it's made from: Monte Carlo simulations and Markov chains. Let's start with the first one.

chains. Let's start with the first one.

Monte Carlo methods estimate quantities by using random sampling. A classic

example is estimating the value of pi.

Imagine that you're drawing a square of side 2R and inside of it you draw a circle with radius R. To estimate pi, [music] we can start by throwing random points uniformly into the square, then

the probability that a point will land inside of the circle is simply the ratio of the two areas. [music] Because the area of the circle is pi r squared and

the area of the square is 2r squared, we get probability for a point landing inside a circle is simply pi over four.

So, by throwing many points at the square, we can get an estimate for pi.

And this is what the Monte Carlo method does. We can approximate a quantity by

does. We can approximate a quantity by using many random samples. In this

example, the samples are independent, but in many problems independent sampling is not possible. And that leads us to the next ingredient, Markov chains. A Markov chain is a sequence of

chains. A Markov chain is a sequence of random states where the probability of the next state depends only on the current state. This is known as the

current state. This is known as the Markov property. In other words, the

Markov property. In other words, the system is memoryless. Mathematically, we

can write this as the probability of the next state at time t plus one given the current state and all previous states is the same as the probability of the next state given only the current state.

Let's make this concrete. Say our system has three possible states: rainy, cloudy, and sunny. At each step, the chain moves between states according to a fixed probability. From rainy, there's

a 60% chance of moving to cloudy and a 40% chance of staying in rainy, etc. Now, let's actually run the chain. We

start from rainy, we move to cloudy, then sunny, sunny again. Each step

depends only on the current states.

That's the Markov property in action. If

the chain is allowed to explore all possible states and it doesn't get stuck in one region forever, then over time it will settle into a stable distribution.

And when we do MCMC, we carefully design the algorithm so that the stable distribution is exactly the distribution that we want to sample from. To

summarize, Markov chain Monte Carlo combines the following two ideas: Monte Carlo simulations, which gives us a way to estimate quantities [music] using random samples, and Markov chains, which

gives us a way to generate those samples when independent sampling is difficult.

Let's see how this connects to something like Bayesian inference. I recently made a video about this topic, but if you didn't watch it, here is a quick recap.

In Bayesian inference, we [music] want to determine which parameter values are most consistent with the observed data.

Instead of returning a single best-fit value, methods produces a posterior distribution over all parameters using Bayes' theorem. Each term in this equation has a clear interpretation. The

prior encodes our initial belief about the parameters before observing the data. And the likelihood tells us how

data. And the likelihood tells us how well a set of parameters explains the observed data. Combining them gives us

observed data. Combining them gives us the posterior distribution, which tells us which parameter values are likely and how uncertain we are. The challenge in computing the posterior lies in the denominator of this equation, which

[music] is called the evidence and requires evaluating an integral over all possible parameter values. In

low-dimensional problems, this integral can sometimes be computed directly, but in realistic scenarios, the number of parameters quickly grows and with it the number of possible combinations. So,

this integral becomes computationally [music] intractable. So, now we are stuck with a

intractable. So, now we are stuck with a problem. We have a distribution that we

problem. We have a distribution that we want to work with, but we can't even compute it. This is where MCMC enters,

compute it. This is where MCMC enters, because instead of drawing independent samples directly from the posterior, we construct a Markov chain whose stationary distribution is the

posterior. Then we simulate this chain

posterior. Then we simulate this chain for a long time and collect the samples that it produces. Those samples will behave as if they were drawn directly from the posterior distribution. One of

the simplest MCMC algorithms is the Metropolis-Hastings algorithm. It gives

Metropolis-Hastings algorithm. It gives us a concrete recipe for constructing a Markov chain that targets any distribution we want, including a posterior that we can't [music] compute

directly. We start by picking an initial

directly. We start by picking an initial parameter value. This can be a random

parameter value. This can be a random guess or something informed by your domain knowledge. But you might already

domain knowledge. But you might already be wondering, what if our starting value is bad and we end up in a region of very low posterior probability? The chain

will indeed spend some time wandering in that low probability region before finding its way to where it needs to be.

And if we keep those early samples, they will bias our inference. But as we saw before, Markov chains are memoryless.

Future states only depend on the current one, which means that the chain has no memory of where it started. Once it

reaches the high probability region, it stays there. So, in practice, we can

stays there. So, in practice, we can just throw away the first few hundred samples. This is often called a burn-in

samples. This is often called a burn-in period. After that, the samples will

period. After that, the samples will behave as if they were drawn from the posterior, regardless of where we started. From our initial value, we

started. From our initial value, we propose a new candidate value by sampling from a proposal distribution. A

common choice is a Gaussian centered at the current value. The width of that Gaussian controls how bold our proposals are. If they are too narrow, the chain

are. If they are too narrow, the chain moves slowly. And if they are too wide,

moves slowly. And if they are too wide, most of our proposals will be rejected.

[music] And to decide whether to accept or to reject a new position, we compute the acceptance ratio. This is simply the

acceptance ratio. This is simply the ratio of the unnormalized posterior at the proposed point to the unnormalized posterior at the current point. Because

we are computing a ratio of two posteriors, the evidence will cancel out completely, so we don't even need to compute it. This is the core reason why

compute it. This is the core reason why MCMC is so powerful. We only need to evaluate the posterior up to a constant, which we can always do. If the proposed point has a higher posterior probability, we always [music] accept

because we are moving uphill. But if the proposed point has a lower posterior probability, we accept with probability R. So, we sometimes move downhill, but

R. So, we sometimes move downhill, but not very often. And this is what prevents the chain from getting stuck on a single peak, [music] which allows it to explore the full distribution.

Repeating this procedure produces a random walk through parameter space, but not an aimless one, because the acceptance rule is carefully constructed so that the chain visits each region of

parameter space in proportion to its posterior probability. High probability

posterior probability. High probability regions are visited often, and low probability regions are visited rarely. So, the chain is not just

rarely. So, the chain is not just wandering. It is by construction

wandering. It is by construction sampling from the posterior. Once we

have samples from the posterior, we can estimate anything we want: the mean of a parameter, the 95% credible intervals, or the probability that a parameter

exceeds some threshold. The unifying

idea is that any quantity you want from the posterior can be written as an expectation, which is approximated by averages over samples. That's why it's called Markov chain Monte Carlo. The

chain gives you the samples, and the samples give you everything else. So,

what did we just do? We started with a posterior distribution that we wanted to compute, but were unable to.

We couldn't draw independent samples from it, and we couldn't even evaluate it exactly. But with MCMC, [music]

it exactly. But with MCMC, [music] none of that matters. We just need to evaluate the posterior up to a constant, which we can always do, and the algorithm takes care of the rest. So,

the next time someone asks you how you can do inference over a distribution you can't compute, you build a chain that explores the parameter space for you.

That's MCMC, and it's one of the most useful ideas in all of statistics. If

you like this video, you might also want to see my most recent video on Bayesian statistics, or check out one of my videos on black holes where these methods are widely used. And I'll see you if you choose to watch another video

from me.

Bye.

Loading...

Loading video analysis...