Clustering (4): Gaussian Mixture Models and EM
By Alexander Ihler
Summary
## Key takeaways - **GMMs: Beyond K-Means for Overlapping, Non-Circular Clusters**: Gaussian Mixture Models (GMMs) extend K-Means by modeling clusters with Gaussian distributions, allowing for non-circular shapes and overlapping data, which K-Means struggles with. [00:15], [00:44] - **EM Algorithm: Iterative Soft Assignment for GMMs**: The Expectation-Maximization (EM) algorithm iteratively refines GMM parameters by first assigning soft probabilities (responsibilities) of data points belonging to each cluster (E-step), then updating cluster parameters based on these probabilities (M-step). [05:10], [06:35] - **GMMs as Generative Models: Sampling and Imputation**: Trained GMMs act as generative models, enabling tasks like sampling new data points similar to the training set or imputing missing data values. [01:45], [02:04] - **EM Convergence: Guaranteed Likelihood Increase, Not Global Optimum**: While EM strictly increases the log-likelihood of the data with each iteration, ensuring convergence, it's not guaranteed to find the global optimum and may require multiple initializations. [08:40], [09:28] - **Red Blood Cell Data: EM Identifies Hidden Patient Groups**: In a red blood cell dataset, EM successfully identified two distinct patient groups (normal and anemic) based on their feature measurements, even though the diagnoses were hidden. [11:02], [12:51]
Topics Covered
- Why K-Means Fails: Overlapping and Non-Circular Clusters?
- Beyond Clustering: GMMs as Generative Probability Models
- Unveiling Hidden Structures: GMMs as Latent Variable Models
- EM Algorithm Pitfalls: Understanding Local Optima & Convergence
- How EM Algorithm Iteratively Refines Cluster Assignments
Full Transcript
another common and important clustering
technique is based on probability
density estimation using gaussian
mixture models and a procedure called
expectation maximization or em to fit
the model parameters
if you recall a k-means algorithm we
described each cluster Center using just
a single point in feature space the
cluster Center and assigned each data
point to its nearest cluster
but what if our concept of the grouping
of the data have groups that overlap in
the feature space
so then first of all it's hard to know
which assignment is right since both are
plausible
and secondly
in k-means we always use euclidean
distance to the center but what if our
cluster is defined by some non-circular
shape
so for example in these data over here
the data can be seen as forming two
groups one of which has a lot of spread
in X2 here and the other of which has a
lot of spread in X1
both are centered at the same place
so it's easy mentally to see these as
falling into two groups and to Cluster
these data in one and these data in the
other
but something like k-means won't be able
to discover that assignment
gaussian mixture models are an extension
of the k-means model in which clusters
are modeled with gaussian distributions
so we've not only their mean but also a
covariance that describes their
ellipsoidal shape
so then we can fit the model by
maximizing the likelihood of The
observed data we do so this with an
algorithm called em it'll be very
reminiscent of k-means except it'll
assign data to each cluster with some
soft probability
as a positive side effect after
clustering we're actually creating a
generative model for the data x a
probability model which means we can do
a lot of useful tasks like sampling new
examples that we think are like the data
that we measured or comparing two
collections of data like the training
and test set to see if they differ
or imputing missing data values from
some of our data
in fact often k-means is actually viewed
as a special case of a gaussian mixture
model to derive some of these benefits
a gaussian mixture model is a useful
probability distribution model
we begin with several mixture components
indexed by C Each of which is described
by a gaussian distribution so each has a
mean mu c a variance or covariance Sigma
c and a size Pi C
for example in this figure we have three
gaussian components which means here
here and here
each of these components has a different
variance
and a different height
or area
so the joint probability distribution is
then defined by the weighted average of
these individual components so Pi c
times the gaussian defined by mu C and
sigma C
we can interpret this joint probability
distribution over X in a simple
generative way
to draw a sample from X from P of X we
first select one of the components
with discrete probability pi
so components with large Pi are selected
more often
as in k-means we'll view this as the
assignment of that sample to one of the
components and denoted by Z
then given the component assignment Z
equals c
we can draw a value from X from the
corresponding gaussian distribution
so together these two distributions make
a joint model over x and z
discarding the value of Z gives a sample
from the marginal P of X defined above
models like this are sometimes called
latent variable models the data X are
modeled jointly with an additional
variable Z that we don't get to observe
it's hidden the presence of the unknown
value of Z helps explain patterns in the
values of X for example in this case
groups or clusters
usually our features are multivariate
even high dimensional so we'll typically
use a multivariate gaussian which has
the same quadratic form but now involves
a vector mean mu of length n the same
size as the number of features in a data
point x
and an N by n covariance Matrix Sigma
recall that if we were given data from a
multivariate gaussian the maximum
likelihood estimate for the model
parameters were simply the mean of the
data
so the first moment of the data and the
covariance estimate was simply the mean
of the N by n matrices formed by the
outer product of x minus mu with itself
so this is the centered second moment of
the data
em then proceeds iteratively in two
steps
first the expectation or e-step treats
the gaussian parameters mu Sigma and Pi
is fixed
then for each data point I and each
cluster C we compute a responsibility
value R sub IC that measures a relative
probability the data point x i belongs
to Cluster C
to do this we just compute the
probability of X under model component c
a weighted gaussian and normalize by the
total over all the values of c
so here we evaluate a data point x
under component one so Pi 1 times the
gaussian defined by mu mean mu 1 and
covariance sigma 1.
we then also evaluate its probability
under component two so Pi 2 times the
gaussian defined by mu 2 and sigma 2.
if a particular component C is not a
very good explanation for X it will
typically have a small Ric value
conversely if it's by far the best
possible explanation for x
it'll have Ric approximately equal to
one
here component 2 is about twice as good
an explanation as component one for that
data point
so component two gets responsibility
two-thirds and component one gets
responsibility one-third
practically speaking Ric is a number of
data by number of clusters so M by K
Matrix that sums to 1 over the index C
then in the second step of em the
maximization or m-step we fix these
assignment responsibilities Ric and
update the parameters of the Clusters mu
Sigma and pi
then for each cluster C we update its
parameters using an estimate weighted by
the probabilities Ric as if it observed
some fraction Ric of data point I
so cluster C sees some total number of
data points MC that's the sum of these
soft memberships or fractional weights
assigned to Cluster C
then Pi C is just this value normalized
by the total number of data M so this is
the fraction of data point probabilities
that's assigned to Cluster C
the weighted mean mu C is just the
weighted average of the data so each
point x i is given weight Ric
and we divide by the total MC
so if a point x I was poorly explained
by some cluster C its Ric will be small
and that point won't influence this
average very much
but on the other hand if C was the best
explanation for X then Ric will be large
and X I will influence the mean more
the covariance is similarly just a
weighted average of the N by n matrices
formed by taking the outer product of x
i minus its cluster C's mean
and again they're weighted by Ric so
that if x i is a strong member of
cluster C this will weight will be
nearly one if x i is not very well
explained by cluster C then it won't
enter into this average very much
it's straightforward to prove although I
won't do it here that these iterations
strictly increase the log likelihood of
the model
increasing its fit to the data
the log likelihood is just the log
probability of the data points under the
mixture model so it's a sum over data
points of log of the probability
P of x from before where that
probability is a mixture of gaussians
in fact em is interpretable as a form of
coordinate Ascent just like our previous
k-means algorithm so thus we're again
guaranteed to converge
however convergence here won't typically
be as abrupt as it was in k-means so in
practice one usually stops once the
parameters or the log likelihood
objective have stopped changing very
much
a note that like in K means convergence
is not guaranteed to be to a global
Optimum so we may have to start from
several initializations and use the log
likelihood to select the best
one quick point the results of this
algorithm is a description of a
collection of clusters so centers and
covariances and so on and soft
membership probabilities for each data
point
if we want to identify our data points
with a single cluster like in k-means we
could actually just choose the most
probable cluster so the cluster C that
has largest value Ric for data point I
just like in K means the view of
clusters is easily applied to new
out-of-sample data points as well so
these points can be given either a soft
cluster membership Ric using an e-step
or a hard membership assignment zi by
again taking the largest Ric
also like k-means selecting the number
of clusters is important and can't be
done using the clustered data themselves
so like K means one option is to use a
complexity penalty in our score an
analog to the mean squared error cost in
k-means is the negative log likelihood
of the data in our mixture model and we
can then include a complexity penalty
like Bic just like before
alternatively since the gaussian mixture
model is a true probability model we can
use the log likelihood of a validation
or test set to assess the model fit as
well by evaluating the mixture
probability on the new data points
here's a demonstration of em courtesy of
some slides by Professor pork Smith here
at UC Irvine run on data measuring
properties of patients red blood cells
the data include both a group of normal
patients and a group of anemic patients
but the diagnosis is actually hidden
from us so we only see the feature
measurements themselves
probably you can see the two groups of
patients already in the data up here
with high red blood cell volume and high
hemoglobin are the normals and then this
long dispersed group with lower volume
and lower hemoglobin are the anemics
let's see how em identifies these two
groups automatically
we initialize our two clusters randomly
in this case they're actually quite
similar but the means and covariances of
the green cluster and the red cluster
are slightly different and that'll
actually be enough
during the e-step these points over here
will be slightly more probable under the
red model than the green while these
points will be about equally explained
and these points will be slightly more
probable under green
given these soft assignments or
responsibilities when we update our
model parameters to compute the weighted
mean and covariances we'll find that the
red model shifts slightly to better
accommodate the data down here that
we're given higher probability under the
red model
and similarly green shifts to better
accommodate the data that were given
slightly higher probability under its
component
then recomputing the probabilities in
the e-step makes these data even more
likely under red and these data even
more likely under green
another m-step involves the components
even further away to help explain their
respective data points better
repeating we see
that the Clusters are continuing to
separate
by iteration 10 they're explaining very
different groups of data
and by iteration 15 red is almost
entirely responsible for the dispersed
data and the green cluster for the tight
grouping of normals
eventually this process converges the
model cease to change very much and we
can stop
note that the model doesn't know which
of these two groups is normal or anemic
just that it can find two distinct
groups or clusters in the data that
behave differently and a description of
what the data in each cluster looked
like
if we look at the log likelihood of the
data over the course of the algorithm we
find that each iteration strictly
increases the log likelihood score so
after about 15 iterations
uh the score plateaus and doesn't
increase any further indicating again
that the model has converged
em algorithm is actually quite General
and can be used for many models or
problems involving partially observed
data from this Viewpoint here the
complete data or the feature values and
the cluster assignments but
unfortunately for us the assignment zi
are missing
then em corresponds to First Computing
the distribution over the hidden
variable zi given the current parameters
of the model
and then maximizing the complete log
likelihood over x and z together in
expectation over the distribution of Z
and gaussian mixture models this gives
us this elegant algorithm that we saw
where we compute soft assignments for
each data point and then plug those soft
assignments into the maximum likelihood
estimates
but in more General models it may be
procedurally a bit more complex
two simple alternatives are often used
when em is difficult to implement
there's a stochastic version of em or a
hard version of em in stochastic em
instead of taking the expectation over Z
we just sample its value and then fix it
in hard em we select the most probable
value and fix it
often it's easier to work with a fixed
value of Z than maximizing the
expectation this process of plugging a
value into Z is often called imputing
the value of Z
both variance stochastic and heart Em
are quite similar to regular em a heart
em as you might imagine is a little less
smooth in its optimization since the
best value of Z might change
discontinuously and it's often prone to
more local Optima since once we have a
hard assignment we'll move the
parameters to explain it which then
reinforces that hard assignment
heart em is very closely related to
k-means with this best assignment update
stochastic em is less prone to local
Optima but has more Randomness built in
making it harder to gauge convergence
summary gaussian mixture models are
useful and flexible class of probability
distributions here we're using them for
clustering they explain the distribution
the data X using a latent variable
framework so we posit that there's some
hidden grouping or clustering that
explains a lot of the variation in the
data our model then contains a latent
membership or assignment variable zi and
then the actual feature values x i are
gaussian given that cluster assignment
Zi
we can learn the parameters of a
gaussian mixture model using expectation
maximization or em
applied to gaussian mixture models has a
simple and elegant form we just iterate
between Computing soft membership or
assignment probabilities which we call
the responsibilities R sub IC and
updating the parameters of the mixture
model components so the cluster centers
covariances and so on using these soft
memberships
this procedure is guaranteed to strictly
increase the log likelihood decrease the
negative log likelihood of the model on
training data which ensures that it's
convergent but it's non-convex so it may
have local Optima which makes
initialization potentially very
important
we also discussed how we can set the
number of clusters using either a
complexity penalty like Bic just like we
did in k-means
or since the gaussian mixture model is a
true probability model over X we can
actually use the log likelihood score on
held out validation or test data as well
Loading video analysis...