Clustering (4): Gaussian Mixture Models and EM

By Alexander Ihler

Summary

## Key takeaways - **GMMs: Beyond K-Means for Overlapping, Non-Circular Clusters**: Gaussian Mixture Models (GMMs) extend K-Means by modeling clusters with Gaussian distributions, allowing for non-circular shapes and overlapping data, which K-Means struggles with. [00:15], [00:44] - **EM Algorithm: Iterative Soft Assignment for GMMs**: The Expectation-Maximization (EM) algorithm iteratively refines GMM parameters by first assigning soft probabilities (responsibilities) of data points belonging to each cluster (E-step), then updating cluster parameters based on these probabilities (M-step). [05:10], [06:35] - **GMMs as Generative Models: Sampling and Imputation**: Trained GMMs act as generative models, enabling tasks like sampling new data points similar to the training set or imputing missing data values. [01:45], [02:04] - **EM Convergence: Guaranteed Likelihood Increase, Not Global Optimum**: While EM strictly increases the log-likelihood of the data with each iteration, ensuring convergence, it's not guaranteed to find the global optimum and may require multiple initializations. [08:40], [09:28] - **Red Blood Cell Data: EM Identifies Hidden Patient Groups**: In a red blood cell dataset, EM successfully identified two distinct patient groups (normal and anemic) based on their feature measurements, even though the diagnoses were hidden. [11:02], [12:51]

Topics Covered

Why K-Means Fails: Overlapping and Non-Circular Clusters?
Beyond Clustering: GMMs as Generative Probability Models
Unveiling Hidden Structures: GMMs as Latent Variable Models
EM Algorithm Pitfalls: Understanding Local Optima & Convergence
How EM Algorithm Iteratively Refines Cluster Assignments

Full Transcript

another common and important clustering

technique is based on probability

density estimation using gaussian

mixture models and a procedure called

expectation maximization or em to fit

the model parameters

if you recall a k-means algorithm we

described each cluster Center using just

a single point in feature space the

cluster Center and assigned each data

point to its nearest cluster

but what if our concept of the grouping

of the data have groups that overlap in

the feature space

so then first of all it's hard to know

which assignment is right since both are

plausible

and secondly

in k-means we always use euclidean

distance to the center but what if our

cluster is defined by some non-circular

shape

so for example in these data over here

the data can be seen as forming two

groups one of which has a lot of spread

in X2 here and the other of which has a

lot of spread in X1

both are centered at the same place

so it's easy mentally to see these as

falling into two groups and to Cluster

these data in one and these data in the

other

but something like k-means won't be able

to discover that assignment

gaussian mixture models are an extension

of the k-means model in which clusters

are modeled with gaussian distributions

so we've not only their mean but also a

covariance that describes their

ellipsoidal shape

so then we can fit the model by

maximizing the likelihood of The

observed data we do so this with an

algorithm called em it'll be very

reminiscent of k-means except it'll

assign data to each cluster with some

soft probability

as a positive side effect after

clustering we're actually creating a

generative model for the data x a

probability model which means we can do

a lot of useful tasks like sampling new

examples that we think are like the data

that we measured or comparing two

collections of data like the training

and test set to see if they differ

or imputing missing data values from

some of our data

in fact often k-means is actually viewed

as a special case of a gaussian mixture

model to derive some of these benefits

a gaussian mixture model is a useful

probability distribution model

we begin with several mixture components

indexed by C Each of which is described

by a gaussian distribution so each has a

mean mu c a variance or covariance Sigma

c and a size Pi C

for example in this figure we have three

gaussian components which means here

here and here

each of these components has a different

variance

and a different height

or area

so the joint probability distribution is

then defined by the weighted average of

these individual components so Pi c

times the gaussian defined by mu C and

sigma C

we can interpret this joint probability

distribution over X in a simple

generative way

to draw a sample from X from P of X we

first select one of the components

with discrete probability pi

so components with large Pi are selected

more often

as in k-means we'll view this as the

assignment of that sample to one of the

components and denoted by Z

then given the component assignment Z

equals c

we can draw a value from X from the

corresponding gaussian distribution

so together these two distributions make

a joint model over x and z

discarding the value of Z gives a sample

from the marginal P of X defined above

models like this are sometimes called

latent variable models the data X are

modeled jointly with an additional

variable Z that we don't get to observe

it's hidden the presence of the unknown

value of Z helps explain patterns in the

values of X for example in this case

groups or clusters

usually our features are multivariate

even high dimensional so we'll typically

use a multivariate gaussian which has

the same quadratic form but now involves

a vector mean mu of length n the same

size as the number of features in a data

point x

and an N by n covariance Matrix Sigma

recall that if we were given data from a

multivariate gaussian the maximum

likelihood estimate for the model

parameters were simply the mean of the

data

so the first moment of the data and the

covariance estimate was simply the mean

of the N by n matrices formed by the

outer product of x minus mu with itself

so this is the centered second moment of

the data

em then proceeds iteratively in two

steps

first the expectation or e-step treats

the gaussian parameters mu Sigma and Pi

is fixed

then for each data point I and each

cluster C we compute a responsibility

value R sub IC that measures a relative

probability the data point x i belongs

to Cluster C

to do this we just compute the

probability of X under model component c

a weighted gaussian and normalize by the

total over all the values of c

so here we evaluate a data point x

under component one so Pi 1 times the

gaussian defined by mu mean mu 1 and

covariance sigma 1.

we then also evaluate its probability

under component two so Pi 2 times the

gaussian defined by mu 2 and sigma 2.

if a particular component C is not a

very good explanation for X it will

typically have a small Ric value

conversely if it's by far the best

possible explanation for x

it'll have Ric approximately equal to

one

here component 2 is about twice as good

an explanation as component one for that

data point

so component two gets responsibility

two-thirds and component one gets

responsibility one-third

practically speaking Ric is a number of

data by number of clusters so M by K

Matrix that sums to 1 over the index C

then in the second step of em the

maximization or m-step we fix these

assignment responsibilities Ric and

update the parameters of the Clusters mu

Sigma and pi

then for each cluster C we update its

parameters using an estimate weighted by

the probabilities Ric as if it observed

some fraction Ric of data point I

so cluster C sees some total number of

data points MC that's the sum of these

soft memberships or fractional weights

assigned to Cluster C

then Pi C is just this value normalized

by the total number of data M so this is

the fraction of data point probabilities

that's assigned to Cluster C

the weighted mean mu C is just the

weighted average of the data so each

point x i is given weight Ric

and we divide by the total MC

so if a point x I was poorly explained

by some cluster C its Ric will be small

and that point won't influence this

average very much

but on the other hand if C was the best

explanation for X then Ric will be large

and X I will influence the mean more

the covariance is similarly just a

weighted average of the N by n matrices

formed by taking the outer product of x

i minus its cluster C's mean

and again they're weighted by Ric so

that if x i is a strong member of

cluster C this will weight will be

nearly one if x i is not very well

explained by cluster C then it won't

enter into this average very much

it's straightforward to prove although I

won't do it here that these iterations

strictly increase the log likelihood of

the model

increasing its fit to the data

the log likelihood is just the log

probability of the data points under the

mixture model so it's a sum over data

points of log of the probability

P of x from before where that

probability is a mixture of gaussians

in fact em is interpretable as a form of

coordinate Ascent just like our previous

k-means algorithm so thus we're again

guaranteed to converge

however convergence here won't typically

be as abrupt as it was in k-means so in

practice one usually stops once the

parameters or the log likelihood

objective have stopped changing very

much

a note that like in K means convergence

is not guaranteed to be to a global

Optimum so we may have to start from

several initializations and use the log

likelihood to select the best

one quick point the results of this

algorithm is a description of a

collection of clusters so centers and

covariances and so on and soft

membership probabilities for each data

point

if we want to identify our data points

with a single cluster like in k-means we

could actually just choose the most

probable cluster so the cluster C that

has largest value Ric for data point I

just like in K means the view of

clusters is easily applied to new

out-of-sample data points as well so

these points can be given either a soft

cluster membership Ric using an e-step

or a hard membership assignment zi by

again taking the largest Ric

also like k-means selecting the number

of clusters is important and can't be

done using the clustered data themselves

so like K means one option is to use a

complexity penalty in our score an

analog to the mean squared error cost in

k-means is the negative log likelihood

of the data in our mixture model and we

can then include a complexity penalty

like Bic just like before

alternatively since the gaussian mixture

model is a true probability model we can

use the log likelihood of a validation

or test set to assess the model fit as

well by evaluating the mixture

probability on the new data points

here's a demonstration of em courtesy of

some slides by Professor pork Smith here

at UC Irvine run on data measuring

properties of patients red blood cells

the data include both a group of normal

patients and a group of anemic patients

but the diagnosis is actually hidden

from us so we only see the feature

measurements themselves

probably you can see the two groups of

patients already in the data up here

with high red blood cell volume and high

hemoglobin are the normals and then this

long dispersed group with lower volume

and lower hemoglobin are the anemics

let's see how em identifies these two

groups automatically

we initialize our two clusters randomly

in this case they're actually quite

similar but the means and covariances of

the green cluster and the red cluster

are slightly different and that'll

actually be enough

during the e-step these points over here

will be slightly more probable under the

red model than the green while these

points will be about equally explained

and these points will be slightly more

probable under green

given these soft assignments or

responsibilities when we update our

model parameters to compute the weighted

mean and covariances we'll find that the

red model shifts slightly to better

accommodate the data down here that

we're given higher probability under the

red model

and similarly green shifts to better

accommodate the data that were given

slightly higher probability under its

component

then recomputing the probabilities in

the e-step makes these data even more

likely under red and these data even

more likely under green

another m-step involves the components

even further away to help explain their

respective data points better

repeating we see

that the Clusters are continuing to

separate

by iteration 10 they're explaining very

different groups of data

and by iteration 15 red is almost

entirely responsible for the dispersed

data and the green cluster for the tight

grouping of normals

eventually this process converges the

model cease to change very much and we

can stop

note that the model doesn't know which

of these two groups is normal or anemic

just that it can find two distinct

groups or clusters in the data that

behave differently and a description of

what the data in each cluster looked

like

if we look at the log likelihood of the

data over the course of the algorithm we

find that each iteration strictly

increases the log likelihood score so

after about 15 iterations

uh the score plateaus and doesn't

increase any further indicating again

that the model has converged

em algorithm is actually quite General

and can be used for many models or

problems involving partially observed

data from this Viewpoint here the

complete data or the feature values and

the cluster assignments but

unfortunately for us the assignment zi

are missing

then em corresponds to First Computing

the distribution over the hidden

variable zi given the current parameters

of the model

and then maximizing the complete log

likelihood over x and z together in

expectation over the distribution of Z

and gaussian mixture models this gives

us this elegant algorithm that we saw

where we compute soft assignments for

each data point and then plug those soft

assignments into the maximum likelihood

estimates

but in more General models it may be

procedurally a bit more complex

two simple alternatives are often used

when em is difficult to implement

there's a stochastic version of em or a

hard version of em in stochastic em

instead of taking the expectation over Z

we just sample its value and then fix it

in hard em we select the most probable

value and fix it

often it's easier to work with a fixed

value of Z than maximizing the

expectation this process of plugging a

value into Z is often called imputing

the value of Z

both variance stochastic and heart Em

are quite similar to regular em a heart

em as you might imagine is a little less

smooth in its optimization since the

best value of Z might change

discontinuously and it's often prone to

more local Optima since once we have a

hard assignment we'll move the

parameters to explain it which then

reinforces that hard assignment

heart em is very closely related to

k-means with this best assignment update

stochastic em is less prone to local

Optima but has more Randomness built in

making it harder to gauge convergence

summary gaussian mixture models are

useful and flexible class of probability

distributions here we're using them for

clustering they explain the distribution

the data X using a latent variable

framework so we posit that there's some

hidden grouping or clustering that

explains a lot of the variation in the

data our model then contains a latent

membership or assignment variable zi and

then the actual feature values x i are

gaussian given that cluster assignment

Zi

we can learn the parameters of a

gaussian mixture model using expectation

maximization or em

applied to gaussian mixture models has a

simple and elegant form we just iterate

between Computing soft membership or

assignment probabilities which we call

the responsibilities R sub IC and

updating the parameters of the mixture

model components so the cluster centers

covariances and so on using these soft

memberships

this procedure is guaranteed to strictly

increase the log likelihood decrease the

negative log likelihood of the model on

training data which ensures that it's

convergent but it's non-convex so it may

have local Optima which makes

initialization potentially very

important

we also discussed how we can set the

number of clusters using either a

complexity penalty like Bic just like we

did in k-means

or since the gaussian mixture model is a

true probability model over X we can

actually use the log likelihood score on

held out validation or test data as well

Loading...

Loading video analysis...