Representation Learning: Part 2
By IIT Madras - B.S. Degree Programme
Summary
Topics Covered
- Exact linearity is a myth
- Choose line by minimal reconstruction error
- Minimize total squared projection errors
- PCA direction maximizes variance
Full Transcript
here we are saying if we know that well four points are on a line and the fifth point is not on the line, then how can we find the proxy for the fifth point along the line that has the four points. But in
practice, nobody is going to come and tell us that hey, these points are on the line, these points are not on the line. So, now you can find the proxies for the points not on the line along the line. Nobody tells us that.
We are just given a data set which has a bunch of points. So, which
means that we need to find that line on to which you need to project these data points as well. So, that is also something that we need to find.
So, now there are multiple lines. So, we need to somehow find that line.
So, that is what we are going to do next. We are going to use the fact that given a line, we know how to find the proxy for a data point on this line to actually find that line itself. Let us see how we can do that. So, now just to make this goal a little bit more precise. So, this is our goal now. So, we want to develop
a way to find of course, a compressed representation of data when data points not necessarily fall along the line,
right? So, we are not going to assume that all the data points
right? So, we are not going to assume that all the data points are along the line, just that one or two points are not along the line, right? So, this is not an exception, right? So, this is
right? So, this is not an exception, right? So, this is almost a rule, right? So, it's never going to be the case that we are going to find almost all points on a line, but then just one or two points not on the line, right? This is more like a rule, not really an exception. What do I mean by that? Let's say if there was a class of
exception. What do I mean by that? Let's say if there was a class of 100 people and then I measured their height and weight and try to plot them, we might get some values like this, right? So,
of course, as the height increases, the weight increases. Now, we may want to represent this data set using this line. right. But as you can see, not I mean, only one or two points actually perhaps fall on this line, maybe none of the points fall on this line. Yet we want to represent this data points using
this line, right. So, which means that the fact that there is an exact linear relationship between the features is a myth, right. So, that's never going to happen. That's
that's very, very rare, right. So, physical loss might satisfy something like that. But then
your real world data will perhaps never satisfy that, right? So, which means that we are always going to find proxies for data points along the line, right? So, you
will have to find these proxies. Now, if I know that red line is the best line, then I know how to find proxy for each of these data points, but nobody gives us the red line, right? So, maybe it's not the red line, maybe it's the blue line, right? So, now, how do we know, right? So, because
for given a line, I can always find a proxy for that line for any data point, right? So, I could have found proxies for all these purple data points along the blue line as well. So, by finding by projecting them along the blue line. Now, the brown points would act as the
proxies and I can still compress them. Now, both are going to give me the same amount of compression. is no doubt about that. So, because once I found proxies along the line, I can find the representative for this line and one coefficient for each of the data points. Both are good ideas. But one line is still better than the other, not in terms of the amount of compression you achieve, but in
terms of what. Think about that. Let me tell you what it is. So, it
is obvious why a red line is better than the blue line. Though both
give us 50% compression, the red line is a better line because if you had to reconstruct this data set, then the amount of error that you suffer with respect to the original data set with respect to the red line is much lesser than the blue line. So, the reconstruction error is lesser for the red line, not the
compression ratio. The reconstruction error is less for the red line than the blue line.
compression ratio. The reconstruction error is less for the red line than the blue line.
So, which means we have to find that line to represent our data that gives us the least reconstruction error. Now, how do we find that is the question, right?
So, now we have a very, very precise goal to tackle, right? So, now our modified goal or rather not a modified goal, our specific goal now is find the line given a data set, of course, I'm not writing that again, find the line that has the least reconstruction error.
this is what we want to find. So, let us do this. Now, let us say you have a data set. As usual, we have an unsupervised data set which has x1, x2 till xn where xi's are in Rd.
Now, how do I define the error for a given line with respect to the data set? Well,
Basically, what you need to do, you need to kind of sum up the errors for each of the data points. So, your
data set has a bunch of n data points. So, you sum up the error incurred by each of these data points along the line. So, this is for a given line. If I give you a line, you can measure the performance of the
given line. If I give you a line, you can measure the performance of the line as the error that the data set incurs on this line. So, by summing up the error for each of the data points. But what is the error of a particular data point on a line? Well, that is just the length squared of this line, which we are going to think of as being represented by some
w is just we know x minus x transpose w into w. Now, remember, when I say line, I am going to represent using
w. Now, remember, when I say line, I am going to represent using this line is going to be represented using w such that the length of w is 1. So, w transpose w or norm w squared is 1. That is how we are going to think of lines. So, among all possible
1. That is how we are going to think of lines. So, among all possible lines represented by vectors in the unit circle, I want to find the best line.
So, given a line which means I give you a w and I ask what is the error and that would be the sum of the errors for all the data points along the line. Now, what is the length squared? Well, we know the length is just a norm w squared well for any vector let us say we take a vector z the norm w squared is just
z transpose z right. So, dot product of a vector with itself is just its length squared. Now, if we do that so then this is going to be just
length squared. Now, if we do that so then this is going to be just sum of i equals 1 to n the norm of x minus x transpose w into w squared this is what we want to minimize and if you want to minimize this we can Well, this is the error of a given line given w with respect to the data set and our goal is to find that w that
minimizes this, right? So, we can think of this as some function of w which is just the sum of the errors x minus x transpose w into w squared. This so now you want to minimize this function.
A couple of things that we will do here. So, we will think of this as the average error instead of the sum of the error. That does not really change the w. So, if some w minimizes the sum of the errors, it is the same w that would also minimize the average error. So, I can think of this as function as that w that minimizes the average error as well. We
will see why we can think, I mean, the whole derivation would go through even if you think of the sum. terms of interpretability, we will see why the average is a better thing to look at later. Okay. So, now this we know the length squared is just of a vector. Remember, this is a vector x is
in Rd, w is in Rd, this x transpose w is in R. So,
this is a scalar times w. This is the proxy for x along the line w and then you are looking at the difference vectors difference is a vector and we are looking at its length. And because it is a vector, the length squared is just the transpose of the vector with itself. So, I can write this as x minus x transpose w into w transpose x minus x
transpose w into w. That is still okay. Well, I should write xi everywhere where I use i. Sorry about that.
because this is the ith data point, Xi. Yeah.
Okay. So, now I can expand this. I am sure, I mean anybody who has done an MLF or a linear algebra course will know how to do this expansion.
I will do this nevertheless. So, this is Xi transpose Xi minus Xi transpose W times Xi transpose W that is a square minus again X i transpose w into W transpose X i which is same as X i transpose w. So, this is the squared and the last term is plus X i the scalars multiply themselves that is X i transpose w squared. Now, the vectors
will dot themselves W transpose w. Well, that is norm w squared, but we know that the w's that we care about have length 1. So, that is just multiplied by 1, right? So, if you do again some simplification, these will cancel out.
and we are just going to have 1 over n sum over i equals 1 to n X i transpose X i which is just the norm X i squared minus X i transpose w squared. Now, if you think about this, we are after that w that minimizes this value, right. So,
but if you look at the first term that does not have a w. So,
that is a constant that you are adding to each of these terms. So, it does it would not affect your w even if you remove that constant. So, equivalently
I can minimize a different function without that constant which would simply be 1 by n sum over i equals 1 to n without this constant minus xi transpose w square. So, that constant can go away and then the whichever w minimizes g of w is also the one that minimizes f of w.
If you are not convinced about this, pause and think about why this constant does not matter in finding w. Okay. So, now this is g of w which has a negative sign. Of course, we want to minimize over all w such that w trans norm of w squared or w transpose w is 1, right? So, that is
that is the goal. But instead of holding on to this negative sign, we can write this as instead max over W such that W transpose, such that norm of W squared is 1, 1 by n sum over i equals 1 to n without the negative sign, I can write this as xi transpose W
squared. This is still okay. we will see why
squared. This is still okay. we will see why we are doing this simplification. So, now I can write this objective as 1 over n sum over i equals 1 to n. Well, this is xi transpose w squared. I can write that as w transpose xi into xi transpose w.
squared. I can write that as w transpose xi into xi transpose w.
It is the same thing, right? So, w transpose xi, same as xi transpose w, I am squaring this. So, I can write this as a product of two terms. Now, why am I doing that? Well, that is because I can now write this as Now, remember W is a vector in Rd, right? So, which means W transpose can be thought of as like a 1 cross d vector, xi is a
d cross 1 vector, xi transpose is a 1 cross d vector, W is a d cross 1 vector, right? So, vectors are all column vectors by default, the transposes would be row vectors. Now, I can write this whole thing as W transpose X i X i transpose W. I have just changed the brackets, right? So that's still okay because everything is linear. I can do this as well. This is just matrix
multiplication, right? So it does not matter. I mean, it's the same three matrices. I
multiplication, right? So it does not matter. I mean, it's the same three matrices. I
can, I mean, I can do whichever way I want. I'm not like commuting it, right? So remember that I'm just changing the brackets. The order in which the multiplication
right? So remember that I'm just changing the brackets. The order in which the multiplication happens is still the same. Now, note that W does not depend on I.
So, I can bring W outside and say that this whole thing is W transpose 1 over n sum over i equals 1 to n X i X i transpose W. Now, let me
transpose W. Now, let me note this that this what object is this, right? So, if you have not already seen it, pause and think about what kind of an object is it? Is it a vector? Is it a scalar? Is it a matrix? What is
it? Is it a vector? Is it a scalar? Is it a matrix? What is
it? I will tell you what it is, right? So, this is a D cross D matrix. It is an average of a bunch of D cross D matrix. One
D matrix. It is an average of a bunch of D cross D matrix. One
D cross D matrix per data point, you take all the data points, find their average matrix, right? So, that is what this is. So, let us give this matrix a name. This matrix, let us call this C, right? So, essentially, then what we
a name. This matrix, let us call this C, right? So, essentially, then what we are saying is that So, equivalently, the problem that we can solve to achieve our objective is the following, right. So, we can maximize over W such that norm of W squared is 1 W
transpose C W, where C is 1 over n sum over i equals 1 to n X i X i transpose, right.
could have basically what we are saying is that finding that line that passes through the origin which best represents the data in terms of reconstruction error, minimum reconstruction error is same as finding that W that maximizes this quantity with respect to a matrix that you can generate from your data. This is a D
cross T matrix and some of you might already recognize what this matrix is. We
will talk about this matrix a little bit detail later, but this matrix is nothing but the covariance matrix.
So, this matrix is called the covariance matrix of the data set. And basically, what we are trying to find is that we want to find the direction W. which maximizes this bilinear form which is W transpose CW for the covariance matrix, right? Again, for those
who have done a linear algebra course already, you would immediately recognize that the solution to this problem is W is the eigenvector corresponding the maximum eigenvalue of C, the covariance matrix.
Basically, your line which best represents the data in terms of error minimization of reconstruction is same as the line which maximizes the W transpose CW over C is the covariance matrix And this line is given by the eigenvector corresponding to the maximum eigenvalue of the covariance matrix. We will
talk about this covariance matrix in a little bit more detail a little later. For
now, imagine that, well, this is not a hard problem to solve. So, that is the main takeaway at this point. We will understand this covariance matrix and what does it mean to say where did the covariance come into picture while we are doing error minimization and all that a little later. But for now, assume that well, this
problem has a well-known, well-understood solution in terms of the eigenvectors of a certain matrices associated with the data set.
Loading video analysis...